WO2023231987A1 - Text recognition method and electronic device - Google Patents

Text recognition method and electronic device Download PDF

Info

Publication number
WO2023231987A1
WO2023231987A1 PCT/CN2023/096921 CN2023096921W WO2023231987A1 WO 2023231987 A1 WO2023231987 A1 WO 2023231987A1 CN 2023096921 W CN2023096921 W CN 2023096921W WO 2023231987 A1 WO2023231987 A1 WO 2023231987A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
content
electronic device
text content
area
Prior art date
Application number
PCT/CN2023/096921
Other languages
French (fr)
Chinese (zh)
Inventor
滕益华
吴觊豪
洪芳宇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023231987A1 publication Critical patent/WO2023231987A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means

Definitions

  • Embodiments of the present application relate to the field of terminal equipment, and in particular, to a text recognition method and electronic equipment.
  • the user can use the text recognition function of the application to identify the text in the picture or interface.
  • text recognition function is implemented based on optical character recognition (Optical Character Recognition, OCR) technology.
  • OCR Optical Character Recognition
  • the application can recognize the text in the picture based on OCR technology and output the recognition results.
  • OCR optical Character Recognition
  • the output results of current OCR technology after text recognition are quite different from the original text, which affects the user experience.
  • this application provides a text recognition method and electronic device.
  • the electronic device can output a text recognition result that meets the user's needs based on the image and text content of the text area.
  • embodiments of the present application provide a text recognition method.
  • the method includes: the electronic device performs text area detection on an object to be recognized, and obtains an image of a first text area, where the first text area includes text content.
  • the electronic device performs text content recognition on the acquired first text area to obtain the first text content.
  • the electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result.
  • the electronic device displays the text recognition result of the first text area based on the classification result.
  • the step of displaying the text recognition result may specifically include: if the classification result is the first category, the text recognition result filters the first text content.
  • the text recognition result includes the text content after the first text content has been corrected. If the classification result is the third classification, the text recognition result includes the first text content.
  • the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing.
  • correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.
  • the text recognition result is optionally the text recognition result display box 405 in FIG. 4 . That is to say, if the text recognition result is the result indicated by the first classification (ie filtering), then the text recognition result display box 405 will The result corresponding to a text area is empty, that is, the text content recognition result corresponding to the first text area (ie, the first text content) is not displayed. If the text recognition result is the result of the second classification instruction (that is, outputting the corrected text content) or the result of the third classification instruction (that is, directly outputting the text content), the text recognition result display box 405 includes the correction corresponding to the first text area. The following text content or the text content of the first text area.
  • the text recognition result may be a result corresponding to the text area itself.
  • the text recognition result is the result indicated by the first classification (ie, filtered)
  • the text recognition result corresponding to the first text area displayed by the electronic device is empty (it may be blank or no blank).
  • the electronic device may display the first text in the text recognition result display box 405
  • the text content corresponding to the area can be modified or the result of text content recognition).
  • the classification result is optionally a numerical value, and the numerical value is used to represent the classification item.
  • the classification result may also include three numerical values, and the classification corresponding to the largest numerical value is the classification corresponding to the first text area.
  • the electronic device performs classification based on the image of the first text area and the first text content, and obtains the classification result, including: the electronic device obtains intermediate representation information based on the image of the first text area and the first text content.
  • the electronic device classifies the intermediate representation information and obtains the classification result.
  • the intermediate representation information may be called multi-modal information.
  • the intermediate characterization information may be used to characterize the image features of the image of the first text area and the text features of the first text content.
  • the electronic device classifies the intermediate representation information and obtains the classification result, including: the electronic device classifies the intermediate representation information through the classification model and obtains the classification result. In this way, the electronic device can classify the intermediate representation information through the pre-trained classification model to obtain the corresponding classification result.
  • the electronic device before the electronic device displays the text recognition result of the first text area based on the classification result, the electronic device further includes: the electronic device corrects the intermediate representation information to obtain the first text Text content after content correction. For example, before, at the same time, or after classifying the intermediate representation information, the electronic device corrects the intermediate representation information to obtain the corrected text content.
  • the electronic device can determine whether to output the corrected text content based on the classification result. For example, if the corrected text content does not need to be output, for example, the classification result is the first category or the third category, then the corrected text content is discarded.
  • the electronic device corrects the intermediate representation information to obtain the corrected target text content, including: the electronic device corrects the intermediate representation information through the correction model to obtain the third A text content after the text content has been corrected.
  • the electronic devices can use pre-trained The correction model corrects the intermediate representation information to obtain the corrected text content.
  • the electronic device obtains the intermediate representation information based on the image of the first text area and the first text content, including: the electronic device images the image of the first text area. Encoding to obtain the first image encoding information. The electronic device performs text encoding on the first text content to obtain the first text encoding information. The electronic device performs multi-modal encoding on the first image encoding information and the first text encoding information through a multi-modal encoding model to obtain intermediate representation information. In this way, the electronic device can obtain higher-dimensional semantic information by encoding the image and text content of the text area. The electronic device can perform multi-modal encoding on the first image encoding information and the first text encoding information through a pre-trained multi-modal encoding model to obtain intermediate representation information with high-dimensional semantics.
  • the multi-modal coding model, the classification model and the correction model form a neural network
  • the training data of the neural network includes the second text area and the second text area corresponding to the second text area.
  • the neural network can be cyclically trained, so that the neural network can complete the corresponding function, that is, it can fuse, classify, and correct image and text content.
  • the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content.
  • the text recognition method in this application can implement different processing methods for text content, that is, the text recognition results finally displayed are semantically coherent text content.
  • filtering or correction methods are used to avoid the impact of the semantically incoherent text content on the text recognition results.
  • the text recognition result is the first category or the second category.
  • the partially missing text content may be that each text in the text area is missing part of the information, for example, the upper half may be missing or the lower half may be missing. Is mental, partially missing text can also be that at least one text in the text area is missing part of the information.
  • the semantics expressed by the first text content are different from the semantics expressed by the text content in the first text area.
  • the text content recognition results can be filtered to filter or correct text content that is different from the original semantics, thereby improving the user experience.
  • the object to be identified is a picture, a web page or documentation.
  • inventions of the present application provide a text recognition method.
  • the method includes: an electronic device detects a text area of an object to be recognized, and obtains an image of a first text area; the first text area includes text content.
  • the electronic device performs text content recognition on the first text area to obtain the first text content.
  • the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content.
  • the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content and the first text content It is semantically coherent text content, or the image of the first text area represents that the first text area does not include part of the missing text content, and the text recognition result includes the first text content; if the image of the first text area represents that the first text area includes Part of the text content is missing, and the first text content includes text content with semantic errors, the text recognition result filters the first text content, or the text recognition result includes text content after the first text content is corrected.
  • the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing.
  • correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.
  • the electronic device can detect, based on the image of the text area, whether the text content in the text area is truncated, that is, whether the text includes missing content.
  • the first text content can be output directly.
  • the text content is truncated, it is detected whether the semantics of the first text content is coherent. If the semantics of the first text content are coherent, the first text content can be directly output. If the semantics of the first text content are incoherent, it is further detected whether the first text content can be modified. If the first text content can be modified, the modified text content is output. If the first text content cannot be modified, the first text content is filtered.
  • the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content , and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be modified. If the first text content cannot be corrected, the text recognition result filters the first text content. If the first text content can be corrected, the text recognition result includes the text content after the first text content is corrected. In this way, when the electronic device detects that the text content in the first text area is truncated and the semantics of the first text content is incoherent, it can further detect whether the first text content can be corrected.
  • the electronic device can correct the first text content and output the corrected text content. If it cannot be corrected, the electronic device filters the first text content. That is to say, the text recognition result of the first text area displayed by the electronic device is empty, or it is corrected text content, or it is original semantically coherent text content, so as to avoid the impact of incorrect text content recognition results on the user's use.
  • the method also includes: the electronic device corrects the first text content through the correction model to obtain text content after the first text content is corrected. In this way, the electronic device can correct the first text content through the pre-trained correction model to obtain semantically coherent text content.
  • the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: the electronic device uses a classification model to The image of the first text area is classified to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content.
  • the electronic device can classify the image of the text area through the pre-trained classification model to detect whether the text content in the text area is truncated.
  • the electronic device based on the image of the first text area and the first text content , displaying the text recognition result of the first text area, including: the electronic device performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors.
  • the electronic device can perform semantic analysis on the text content through the pre-trained semantic model to obtain semantic analysis results.
  • the semantic analysis result can be a numerical value
  • the electronic device can preset a semantic coherence threshold, and the threshold is used to indicate the semantic coherence of the text content. If the value of the semantic analysis result is greater than or equal to the threshold, the first text content is semantically coherent. If the value of the semantic analysis result is less than the threshold, the first text content is semantically incoherent.
  • the semantic analysis result is also used to indicate whether the first text content can be modified, and the electronic device displays the first text content based on the image of the first text area and the first text content.
  • the text recognition result of a text area includes: the electronic device determines whether the first text content can be modified based on the semantic analysis result.
  • the electronic device may set a correction threshold that is different from the semantic coherence threshold. If the value of the semantic analysis result is greater than or equal to the correction threshold, the first text content may be corrected. If the value of the semantic analysis result is less than the correction threshold, the first text content cannot be corrected.
  • the correction model, the classification model, and the semantic model form a neural network
  • the training data of the neural network includes a second text area and a second text corresponding to the second text area. content, as well as a third text area and third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete text content.
  • the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content.
  • the semantics expressed by the semantically incorrect text content are different from the semantics expressed by the corresponding text content in the first text area.
  • the object to be identified is a picture, a web page or a document.
  • inventions of the present application provide an electronic device.
  • the electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions for performing a method of the first aspect or any possible implementation of the first aspect.
  • inventions of the present application provide an electronic device.
  • the electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions to perform a method of the second aspect or any possible implementation of the second aspect.
  • embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.
  • embodiments of the present application provide a computer-readable medium for storing a computer program.
  • the computer program includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.
  • embodiments of the present application provide a computer program, which includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.
  • embodiments of the present application provide a computer program, which includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.
  • Figure 1 is a schematic diagram of the hardware structure of an exemplary electronic device
  • Figure 2 is a schematic diagram of the software structure of an exemplary electronic device
  • Figure 3 is a schematic diagram of a text recognition scene containing truncated text
  • Figure 4 is a schematic diagram illustrating an application scenario for applying the text recognition method in the embodiment of the present application
  • Figure 5 is a schematic flow chart of an exemplary text recognition method
  • Figure 6 is a schematic diagram of exemplary text recognition
  • Figure 7 is an exemplary text image encoding schematic diagram
  • Figure 8 is an exemplary schematic diagram of image information encoding
  • Figure 9 is an exemplary schematic diagram of image information encoding
  • Figure 10 is an exemplary flattening diagram of Image Patch
  • Figure 11 is an exemplary text content encoding schematic diagram
  • Figure 12 is a schematic diagram of an exemplary text information encoding process
  • Figure 13 is a schematic diagram of an exemplary acquisition process of intermediate representation information
  • Figure 14a is a schematic diagram of an exemplary multi-modal encoding
  • Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder
  • Figure 14c is a schematic diagram of an exemplary classification process
  • Figure 15 is an exemplary text modification schematic diagram
  • Figure 16 is a schematic diagram of the processing flow of the correction module
  • Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder.
  • Figure 18a is a schematic diagram of an exemplary application scenario
  • Figure 18b is a schematic diagram of another application scenario
  • Figure 18c is a schematic diagram of another exemplary application scenario
  • Figure 18d is a schematic diagram of another exemplary application scenario
  • Figure 18e is a schematic diagram of another application scenario exemplarily shown.
  • Figure 19 is a schematic flow chart of an exemplary text recognition method
  • Figure 20 is an exemplary schematic diagram of text image processing
  • Figure 21 is an exemplary processing flow of the semantic model
  • Figure 22 is a schematic structural diagram of an exemplary device.
  • FIG. 1 shows a schematic structural diagram of an electronic device 100 .
  • the electronic device 100 shown in FIG. 1 is only an example of an electronic device, and the electronic device 100 may have more or fewer components than shown in the figure, and two or more components may be combined. , or can have different component configurations.
  • the various components shown in Figure 1 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
  • the electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2.
  • Mobile communication module 150 wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, And subscriber identification module (subscriber identification module, SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, and a distance sensor 180F.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • NPU neural-network processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 100 .
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive wireless charging input through the wireless charging coil of the electronic device 100 . While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc.
  • the power management module 141 can also be used to monitor battery capacity, battery cycle times, battery health status (leakage, impedance) and other parameters.
  • the power management module 141 may also be provided in the processor 110 .
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 may be combined with at least part of the modules of the processor 110 are set up in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellites.
  • WLAN wireless local area networks
  • System global navigation satellite system, GNSS
  • frequency modulation frequency modulation, FM
  • near field communication technology near field communication, NFC
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device 100 .
  • the internal memory 121 may include a program storage area and a data storage area. Among them, save The stored program area can store the operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.).
  • the storage data area may store data created during use of the electronic device 100 (such as audio data, phone book, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of this application takes the Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 .
  • the embodiments of the present application can also be applied to other systems such as the Hongmeng system.
  • the implementation methods may refer to the technical solutions in the embodiments of the present application. This application will not give examples one by one.
  • FIG. 2 is a software structure block diagram of the electronic device 100 according to the embodiment of the present application.
  • the layered architecture of the electronic device 100 divides the software into several layers, and each layer has clear roles and division of labor.
  • the layers communicate through software interfaces.
  • the Android system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system libraries, and kernel layer.
  • the application layer can include a series of application packages.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, text recognition, text processing, etc.
  • the text recognition application program may also be called a text recognition module or a text recognition engine in the embodiment of the present application, which is not limited by this application.
  • the text recognition module can be used to identify the text area and text content in the image to be recognized (see below for specific concepts).
  • the text processing application program may also be called a text processing module, which is used to further process the output results of the text recognition module (for specific processing procedures, please refer to the embodiment below). It should be noted that in the embodiment of the present application, the text processing module further processes the results of the text recognition module as an example for description. In other embodiments, the text recognition module can also perform the steps performed by the text processing module. It can also be understood that the steps performed by the text recognition module and the text processing module can be performed by one module, which is not limited in this application.
  • the application framework layer provides an application programming interface (API) and programming framework for applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.
  • a window manager is used to manage window programs.
  • the window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make this data accessible to applications.
  • Said data can include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, etc.
  • a view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide communication functions of the electronic device 100 .
  • call status management including connected, hung up, etc.
  • the resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
  • the notification manager allows applications to display notification information in the status bar, which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of conversation windows. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.
  • System libraries can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing library, 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support multiple audio and video encoding formats.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.
  • 2D Graphics Engine is a drawing engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display drivers, camera drivers, audio drivers, sensors, Bluetooth drivers, Wi-Fi drivers and other drivers.
  • the components included in the system framework layer, system library and runtime layer shown in Figure 2 do not constitute specific limitations on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • FIG 3 is a schematic diagram illustrating a text recognition scenario containing truncated text.
  • a picture 302 is displayed in the display interface 301 of the mobile phone.
  • the display interface 301 may be an application interface, for example, it may be an interface of a system application such as a gallery application interface.
  • the interface 301 may also be an application interface of a third-party application such as a chat application. That is to say, in the embodiment of the present application, the system in the mobile phone can have its own text recognition function (ie, the text recognition module in Figure 2).
  • the gallery application can call the text recognition module of the mobile phone to perform text recognition on pictures.
  • the third-party application in the mobile phone can also have its own text recognition function.
  • the implementation process of the text recognition function of different third-party applications can be the same or different, which is not limited by this application.
  • the third-party application in the mobile phone can also call the text recognition module of the mobile phone, which is not limited in this application.
  • the picture 302 includes text and images (of course, the picture 302 may also include only text).
  • the embodiment of the present application only takes the text recognition scene of a picture as an example for explanation. In other embodiments, it can also be applied to the text recognition scene in the application interface. For example, the scene can be displayed to a browser application. The page is used for text recognition, which is not limited by this application.
  • the picture 302 can be generated after the mobile phone performs a screenshot operation in response to a user operation; the picture 302 can also be generated by the mobile phone through the camera function; the picture 302 can also be a downloaded picture, etc., which is not limited in this application.
  • the text in the picture 302 includes multiple lines, where the first line of text and the last line of text displayed in the picture 302 are cut off by the border of the picture 302.
  • this type of text or text
  • Figure 3 only takes vertical truncation of text as an example for illustration.
  • the technical solutions in the embodiments of the present application can also be applied to recognition scenarios of horizontal truncation of text and diagonal truncation of text. Specific examples will be described below.
  • the "vertical truncation of text” described in the embodiments of this application may be truncation perpendicular to the text running direction.
  • the text lines are blocked by the upper and lower edges of the screen or some fixed or frozen status bars due to the up and down sliding of the interface.
  • picture 302 as a screenshot of a webpage
  • the user slides the webpage up and down while browsing the webpage.
  • the first line currently displayed on the webpage may be blocked by the upper edge of the webpage (which can also be understood as the upper border of the display box). ) truncation.
  • the user takes a screenshot of the currently displayed web page, and the mobile phone responds to the received screenshot of the user's operation to generate picture 302.
  • the first line of text displayed in picture 302 is the "vertically truncated text".
  • transverse truncation of text refers to truncation along the text line direction.
  • text lines may be truncated laterally due to taking pictures or scanning.
  • the "oblique stage text” may be a truncation in a direction that has a certain angle with the text running direction.
  • the user can long press the picture 302.
  • the application displays the option box 303 in response to the received long press operation on the picture 302 .
  • the option box 303 includes but is not limited to: sharing options, collection options, text extraction options 304, etc.
  • the location and size of the option box 303 as well as the number and names of the options included therein are only illustrative examples and are not limited by this application.
  • the user clicks the extract text option 304 to indicate extracting the text in the picture 302 .
  • the mobile phone starts the text recognition function (as mentioned above, the text recognition function can be the text recognition function that comes with the application, or it can be the text recognition function of the calling system, which is not limited in this application).
  • the text recognition function optionally adopts OCR technology.
  • the OCR technology is mainly divided into two steps.
  • the first step is text area detection
  • the second step is text content recognition (which can also be called text content recognition).
  • the text area detection step optionally includes detecting at least one text area in the image, that is, identifying the area containing text in the image.
  • the step of identifying text content may optionally include identifying the text in the acquired text area, that is, identifying the specific text content in the text area.
  • the display interface 301 includes but is not limited to: a reduced picture 302 and a text recognition result display box 305 .
  • the interface layout in the display interface 301 in the embodiment of this application is only a schematic example, and this application does not limit it.
  • the text recognition result display box 305 includes, but is not limited to: the “smudge selected text” option, text recognition results, and other options.
  • other options include but are not limited to: "Select All” option, "Search” option, "Copy” option, "Translate” option, etc. Each of the other options can be used to process the text recognition results accordingly.
  • the text recognition result in the text recognition result display box 305 is the result recognized through the text recognition function.
  • the text recognition function recognizes The results may not be accurate.
  • the original text of the first line of text in the web page is "The first round of the game, the audience cheered when the All-** and others appeared, 5", and because the text on the home page is cut off by the upper border when browsing the web page , causing the first line of text in the screenshot 302 to be truncated.
  • the output result is "Ri Kong L Dai, Shi Hong Roast Shou Ba Yuan's Tu Cong Ding, 5", which is quite different from the original text.
  • the recognition results are lower than this type, the original text cannot be restored through timely semantic reasoning and other technologies, affecting the user experience.
  • the recognition result corresponding to the untruncated text line in the picture 302 (for example, the second line of text in the picture 302) is no different from the original text.
  • Embodiments of the present application provide a text recognition method that uses text images and text content as inputs to a model (which can be called a text recognition model or a text recognition network), and obtains the corresponding modal information through respective modal encodings. Encoded information.
  • the text processing module performs modal information fusion on the coding information corresponding to the text image and the coding information corresponding to the text content as the attention input of the classification decoder and the correction decoder.
  • the model is equivalent to an implicit comprehensive consideration of image information (mainly truncation) and text information (mainly semantic coherence), and uses high-dimensional multi-modal semantic information to conduct different input combinations. More refined decisions can be made to achieve anthropomorphic and complex decision-making effects.
  • the user can determine the obscured text through semantics.
  • users can correctly read the corresponding text content.
  • the technical solution in the embodiment of the present application can achieve an anthropomorphic user reading effect, and no result can be output when the text is largely blocked (ie, truncated).
  • the corrected result is output.
  • the corresponding text can be output without obstruction.
  • correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) can be filtered out to improve the user experience.
  • Figure 4 is a schematic diagram of an application scenario for applying the text recognition method in the embodiment of the present application. Please refer to (1) of Figure 4.
  • the gallery application Taking the gallery application as an example, the user clicks on the picture 402 displayed in the gallery application. After the thumbnail is obtained, the gallery application may display the picture 402 in the display interface 401.
  • the display interface 401 also includes, but is not limited to, options (or controls) such as sharing options and collection options.
  • the gallery application can call the text recognition module and text processing module of the system to perform text recognition and processing on the picture 402 (which may also be called a picture to be recognized or an image to be recognized).
  • text recognition includes two parts: text area detection and text content recognition.
  • the text recognition module can perform a text area detection step to detect whether the picture 402 includes a text area.
  • the picture 402 includes pictures and text (of course, it may also include only text, which is not limited in this application). Accordingly, the text recognition module may detect at least one text area included in the picture 402.
  • the "extract text in the picture” option 403 can be displayed in the display interface 401 .
  • the user can click the "Extract text in picture” option 403 to instruct the text content in the picture 402 to be extracted.
  • the gallery application responds to received user actions through text recognition
  • the module performs text recognition on the picture 402, that is, performs text content recognition steps to obtain the corresponding text content in each text area.
  • the text processing module can further process the recognition results (including text areas and text content) obtained by the text recognition module. Please refer to (2) of Figure 4.
  • the display interface 401 includes but is not limited to: a reduced picture 402 and an extracted text display box 404.
  • the extracted text display box 404 includes but is not limited to: a text recognition result display box 405 and other options.
  • Other options include, but are not limited to: "Erase selected text” option, "Read full text” option, “Select all” option, “Search” option, "Copy” option and “Translate” option, etc.
  • the text recognition result display box 405 includes text content recognized by the text recognition module, as shown in (2) of Figure 4. In this embodiment of the present application, for truncated text (such as the first line of text), The mobile phone does not display the corresponding text in the text recognition result display box 405.
  • the text processing module can adopt a non-output (i.e., non-display) method to avoid the problem of large differences between the text recognition results and the original text.
  • the text processing module may display the corresponding text in the text recognition result display box 405 .
  • the text processing module can also modify (or correct) the text content recognized by the text recognition module to obtain the correct text (which can also be understood as being close to or the same as the original text). text), and output (i.e., display in the text recognition result display box 405) the corrected result. That is to say, in this embodiment of the present application, text with semantic errors is filtered or corrected so that the text recognition results displayed in the text recognition result display box 405 are semantically logically correct and coherent, thereby improving the user experience.
  • the embodiments of this application only take the text recognition and processing scenario of pictures as an example for explanation. In other embodiments, it can also be applied to text recognition and processing scenarios in application interfaces. For example, the scenario can be browsing.
  • the page displayed by the server application is used for text recognition and processing, which is not limited by this application.
  • the picture 402 can be generated after the mobile phone performs a screenshot operation in response to the user operation; the picture 402 can also be generated by the mobile phone through the camera function; the picture 402 can also be a downloaded picture, etc., which is not covered by this application. limited.
  • the text recognition function that comes with the chat application can perform text recognition on the image to be recognized and obtain the corresponding text recognition results.
  • the chat application can call the text processing module of the mobile phone to further process the text recognition results.
  • the chat application may also have its own text recognition module and text processing module, and implement the steps implemented by the text recognition module and text processing module involved in the embodiments of this application.
  • the chat application can also call the text recognition module and text processing module of the mobile phone, which is not limited in this application.
  • the steps performed by the text recognition module described in the embodiments of this application are only illustrative examples.
  • the steps performed by the text recognition module in the mobile phone and the text recognition module that comes with the application may be the same or different. Specific details may refer to existing technical embodiments, and are not limited in this application.
  • the text recognition module in a mobile phone can use OCR technology to perform text recognition and obtain corresponding recognition results, including text images and text content (the concepts of text images and text content will be explained below).
  • the text recognition module in the chat application can use other technologies to perform text recognition and obtain corresponding recognition results, which also include text images and text content.
  • the recognition results obtained by the text recognition module of the chat application and the text recognition module of the mobile phone application can be consistent. Same or different.
  • the text recognition module in the mobile phone may recognize 5 text areas and obtain the corresponding text content.
  • the text recognition module in the chat application may recognize 6 text areas and obtain the corresponding text content, which is not limited by this application. That is to say, the text processing module in the embodiment of the present application can further process the recognition results of any text recognition module (which can be a mobile phone and/or an application) to obtain results that meet user needs.
  • text truncation by a border is used as an example for explanation.
  • text truncation may also be caused by image occlusion or other reasons, which is not limited by this application.
  • the text recognition module can perform the text area detection step on each picture in the gallery application when the mobile phone is in standby or the gallery application is in the background. That is to say, the text recognition module can perform the text area detection step on the pictures in the gallery application in advance, so that after the user clicks on the picture including the text area, the "Extract text in the picture" option box can be displayed immediately to improve text recognition and Overall efficiency of processing.
  • FIG. 5 is a schematic flowchart of an exemplary text recognition method.
  • the text recognition module can obtain the results recognized based on OCR technology.
  • the results include at least one text image and text content corresponding to each text image.
  • Figure 6 is a schematic diagram of text recognition. Please refer to Figure 6 .
  • the text recognition module uses OCR technology to perform an operation on picture 601 (that is, picture 402. For detailed description, please refer to picture 402, which will not be described again here). Text area detection to obtain at least one text area.
  • text area detection can be understood as that after the OCR technology detects the area containing text in the picture 601, it segments at least one text area in the picture 601 to obtain at least one text image (that is, at least one text area in the picture 601). the image corresponding to the text area).
  • the text recognition module detects a text area 602a containing text in the picture 601.
  • the text recognition module can segment the text area 602a (for example, along a dotted line) to obtain an image corresponding to the text area 602a, which is referred to as Text image 602a.
  • the text recognition module can sequentially segment areas containing text in the picture 601. For example, the image of the text area 603a can be obtained, which is referred to as the text image 603a. In the embodiment of this application, only the text area 602a and the text area 603a are used as examples for description. The text recognition module can obtain more text areas in the picture 601.
  • the text recognition module after the text recognition module recognizes the text area through OCR technology, it can undergo radiation or perspective transformation correction and other processing to obtain the corresponding text image.
  • the size of the single text image may be the same as the size of the actual area occupied by the text content in the text image, or may be larger than the size of the actual area occupied by the text content.
  • the size of the text image 602a is larger than the size of the area actually occupied by the text content. That is, there is a blank area between the frame of the text image and the text content (ie, the edge of the text content).
  • the text recognition module can perform text content recognition on at least one acquired text area (ie, text image) through OCR technology.
  • the text recognition module The text image 602a performs text content recognition and obtains the text content recognition result 602b (which can also be called the text content 602b). That is, it is recognized that the text content in the text image 602a is "Ri Kong L Loan, Shi Hong Roasted Shou Ba Yuan's Soil" From Ding, 5”.
  • the text recognition module continues to recognize other text images to obtain the corresponding text content recognition results.
  • the text recognition module uses OCR technology to perform text content recognition on the text image 603a to obtain the corresponding text content recognition results 603b (also called is the text content 603b), that is, it is recognized that the text content in the text image 603b is "The champion also showed superb strength, 107B in the first round.”
  • this embodiment only takes the text image 602a and the text image 603a as an example for explanation.
  • the text recognition module can perform text content recognition on each acquired text image based on OCR technology to obtain the corresponding text content. The application will not be explained one by one. It should be further noted that the text recognition module can perform text content recognition on each text image in parallel or sequentially, which is not limited by this application.
  • the text processing module obtains the recognition results obtained by the text recognition module, including but not limited to: text image 602a and corresponding text content 602b, and text image 603a and corresponding text content 603c.
  • the text processing module executes the process in Figure 5 for each text image input by the text recognition module and the text content corresponding to the text image.
  • the text recognition module can output the recognition results to the text processing module for further processing after acquiring the images corresponding to all text areas of the recognized image (for example, picture 601) and the corresponding text content.
  • the text recognition module can execute the process in Figure 5 on the obtained text images and text content one by one.
  • the text recognition module can also process multiple text images and text contents in parallel, which is not limited in this application.
  • the text recognition module can also output the text content and the corresponding text image to the text processing module for processing. This application does not limit this, and the description will not be repeated below.
  • the text processing module passes the text image 602a and the text content 602b through a coding model (which can also be called a coding module) to obtain the corresponding text image 602a.
  • a coding model which can also be called a coding module
  • the encoding model may include, but is not limited to, an image encoding model (which may be called an image encoding module) and a text encoding model (which may also be called a text encoding module).
  • the image coding model can be used to code the text image 602a to obtain image coding information corresponding to the text image 602a.
  • the image encoding model can encode text images into machine-recognizable or understandable semantic information.
  • the text encoding module can be used to encode text content 602b to obtain text encoding information. It can also be understood that the text encoding module encodes text content into machine-recognizable or understandable semantic information.
  • the text processing module may process the text image 602a and the text content 602b sequentially or in parallel, which is not limited in this application.
  • the text processing module can first process the text image 602a to obtain the image encoding information, and then process the text content 602b to obtain the text encoding information.
  • the text processing module may first encode the text content 602b, and then encode the text image 602a.
  • the text processing module can simultaneously encode the text image 602a and the text content 602b, which is not limited in this application.
  • the text processing module will The image coding information corresponding to the text image 602a and the text coding information corresponding to the text content 602b are fused through a multi-modal model (which may also be called a multi-modal coding module, a multi-modal fusion module, etc., which is not limited in this application), Multimodal coding information is obtained, which can also be called intermediate representation information.
  • a multi-modal model which may also be called a multi-modal coding module, a multi-modal fusion module, etc., which is not limited in this application
  • Multimodal coding information is obtained, which can also be called intermediate representation information.
  • the text processing module corrects the intermediate representation information through a correction model (which can also be called a correction module), and the text processing module passes the intermediate representation information through a classification model (which can also be called a classification module) to correct the intermediate representation.
  • the information is classified and the classification results are obtained.
  • the classification results include three categories: filtering, correction and output, and direct output.
  • the filtering classification item is optionally to filter the text content, that is, not to display the corresponding text content in the text recognition result.
  • Correcting and outputting the classification item optionally means outputting the corrected text. It can also be understood that the text content can be corrected and then displayed in the text recognition result. Directly outputting classification items optionally displays text content in the text recognition results.
  • the text processing module can directly display the text content recognized by the text recognition module through OCR technology in the text recognition results.
  • the text processing module filters the text content 602b, that is, the text content is not displayed in the text recognition result. 602b, to avoid the impact of semantically incorrect text on text recognition results.
  • the classification result of the intermediate representation information is a corrected output classification item
  • the text processing module can display the corrected result of the intermediate representation information in the text recognition result.
  • the classification result of the intermediate representation information is direct output, the text processing module displays the text content 602b in the text recognition result.
  • FIG. 7 is an exemplary text image encoding schematic diagram. Please refer to Figure 7.
  • the text processing module specifically, the image encoding model
  • the image information is converted into two-dimensional image coded information E v .
  • the structure of the encoded information (such as two-dimensional encoded information) obtained by encoding text images and text content is obtained based on the architecture of the encoder.
  • the encoder architecture can be set according to actual needs, such as , in other embodiments, three-dimensional image information can also be converted into higher-dimensional or lower-dimensional image encoding information, which is not limited in this application and will not be repeated below.
  • image information encoding of text images through the encoding methods of Patch Embedding and Positional Encoding is used as an example for explanation.
  • encoding can also be performed through other encoding methods. There are no restrictions on application.
  • the text processing module divides the text image 602a into N patches.
  • Figure 8 is an exemplary image information encoding schematic diagram.
  • the text processing module can change the height of the text image 602a (which can also be width, or width and height) to resize (adjust) the height of the text image 602a to a preset pixel value.
  • the text module can adjust the height of the text image 602a to 32 pixels (or 64 pixels, depending on Actual requirements are set, and this application does not limit it).
  • the width of the text image 602a is adjusted according to the proportion (ie, the aspect ratio of the image 602a) with the height. As shown in FIG.
  • the adjusted height of the text image 602a is H and the width (also called length) is W as an example for explanation. It should be noted that in other embodiments, The text image may not be resized, which is not limited in this application.
  • the text processing module divides the text image 602a into N Image Patches.
  • the values of h and w can be the same or different, for example, they can both be 16 pixels, and can be set according to actual requirements, which is not limited in this application.
  • N is a positive integer.
  • N can be obtained by rounding up.
  • the text processing module performs Patch Embedding on N Image Patches.
  • FIG. 9 is an exemplary schematic diagram of image information encoding. Please refer to Figure 9.
  • An exemplary Patch Embedding process includes but is not limited to the following steps:
  • Step a The text processing module flattens each Image Patch to obtain the one-dimensional vector Pi corresponding to each Image Patch.
  • each Image Patch is w
  • the height is h
  • the number of channels is c.
  • the size of each Image Patch is (h*w*c).
  • the text processing module flattens the Image Patch to obtain a one-dimensional vector of length (h*w*c). For the i-th image block, record the one-dimensional vector as Pi , and Pi is expressed as:
  • FIG. 10 is an exemplary flattened schematic diagram of an Image Patch. Please refer to Figure 10, taking Image Patch801 in Figure 8 as an example.
  • the size of Image Patch801 is (h*w*c).
  • the text processing module expands Image Patch801, the corresponding one-dimensional vector P 1 is obtained.
  • P 1 is expressed as:
  • the text processing module can flatten each Image Patch to obtain N Pi, that is, P 1 ...P n as shown in Figure 9.
  • Step b The text processing module passes N one-dimensional vectors Pi through the fully connected layer to obtain N one-dimensional tensors with a preset length.
  • the text processing module passes N one-dimensional vectors Pi through a fully connected layer with an output length of embedding_size (which can be set according to actual needs, and is not limited in this application), and obtains N pieces of length embedding_size.
  • One-dimensional tensor E vi , E vi is expressed as:
  • the text processing module passes P 1 through a fully connected layer with a length of embedding_size, and obtains a one-dimensional tensor E v1 with a length of embedding_size.
  • E v1 is expressed as:
  • the text processing module performs the same processing on N one-dimensional tensors according to the above method to obtain E v1 ...E vn .
  • the preset length is embedding_size as an example for explanation.
  • the preset length can be other values, which is related to what kind of fully connected layer is used. This application does not Make limitations.
  • step c the text processing module arranges the N one-dimensional tensors E vi in order to obtain a two-dimensional tensor with a dimension of N*embedding_size.
  • the text processing module arranges N one-dimensional tensors E v1 ...E vn in order to obtain a two-dimensional tensor E v0 .
  • E v0 is expressed as:
  • the dimension of E v0 is (N*embedding_size).
  • the image encoding method in the embodiment of the present application is only a schematic example.
  • the text processing module can also call the kernel (kernel) size (h*w), stride (step length) is h (or w), and the convolution kernel with the number of output channels embedding_size is obtained by acting on Image Patches.
  • the specific method can be set according to actual needs. The purpose is to encode N Image Patches and obtain higher semantics. machine-encoded information.
  • the text processing module can concatenate (concat) E v0 with the classification header E cls to obtain the two-dimensional tensor E v1 .
  • the dimension of E cls is optionally (1, embedding_size). This dimension can be set according to actual requirements and is not limited in this application.
  • the classification head E cls is a learnable parameter of the neural network.
  • E v1 [E cls ,E v0 ] (2)
  • E v0 in the above embodiment as an example, the text processing module splices E v0 and E cls to obtain E v1 , and E v1 is expressed as:
  • the dimension of E v1 is (N+1, embedding_size).
  • the text processing module performs Positional Encoding on E v1 .
  • the text processing module adds the two-dimensional tensor E v1 obtained above and the two-dimensional position code E pos to obtain the image coding information E v .
  • the dimension of the position coding is related to the dimension of the result after the above processing. This application only takes two dimensions as an example for explanation, and this application does not limit it.
  • E pos is expressed as:
  • E pos is (N+1, embedding_size).
  • E pos is a learnable parameter of the neural network.
  • N v N+1 is recorded.
  • E v1 obtains the image encoding information E v through Positional Encoding, which is expressed as:
  • FIG. 11 is an exemplary text content encoding schematic diagram. Please refer to Figure 11.
  • the text processing module specifically a text encoding model, which will not be described again below
  • performs text information encoding also called text information encoding
  • the text content 602b including Word Embedding. (Word Embedding) and Positional Encoding, thereby converting text information into text encoding information with higher semantic characteristics (also called text encoding information), recorded as E t .
  • Figure 12 is a schematic diagram of an exemplary text information encoding process. Please refer to Figure 12. The process includes but is not limited to the following steps:
  • the text processing module performs word segmentation processing on the text content 602b.
  • the text processing module segments text content 602b according to a preset character length to obtain a segmentation result (which may also be called a segmentation sequence).
  • the preset character length can also be set according to actual needs, for example, it can be two characters, which is not limited in this application.
  • the preset character lengths can also be unequal. For example, "eye shape” can be divided into one word, and “mountain” can be divided into one word, which is not limited in this application.
  • the text processing module obtains the text serial number sequence corresponding to the word segmentation sequence.
  • the text processing module can be preset with a text serial number table (which can also be called text serial number information, character code table, etc., which is not limited in this application).
  • the text serial number table is used to indicate text (words or characters) and Serial number correspondence.
  • the corresponding serial number of "item” in the text serial number table is "12".
  • the corresponding serial number of "relationship” in the text serial number table is "52".
  • the corresponding relationship between text and serial numbers can be set according to actual needs, and is not limited in this application. It should be noted that the correspondence between text and serial numbers can be saved in a table or in other ways, which is not limited in this application.
  • the text contained in the text sequence number table can cover dictionaries or any books in professional fields, etc., which is not limited by this application.
  • the text processing module can search the sequence number (also called text sequence number) corresponding to each segment (word or word) in the segment sequence w based on the text sequence number table to obtain the text sequence number.
  • the text processing module passes the text sequence n through word embedding to obtain the two-dimensional tensor E t0 .
  • the two-dimensional tensor E t0 can be expressed as:
  • the dimension of the two-dimensional tensor E t0 is (m, embedding_size).
  • E t0 is related to the embedding layer and is not limited in this application.
  • the text processing module adds E t0 to the position code to obtain the text information code E t .
  • the text processing module adds E t0 to the position code E pos ′ to obtain the text information code E t .
  • the dimension of the position coding is related to the dimension of the result after the above processing. This application only takes two dimensions as an example for explanation, and this application does not limit it.
  • E pos ' is expressed as:
  • E pos ′ is (m, embedding_size).
  • E pos ′ is a learnable parameter of the neural network.
  • N t m is recorded.
  • the text processing module adds E t0 and E pos ' to obtain the text information encoding E t , which is expressed as:
  • the positional encoding in the embodiment of this application can be a parameter-learnable embedding layer similar to Bert Positional Embedding, or it can be a positional encoding based on sine/cosine transformation similar to the native Transformer architecture, which can be set according to actual needs.
  • This application No restrictions.
  • Figure 13 is a schematic flowchart illustrating an exemplary process for obtaining intermediate representation information. Please refer to Figure 13. Specifically, it includes but is not limited to the following steps:
  • the text processing module performs feature fusion on the image encoding information E v and the text encoding information E t to obtain the mixed semantic encoding Em (which can also be called mixed encoding information, and is not limited in this application).
  • the mixed semantic encoding E m can be expressed as:
  • the dimension of the mixed semantic encoding E m is (N v +N t , embedding_size)
  • the fusion method of image encoding information E v and text encoding information E t is only used as splicing as an example. In other embodiments, other methods can also be used, such as addition, etc. There are no restrictions on application.
  • the text processing module passes the mixed semantic encoding E m through the multi-modal encoder to obtain multi-modal encoding information (ie, intermediate representation information).
  • Figure 14a is an exemplary multi-modal coding schematic diagram. Please refer to Figure 14a.
  • the text processing module passes the mixed semantic encoding Em through the multi-modal encoder 1301 to obtain multi-modal coding information (ie, intermediate representation information), denoted by is E IR .
  • multi-modal encoder can also be understood as being used to extract high-dimensional semantic information that combines image information and text information based on input multi-modal encoding information.
  • the multi-modal encoder (Encoder) 1301 is composed of stacked Transformer Encoder, for example, the number of stacks is L.
  • Each Transformer Encoder mainly consists of Multi-Head Attention layer, Layer Normalization (Norm in Figure 14a) and Feed forward neural network (Feed forward neural network) (Figure 14a) 14a Feed Forwad) composition.
  • Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder 1301. Please refer to Figure 14b.
  • the stacking number L is 3 as an example. That is, the multi-modal encoder 1301 includes a multi-modal encoder 1301a, a multi-modal encoder 1301b, and a multi-modal encoder 1301c. . It should be noted that the number of encoders described in the embodiments of this application is only a schematic example and can be set according to actual needs, and is not limited in this application. Exemplarily, the mixed semantic encoding Em passes through the multi-mode encoder 1301a, and an output result is obtained.
  • the output result of the multi-modal encoder 1301a is used as the input of the multi-modal encoder 1301b and continues to be encoded.
  • the multi-modal encoder 1301b performs encoding based on the output result of the multi-modal encoder 1301a, and obtains the output result, which is used as the input of the multi-modal encoder 1301c.
  • TE identifies a single multi-modal encoder in multi-modal encoder 1301.
  • the dimension of the multi-modal encoding information E IR is (N v +N t , embedding_size). For example it can be expressed as:
  • the multi-modal encoder is a Transformer Encoder as an example for explanation.
  • the multi-modal encoder can also be similar to a bidirectional recurrent neural network, or a simpler convolutional neural network encoder, which can be set according to actual needs and is not limited in this application.
  • the method by which the text processing module obtains multi-modal coding is not limited to the method of splicing image coding information and text coding information through a multi-modal encoder.
  • the text processing module can also convert images into The coded information and text coded information pass through their respective encoders and then are fused.
  • the text processing module passes the image encoding information through the image encoder to obtain high-dimensional image semantic information, and passes the text encoding information through the text encoder to obtain high-dimensional text semantic information.
  • the text processing module dimensionally aligns the high-dimensional image semantic information and the high-dimensional text semantic information and splices them together to obtain intermediate representation information.
  • the specific method can be set according to actual needs, and the purpose is to obtain high-dimensional image semantic features and text semantic features.
  • the text processing module (specifically, it can be a classification model, which will not be repeated below) can classify the intermediate representation information to determine whether to output text content 602b based on the classification results.
  • Figure 14c is a schematic diagram of an exemplary classification process.
  • the text processing module can pass the multi-modal encoding information (ie, intermediate representation information) through the classification model to obtain the classification result.
  • the classification model may include, but is not limited to, a classification decoder, and an argmax layer (or softmax layer).
  • the classification decoder is a fully connected layer, and the fully connected layer is an MLP (Multi-layer perceptron, multi-layer perceptron) as an example.
  • the MLP may include multiple hidden layers. It should be noted that in the embodiment of the present application, only the fully connected layer (such as MLP) is used as the classification decoder as an example for explanation.
  • the classification decoder can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs.
  • decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder
  • RNN Recurrent Neural Network
  • This application does not Make limitations. Its purpose is to output corresponding classification results based on the input intermediate representation information.
  • argmax layer is used as an example for explanation. In other embodiments, the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.
  • the classification results include but are not limited to three classification items:
  • the classification result obtained includes the scores corresponding to the three classification items.
  • the module can pass the scores corresponding to the three classification items through the argmax layer or softmax layer to obtain the final decision category.
  • the dimension of the multi-modal coding information E IR is (N v +N t , embedding_size).
  • the first dimension of the multi-modal coding information E IR can be obtained to obtain a one-dimensional tensor E IR0 with a length of embedding_size, which is expressed as:
  • the text processing module passes the one-dimensional tensor E IR0 through the fully connected layer and outputs the one-dimensional tensor T out with a length of 3 (that is, the same number as the number of classification items).
  • the fully connected layer can be an MLP, and the MLP can include multiple hidden layers.
  • T out includes the scores corresponding to the above three classification items a, b, and c.
  • T out [f(a), f(b), f(c)]
  • f(a) is the score corresponding to the classification item a (that is, the filtered classification item)
  • f(b) is the score corresponding to the classification item b (that is, the corrected output classification item)
  • f(c) is the classification item c (That is, directly output the score corresponding to the classification item).
  • the text processing module passes T out through the argmax layer to output the classification item corresponding to the maximum score.
  • MLP is only used as a fully connected layer as an example for explanation.
  • the fully connected layer can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs.
  • RNN Recurrent Neural Network
  • Its purpose is to output corresponding classification results based on the input intermediate representation information.
  • the argmax layer is used as an example for explanation.
  • the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.
  • the text processing module can filter the corresponding text content, that is, the corresponding text content is not displayed in the text recognition result. For example, when the text processing module processes the text image 602a and the text content 602b, it detects that the classification result is category a, that is, the filtered classification item, then the text processing module filters the text content 602b, as shown in (2) of Figure 4 indicates that the text recognition results do not include the truncated first line of text, thereby avoiding errors in the truncated text recognition results and affecting the user experience.
  • the text processing module can display the corresponding text content in the text recognition result. For example, when the text processing module processes the text image 603a and the text content 603b, it is detected that the classification result is category c, that is, the classification result is a direct output classification item. The text processing module determines that the text content 603b can be output directly. As shown in (2) of Figure 4, the text processing module can display the text content 603b at a corresponding position in the text recognition result.
  • the output result is b, that is, the classification result is the corrected output classification item.
  • the results identified by OCR technology include some errors and need to be corrected before they can be output.
  • each multi-modal encoding information ie, intermediate representation information
  • the text processing module After the text processing module detects that the classification result corresponding to the single multi-modal encoding information is the corrected output classification item, the text processing module can display the text content corrected by the correction module in the text recognition result. It should be noted that if the classification result is category a or category c, the text processing module discards (or ignores) the correction result output by the correction module.
  • FIG. 15 is an exemplary text modification schematic diagram. Please refer to Figure 15.
  • the correction module including Transformer Decoder is used as an example for explanation.
  • the text processing module passes multi-modal coding information (i.e. intermediate representation information) through Transformer Decoder1501, fully connected layer and argmax layer to obtain the corrected text content.
  • the correction module can also be other architectures, such as but not limited to: forward decoder based on recurrent neural network, Bert Decoder architecture, similar stepwise monotonic attention (stepwise monotonic attention)
  • the decoder, etc. can be set according to actual needs and is not limited in this application. Its purpose is to correct the input intermediate representation information to obtain the corrected text.
  • Transformer Decoder1501 includes Q stacked Transformer Decoder, and Q can be a positive integer greater than 0.
  • a single Transformer Decoder can be represented as TD.
  • a single TD includes but is not limited to: Masked multi-head attention layer, multi-head attention layer, layer normalization (i.e. (Norm) in Figure 15) and feedforward neural network (i.e. Figure 15 Feed forward).
  • layer normalization i.e. (Norm) in Figure 15
  • feedforward neural network i.e. Figure 15 Feed forward
  • the K vector and V vector of the Transformer Decoder are multi-modal encoding information (that is, the output of the Encoder), and the Q vector is the output of the Masked multi-head attention layer.
  • Figure 16 is a schematic diagram of the processing flow of the correction module. Please refer to Figure 16.
  • the recognition results of the OCR technology obtained by the text processing module include text content and text images, where the text content is "volcanic eruption". In other words, the word "violent" is recognized incorrectly.
  • the text processing module obtains multi-modal coding information corresponding to text content and text images.
  • the text processing module obtains the corresponding classification results based on the multi-modal coding information, and the classification results are the corrected output classification items. Specific details can be found above and will not be repeated here. Please refer to Figure 16.
  • the text processing module inputs the multi-modal encoding information into Transformer Decoder1501 as K vector and V vector, and the start character ⁇ s> is input into Transformer Decoder1501 as Q vector through Output Embedding and Positional Encoding.
  • Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder. Please refer to Figure 17, assuming that the stack number Q of Transformer Decoder1501 in the embodiment of this application is 2,
  • Output Embedding can be Word Embedding.
  • the specific implementation can refer to the method in the above embodiment, or the implementation method in other existing technical embodiments, which this application will no longer pursue.
  • the stack number Q of Transformer Decoder 1501 in the embodiment of the present application is 2 (can be set according to actual requirements, and is not limited by this application), including Transformer Decoder 1501a and Transformer Decoder 1501b.
  • the text processing module inputs the multi-modal encoding information into Transformer Decoder1501a as K vector and V vector, and the start character ⁇ s> is input into Transformer Decoder1501a as Q vector through Output Embedding and Positional Encoding.
  • the input of Transformer Decoder1501a is input as the Q vector of Transformer Decoder1501b, and the multi-modal encoding information is input as K vector and V vector. Enter Transformer Decoder1501b.
  • the output of Transformer Decoder1501b is recorded as E dout 1.
  • E dout 1 passes through the fully connected layer to obtain E out 1, where the dimension of E out 1 is (seq_len,N vocab ).
  • the text processing module slices the first dimension of E out 1, takes its last column, and obtains a one-dimensional tensor with a length of N vocab .
  • the text processing module passes the one-dimensional tensor through the argmax layer (it can also be the argmax and softmax layers, which can be set according to actual needs, and is not limited in this application).
  • N vocab is optionally the number of texts included in the text sequence number table. For example, if the dictionary includes 100 words and corresponding sequence numbers, the value of N vocab is 100.
  • the value of seq_len is the number of output characters.
  • the number of output characters is 5, including "fire”, “mountain”, “explosion”, “fa” and the end character ⁇ end>.
  • the value output by the argmax layer is used to indicate the sequence number in the dictionary.
  • the text processing module can determine the corresponding word or phrase based on the serial number. In this example, the text processing module may determine that the corresponding word or word is "fire". In other words, the text processing module encodes the multi-modal information and the starting character ⁇ s>, and the character "fire” can be obtained through Transformer Decoder1501.
  • the multi-modal encoding information is used as a K vector and a Q vector, and the "fire” character and the start symbol ⁇ s> are input into Transformer Decoder 1501 as a Q vector.
  • the "fire" character and the start character ⁇ s> are input to Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding.
  • Transformer Decoder1501 is based on the multi-modal encoding information, the "fire” character and the start character ⁇ s>, and outputs E dout 2.
  • E dout 2 passes through the fully connected layer E out 2.
  • E out 2 obtains the corresponding value through the argmax layer.
  • the text processing module can determine the corresponding character based on the value, such as "mountain”.
  • the processing module passes the multi-modal encoding information, the character "fire” and the start character ⁇ s> through Transformer Decoder1501 to obtain the character “mountain”.
  • the processing module passes the multi-modal encoding information, the character "fire” and the start character ⁇ s> through Transformer Decoder1501 to obtain the character “mountain”.
  • the multi-modal encoding information is used as K vector and Q vector, and the "fire” character, "mountain” character and start character ⁇ s> are input into Transformer Decoder1501 as Q vector.
  • the "fire” character, "mountain” character and the start character ⁇ s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding.
  • Transformer Decoder1501 outputs E dout 3 based on the multi-modal encoding information, the "fire” character, the "mountain” character and the start character ⁇ s>.
  • E doout 3 passes through the fully connected layer E out 3 .
  • E out 3 gets the corresponding value through the argmax layer.
  • the text processing module can determine the corresponding character based on the value, for example, "explosion”.
  • the text processing module passes the multi-modal encoding information, the "fire” character, the “mountain” character and the starting character ⁇ s> through Transformer Decoder1501 to obtain the character "explosion”.
  • the incorrect character “explosion” in the OCR technology recognition result is corrected to "explosion”.
  • the multi-modal encoding information is used as K vector and Q vector, and the "fire” character, "mountain” character, “explosion” character and the start character ⁇ s> are input into Transformer Decoder1501 as Q vector.
  • the "fire” character, "mountain” character, “explosion” character and start character ⁇ s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding.
  • Transformer Decoder1501 outputs E dout 4 based on the multi-modal encoding information, the "fire” character, the "mountain” character, the “explosion” character and the start character ⁇ s> and the start character ⁇ s>.
  • E doout 4 passes through the fully connected layer E out 4 .
  • E out 4 gets the corresponding value through the argmax layer.
  • the text processing module can determine the corresponding character based on the value, for example, "Fa”. In other words, the text processing module passes the multi-modal encoding information, the "fire” character, the “mountain” character, the "blast” character and the start character ⁇ s> through Transformer Decoder1501 to obtain the character "fa”. For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be described again here.
  • the multi-modal encoding information is as K vector and Q vector, "fire” character, “mountain” character,
  • the "explosion” character, the "fa” character and the start character ⁇ s> are input into Transformer Decoder1501 as Q vectors.
  • the "Fire” character, “Mountain” character, “Explosion” character, "Fa” character and the start character ⁇ s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding.
  • Transformer Decoder1501 outputs E dout 5 based on multi-modal encoding information, "fire” characters, "mountain” characters, “blast” characters, "fa” characters and the start character ⁇ s>.
  • E dout 5 passes through the fully connected layer E out 5 .
  • E out 5 gets the corresponding value through the argmax layer.
  • the text processing module determines that the output result is the terminator ⁇ end>, which ends the loop.
  • the text processing module can obtain the correction result output by the correction module, that is, "volcanic eruption.”
  • the text processing module displays the obtained correction results in the recognition results.
  • the models involved in the embodiments of this application can form a text processing model, which can also be understood as Neural Networks.
  • the input data of the model are mainly text images (including truncated and untruncated samples) and the corresponding text recognition content (i.e., text content).
  • the corresponding text recognition content i.e., text content
  • the text to be corrected is manually modified to obtain the corrected text, which is used as the supervision data output by the text correction decoder.
  • the training process of the text processing model is supervised training.
  • the classification decoder i.e., classification model
  • the text correction decoder i.e., correction model
  • the teacher-forcing method is used for training at each time step. Since the two decoders share the encoder (that is, the backbone of the neural network of the text processing model), the actual training process is joint training.
  • truncating text may also include transverse truncating text and oblique truncating text.
  • the text recognition module can usually predict the text content based on the OCR recognition process to obtain Correct text. That is to say, horizontal truncation of text may generally not cause the semantic error of vertical truncation of text mentioned above.
  • the solution in the embodiment of the present application when applied to horizontally truncate text, it can also be aligned and processed accordingly. The processed results may be slightly different from the recognition results of OCR technology.
  • the oblique truncated text is similar to the horizontal stage text.
  • the correct text content can be obtained through prediction and other methods in OCR technology. That is to say, after processing through the solution in the embodiment of the present application, the difference between the output result and the OCR technology recognition result is small.
  • OCR technology may not be able to recognize all text areas. For example, as shown in Figure 18a, the angle between the text lines is assumed to be 30°.
  • the OCR technology performs text area detection, the text area recognized by the OCR technology only includes the part shown in the dotted line.
  • OCR technology recognizes the text content of the detected text area, based on its prediction function, it can output text content that is consistent with the original text. It can also be understood that for texts with large oblique angles, the corresponding recognition results may not have semantic errors.
  • FIG. 18b is a schematic diagram of an exemplary application scenario. Please refer to Figure 18b.
  • the image to be recognized includes the lower part of the text being truncated.
  • the text processing module can also process the OCR recognition result corresponding to the text line based on the solution described in the above embodiment.
  • "partial occlusion" can optionally be the part of the entire line of text. The upper part (or the lower part, or any part) of part of the text is blocked.
  • Figure 18c is a schematic diagram of another application scenario. Please refer to Figure 18c, a part of the text line in the image to be recognized The text is occluded. That is, the original text is "multimodal encoding information (intermediate representation information)", and the "intermediate representation information” is partially occluded.
  • the text recognition module performs OCR recognition on the text line, which can Multiple text areas are obtained.
  • the text recognition module may identify the text area corresponding to "multi-modal encoding information", as well as the text area corresponding to the occluded "(intermediate representation information)", and The text content corresponding to the two text areas. Then, the text processing module can perform the processing solution in the embodiment of the present application on the images of the two text areas and the corresponding text content. Optionally, the text recognition module performs OCR recognition on the text lines. , it is also possible to obtain a text area. For example, as shown in Figure 18e, the text recognition module may divide the occluded text part and the unobstructed text part into the same text area.
  • this type of text area can also be The image and text content of the text area are processed.
  • the technical solution in the embodiment of the present application can be applied to a variety of scenes where the text is occluded, thereby meeting the needs for text recognition in different scenarios.
  • this The application embodiment can effectively solve the text recognition problem of text lines with an occlusion rate of 20% to 50% (it can also float within this range, which is not limited by this application). It should be noted that, as mentioned above, if the text If the occlusion rate of the line is too high (for example, 80%), the corresponding text area may not be detected during the OCR stage. If the occlusion rate is low, the OCR recognition result may be correct.
  • the text processing module can directly output Or output the corresponding text content after correction.
  • Figure 19 is a schematic flowchart of another text recognition method provided by an embodiment of the present application. Please refer to Figure 19. This method includes but is not limited to:
  • the text processing module passes the text image through the classification model to obtain the classification result.
  • the text processing module determines whether the text content is truncated based on the classification results.
  • the text processing module can pre-process the text image.
  • the pre-processing can be resizing the text image.
  • FIG. 20 is an exemplary schematic diagram of text image processing. Please refer to FIG. 20 .
  • the text processing module converts the text image 602a (which can also be processed with text image) is input to the classification model.
  • the classification model can classify the text image 602a and obtain a classification result.
  • the training data used by the classification model in the training phase includes, but is not limited to, text images corresponding to truncated text and text images corresponding to uncensored text.
  • the classification model can be supervised with a cross-entropy loss function.
  • the classification model may include but is not limited to mainstream classification networks based on Convolutional Neural Network (CNN) (for example, including VGG, ResNet, EfficientNet, etc.), or VIT (Vision Transformer, etc.) based on the Transformer structure.
  • CNN Convolutional Neural Network
  • VGG ResNet
  • EfficientNet etc.
  • VIT Vision Transformer, etc.
  • Visual Transformer classification model and its variants. Its purpose is mainly to output the probability of a two-classification problem, that is, the score corresponding to the truncated classification item or the non-truncated classification item.
  • the classification model is recorded as CLS
  • the text image includes parameters in three dimensions: width, height, and number of channels.
  • the output result score can optionally be a value greater than 0 or less than 1. Among them, the closer the value is to 1, the higher the truncation probability is.
  • the text processing module can set a truncation threshold, for example, 0.5, which can be set according to actual needs and is not limited in this application. In one example, if the output result score is greater than or equal to the truncation threshold (0.5), the text content corresponding to the text image is determined to be truncated text. In another example, if the output result score is less than the truncation threshold (0.5), the text content corresponding to the text image is determined to be non-truncated text.
  • the text processing module determines that the text content corresponding to the text image is non-truncated text, the corresponding text content can be directly output, that is, the corresponding text content can be displayed in the recognition result.
  • the corresponding text content can be directly output, that is, the corresponding text content can be displayed in the recognition result.
  • the text processing module passes the text content through the semantic model to obtain semantic judgment results.
  • the text processing module determines that the text content corresponding to the text image is truncated text
  • the text processing module inputs the text content corresponding to the text image into the semantic model (which may also be called a semantic judgment module).
  • Figure 21 is an exemplary processing flow of the semantic model. Please refer to Figure 21.
  • the processing flow of the semantic model includes but is not limited to the following steps:
  • the text processing module segments the text content into words and obtains the word segmentation results.
  • the text processing module (specifically, the semantic model) segments the text content 602b into words and obtains the corresponding word segmentation serial number sequence.
  • the text processing module specifically, the semantic model
  • segments the text content 602b into words and obtains the corresponding word segmentation serial number sequence For the specific steps of word segmentation and obtaining the text sequence, please refer to the relevant content in the above embodiments, and will not be described again here.
  • the text processing module passes the word segmentation results through Word embedding and Positional Endcoing to obtain E text .
  • the text module (specifically, the semantic model) passes the obtained text serial number sequence through Word embedding and Positional Endcoing to obtain the text encoding information E text .
  • the text module passes the obtained text serial number sequence through Word embedding and Positional Endcoing to obtain the text encoding information E text .
  • the text processing module passes E text through the encoding module to obtain F text .
  • the text processing module passes E text through the encoding module (ie, the encoder (Encoder)), and can obtain the encoded information with high-dimensional semantic features, that is, F text .
  • Encoding modules include but are not limited to: CNN encoder, RNN encoder, BiRNN (bidirectional recurrent neural network) encoder (such as bidirectional LSTM (Long Short-Term Memory, long short-term memory network)), Transformer Encoder, etc. This application does not limited.
  • the processing flow of the encoder can be referred to the relevant descriptions in Figure 14a and Figure 14b and will not be described again here. Among them, during the implementation process, E text is replaced by the multi-modal coding information in Figure 14a and Figure 14b.
  • F text Encoder(E text ) (10)
  • the text processing module passes the F text through the decoding module to obtain the output score score t (which is the semantic judgment result).
  • decoding module i.e. decoder
  • score t Decoder(F text ) (11)
  • the decoding module includes but is not limited to: MLP (ie fully connected layer) decoder, CNN decoder, RNN decoder and Transformer decoder, which can be set according to actual needs and is not limited in this application.
  • MLP fully connected layer
  • the decoding module please refer to the relevant contents of Figure 15, Figure 16 and Figure 17, and will not be described again here.
  • score t is the result of a binary classification problem. It can be understood that the output results are used to indicate semantic coherence or incoherence.
  • the armax layer may not be included in the decoder. In other embodiments, an argmax layer may also be included, which is not limited in this application.
  • the input of the semantic model is mainly a line or a string
  • the output is a category (ie, a semantically coherent type or a semantically incoherent type).
  • the semantic model collects corpus and manually annotates each item to determine whether the semantics are coherent.
  • the semantic model can also obtain positive and negative training samples through data generation and other methods.
  • the score t output by the decoding module can be used to indicate semantic coherence.
  • score t can optionally be a value greater than 0 or less than 1.
  • the text processing module can set a semantic coherence threshold, such as 0.5, which can be set according to actual needs and is not limited in this application.
  • the text processing module can determine the semantic coherence of the corresponding text content. In other words, the results of OCR technology for truncated text recognition are correct. Correspondingly, the text processing module can directly output the text content, that is, the corresponding text content is displayed in the text recognition result.
  • the semantic coherence threshold ie, 0.5
  • the text processing module may determine that the corresponding text content is semantically incoherent. That is to say, there is a semantic error in the recognition result of the truncated text by the OCR technology, and the text processing module continues to perform step (5).
  • the semantic coherence threshold ie, 0.5
  • the text processing module can also detect semantic coherence in other ways, for example, it can be based on a grammatical error checking model.
  • the grammatical error checking model can output a candidate set of grammatical error positions based on the input text content. And set the threshold judgment based on the ratio of the candidate set to the total number of tokens (minimum than the semantic unit).
  • the text processing module can obtain the probability of each token through the forward language model, and make a judgment based on the average probability and a preset threshold. For specific details, please refer to the relevant content in the prior art embodiments and will not be described again here.
  • the text processing module determines whether the text content can be corrected.
  • the text processing module can continue to determine whether the text content can be modified based on the results output by the semantic model.
  • the text processing module can set a correction threshold, such as 0.2, which can be set according to actual needs and is not limited in this application.
  • the text processing module can determine that the corresponding text content can be corrected, and the text processing module can correct the text content and output it.
  • the text processing module can use the text content as the input of the correction module, and perform correction through the correction module.
  • the processing flow of the correction module can be referred to the relevant contents of Figure 15, Figure 16 and Figure 17, which will not be described again here.
  • the text processing module can determine the corresponding The text content cannot be corrected, then the text processing module filters the text content, that is, the corresponding text content is not displayed in the text recognition result.
  • the text processing module can also detect whether the text content can be corrected based on other detection methods. For example, as mentioned above, the text processing module can perform processing based on the grammatical error checking model in the semantic coherence judgment process. The text processing module can further determine the number of grammatical errors or the proportion of the number of grammatical error characters based on the output results of the grammatical error checking model. Compare. The text processing module can determine whether the text content can be corrected based on this proportion. For another example, as mentioned above, the semantic coherence judgment can calculate the average probability based on the forward language model, and the text processing module can determine whether the text content is modified based on the average probability (for example, setting a corresponding correction threshold).
  • the text processing module can also use other correction methods, such as grammar-based
  • the output results of the error checking model are used to correct the text content through confusion set recall and candidate ranking.
  • the text processing module can obtain a confusion set of error positions based on the output results of the grammatical error checking model by calling a statistical language model, a neural language model, or Bert's bidirectional language model, and then through candidate sorting and error remoteness mechanisms, Recall corrected text.
  • each model involved in each step in Figure 19 can form a neural network.
  • the training method of the neural network can refer to the relevant description of the neural network training involved in the above embodiments, and will not be described again here. .
  • FIG. 22 shows a schematic block diagram of a device 2200 according to an embodiment of the present application.
  • the device 2200 may include: a processor 2201 and a transceiver/transceiver pin 2202, and optionally, a memory 2203.
  • bus 2204 which includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • bus 2204 includes a power bus, a control bus, and a status signal bus in addition to a data bus.
  • various buses are referred to as bus 2204 in the figure.
  • the memory 2203 may be used for instructions in the foregoing method embodiments.
  • the processor 2201 can be used to execute instructions in the memory 2203, and control the receiving pin to receive signals, and control the transmitting pin to send signals.
  • the device 2200 may be the electronic device or a chip of the electronic device in the above method embodiment,
  • This embodiment also provides a computer storage medium that stores computer instructions.
  • the electronic device When the computer instructions are run on an electronic device, the electronic device causes the electronic device to execute the above related method steps to implement the method in the above embodiment.
  • This embodiment also provides a computer program product.
  • the computer program product When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the method in the above embodiment.
  • inventions of the present application also provide a device.
  • This device may be a chip, a component or a module.
  • the device may include a connected processor and a memory; where the memory is used to store computer execution instructions.
  • the processor can execute computer execution instructions stored in the memory, so that the chip executes the methods in each of the above method embodiments.
  • the electronic equipment, computer storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the corresponding methods provided above. The beneficial effects of the method will not be repeated here.
  • Computer-readable media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
  • Storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • first and second in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects.
  • first target object, the second target object, etc. are used to distinguish different target objects, rather than to describe a specific order of the target objects.
  • multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided in embodiments of the present application are a text recognition method and an electronic device. The method comprises: an electronic device can obtain an image and first text content in a first text region of an object to be recognized; the electronic device classifies the image and the first text content in the first text region to display a text recognition result of the first text region on the basis of a classification result, wherein the classification result comprises a first classification, a second classification, and a third classification. The text recognition result corresponding to the first classification filters the first text content. The text recognition result corresponding to the second classification comprises text content after correction of the first text content. The text recognition result corresponding to the third classification comprises the first text content. In this way, the electronic device can comprehensively consider an image and text content in a text region to avoid semantically incorrect text content in text recognition results, thereby improving user experience.

Description

文本识别方法及电子设备Text recognition method and electronic device
本申请要求于2022年05月30日提交中国国家知识产权局、申请号为202210597895.6、申请名称为“文本识别方法及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on May 30, 2022, with application number 202210597895.6 and the application name "Text Recognition Method and Electronic Device", the entire content of which is incorporated into this application by reference. middle.
技术领域Technical field
本申请实施例涉及终端设备领域,尤其涉及一种文本识别方法及电子设备。Embodiments of the present application relate to the field of terminal equipment, and in particular, to a text recognition method and electronic equipment.
背景技术Background technique
随着通信技术的不断发展,手机等终端已成为人们日常生活中不可或缺的一部分。用户利用手机不仅可以与其他用户交流通信,还可以浏览或处理各类信息。With the continuous development of communication technology, mobile phones and other terminals have become an indispensable part of people's daily lives. Users can use mobile phones not only to communicate with other users, but also to browse or process various types of information.
在使用过程中,对于手机显示的感兴趣的内容,如用户对图片或应用界面中的某一些文字感兴趣,用户可通过应用的文本识别功能,以识别图片或界面中的文字。通常文本识别功能是基于光学字符识别(Optical Character Recognition,OCR)技术来实现的。以图片为例,应用可基于OCR技术识别图片中的文字,并输出识别结果。但是,对于包含截断文本的文本识别场景,目前的OCR技术对文本识别后的输出结果与原文的差异较大,影响用户体验。During use, if the user is interested in the content displayed on the mobile phone, such as a picture or some text in the application interface, the user can use the text recognition function of the application to identify the text in the picture or interface. Usually text recognition function is implemented based on optical character recognition (Optical Character Recognition, OCR) technology. Taking pictures as an example, the application can recognize the text in the picture based on OCR technology and output the recognition results. However, for text recognition scenarios that include truncated text, the output results of current OCR technology after text recognition are quite different from the original text, which affects the user experience.
发明内容Contents of the invention
为了解决上述技术问题,本申请提供一种文本识别方法及电子设备。在该方法中,电子设备可基于文本区域的图像与文本内容,输出满足用户需求的文本识别结果。In order to solve the above technical problems, this application provides a text recognition method and electronic device. In this method, the electronic device can output a text recognition result that meets the user's needs based on the image and text content of the text area.
第一方面,本申请实施例提供一种文本识别方法。该方法包括:电子设备对待识别对象进行文本区域检测,得到第一文本区域的图像,其中,第一文本区域中包括文本内容。电子设备对获取到的第一文本区域进行文本内容识别,得到第一文本内容。接着,电子设备基于第一文本区域的图像与第一文本内容进行分类,得到分类结果。随后,电子设备基于分类结果,显示第一文本区域的文本识别结果。显示文本识别结果的步骤可以具体包括:若分类结果为第一分类,文本识别结果过滤了第一文本内容。若分类结果为第二分类,文本识别结果包括第一文本内容修正后的文本内容。若分类结果为第三分类,文本识别结果包括第一文本内容。这样,电子设备可通过对图像信息(即文本区域的图像)和文字信息(即文本内容)进行综合考量,可以在文本区域中包含的文本内容缺失较多的情况下,将文本内容识别的结果(即第一文本内容)过滤。而在文本内容缺失较少的情况下,输出修正后的结果。并且可以在文本内容未缺失的情况下,输出对应的文本。从而能够在文本识别结果中呈现是正确的、语义通顺的结果,而将语义错误的结果(即文本内容)滤除,从而能得到拟人化的复杂决策效果,以提升用户使用体验。In a first aspect, embodiments of the present application provide a text recognition method. The method includes: the electronic device performs text area detection on an object to be recognized, and obtains an image of a first text area, where the first text area includes text content. The electronic device performs text content recognition on the acquired first text area to obtain the first text content. Then, the electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result. Subsequently, the electronic device displays the text recognition result of the first text area based on the classification result. The step of displaying the text recognition result may specifically include: if the classification result is the first category, the text recognition result filters the first text content. If the classification result is the second classification, the text recognition result includes the text content after the first text content has been corrected. If the classification result is the third classification, the text recognition result includes the first text content. In this way, the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.
示例性的,所述文本识别结果可选地为图4中的文本识别结果显示框405。也就是说,如果文本识别结果为第一分类指示的结果(即过滤),则文本识别结果显示框405中第 一文本区域对应的结果为空,即不显示第一文本区域所对应的文本内容识别结果(即第一文本内容)。如果文本识别结果为第二分类指示的结果(即输出修正的文本内容)或者第三分类指示的结果(即直接输出文本内容),则文本识别结果显示框405中包括第一文本区域对应的修正后的文本内容或者是第一文本区域的文本内容。Illustratively, the text recognition result is optionally the text recognition result display box 405 in FIG. 4 . That is to say, if the text recognition result is the result indicated by the first classification (ie filtering), then the text recognition result display box 405 will The result corresponding to a text area is empty, that is, the text content recognition result corresponding to the first text area (ie, the first text content) is not displayed. If the text recognition result is the result of the second classification instruction (that is, outputting the corrected text content) or the result of the third classification instruction (that is, directly outputting the text content), the text recognition result display box 405 includes the correction corresponding to the first text area. The following text content or the text content of the first text area.
示例性的,所述文本识别结果可以是文本区域自身对应的结果。例如,如果文本识别结果为第一分类指示的结果(即过滤),则电子设备显示的第一文本区域对应的文本识别结果为空(可以是空白,也可以不留空白)。如果文本识别结果为第二分类指示的结果(即输出修正的文本内容)或者第三分类指示的结果(即直接输出文本内容),则电子设备可在文本识别结果显示框405中显示第一文本区域对应的文本内容(可以是修改后的,也可以是文本内容识别后的结果)。For example, the text recognition result may be a result corresponding to the text area itself. For example, if the text recognition result is the result indicated by the first classification (ie, filtered), the text recognition result corresponding to the first text area displayed by the electronic device is empty (it may be blank or no blank). If the text recognition result is the result of the second classification instruction (that is, outputting the corrected text content) or the result of the third classification instruction (that is, directly outputting the text content), the electronic device may display the first text in the text recognition result display box 405 The text content corresponding to the area (can be modified or the result of text content recognition).
示例性的,分类结果可选地为一个数值,该数值用于表示分类项。For example, the classification result is optionally a numerical value, and the numerical value is used to represent the classification item.
示例性的,分类结果也可以包括3个数值,最大数值对应的分类即为第一文本区域对应的分类。For example, the classification result may also include three numerical values, and the classification corresponding to the largest numerical value is the classification corresponding to the first text area.
根据第一方面,电子设备基于第一文本区域的图像与第一文本内容进行分类,得到分类结果,包括:电子设备基于第一文本区域的图像与第一文本内容,得到中间表征信息。电子设备对中间表征信息进行分类,得到分类结果。这样,电子设备利用高维多模态语义信息对不同的输入组合进行更为精细化的决策,从而能得到拟人化的复杂决策效果。According to the first aspect, the electronic device performs classification based on the image of the first text area and the first text content, and obtains the classification result, including: the electronic device obtains intermediate representation information based on the image of the first text area and the first text content. The electronic device classifies the intermediate representation information and obtains the classification result. In this way, electronic devices use high-dimensional multi-modal semantic information to make more refined decisions on different input combinations, thereby achieving anthropomorphic complex decision-making effects.
示例性的,中间表征信息可以称为多模态信息。For example, the intermediate representation information may be called multi-modal information.
示例性的,中间表征信息可以用于表征第一文本区域的图像的图像特征与第一文本内容的文本特征。For example, the intermediate characterization information may be used to characterize the image features of the image of the first text area and the text features of the first text content.
根据第一方面,或者以上第一方面的任意一种实现方式,电子设备对中间表征信息进行分类,得到分类结果,包括:电子设备通过分类模型对中间表征信息进行分类,得到分类结果。这样,电子设备可通过预先训练好的分类模型,对中间表征信息进行分类,以得到对应的分类结果。According to the first aspect, or any implementation manner of the above first aspect, the electronic device classifies the intermediate representation information and obtains the classification result, including: the electronic device classifies the intermediate representation information through the classification model and obtains the classification result. In this way, the electronic device can classify the intermediate representation information through the pre-trained classification model to obtain the corresponding classification result.
根据第一方面,或者以上第一方面的任意一种实现方式,电子设备基于分类结果,显示第一文本区域的文本识别结果之前,还包括:电子设备对中间表征信息进行修正,得到第一文本内容修正后的文本内容。示例性的,电子设备对中间表征信息进行分类之前、同时或者之后,对中间表征信息进行修正以得到修正后的文本内容。电子设备可基于分类结果,确定是否输出修正后的文本内容。示例性的,如果不需要输出修正后的文本内容,例如分类结果为第一分类或者第三分类,则丢弃修正后的文本内容。According to the first aspect, or any implementation of the above first aspect, before the electronic device displays the text recognition result of the first text area based on the classification result, the electronic device further includes: the electronic device corrects the intermediate representation information to obtain the first text Text content after content correction. For example, before, at the same time, or after classifying the intermediate representation information, the electronic device corrects the intermediate representation information to obtain the corrected text content. The electronic device can determine whether to output the corrected text content based on the classification result. For example, if the corrected text content does not need to be output, for example, the classification result is the first category or the third category, then the corrected text content is discarded.
根据第一方面,或者以上第一方面的任意一种实现方式,电子设备对中间表征信息进行修正,得到修正后的目标文本内容,包括:电子设备通过修正模型对中间表征信息进行修正,得到第一文本内容修正后的文本内容。这样,电子设备可通过预先训练好的 修正模型对中间表征信息进行修正,从而得到修正后的文本内容。According to the first aspect, or any implementation manner of the above first aspect, the electronic device corrects the intermediate representation information to obtain the corrected target text content, including: the electronic device corrects the intermediate representation information through the correction model to obtain the third A text content after the text content has been corrected. In this way, electronic devices can use pre-trained The correction model corrects the intermediate representation information to obtain the corrected text content.
根据第一方面,或者以上第一方面的任意一种实现方式,电子设备基于第一文本区域的图像与第一文本内容,得到中间表征信息,包括:电子设备对第一文本区域的图像进行图像编码,得到第一图像编码信息。电子设备对第一文本内容进行文本编码,得到第一文本编码信息。电子设备通过多模态编码模型对第一图像编码信息与第一文本编码信息进行多模态编码,得到中间表征信息。这样,电子设备通过文本区域的图像以及文本内容进行编码,可得到更高维的语义信息。电子设备可通过预先训练好的多模态编码模型,对第一图像编码信息与第一文本编码信息进行多模态编码,以得到具有高维语义的中间表征信息。According to the first aspect, or any implementation of the above first aspect, the electronic device obtains the intermediate representation information based on the image of the first text area and the first text content, including: the electronic device images the image of the first text area. Encoding to obtain the first image encoding information. The electronic device performs text encoding on the first text content to obtain the first text encoding information. The electronic device performs multi-modal encoding on the first image encoding information and the first text encoding information through a multi-modal encoding model to obtain intermediate representation information. In this way, the electronic device can obtain higher-dimensional semantic information by encoding the image and text content of the text area. The electronic device can perform multi-modal encoding on the first image encoding information and the first text encoding information through a pre-trained multi-modal encoding model to obtain intermediate representation information with high-dimensional semantics.
根据第一方面,或者以上第一方面的任意一种实现方式,多模态编码模型、分类模型和修正模型组成神经网络,神经网络的训练数据包括第二文本区域和与第二文本区域对应的第二文本内容,以及第三文本区域和与第三文本区域对应的第三文本内容;第二文本区域中包括部分缺失的文本内容,第三文本区域中的文本内容为完整文本内容。这样,可通过输入不同类型(包括文字缺失和不缺失的文本区域)的文本区域的图像和文本内容,可以对神经网络进行循环训练,以使得神经网络能够完成对应的功能,即能够对文本区域的图像和文本内容进行融合、分类以及修正。According to the first aspect, or any implementation of the above first aspect, the multi-modal coding model, the classification model and the correction model form a neural network, and the training data of the neural network includes the second text area and the second text area corresponding to the second text area. The second text content, as well as the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete text content. In this way, by inputting images and text content of text areas of different types (including text areas with missing text and non-missing text areas), the neural network can be cyclically trained, so that the neural network can complete the corresponding function, that is, it can fuse, classify, and correct image and text content.
根据第一方面,或者以上第一方面的任意一种实现方式,第一文本区域的文本识别结果显示于文本识别区域中,文本识别区域中还包括待识别对象中的第三文本区域对应的文本内容。这样,本申请中的文本识别方法可以实现对文本内容的不同处理方式,即,最终显示的文本识别结果均是语意连贯的文本内容。对于文字内容识别结果中语义不连贯的文本内容,采用过滤或者是修正的方式,以避免语义不连贯的文本内容对文本识别结果的影响。According to the first aspect, or any implementation of the above first aspect, the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content. In this way, the text recognition method in this application can implement different processing methods for text content, that is, the text recognition results finally displayed are semantically coherent text content. For the semantically incoherent text content in the text content recognition results, filtering or correction methods are used to avoid the impact of the semantically incoherent text content on the text recognition results.
根据第一方面,或者以上第一方面的任意一种实现方式,若第一文本区域中包括部分缺失的文本内容,文本识别结果为第一分类或第二分类。示例性的,部分缺失的文本内容可以为文本区域中的每个文字均缺失部分信息,例如可以是缺失上半部,也可以是缺失下半部分。是心灵的,部分缺失的文本也可以是文本区域中的至少一个文字缺失部分信息。According to the first aspect, or any implementation of the above first aspect, if the first text area includes partially missing text content, the text recognition result is the first category or the second category. For example, the partially missing text content may be that each text in the text area is missing part of the information, for example, the upper half may be missing or the lower half may be missing. Is mental, partially missing text can also be that at least one text in the text area is missing part of the information.
根据第一方面,或者以上第一方面的任意一种实现方式,第一文本内容表达的语义与第一文本区域中的文本内容表达的语义不相同。这样,本申请实施例中可以对文本内容识别结果进行筛选,以过滤或修正与原有语义不相同的文本内容,从而提升用户使用体验。According to the first aspect, or any implementation of the above first aspect, the semantics expressed by the first text content are different from the semantics expressed by the text content in the first text area. In this way, in embodiments of the present application, the text content recognition results can be filtered to filter or correct text content that is different from the original semantics, thereby improving the user experience.
根据第一方面,或者以上第一方面的任意一种实现方式,待识别对象为图片、网页 或文档。According to the first aspect, or any implementation of the first aspect above, the object to be identified is a picture, a web page or documentation.
第二方面,本申请实施例提供一种文本识别方法。该方法包括:电子设备对待识别对象进行文本区域检测,得到第一文本区域的图像;第一文本区域中包括文本内容。电子设备对第一文本区域进行文本内容识别,得到第一文本内容。电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果。电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果,包括:若第一文本区域的图像表征第一文本区域包括部分缺失的文本内容且第一文本内容为语意连贯的文本内容,或者,第一文本区域的图像表征第一文本区域不包括部分缺失的文本内容,文本识别结果包括第一文本内容;若第一文本区域的图像表征第一文本区域包括部分缺失的文本内容,且第一文本内容包括语义错误的文本内容,文本识别结果过滤了第一文本内容或者文本识别结果包括第一文本内容修正后的文本内容。这样,电子设备可通过对图像信息(即文本区域的图像)和文字信息(即文本内容)进行综合考量,可以在文本区域中包含的文本内容缺失较多的情况下,将文本内容识别的结果(即第一文本内容)过滤。而在文本内容缺失较少的情况下,输出修正后的结果。并且可以在文本内容未缺失的情况下,输出对应的文本。从而能够在文本识别结果中呈现是正确的、语义通顺的结果,而将语义错误的结果(即文本内容)滤除,从而能得到拟人化的复杂决策效果,以提升用户使用体验。In the second aspect, embodiments of the present application provide a text recognition method. The method includes: an electronic device detects a text area of an object to be recognized, and obtains an image of a first text area; the first text area includes text content. The electronic device performs text content recognition on the first text area to obtain the first text content. The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content. The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content and the first text content It is semantically coherent text content, or the image of the first text area represents that the first text area does not include part of the missing text content, and the text recognition result includes the first text content; if the image of the first text area represents that the first text area includes Part of the text content is missing, and the first text content includes text content with semantic errors, the text recognition result filters the first text content, or the text recognition result includes text content after the first text content is corrected. In this way, the electronic device can comprehensively consider the image information (i.e., the image of the text area) and the text information (i.e., the text content), and can recognize the result of the text content when the text content contained in the text area is missing. (i.e. first text content) filtering. In the case where there is less missing text content, the corrected result is output. And the corresponding text can be output when the text content is not missing. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) are filtered out, so that complex anthropomorphic decision-making effects can be obtained to improve user experience.
示例性的,电子设备可基于文本区域的图像,检测文本区域中的文本内容是否被截断,即是否包括缺失内容的文本。一个示例中,如果文本内容未被截断,则可以直接输出第一文本内容。另一个示例中,如果文本内容被截断,则检测第一文本内容的语义是否连贯。若第一文本内容的语义连贯,则可以直接输出第一文本内容。若第一文本内容的语义不连贯,则进一步检测第一文本内容是否可被修改。若第一文本内容可被修改,则输出修改后的文本内容,若第一文本内容不可被修改,则过滤第一文本内容。For example, the electronic device can detect, based on the image of the text area, whether the text content in the text area is truncated, that is, whether the text includes missing content. In one example, if the text content is not truncated, the first text content can be output directly. In another example, if the text content is truncated, it is detected whether the semantics of the first text content is coherent. If the semantics of the first text content are coherent, the first text content can be directly output. If the semantics of the first text content are incoherent, it is further detected whether the first text content can be modified. If the first text content can be modified, the modified text content is output. If the first text content cannot be modified, the first text content is filtered.
根据第二方面,电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果,包括:若第一文本区域的图像表征第一文本区域包括部分缺失的文本内容,且第一文本内容包括语义不连贯的文本内容,电子设备检测第一文本内容是否可被修正。若第一文本内容不可被修正,文本识别结果过滤了第一文本内容。若第一文本内容可被修正,文本识别结果包括第一文本内容修正后的文本内容。这样,电子设备在检测到第一文本区域中的文本内容被截断,并且第一文本内容的语义不连贯的情况下,可以进一步检测第一文本内容是否可以被修正。如果可以被修正,则电子设备可以对第一文本内容进行修正,并输出修正后的文本内容。如果不可以被修正,则电子设备过滤第一文本内容。也就是说,电子设备显示的第一文本区域的文本识别结果中是空,或者是修正后的文本内容,或者是原本语义连贯的文本内容,以避免文本内容识别结果错误对用户使用的影响。According to the second aspect, the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: if the image of the first text area represents that the first text area includes partially missing text content , and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be modified. If the first text content cannot be corrected, the text recognition result filters the first text content. If the first text content can be corrected, the text recognition result includes the text content after the first text content is corrected. In this way, when the electronic device detects that the text content in the first text area is truncated and the semantics of the first text content is incoherent, it can further detect whether the first text content can be corrected. If it can be corrected, the electronic device can correct the first text content and output the corrected text content. If it cannot be corrected, the electronic device filters the first text content. That is to say, the text recognition result of the first text area displayed by the electronic device is empty, or it is corrected text content, or it is original semantically coherent text content, so as to avoid the impact of incorrect text content recognition results on the user's use.
根据第二方面,或者以上第二方面的任意一种实现方式,若第一文本内容可被修正, 方法还包括:电子设备通过修正模型对第一文本内容进行修正,得到第一文本内容修正后的文本内容。这样,电子设备可以通过预先训练好的修正模型对第一文本内容进行修正,以得到语意连贯的文本内容。According to the second aspect, or any implementation of the above second aspect, if the first text content can be modified, The method also includes: the electronic device corrects the first text content through the correction model to obtain text content after the first text content is corrected. In this way, the electronic device can correct the first text content through the pre-trained correction model to obtain semantically coherent text content.
根据第二方面,或者以上第二方面的任意一种实现方式,电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果,包括:电子设备通过分类模型对第一文本区域的图像进行分类,得到分类结果;分类结果用于指示第一文本区域中是否包括部分缺失的文本内容。这样,电子设备可以通过预先训练好的分类模型对文本区域的图像进行分类,以检测文本区域中的文本内容是否被截断。According to the second aspect, or any implementation of the above second aspect, the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including: the electronic device uses a classification model to The image of the first text area is classified to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content. In this way, the electronic device can classify the image of the text area through the pre-trained classification model to detect whether the text content in the text area is truncated.
根据第二方面,或者以上第二方面的任意一种实现方式,若第一文本区域的图像表征第一文本区域包括部分缺失的文本内容,电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果,包括:电子设备通过语义模型对第一文本内容进行语义分析,得到语义分析结果;语义分析结果用于指示第一文本内容是否包括语意错误的文本内容。这样,电子设备可以通过预先训练好的语义模型,对文本内容进行语义分析,以得到语义分析结果。According to the second aspect, or any implementation of the above second aspect, if the image of the first text area represents that the first text area includes partially missing text content, the electronic device based on the image of the first text area and the first text content , displaying the text recognition result of the first text area, including: the electronic device performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors. In this way, the electronic device can perform semantic analysis on the text content through the pre-trained semantic model to obtain semantic analysis results.
示例性的,语义分析结果可以是一个数值,电子设备可以预先设置语义连贯阈值,阈值用于指示文本内容的语义连贯性。若语义分析结果的数值大于或等于阈值,则第一文本内容的语义连贯,若语义分析结果的数值小于阈值,则第一文本内容的语义不连贯。For example, the semantic analysis result can be a numerical value, and the electronic device can preset a semantic coherence threshold, and the threshold is used to indicate the semantic coherence of the text content. If the value of the semantic analysis result is greater than or equal to the threshold, the first text content is semantically coherent. If the value of the semantic analysis result is less than the threshold, the first text content is semantically incoherent.
根据第二方面,或者以上第二方面的任意一种实现方式,语义分析结果还用于指示第一文本内容是否可被修正,电子设备基于第一文本区域的图像与第一文本内容,显示第一文本区域的文本识别结果,包括:电子设备基于语义分析结果,确定第一文本内容是否可被修改。电子设备可以设置修正阈值,修正阈值与语义连贯阈值不相同。若语义分析结果的数值大于或等于修正阈值,则第一文本内容可被修正。若语义分析结果的数值小于修正阈值,则第一文本内容不可被修正。According to the second aspect, or any implementation of the above second aspect, the semantic analysis result is also used to indicate whether the first text content can be modified, and the electronic device displays the first text content based on the image of the first text area and the first text content. The text recognition result of a text area includes: the electronic device determines whether the first text content can be modified based on the semantic analysis result. The electronic device may set a correction threshold that is different from the semantic coherence threshold. If the value of the semantic analysis result is greater than or equal to the correction threshold, the first text content may be corrected. If the value of the semantic analysis result is less than the correction threshold, the first text content cannot be corrected.
根据第二方面,或者以上第二方面的任意一种实现方式,修正模型、分类模型、语义模型组成神经网络,神经网络的训练数据包括第二文本区域和与第二文本区域对应的第二文本内容,以及第三文本区域和与第三文本区域对应的第三文本内容;第二文本区域中包括部分缺失的文本内容,第三文本区域中的文本内容为完整文本内容。这样,可通过输入不同类型(包括文字缺失和不缺失的文本区域)的文本区域的图像和文本内容,可以对神经网络进行循环训练,以使得神经网络能够完成对应的功能,即能够对文本区域的图像和文本内容进行截断判断、语义分析以及修正。According to the second aspect, or any implementation of the above second aspect, the correction model, the classification model, and the semantic model form a neural network, and the training data of the neural network includes a second text area and a second text corresponding to the second text area. content, as well as a third text area and third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete text content. In this way, by inputting images and text content of text areas of different types (including text areas with missing text and non-missing text areas), the neural network can be cyclically trained, so that the neural network can complete the corresponding function, that is, it can Carry out truncation judgment, semantic analysis and correction of image and text content.
根据第二方面,或者以上第二方面的任意一种实现方式,第一文本区域的文本识别结果显示于文本识别区域中,文本识别区域中还包括待识别对象中的第三文本区域对应的文本内容。 According to the second aspect, or any implementation of the above second aspect, the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes text corresponding to the third text area in the object to be recognized. content.
根据第二方面,或者以上第二方面的任意一种实现方式,语义错误的文本内容表达的语义与第一文本区域中对应的文本内容表达的语义不相同。According to the second aspect, or any implementation of the above second aspect, the semantics expressed by the semantically incorrect text content are different from the semantics expressed by the corresponding text content in the first text area.
根据第二方面,或者以上第二方面的任意一种实现方式,待识别对象为图片、网页或文档。According to the second aspect, or any implementation of the above second aspect, the object to be identified is a picture, a web page or a document.
第三方面,本申请实施例提供一种电子设备。该电子设备包括:一个或多个处理器;存储器;以及一个或多个计算机程序,其中一个或多个计算机程序存储在存储器上,当计算机程序被一个或多个处理器执行时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的方法的指令。In a third aspect, embodiments of the present application provide an electronic device. The electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions for performing a method of the first aspect or any possible implementation of the first aspect.
第四方面,本申请实施例提供一种电子设备。该电子设备包括:一个或多个处理器;存储器;以及一个或多个计算机程序,其中一个或多个计算机程序存储在存储器上,当计算机程序被一个或多个处理器执行时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的方法的指令。In a fourth aspect, embodiments of the present application provide an electronic device. The electronic device includes: one or more processors; memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, and when the computer program is executed by the one or more processors, the electronic device Instructions to perform a method of the second aspect or any possible implementation of the second aspect.
第五方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。In a fifth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program, where the computer program includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.
第六方面,本申请实施例提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第二方面或第二方面的任意可能的实现方式中的方法的指令。In a sixth aspect, embodiments of the present application provide a computer-readable medium for storing a computer program. The computer program includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.
第七方面,本申请实施例提供了一种计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。In a seventh aspect, embodiments of the present application provide a computer program, which includes instructions for executing the method in the first aspect or any possible implementation of the first aspect.
第八方面,本申请实施例提供了一种计算机程序,该计算机程序包括用于执行第二方面或第二方面的任意可能的实现方式中的方法的指令。In an eighth aspect, embodiments of the present application provide a computer program, which includes instructions for executing the method in the second aspect or any possible implementation of the second aspect.
附图说明Description of the drawings
图1为示例性示出的电子设备的硬件结构示意图;Figure 1 is a schematic diagram of the hardware structure of an exemplary electronic device;
图2为示例性示出的电子设备的软件结构示意图;Figure 2 is a schematic diagram of the software structure of an exemplary electronic device;
图3为示例性示出的含有截断文本的文本识别场景示意图;Figure 3 is a schematic diagram of a text recognition scene containing truncated text;
图4为示例性示出的应用本申请实施例中的文本识别方法的应用场景示意图;Figure 4 is a schematic diagram illustrating an application scenario for applying the text recognition method in the embodiment of the present application;
图5为示例性示出的文本识别方法的流程示意图;Figure 5 is a schematic flow chart of an exemplary text recognition method;
图6为示例性示出的文本识别的示意图;Figure 6 is a schematic diagram of exemplary text recognition;
图7为示例性示出的文本图像编码示意图;Figure 7 is an exemplary text image encoding schematic diagram;
图8为示例性示出的图像信息编码示意图; Figure 8 is an exemplary schematic diagram of image information encoding;
图9为示例性示出的图像信息编码示意图;Figure 9 is an exemplary schematic diagram of image information encoding;
图10为示例性示出的Image Patch展平示意图;Figure 10 is an exemplary flattening diagram of Image Patch;
图11为示例性示出的文本内容编码示意图;Figure 11 is an exemplary text content encoding schematic diagram;
图12为示例性示出的文本信息编码流程示意图;Figure 12 is a schematic diagram of an exemplary text information encoding process;
图13为示例性示出的中间表征信息的获取流程示意图;Figure 13 is a schematic diagram of an exemplary acquisition process of intermediate representation information;
图14a为示例性示出的多模态编码示意图;Figure 14a is a schematic diagram of an exemplary multi-modal encoding;
图14b为多模态编码器的处理流程示意图;Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder;
图14c为示例性示出的分类流程示意图;Figure 14c is a schematic diagram of an exemplary classification process;
图15为示例性示出的文本修正示意图;Figure 15 is an exemplary text modification schematic diagram;
图16为示例性示出的修正模块处理流程示意图;Figure 16 is a schematic diagram of the processing flow of the correction module;
图17为示例性示出的Transformer Decoder的处理流程示意图;Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder.
图18a为示例性示出的一种应用场景示意图;Figure 18a is a schematic diagram of an exemplary application scenario;
图18b为示例性示出的另一种应用场景示意图;Figure 18b is a schematic diagram of another application scenario;
图18c为示例性示出的又一种应用场景示意图;Figure 18c is a schematic diagram of another exemplary application scenario;
图18d为示例性示出的又一种应用场景示意图;Figure 18d is a schematic diagram of another exemplary application scenario;
图18e为示例性示出的又一种应用场景示意图;Figure 18e is a schematic diagram of another application scenario exemplarily shown;
图19为示例性示出的文本识别方法的流程示意图;Figure 19 is a schematic flow chart of an exemplary text recognition method;
图20为示例性示出的文本图像处理示意图;Figure 20 is an exemplary schematic diagram of text image processing;
图21为示例性示出的语义模型的处理流程;Figure 21 is an exemplary processing flow of the semantic model;
图22为示例性示出的装置的结构示意图。Figure 22 is a schematic structural diagram of an exemplary device.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
图1示出了电子设备100的结构示意图。应该理解的是,图1所示电子设备100仅是电子设备的一个范例,并且电子设备100可以具有比图中所示的更多的或者更少的部件,可以组合两个或多个的部件,或者可以具有不同的部件配置。图1中所示出的各种部件可以在包括一个或多个信号处理和/或专用集成电路在内的硬件、软件、或硬件和软件的组合中实现。FIG. 1 shows a schematic structural diagram of an electronic device 100 . It should be understood that the electronic device 100 shown in FIG. 1 is only an example of an electronic device, and the electronic device 100 may have more or fewer components than shown in the figure, and two or more components may be combined. , or can have different component configurations. The various components shown in Figure 1 may be implemented in hardware, software, or a combination of hardware and software including one or more signal processing and/or application specific integrated circuits.
电子设备100可以包括:处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F, 接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2. Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, And subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, and a distance sensor 180F. Proximity light sensor 180G, fingerprint sensor 180H, temperature sensor 180J, touch sensor 180K, ambient light sensor 180L, bone conduction sensor 180M, etc.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait. Among them, different processing units can be independent devices or integrated in one or more processors.
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。The controller may be the nerve center and command center of the electronic device 100 . The controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from the wired charger through the USB interface 130 . In some wireless charging embodiments, the charging management module 140 may receive wireless charging input through the wireless charging coil of the electronic device 100 . While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc. The power management module 141 can also be used to monitor battery capacity, battery cycle times, battery health status (leakage, impedance) and other parameters. In some other embodiments, the power management module 141 may also be provided in the processor 110 . In other embodiments, the power management module 141 and the charging management module 140 may also be provided in the same device.
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块 被设置在同一个器件中。The mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 . The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc. The mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation. In some embodiments, at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 . In some embodiments, at least part of the functional modules of the mobile communication module 150 may be combined with at least part of the modules of the processor 110 are set up in the same device.
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。A modem processor may include a modulator and a demodulator. Among them, the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal. The demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing. After the low-frequency baseband signal is processed by the baseband processor, it is passed to the application processor. The application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。The wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (bluetooth, BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 . The wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。In some embodiments, the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology. The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。The display screen 194 is used to display images, videos, etc. Display 194 includes a display panel. The display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode). emitting diode (AMOLED), flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc. In some embodiments, the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。The electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
ISP用于处理摄像头193反馈的数据。摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。The ISP is used to process the data fed back by the camera 193. Camera 193 is used to capture still images or video. The object passes through the lens to produce an optical image that is projected onto the photosensitive element. In some embodiments, the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存 储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device 100 . The internal memory 121 may include a program storage area and a data storage area. Among them, save The stored program area can store the operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.). The storage data area may store data created during use of the electronic device 100 (such as audio data, phone book, etc.). In addition, the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, universal flash storage (UFS), etc.
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。The audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be provided in the processor 110 , or some functional modules of the audio module 170 may be provided in the processor 110 .
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。在其他实施例中,本申请实施例还可以应用于鸿蒙系统等其它系统中,其实现方法均可参照本申请实施例中的技术方案,本申请不再逐一举例说明。The software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. The embodiment of this application takes the Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 . In other embodiments, the embodiments of the present application can also be applied to other systems such as the Hongmeng system. The implementation methods may refer to the technical solutions in the embodiments of the present application. This application will not give examples one by one.
图2是本申请实施例的电子设备100的软件结构框图。FIG. 2 is a software structure block diagram of the electronic device 100 according to the embodiment of the present application.
电子设备100的分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。The layered architecture of the electronic device 100 divides the software into several layers, and each layer has clear roles and division of labor. The layers communicate through software interfaces. In some embodiments, the Android system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system libraries, and kernel layer.
应用程序层可以包括一系列应用程序包。The application layer can include a series of application packages.
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息,文本识别、文本处理等应用程序。其中,文本识别应用程序在本申请实施例中也可以称为文本识别模块或文本识别引擎等,本申请不做限定。文本识别模块可用于识别待识别图片中的文本区域和文本内容(具体概念可参照下文)。文本处理应用程序也可以称为文本处理模块,用于对文本识别模块的输出结果进行进一步处理(具体处理流程可参照下文实施例)。需要说明的是,在本申请实施例中,以文本处理模块对文本识别模块的结果进行进一步处理为例进行说明。在其他实施例中,也可以由文本识别模块执行文本处理模块所执行的步骤,也可以理解为,文本识别模块和文本处理模块所执行的步骤可以由一个模块执行,本申请不做限定。As shown in Figure 2, the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, text recognition, text processing, etc. The text recognition application program may also be called a text recognition module or a text recognition engine in the embodiment of the present application, which is not limited by this application. The text recognition module can be used to identify the text area and text content in the image to be recognized (see below for specific concepts). The text processing application program may also be called a text processing module, which is used to further process the output results of the text recognition module (for specific processing procedures, please refer to the embodiment below). It should be noted that in the embodiment of the present application, the text processing module further processes the results of the text recognition module as an example for description. In other embodiments, the text recognition module can also perform the steps performed by the text processing module. It can also be understood that the steps performed by the text recognition module and the text processing module can be performed by one module, which is not limited in this application.
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides an application programming interface (API) and programming framework for applications in the application layer. The application framework layer includes some predefined functions.
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in Figure 2, the application framework layer can include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。A window manager is used to manage window programs. The window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。 Content providers are used to store and retrieve data and make this data accessible to applications. Said data can include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls that display text, controls that display pictures, etc. A view system can be used to build applications. The display interface can be composed of one or more views. For example, a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide communication functions of the electronic device 100 . For example, call status management (including connected, hung up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The notification manager allows applications to display notification information in the status bar, which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of conversation windows. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库,2D图形引擎(例如:SGL)等。System libraries can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing library, 2D graphics engine (for example: SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。The surface manager is used to manage the display subsystem and provides the fusion of 2D and 3D layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc. The media library can support multiple audio and video encoding formats.
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.
2D图形引擎是2D绘图的绘图引擎。2D Graphics Engine is a drawing engine for 2D drawing.
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器,蓝牙驱动,Wi-Fi驱动等驱动。The kernel layer is the layer between hardware and software. The kernel layer contains at least display drivers, camera drivers, audio drivers, sensors, Bluetooth drivers, Wi-Fi drivers and other drivers.
可以理解的是,图2示出的系统框架层、系统库与运行时层包含的部件,并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。It can be understood that the components included in the system framework layer, system library and runtime layer shown in Figure 2 do not constitute specific limitations on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
图3为示例性示出的含有截断文本的文本识别场景示意图。请参照图3的(1),手机的显示界面301中显示图片302。示例性的,显示界面301可以是应用界面,例如可以是图库应用界面等系统应用的界面,界面301也可以是聊天应用等第三方应用的应用界面。也就是说,在本申请实施例中,手机中的系统可以自带文本识别功能(即图2中的文本识别模块),例如图库应用可以调用手机的文本识别模块,对图片进行文本识别。可选地,手机中的第三方应用也可以自带文本识别功能,不同的第三方应用的文本识别功能的实现过程可以相同或不同,本申请不做限定。可选地,手机中的第三方应用也可以调用手机的文本识别模块,本申请不做限定。Figure 3 is a schematic diagram illustrating a text recognition scenario containing truncated text. Please refer to (1) of Figure 3 , a picture 302 is displayed in the display interface 301 of the mobile phone. For example, the display interface 301 may be an application interface, for example, it may be an interface of a system application such as a gallery application interface. The interface 301 may also be an application interface of a third-party application such as a chat application. That is to say, in the embodiment of the present application, the system in the mobile phone can have its own text recognition function (ie, the text recognition module in Figure 2). For example, the gallery application can call the text recognition module of the mobile phone to perform text recognition on pictures. Optionally, the third-party application in the mobile phone can also have its own text recognition function. The implementation process of the text recognition function of different third-party applications can be the same or different, which is not limited by this application. Optionally, the third-party application in the mobile phone can also call the text recognition module of the mobile phone, which is not limited in this application.
仍参照图3的(1),示例性的,图片302中包括文字与图像(当然图片302中也可以只包括文字)。需要说明的是,本申请实施例中仅以图片的文本识别场景为例进行说明,在其他实施例中,还可以应用于应用界面中的文本识别场景,例如场景可以是对浏览器应用显示的页面进行文本识别,本申请不做限定。 Still referring to (1) of FIG. 3 , for example, the picture 302 includes text and images (of course, the picture 302 may also include only text). It should be noted that the embodiment of the present application only takes the text recognition scene of a picture as an example for explanation. In other embodiments, it can also be applied to the text recognition scene in the application interface. For example, the scene can be displayed to a browser application. The page is used for text recognition, which is not limited by this application.
可选地,图片302可以是手机响应于用户操作,执行屏幕截图操作后生成的;图片302也可以是手机通过拍照功能生成的;图片302还可以是下载的图片等,本申请不做限定。Optionally, the picture 302 can be generated after the mobile phone performs a screenshot operation in response to a user operation; the picture 302 can also be generated by the mobile phone through the camera function; the picture 302 can also be a downloaded picture, etc., which is not limited in this application.
示例性的,图片302中的文字包括多行,其中,图片302中显示的第一行文字与最后一行文字被图片302的边框截断,本申请实施例中将该类文字(或文本)称为“截断文本”。需要说明的是,图3中仅以垂直截断文本为例进行说明,本申请实施例中的技术方案同样可应用于横向截断文本以及斜向截断文本的识别场景,具体示例将在下文中进行说明。示例性的,本申请实施例中所述的“垂直截断文本”可选地为垂直于文本行走向的截断。可以理解为,文本行由于界面上下滑动而被屏幕上下边缘、或者是某些固定、冻结的状态栏遮挡。举例说明,以图片302为网页截图为例,用户在浏览网页的过程中,上下滑动网页,相应的,网页当前显示的第一行可能被网页的上边缘(也可以理解为显示框的上边框)截断。用户对当前显示的网页进行截图操作,手机响应于接收到的用户操作截图,以生成图片302。其中,图片302中显示的第一行文本即为“垂直截断文本”。示例性的,本申请实施例中所述的“横向截断文本”即为沿着文本行走向的截断,例如可能是拍照或扫描导致文本行横向截断。示例性的,“斜向阶段文本”可选地为与文本行走向具有一定夹角方向上的截断。For example, the text in the picture 302 includes multiple lines, where the first line of text and the last line of text displayed in the picture 302 are cut off by the border of the picture 302. In the embodiment of this application, this type of text (or text) is called "Truncate text". It should be noted that Figure 3 only takes vertical truncation of text as an example for illustration. The technical solutions in the embodiments of the present application can also be applied to recognition scenarios of horizontal truncation of text and diagonal truncation of text. Specific examples will be described below. For example, the "vertical truncation of text" described in the embodiments of this application may be truncation perpendicular to the text running direction. It can be understood that the text lines are blocked by the upper and lower edges of the screen or some fixed or frozen status bars due to the up and down sliding of the interface. For example, taking picture 302 as a screenshot of a webpage, the user slides the webpage up and down while browsing the webpage. Accordingly, the first line currently displayed on the webpage may be blocked by the upper edge of the webpage (which can also be understood as the upper border of the display box). ) truncation. The user takes a screenshot of the currently displayed web page, and the mobile phone responds to the received screenshot of the user's operation to generate picture 302. Among them, the first line of text displayed in picture 302 is the "vertically truncated text". For example, "transverse truncation of text" described in the embodiments of this application refers to truncation along the text line direction. For example, text lines may be truncated laterally due to taking pictures or scanning. For example, the "oblique stage text" may be a truncation in a direction that has a certain angle with the text running direction.
仍参照图3的(1),用户可长按图片302。请参照图3的(2),示例性的,应用响应于接收到的对图片302的长按操作,显示选项框303。可选地,选项框303中包括但不限于:分享选项、收藏选项、提取文字选项304等。选项框303的位置、大小以及其中包括的选项数量和名称仅为示意性举例,本申请不做限定。Still referring to (1) of Figure 3, the user can long press the picture 302. Please refer to (2) of FIG. 3 , for example, the application displays the option box 303 in response to the received long press operation on the picture 302 . Optionally, the option box 303 includes but is not limited to: sharing options, collection options, text extraction options 304, etc. The location and size of the option box 303 as well as the number and names of the options included therein are only illustrative examples and are not limited by this application.
示例性的,用户点击提取文字选项304,以指示提取图片302中的文字。手机响应于接收到的用户操作,启动文本识别功能(如上文所述,文本识别功能可以是应用自带的文本识别功能,也可以是调用系统的文本识别功能,本申请不做限定)。For example, the user clicks the extract text option 304 to indicate extracting the text in the picture 302 . In response to the received user operation, the mobile phone starts the text recognition function (as mentioned above, the text recognition function can be the text recognition function that comes with the application, or it can be the text recognition function of the calling system, which is not limited in this application).
在本申请实施例中,文本识别功能可选地采用OCR技术,OCR技术主要分为两个步骤,第一步为文本区域检测,第二步为文字内容识别(也可以称为文本内容识别)。示例性的,文本区域检测步骤可选地为检测图像中的至少一个文本区域,即识别图像中包含文本的区域。示例性的,文本内容识别步骤可选地为对已获取到的文本区域中的文本进行识别,即识别文本区域中的具体文字内容。文本区域检测以及文本内容识别的详细步骤可参照已有技术实施例中的相关内容,本申请不再赘述。In the embodiment of this application, the text recognition function optionally adopts OCR technology. The OCR technology is mainly divided into two steps. The first step is text area detection, and the second step is text content recognition (which can also be called text content recognition). . Exemplarily, the text area detection step optionally includes detecting at least one text area in the image, that is, identifying the area containing text in the image. For example, the step of identifying text content may optionally include identifying the text in the acquired text area, that is, identifying the specific text content in the text area. For detailed steps of text area detection and text content recognition, please refer to the relevant content in the prior art embodiments, and will not be described again in this application.
请参照图3的(3),示例性的,显示界面301中包括但不限于:缩小后的图片302和文本识别结果显示框305。需要说明的是,本申请实施例中的显示界面301中的界面布局仅为示意性举例,本申请不做限定。示例性的,文本识别结果显示框305中包括但不限于:“涂抹选择文字”选项、文本识别结果以及其它选项。可选地,其它选项包括但不限于:“全选”选项、“搜索”选项、“复制”选项和“翻译”选项等。其它选项中的各选项可用于对文本识别结果进行相应处理。Please refer to (3) of FIG. 3 . As an example, the display interface 301 includes but is not limited to: a reduced picture 302 and a text recognition result display box 305 . It should be noted that the interface layout in the display interface 301 in the embodiment of this application is only a schematic example, and this application does not limit it. By way of example, the text recognition result display box 305 includes, but is not limited to: the “smudge selected text” option, text recognition results, and other options. Optionally, other options include but are not limited to: "Select All" option, "Search" option, "Copy" option, "Translate" option, etc. Each of the other options can be used to process the text recognition results accordingly.
仍参照图3的(3),示例性的,文本识别结果显示框305中的文字识别结果即为通过文本识别功能识别到的结果。但是,在该示例中,由于图片302的首行文本被截断,例如为上文所述的垂直截断文本,使得首行文本显示不完整。相应的,文本识别功能识 别出的结果可能不准确。例如图3的(3)所示,首行文本在网页中的原始文本为“首轮比赛,全**等人亮相时全场欢呼,5”,而由于浏览网页时首页文本被上边框截断,使得截图后的图片302中的首行文本被截断。应用对图片302进行文本识别时,对首行文本识别后,输出的结果为“日孔L贷,士红烤守八元的土从叮,5”,其与原始文本之间的差异较大,存在语义逻辑错误。而低于该类识别结果,及时通过语义推理等技术,也无法还原原文,影响用户使用体验。示例性的,图片302中未被截断的文本行(例如图片302中的第二行文本)所对应的识别结果,则与原始文本之间无差异。Still referring to (3) of FIG. 3 , for example, the text recognition result in the text recognition result display box 305 is the result recognized through the text recognition function. However, in this example, because the first line of text of the picture 302 is truncated, for example, the text is vertically truncated as described above, the first line of text is incompletely displayed. Correspondingly, the text recognition function recognizes The results may not be accurate. For example, as shown in (3) of Figure 3, the original text of the first line of text in the web page is "The first round of the game, the audience cheered when the All-** and others appeared, 5", and because the text on the home page is cut off by the upper border when browsing the web page , causing the first line of text in the screenshot 302 to be truncated. When applying text recognition to picture 302, after recognizing the first line of text, the output result is "Ri Kong L Dai, Shi Hong Roast Shou Ba Yuan's Tu Cong Ding, 5", which is quite different from the original text. , there are semantic logic errors. However, if the recognition results are lower than this type, the original text cannot be restored through timely semantic reasoning and other technologies, affecting the user experience. For example, the recognition result corresponding to the untruncated text line in the picture 302 (for example, the second line of text in the picture 302) is no different from the original text.
本申请实施例提供一种文本识别方法,该方法将文本图像和文本内容作为模型(可以称为文本识别模型、也可以称为文本识别网络)的输入,通过各自模态编码得到对应模态的编码信息。文本处理模块对文本图像对应的编码信息与文本内容对应的编码信息进行模态信息融合,以作为分类解码器和修正解码器的注意力输入。从原理上讲,模型相当于隐式的对图像信息(主要是截断情况)和文字信息(主要是语义连贯程度)进行了综合考量,并利用高维多模态语义信息对不同的输入组合进行更为精细化的决策,从而能得到拟人化的复杂决策效果。而缩位的复杂决策,体现在最终结果上,则为三种分类结果:直接过滤表明遮挡导致语义不可修正的情况,纠正后输出表明遮挡导致语义不连贯但可修正的情况,不纠正直接输出表明不存在遮挡或有遮挡但不影响语义的情况。也就是说,本申请实施例中提供的文本识别方法可以提供一种更加拟人化的处理方案,在正常情况下,如果文本遮挡过多,用户肉眼识别可能是无法识别出正确的信息的,而用户也是可以判断出自己根据截断的文本读出的文字内容是不正确的。如果文本遮挡很少,用户则可以通过语义判断出遮挡住的文字。对于未遮挡的文字,则用户可以正确读出对应的文字内容。本申请实施例中的技术方案即可达到拟人化的用户读取效果,可以在文本遮挡(即截断)较大的情况下,不输出结果。而在遮挡较少的情况下,输出修正后的结果。并且可以在未遮挡的情况下,输出对应的文本。从而能够在文本识别结果中呈现是正确的、语义通顺的结果,而将语义错误的结果(即文本内容)滤除,以提升用户使用体验。Embodiments of the present application provide a text recognition method that uses text images and text content as inputs to a model (which can be called a text recognition model or a text recognition network), and obtains the corresponding modal information through respective modal encodings. Encoded information. The text processing module performs modal information fusion on the coding information corresponding to the text image and the coding information corresponding to the text content as the attention input of the classification decoder and the correction decoder. In principle, the model is equivalent to an implicit comprehensive consideration of image information (mainly truncation) and text information (mainly semantic coherence), and uses high-dimensional multi-modal semantic information to conduct different input combinations. More refined decisions can be made to achieve anthropomorphic and complex decision-making effects. The complex decision-making of abbreviation is reflected in the final result, which is three classification results: direct filtering indicates that the occlusion causes the semantics to be uncorrectable, corrected output indicates that the occlusion causes the semantics to be incoherent but can be corrected, and direct output without correction Indicates situations where there is no occlusion or there is occlusion but does not affect semantics. In other words, the text recognition method provided in the embodiment of the present application can provide a more anthropomorphic processing solution. Under normal circumstances, if the text is too blocked, the user may not be able to recognize the correct information with the naked eye. Users can also determine that the text content they read based on the truncated text is incorrect. If there is little text occlusion, the user can determine the obscured text through semantics. For unobstructed text, users can correctly read the corresponding text content. The technical solution in the embodiment of the present application can achieve an anthropomorphic user reading effect, and no result can be output when the text is largely blocked (ie, truncated). In the case of less occlusion, the corrected result is output. And the corresponding text can be output without obstruction. As a result, correct and semantically smooth results can be presented in the text recognition results, while results with semantic errors (i.e., text content) can be filtered out to improve the user experience.
图4为示例性示出的应用本申请实施例中的文本识别方法的应用场景示意图,请参照图4的(1),以图库应用程序为例,用户点击图库应用程序中显示的图片402对应的缩略图后,图库应用程序可在显示界面401中显示图片402。可选地,显示界面401中还包括但不限于分享选项、收藏选项等选项(或控件)。Figure 4 is a schematic diagram of an application scenario for applying the text recognition method in the embodiment of the present application. Please refer to (1) of Figure 4. Taking the gallery application as an example, the user clicks on the picture 402 displayed in the gallery application. After the thumbnail is obtained, the gallery application may display the picture 402 in the display interface 401. Optionally, the display interface 401 also includes, but is not limited to, options (or controls) such as sharing options and collection options.
示例性的,图库应用可调用系统的文本识别模块与文本处理模块,以对图片402(也可以称为待识别图片或待识别图像)进行文本识别与处理。如上文所述,在本申请实施例中,文本识别包括文本区域检测和文本内容识别两个部分。可选地,文本识别模块可以在接收到用户点击图片402对应的缩略图的操作后,即可执行文本区域检测步骤,以检测图片如402中是否包括文本区域。在本示例中,图片402中包括图片和文字(当然也可以仅包括文字,本申请不做限定)。相应的,文本识别模块可检测到图片402中包括的至少一个文本区域。文本识别模块检测到图片402中包括文本区域后,可在显示界面401中显示“提取图中文字”选项403。用户可点击“提取图中文字”选项403,以指示提取图片402中的文字内容。图库应用程序响应于接收到的用户操作,通过文本识别 模块对图片402进行文本识别,即执行文本内容识别步骤,以获取到每个文本区域中的对应的文本内容。在本申请实施例中,文本处理模块可对文本识别模块得到的识别结果(包括文本区域和文本内容)进行进一步的处理。请参照图4的(2),显示界面401中包括但不限于:缩小后的图片402和提取文字显示框404。可选地,提取文字显示框404中包括但不限于:文本识别结果显示框405和其它选项。其它选项包括但不限于:“涂抹选择文字”选项、“朗读全文”选项、“全选”选项、“搜索”选项、“复制”选项和“翻译”选项等。需要说明的是,本申请实施例中所示的显示界面中的各控件的布局仅为示意性举例,本申请不做限定。示例性的,文本识别结果显示框405中包括文本识别模块识别到的文字内容,如图4的(2)所示,在本申请实施例中,对于被截断的文本(例如首行文本),手机并未在文本识别结果显示框405中显示对应的文字。也就是说,对于可能会存在语义错误或者是乱码的文本识别结果,文本处理模块可采取不输出(即不显示)的方式,以避免文本识别结果与原文本差异较大的问题。仍参照图4的(2),对于未被截断的文本,则文本处理模块可在文本识别结果显示框405中显示对应的文本。可选地,在本申请实施例中,文本处理模块还可以对文本识别模块识别到的文本内容进行修正(或校正),以得到正确的文本(也可以理解为是与原始文本接近或相同的文本),并输出(即在文本识别结果显示框405中显示)修正后的结果。也就是说,在本申请实施例中,通过过滤或修正存在语义错误的文本,使得文本识别结果显示框405中显示的文本识别结果是语义逻辑正确且连贯的,以提升用户使用体验。For example, the gallery application can call the text recognition module and text processing module of the system to perform text recognition and processing on the picture 402 (which may also be called a picture to be recognized or an image to be recognized). As mentioned above, in the embodiment of the present application, text recognition includes two parts: text area detection and text content recognition. Optionally, after receiving the user's operation of clicking on the thumbnail corresponding to the picture 402, the text recognition module can perform a text area detection step to detect whether the picture 402 includes a text area. In this example, the picture 402 includes pictures and text (of course, it may also include only text, which is not limited in this application). Accordingly, the text recognition module may detect at least one text area included in the picture 402. After the text recognition module detects that the picture 402 includes a text area, the "extract text in the picture" option 403 can be displayed in the display interface 401 . The user can click the "Extract text in picture" option 403 to instruct the text content in the picture 402 to be extracted. The gallery application responds to received user actions through text recognition The module performs text recognition on the picture 402, that is, performs text content recognition steps to obtain the corresponding text content in each text area. In the embodiment of the present application, the text processing module can further process the recognition results (including text areas and text content) obtained by the text recognition module. Please refer to (2) of Figure 4. The display interface 401 includes but is not limited to: a reduced picture 402 and an extracted text display box 404. Optionally, the extracted text display box 404 includes but is not limited to: a text recognition result display box 405 and other options. Other options include, but are not limited to: "Erase selected text" option, "Read full text" option, "Select all" option, "Search" option, "Copy" option and "Translate" option, etc. It should be noted that the layout of each control in the display interface shown in the embodiment of this application is only a schematic example, and this application does not limit it. Exemplarily, the text recognition result display box 405 includes text content recognized by the text recognition module, as shown in (2) of Figure 4. In this embodiment of the present application, for truncated text (such as the first line of text), The mobile phone does not display the corresponding text in the text recognition result display box 405. That is to say, for text recognition results that may contain semantic errors or garbled characters, the text processing module can adopt a non-output (i.e., non-display) method to avoid the problem of large differences between the text recognition results and the original text. Still referring to (2) of FIG. 4 , for untruncated text, the text processing module may display the corresponding text in the text recognition result display box 405 . Optionally, in the embodiment of the present application, the text processing module can also modify (or correct) the text content recognized by the text recognition module to obtain the correct text (which can also be understood as being close to or the same as the original text). text), and output (i.e., display in the text recognition result display box 405) the corrected result. That is to say, in this embodiment of the present application, text with semantic errors is filtered or corrected so that the text recognition results displayed in the text recognition result display box 405 are semantically logically correct and coherent, thereby improving the user experience.
需要说明的是,本申请实施例中仅以图片的文本识别及处理场景为例进行说明,在其他实施例中,还可以应用于应用界面中的文本识别及处理场景,例如场景可以是对浏览器应用显示的页面进行文本识别及处理,本申请不做限定。It should be noted that the embodiments of this application only take the text recognition and processing scenario of pictures as an example for explanation. In other embodiments, it can also be applied to text recognition and processing scenarios in application interfaces. For example, the scenario can be browsing. The page displayed by the server application is used for text recognition and processing, which is not limited by this application.
进一步需要说明的是,图片402可以是手机响应于用户操作,执行屏幕截图操作后生成的;图片402也可以是手机通过拍照功能生成的;图片402还可以是下载的图片等,本申请不做限定。It should be further noted that the picture 402 can be generated after the mobile phone performs a screenshot operation in response to the user operation; the picture 402 can also be generated by the mobile phone through the camera function; the picture 402 can also be a downloaded picture, etc., which is not covered by this application. limited.
进一步需要说明的是,本申请实施例中仅以图库应用调用文本识别模块及文本处理模块的场景为例进行说明。本申请实施例中的文本识别模块和文本处理模块所执行的步骤还可以应用于其它应用中。例如,聊天应用自带的文本识别功能可对待识别图片进行文本识别,并获取到对应的文本识别结果。聊天应用可以调用手机的文本处理模块,对文本识别结果进行进一步处理。再例如,聊天应用也可以自带文本识别模块与文本处理模块,并实现本申请实施例中涉及到的文本识别模块与文本处理模块所实现的步骤。再例如,聊天应用也可以调用手机的文本识别模块和文本处理模块,本申请不做限定。It should be further noted that in the embodiment of this application, only the scenario in which the gallery application calls the text recognition module and the text processing module is used as an example for explanation. The steps performed by the text recognition module and the text processing module in the embodiments of the present application can also be applied to other applications. For example, the text recognition function that comes with the chat application can perform text recognition on the image to be recognized and obtain the corresponding text recognition results. The chat application can call the text processing module of the mobile phone to further process the text recognition results. For another example, the chat application may also have its own text recognition module and text processing module, and implement the steps implemented by the text recognition module and text processing module involved in the embodiments of this application. For another example, the chat application can also call the text recognition module and text processing module of the mobile phone, which is not limited in this application.
进一步需要说明的是,本申请实施例中所述的文本识别模块所执行的步骤仅为示意性举例,手机中的文本识别模块与应用自带的文本识别模块所执行的步骤可以相同或不同,具体细节可参照已有技术实施例,本申请不做限定。举例说明,手机中的文本识别模块可以利用OCR技术进行文本识别,并获取到对应的识别结果,包括文本图像和文本内容(文本图像和文本内容的概念将在下文中说明)。聊天应用中的文本识别模块可以利用其它技术进行文本识别,并获取到对应的识别结果,同样包括文本图像和文本内容。可选地,聊天应用的文本识别模块与手机应用的文本识别模块所得到的识别结果可以相 同或不同,例如,手机中的文本识别模块可能识别到5个文本区域,并获取到对应的文本内容。聊天应用中的文本识别模块可能识别到6个文本区域,并获取到对应的文本内容,本申请不做限定。也就是说,本申请实施例中的文本处理模块可以对任一文本识别模块(可以是手机和/或应用的)的识别结果进行进一步处理,以得到符合用户需求的结果。It should be further noted that the steps performed by the text recognition module described in the embodiments of this application are only illustrative examples. The steps performed by the text recognition module in the mobile phone and the text recognition module that comes with the application may be the same or different. Specific details may refer to existing technical embodiments, and are not limited in this application. For example, the text recognition module in a mobile phone can use OCR technology to perform text recognition and obtain corresponding recognition results, including text images and text content (the concepts of text images and text content will be explained below). The text recognition module in the chat application can use other technologies to perform text recognition and obtain corresponding recognition results, which also include text images and text content. Optionally, the recognition results obtained by the text recognition module of the chat application and the text recognition module of the mobile phone application can be consistent. Same or different. For example, the text recognition module in the mobile phone may recognize 5 text areas and obtain the corresponding text content. The text recognition module in the chat application may recognize 6 text areas and obtain the corresponding text content, which is not limited by this application. That is to say, the text processing module in the embodiment of the present application can further process the recognition results of any text recognition module (which can be a mobile phone and/or an application) to obtain results that meet user needs.
进一步需要说明的是,不同的应用触发文本识别及处理功能的操作相同或不同,本申请中涉及到的用户操作(即点击“提取文字”选项)仅为示意性举例,本申请不做限定。It should be further noted that different applications trigger the text recognition and processing functions in the same or different operations. The user operations involved in this application (i.e., clicking on the "extract text" option) are only illustrative examples and are not limited by this application.
进一步需要说明的是,本申请实施例中仅以首行文本截断的场景为例进行说明,在其他实施例中,本申请实施例中的文本识别方法同样可应用于包括末行文本截断的场景。It should be further noted that in the embodiment of the present application, only the scenario of truncation of the first line of text is used as an example. In other embodiments, the text recognition method in the embodiment of the present application can also be applied to the scenario of truncation of the last line of text. .
进一步需要说明的是,本申请实施例中是以边框截断文本为例进行说明,在其它实施例中,造成文本截断的原因也可能是图片遮挡或者是其它,本申请不做限定。It should be further noted that in the embodiment of the present application, text truncation by a border is used as an example for explanation. In other embodiments, text truncation may also be caused by image occlusion or other reasons, which is not limited by this application.
在一种可能的实现方式中,文本识别模块可以在手机处于待机或图库应用处于后台等情况下,对图库应用中的各图片执行文本区域检测步骤。也就是说,文本识别模块可预先对图库应用程序中的图片执行文本区域检测步骤,以在用户点击包括文本区域的图片后,可以立即显示“提取图中文字”选项框,以提高文本识别及处理的整体效率。In a possible implementation, the text recognition module can perform the text area detection step on each picture in the gallery application when the mobile phone is in standby or the gallery application is in the background. That is to say, the text recognition module can perform the text area detection step on the pictures in the gallery application in advance, so that after the user clicks on the picture including the text area, the "Extract text in the picture" option box can be displayed immediately to improve text recognition and Overall efficiency of processing.
下面结合附图对本申请实施例中的文本识别方法进行详细说明。图5为示例性示出的文本识别方法的流程示意图。请参照图5,文本识别模块可获取到基于OCR技术识别到的结果,结果中包括至少一个文本图像和每个文本图像对应的文本内容。举例说明,图6为示例性示出的文本识别的示意图,请参照图6,文本识别模块通过OCR技术对图片601(即为图片402,具体描述可参照图片402,此处不再赘述)进行文本区域检测,以获取到至少一个文本区域。具体地,文本区域检测可以理解为是OCR技术检测到图片601中包含文本的区域后,对图片601中的至少一个文本区域进行分割,以得到至少一个文本图像(即为图片601中的至少一个文本区域所对应的图像)。举例说明,如图6所示,文本识别模块检测到图片601中包含文本的文本区域602a,文本识别模块可分割文本区域602a(例如沿虚线分割),以得到文本区域602a对应的图像,简称为文本图像602a。The text recognition method in the embodiment of the present application will be described in detail below with reference to the accompanying drawings. FIG. 5 is a schematic flowchart of an exemplary text recognition method. Referring to Figure 5, the text recognition module can obtain the results recognized based on OCR technology. The results include at least one text image and text content corresponding to each text image. For example, Figure 6 is a schematic diagram of text recognition. Please refer to Figure 6 . The text recognition module uses OCR technology to perform an operation on picture 601 (that is, picture 402. For detailed description, please refer to picture 402, which will not be described again here). Text area detection to obtain at least one text area. Specifically, text area detection can be understood as that after the OCR technology detects the area containing text in the picture 601, it segments at least one text area in the picture 601 to obtain at least one text image (that is, at least one text area in the picture 601). the image corresponding to the text area). For example, as shown in Figure 6, the text recognition module detects a text area 602a containing text in the picture 601. The text recognition module can segment the text area 602a (for example, along a dotted line) to obtain an image corresponding to the text area 602a, which is referred to as Text image 602a.
示例性的,文本识别模块可依次分割图片601中包含文本的区域,例如可获取到文本区域603a的图像,简称为文本图像603a。本申请实施例中仅以文本区域602a和文本区域603a为例进行说明,文本识别模块可获取到图片601中更多的文本区域。For example, the text recognition module can sequentially segment areas containing text in the picture 601. For example, the image of the text area 603a can be obtained, which is referred to as the text image 603a. In the embodiment of this application, only the text area 602a and the text area 603a are used as examples for description. The text recognition module can obtain more text areas in the picture 601.
在一种可能的实现方式中,文本识别模块通过OCR技术识别到文本区域后,可经过放射或透视变换校正等处理,得到对应的文本图像。In a possible implementation, after the text recognition module recognizes the text area through OCR technology, it can undergo radiation or perspective transformation correction and other processing to obtain the corresponding text image.
在另一种可能的实现方式中,单一文本图像的尺寸可以与文本图像中的文本内容所占的实际区域的尺寸相同,也可以大于文本内容所占的实际区域的尺寸。例如文本图像602a,其文本图像的尺寸大于文本内容实际所占的区域的尺寸,即,文本图像的边框与文本内容(即文本内容的边缘)之间存在空白区域。In another possible implementation, the size of the single text image may be the same as the size of the actual area occupied by the text content in the text image, or may be larger than the size of the actual area occupied by the text content. For example, the size of the text image 602a is larger than the size of the area actually occupied by the text content. That is, there is a blank area between the frame of the text image and the text content (ie, the edge of the text content).
仍参照图6,文本识别模块可通过OCR技术对获取到的至少一个文本区域(即文本图像)进行文本内容识别。仍以文本图像602a和文本图像603a为例,文本识别模块对 文本图像602a进行文本内容识别,并得到文本内容识别结果602b(也可以称为文本内容602b),即识别到文本图像602a中的文字内容为“日孔L贷,士红烤守八元的土从叮,5”。文本识别模块继续对其它文本图像进行识别,以得到对应的文本内容识别结果,例如,文本识别模块通过OCR技术对文本图像603a进行文本内容识别,以得到对应的文本内容识别结果603b(也可以称为文本内容603b),即识别到文本图像603b中的文字内容为“位冠军也展示了高超的实力,第一轮107B”。需要说明的是,本实施例仅以文本图像602a与文本图像603a为例进行说明,文本识别模块可基于OCR技术对每个获取到的文本图像进行文本内容识别,以获取对应的文本内容,本申请不再逐一说明。进一步需要说明的是,文本识别模块可以并行或顺序地对各文本图像进行文本内容识别,本申请不做限定。Still referring to FIG. 6 , the text recognition module can perform text content recognition on at least one acquired text area (ie, text image) through OCR technology. Still taking the text image 602a and the text image 603a as an example, the text recognition module The text image 602a performs text content recognition and obtains the text content recognition result 602b (which can also be called the text content 602b). That is, it is recognized that the text content in the text image 602a is "Ri Kong L Loan, Shi Hong Roasted Shou Ba Yuan's Soil" From Ding, 5”. The text recognition module continues to recognize other text images to obtain the corresponding text content recognition results. For example, the text recognition module uses OCR technology to perform text content recognition on the text image 603a to obtain the corresponding text content recognition results 603b (also called is the text content 603b), that is, it is recognized that the text content in the text image 603b is "The champion also showed superb strength, 107B in the first round." It should be noted that this embodiment only takes the text image 602a and the text image 603a as an example for explanation. The text recognition module can perform text content recognition on each acquired text image based on OCR technology to obtain the corresponding text content. The application will not be explained one by one. It should be further noted that the text recognition module can perform text content recognition on each text image in parallel or sequentially, which is not limited by this application.
仍参照图5,示例性的,文本处理模块获取到文本识别模块得到的识别结果,例如包括但不限于:文本图像602a和对应的文本内容602b,以及,文本图像603a和对应的文本内容603c。文本处理模块对文本识别模块输入的每个文本图像和与文本图像对应的文本内容执行图5中的流程。需要说明的是,文本识别模块可以在获取到带识别图像(例如图片601)的所有文本区域对应的图像以及对应的文本内容后,将识别结果输出至文本处理模块进行进一步处理。文本识别模块可以逐一对获取到的文本图像和文本内容执行图5中的流程。文本识别模块也可以并行处理多个文本图像和文本内容,本申请不做限定。可选地,文本识别模块也可以在获取到一个文本内容后,即可将该文本内容以及对应的文本图像输出至文本处理模块进行处理,本申请不做限定,下文中不再重复说明。Still referring to FIG. 5 , for example, the text processing module obtains the recognition results obtained by the text recognition module, including but not limited to: text image 602a and corresponding text content 602b, and text image 603a and corresponding text content 603c. The text processing module executes the process in Figure 5 for each text image input by the text recognition module and the text content corresponding to the text image. It should be noted that the text recognition module can output the recognition results to the text processing module for further processing after acquiring the images corresponding to all text areas of the recognized image (for example, picture 601) and the corresponding text content. The text recognition module can execute the process in Figure 5 on the obtained text images and text content one by one. The text recognition module can also process multiple text images and text contents in parallel, which is not limited in this application. Optionally, after acquiring a text content, the text recognition module can also output the text content and the corresponding text image to the text processing module for processing. This application does not limit this, and the description will not be repeated below.
请继续参照图5,示例性的,以文本图像602a和文本内容602b为例,文本处理模块将文本图像602a和文本内容602b通过编码模型(也可以称为编码模块),以得到文本图像602a对应的图像编码信息,以及文本内容602b对应的文本编码信息。可选地,编码模型中可以包括但不限于图像编码模型(可以称为图像编码模块)与文本编码模型(也可以称为文本编码模块)。示例性的,图像编码模型可用于对文本图像602a进行编码,得到本文图像602a对应的图像编码信息。也就是说,图像编码模型可以将文本图像编码为机器可识别或者可理解的语义信息。示例性的,文本编码模块可用于对文本内容602b进行编码,得到文本编码信息。也可以理解为,文本编码模块将文本内容编码为机器可识别或者可理解的语义信息。Please continue to refer to Figure 5. Taking the text image 602a and the text content 602b as an example, the text processing module passes the text image 602a and the text content 602b through a coding model (which can also be called a coding module) to obtain the corresponding text image 602a. The image encoding information, and the text encoding information corresponding to the text content 602b. Optionally, the encoding model may include, but is not limited to, an image encoding model (which may be called an image encoding module) and a text encoding model (which may also be called a text encoding module). For example, the image coding model can be used to code the text image 602a to obtain image coding information corresponding to the text image 602a. In other words, the image encoding model can encode text images into machine-recognizable or understandable semantic information. For example, the text encoding module can be used to encode text content 602b to obtain text encoding information. It can also be understood that the text encoding module encodes text content into machine-recognizable or understandable semantic information.
需要说明的是,图像编码信息与文本编码信息的结构可根据编码过程采用对应的编码器架构,本申请实施例中所述的编码器仅为示意性举例,可根据实际需求设置,本申请不做限定。It should be noted that the structures of image encoding information and text encoding information can adopt corresponding encoder architectures according to the encoding process. The encoders described in the embodiments of this application are only illustrative examples and can be set according to actual needs. This application does not Make limitations.
进一步需要说明的是,文本处理模块对文本图像602a和文本内容602b的处理可以是顺序的,也可以是并行的,本申请不做限定。例如,文本处理模块可以先对文本图像602a进行处理,得到图像编码信息,再对文本内容602b进行处理,得到文本编码信息。再例如,文本处理模块可以先对文本内容602b进行编码处理,再对文本图像602a进行编码处理。再例如,文本处理模块可以同时对文本图像602a和文本内容602b进行编码处理,本申请不做限定。It should be further noted that the text processing module may process the text image 602a and the text content 602b sequentially or in parallel, which is not limited in this application. For example, the text processing module can first process the text image 602a to obtain the image encoding information, and then process the text content 602b to obtain the text encoding information. For another example, the text processing module may first encode the text content 602b, and then encode the text image 602a. For another example, the text processing module can simultaneously encode the text image 602a and the text content 602b, which is not limited in this application.
仍参照图5,示例性的,仍以文本图像602a和文本内容602b为例,文本处理模块将 文本图像602a对应的图像编码信息与文本内容602b对应的文本编码信息,通过多模态模型(也可以称为多模态编码模块、多模态融合模块等,本申请不做限定)进行融合,得到多模态编码信息,也可以称为中间表征信息。Still referring to FIG. 5 , still taking the text image 602a and the text content 602b as an example, the text processing module will The image coding information corresponding to the text image 602a and the text coding information corresponding to the text content 602b are fused through a multi-modal model (which may also be called a multi-modal coding module, a multi-modal fusion module, etc., which is not limited in this application), Multimodal coding information is obtained, which can also be called intermediate representation information.
示例性的,文本处理模块将中间表征信息通过修正模型(也可以称为修正模块)进行修正,并且,文本处理模块将中间表征信息通过分类模型(也可以称为分类模块),以对中间表征信息进行分类,得到分类结果。本申请实施例中,分类结果包括三类:过滤、修正并输出、直接输出。其中,过滤分类项可选地为将文本内容过滤,即,不在文本识别结果中显示对应的文本内容。修正并输出分类项可选地为输出修正后的文本,也可以理解为,文本内容可以通过修正后再显示在文本识别结果中。直接输出分类项可选地为在文本识别结果中显示文本内容。也就是说,文本处理模块可以将文本识别模块通过OCR技术识别到的文本内容可直接在文本识别结果中显示。以文本图像602a和文本内容602b对应的中间表征为例,一个示例中,若中间表征信息的分类结果为过滤分类项,则文本处理模块将文本内容602b过滤,即不在文本识别结果中显示文本内容602b,以避免语义错误的文本对文本识别结果造成影响。另一个示例中,若中间表征信息的分类结果为修正后输出分类项,则文本处理模块可将中间表征信息进行修正后的结果显示在文本识别结果中。又一个示例中,若中间表征信息的分类结果为直接输出,则文本处理模块在文本识别结果中显示文本内容602b。Exemplarily, the text processing module corrects the intermediate representation information through a correction model (which can also be called a correction module), and the text processing module passes the intermediate representation information through a classification model (which can also be called a classification module) to correct the intermediate representation. The information is classified and the classification results are obtained. In the embodiment of this application, the classification results include three categories: filtering, correction and output, and direct output. The filtering classification item is optionally to filter the text content, that is, not to display the corresponding text content in the text recognition result. Correcting and outputting the classification item optionally means outputting the corrected text. It can also be understood that the text content can be corrected and then displayed in the text recognition result. Directly outputting classification items optionally displays text content in the text recognition results. In other words, the text processing module can directly display the text content recognized by the text recognition module through OCR technology in the text recognition results. Taking the intermediate representation corresponding to the text image 602a and the text content 602b as an example, in one example, if the classification result of the intermediate representation information is a filtered classification item, the text processing module filters the text content 602b, that is, the text content is not displayed in the text recognition result. 602b, to avoid the impact of semantically incorrect text on text recognition results. In another example, if the classification result of the intermediate representation information is a corrected output classification item, the text processing module can display the corrected result of the intermediate representation information in the text recognition result. In another example, if the classification result of the intermediate representation information is direct output, the text processing module displays the text content 602b in the text recognition result.
下面以文本图像602a和文本内容602b为例,对图5中的各流程进行详细说明。图7为示例性示出的文本图像编码示意图。请参照图7,本申请实施例中,文本处理模块(具体为图像编码模型)对文本图像602a进行图像信息编码的过程包括Patch Embedding(图像块嵌入)和Positional Encoding(位置编码),从而将三维的图像信息转换为二维的图像编码信息EvTaking text image 602a and text content 602b as examples, each process in Figure 5 will be described in detail below. FIG. 7 is an exemplary text image encoding schematic diagram. Please refer to Figure 7. In the embodiment of the present application, the text processing module (specifically, the image encoding model) encodes the image information of the text image 602a including Patch Embedding (image block embedding) and Positional Encoding (positional encoding), thereby converting the three-dimensional The image information is converted into two-dimensional image coded information E v .
需要说明的是,如上文所述,文本图像和文本内容的编码所得到的编码信息的结构(例如二维编码信息)均是根据编码器的架构所得,编码器架构可根据实际需求设置,例如,在其他实施例中,也可以将三维的图像信息转换为更高维或更低维的图像编码信息,本申请不做限定,下文中不再重复说明。It should be noted that, as mentioned above, the structure of the encoded information (such as two-dimensional encoded information) obtained by encoding text images and text content is obtained based on the architecture of the encoder. The encoder architecture can be set according to actual needs, such as , in other embodiments, three-dimensional image information can also be converted into higher-dimensional or lower-dimensional image encoding information, which is not limited in this application and will not be repeated below.
进一步需要说明的是,本申请实施例中仅以通过Patch Embedding和Positional Encoding的编码方式对文本图像进行图像信息编码为例进行说明,在其它实施例中,还可以通过其它编码方式进行编码,本申请不做限定。It should be further noted that in the embodiments of this application, image information encoding of text images through the encoding methods of Patch Embedding and Positional Encoding is used as an example for explanation. In other embodiments, encoding can also be performed through other encoding methods. There are no restrictions on application.
Patch Embedding和Positional Encoding的具体过程包括但不限于:The specific processes of Patch Embedding and Positional Encoding include but are not limited to:
(1)文本处理模块将文本图像602a划分为N个Patch。(1) The text processing module divides the text image 602a into N patches.
图8为示例性示出的图像信息编码示意图,请参照图8,可选地,文本处理模块(具体可以图像编码模型,下文中不再重复说明)可以将文本图像602a的高度(也可以是宽度,或者宽度和高度)进行resize(调整),以将文本图像602a的高度调整为预设像素值,例如文本模块可将文本图像602a的高度调整为32像素(也可以是64像素,可根据实际需求设置,本申请不做限定),相应的,文本图像602a的宽度按照比例(即图像602a的宽高比)随高度进行调整。如图8所示,本申请实施例中以文本图像602a调整后的高度为H,宽度(也可以称为长度)为W为例进行说明。需要说明的是,在其他实施例中, 也可以不对文本图像进行resize,本申请不做限定。Figure 8 is an exemplary image information encoding schematic diagram. Please refer to Figure 8. Optionally, the text processing module (specifically, an image encoding model, which will not be repeated below) can change the height of the text image 602a (which can also be width, or width and height) to resize (adjust) the height of the text image 602a to a preset pixel value. For example, the text module can adjust the height of the text image 602a to 32 pixels (or 64 pixels, depending on Actual requirements are set, and this application does not limit it). Correspondingly, the width of the text image 602a is adjusted according to the proportion (ie, the aspect ratio of the image 602a) with the height. As shown in FIG. 8 , in the embodiment of the present application, the adjusted height of the text image 602a is H and the width (also called length) is W as an example for explanation. It should be noted that in other embodiments, The text image may not be resized, which is not limited in this application.
仍参照如图8,示例性的,文本处理模块将文本图像602a划分为N个Image Patches。在本申请实施例中,假设一个Image Patch的宽度为w,高度为h,则文本处理模块获取到Image Patches的数量为:
N=L*W/h*w      (1)
Still referring to Figure 8, for example, the text processing module divides the text image 602a into N Image Patches. In the embodiment of this application, assuming that the width of an Image Patch is w and the height is h, then the number of Image Patches obtained by the text processing module is:
N=L*W/h*w (1)
可选地,h与w的数值可以相同或不同,例如可以均为16像素,可根据实际需求设置,本申请不做限定。Optionally, the values of h and w can be the same or different, for example, they can both be 16 pixels, and can be set according to actual requirements, which is not limited in this application.
可选地,N的数值为正整数,例如N可以为向上取整后得到。Optionally, the value of N is a positive integer. For example, N can be obtained by rounding up.
(2)文本处理模块对N个Image Patches进行Patch Embedding。(2) The text processing module performs Patch Embedding on N Image Patches.
图9为示例性示出的图像信息编码示意图。请参照图9,示例性的,Patch Embedding流程包括但不限于以下步骤:FIG. 9 is an exemplary schematic diagram of image information encoding. Please refer to Figure 9. An exemplary Patch Embedding process includes but is not limited to the following steps:
步骤a.文本处理模块将每个Image Patch展平,得到每个Image Patch对应的一维向量PiStep a. The text processing module flattens each Image Patch to obtain the one-dimensional vector Pi corresponding to each Image Patch.
具体的,每个Image Patch的宽度为w,高度为h,通道数为c,相应的,每个Image Patch的大小为(h*w*c)。文本处理模块将Image Patch展平,得到长度为(h*w*c)的一维向量。对于第i个图像块,记该一维向量为Pi,Pi表示为:
Specifically, the width of each Image Patch is w, the height is h, and the number of channels is c. Correspondingly, the size of each Image Patch is (h*w*c). The text processing module flattens the Image Patch to obtain a one-dimensional vector of length (h*w*c). For the i-th image block, record the one-dimensional vector as Pi , and Pi is expressed as:
举例说明,图10为示例性示出的Image Patch展平示意图。请参照图10,以图8中的Image Patch801为例。Image Patch801的大小为(h*w*c)。文本处理模块将Image Patch801展开后,得到对应的一维向量P1,P1表示为:
For example, FIG. 10 is an exemplary flattened schematic diagram of an Image Patch. Please refer to Figure 10, taking Image Patch801 in Figure 8 as an example. The size of Image Patch801 is (h*w*c). After the text processing module expands Image Patch801, the corresponding one-dimensional vector P 1 is obtained. P 1 is expressed as:
即为长度为(h*w*c)的一维向量。文本处理模块可基于上述方式,对每个Image Patch进行展平,以得到N个Pi,即如图9中所示的P1……PnThat is, a one-dimensional vector with length (h*w*c). Based on the above method, the text processing module can flatten each Image Patch to obtain N Pi, that is, P 1 ...P n as shown in Figure 9.
步骤b.文本处理模块将N个一维向量Pi通过全连接层,得到N个长度为预设长度的一维张量。Step b. The text processing module passes N one-dimensional vectors Pi through the fully connected layer to obtain N one-dimensional tensors with a preset length.
示例性的,仍参照图9,文本处理模块将N个一维向量Pi分别通过输出长度为embedding_size(可根据实际需求设置,本申请不做限定)的全连接层,得到N个长度为embedding_size的一维张量Evi,Evi表示为:
Illustratively, still referring to Figure 9, the text processing module passes N one-dimensional vectors Pi through a fully connected layer with an output length of embedding_size (which can be set according to actual needs, and is not limited in this application), and obtains N pieces of length embedding_size. One-dimensional tensor E vi , E vi is expressed as:
举例说明,如图9所示,文本处理模块将P1通过长度为embedding_size的全连接层,得到长度为embedding_size的一维张量Ev1,Ev1表示为:
For example, as shown in Figure 9, the text processing module passes P 1 through a fully connected layer with a length of embedding_size, and obtains a one-dimensional tensor E v1 with a length of embedding_size. E v1 is expressed as:
文本处理模块依据上述方式,对N个一维张量做相同处理,以得到Ev1……EvnThe text processing module performs the same processing on N one-dimensional tensors according to the above method to obtain E v1 ...E vn .
需要说明的是,本申请实施例中仅以预设长度为embedding_size为例进行说明,在其他实施例中,预设长度可以为其它数值,其与采用什么样的全连接层有关,本申请不做限定。It should be noted that in the embodiment of this application, the preset length is embedding_size as an example for explanation. In other embodiments, the preset length can be other values, which is related to what kind of fully connected layer is used. This application does not Make limitations.
步骤c,文本处理模块将N个一维张量Evi顺序排列,得到维度为N*embedding_size的二维张量。In step c, the text processing module arranges the N one-dimensional tensors E vi in order to obtain a two-dimensional tensor with a dimension of N*embedding_size.
示例性的,仍参照图9,文本处理模块将N个一维张量Ev1……Evn按照顺序排列,得到二维张量Ev0,Ev0表示为:
For example, still referring to Figure 9, the text processing module arranges N one-dimensional tensors E v1 ...E vn in order to obtain a two-dimensional tensor E v0 . E v0 is expressed as:
其中,Ev0的维度为(N*embedding_size)。Among them, the dimension of E v0 is (N*embedding_size).
需要说明的是,本申请实施例中的图像编码方式仅为示意性举例,例如,在其他实施例中,文本处理模块也可以通过调用kernel(内核)大小为(h*w),stride(步长)为h(或w),输出通道数为embedding_size的卷积核通过作用于Image Patches得到,具体方式可根据实际需求设置,其目的是将N个Image Patches进行编码,并得到具有更高语义的机器编码信息。It should be noted that the image encoding method in the embodiment of the present application is only a schematic example. For example, in other embodiments, the text processing module can also call the kernel (kernel) size (h*w), stride (step length) is h (or w), and the convolution kernel with the number of output channels embedding_size is obtained by acting on Image Patches. The specific method can be set according to actual needs. The purpose is to encode N Image Patches and obtain higher semantics. machine-encoded information.
可选地,在本申请实施例中,文本处理模块可将Ev0与分类头Ecls进行拼接(concat),得到二维张量Ev1。可选地,Ecls的维度可选地为(1,embedding_size),该维度可根据实际需求设置,本申请不做限定。可选地,分类头Ecls为神经网络可学习参数。Optionally, in this embodiment of the present application, the text processing module can concatenate (concat) E v0 with the classification header E cls to obtain the two-dimensional tensor E v1 . Optionally, the dimension of E cls is optionally (1, embedding_size). This dimension can be set according to actual requirements and is not limited in this application. Optionally, the classification head E cls is a learnable parameter of the neural network.
示例性的,Ev1可以表示为:
Ev1=[Ecls,Ev0]         (2)
Illustratively, E v1 can be expressed as:
E v1 = [E cls ,E v0 ] (2)
举例说明,假设分类头Ecls表示为:
For example, assume that the classification header E cls is expressed as:
以上文实施例中的Ev0为例,文本处理模块将Ev0与Ecls拼接,得到Ev1,Ev1表示为:
Taking E v0 in the above embodiment as an example, the text processing module splices E v0 and E cls to obtain E v1 , and E v1 is expressed as:
其中,Ev1的维度为(N+1,embedding_size)。Among them, the dimension of E v1 is (N+1, embedding_size).
需要说明的是,本申请实施例中仅以Ev0与Ecls拼接为例进行说明,在其他实施例中还可以是相加、融合等其它方式,本申请不做限定。It should be noted that in the embodiment of the present application, only the splicing of E v0 and E cls is used as an example for explanation. In other embodiments, addition, fusion, and other methods may also be used, and this application does not limit it.
(3)文本处理模块对Ev1进行Positional Encoding。(3) The text processing module performs Positional Encoding on E v1 .
示例性的,文本处理模块将上文得到的二维张量Ev1与二维位置编码Epos相加,得到图像编码信息Ev。示例性的,图像编码信息Ev可以表示为:
Ev=Ev1+Epos         (3)
For example, the text processing module adds the two-dimensional tensor E v1 obtained above and the two-dimensional position code E pos to obtain the image coding information E v . Exemplarily, the image encoding information E v can be expressed as:
E v =E v1 +E pos (3)
需要说明的是,位置编码的维度与上文处理后的结果的维度有关,本申请仅以二维为例进行说明,本申请不做限定。It should be noted that the dimension of the position coding is related to the dimension of the result after the above processing. This application only takes two dimensions as an example for explanation, and this application does not limit it.
举例说明,假设Epos表示为:
For example, assume that E pos is expressed as:
其中,Epos的维度为(N+1,embedding_size)。可选地,Epos为神经网络可学习参数。本申请实施例中,为方便表示,记Nv=N+1。Among them, the dimension of E pos is (N+1, embedding_size). Optionally, E pos is a learnable parameter of the neural network. In the embodiment of this application, for convenience of expression, N v =N+1 is recorded.
如图9所示,以上文中的Ev1为例,相应的,Ev1通过Positional Encoding,得到图像编码信息Ev表示为:
As shown in Figure 9, taking E v1 in the above example as an example, correspondingly, E v1 obtains the image encoding information E v through Positional Encoding, which is expressed as:
需要说明的是,本申请实施例中仅以图像编码与位置编码进行结合的方式为相加(add)为例进行说明,在其他实施例中还可以是其它结合方式,本申请不做限定。It should be noted that in the embodiment of the present application, the method of combining image coding and position coding is only added as an example for explanation. In other embodiments, other combination methods are also possible, and this application does not limit it.
图11为示例性示出的文本内容编码示意图。请参照图11,本申请实施例中,文本处理模块(具体为文本编码模型,下文中不再重复说明)对文本内容602b进行文本信息编码(也可以称为文字信息编码)的过程包括Word Embedding(词嵌入)和Positional Encoding,从而将文字信息转换为具有更高语义特征的文本编码信息(也可以称为文字编码信息),记为EtFIG. 11 is an exemplary text content encoding schematic diagram. Please refer to Figure 11. In the embodiment of the present application, the text processing module (specifically a text encoding model, which will not be described again below) performs text information encoding (also called text information encoding) on the text content 602b, including Word Embedding. (Word Embedding) and Positional Encoding, thereby converting text information into text encoding information with higher semantic characteristics (also called text encoding information), recorded as E t .
需要说明的是,本申请实施例中仅以通过Word Embedding和Positional Encoding的编码方式对文本内容进行文本信息编码为例进行说明,在其它实施例中,还可以通过其它编码方式进行编码,本申请不做限定。 It should be noted that, in the embodiment of this application, only text information encoding of text content through the encoding methods of Word Embedding and Positional Encoding is used as an example for explanation. In other embodiments, encoding can also be performed through other encoding methods. This application No restrictions.
图12为示例性示出的文本信息编码流程示意图,请参照图12,流程包括但不限于以下步骤:Figure 12 is a schematic diagram of an exemplary text information encoding process. Please refer to Figure 12. The process includes but is not limited to the following steps:
(1)文本处理模块将文本内容602b进行分词处理。(1) The text processing module performs word segmentation processing on the text content 602b.
如图12所示,示例性的,文本处理模块将文本内容602b按照预设字符长度进行分词,得到分词结果(也可以称为分词序列)。As shown in Figure 12, for example, the text processing module segments text content 602b according to a preset character length to obtain a segmentation result (which may also be called a segmentation sequence).
在本申请实施例中,以预设字符长度为一个字符为例,即,文本处理模块将每个字符(包括标点符号)划分为一个词,以得到m个词(例如m为18,即划分为18个词),即序列长度为m的分词序列w,w可以表示为:
w=[w1,w2,……wm]
In the embodiment of this application, taking the preset character length as one character as an example, that is, the text processing module divides each character (including punctuation marks) into a word to obtain m words (for example, m is 18, that is, divide is 18 words), that is, the word segmentation sequence w with sequence length m, w can be expressed as:
w=[w 1 , w 2 ,...w m ]
需要说明的是,在其他实施例中,预设字符长度也可以根据实际需求设置,例如可以是两个字符,本申请不做限定。可选地,预设字符长度还可以是不等长的,例如,可以将“目形”划分为一个词,将“山”划分为一个词,本申请不做限定。It should be noted that in other embodiments, the preset character length can also be set according to actual needs, for example, it can be two characters, which is not limited in this application. Optionally, the preset character lengths can also be unequal. For example, "eye shape" can be divided into one word, and "mountain" can be divided into one word, which is not limited in this application.
(2)文本处理模块获取分词序列对应的文本序号序列。(2) The text processing module obtains the text serial number sequence corresponding to the word segmentation sequence.
本申请实施例中,文本处理模块可预设设置文本序号表(也可以称为文本序号信息、字码表等,本申请不做限定),文本序号表用于指示文字(词或字)与序号的对应关系。例如,“目”在文本序号表中对应的序号为“12”。再例如“关系”在文本序号表中对应的序号为“52”,文字与序号的对应关系可根据实际需求设置,本申请不做限定。需要说明的是,文本与序号的对应关系可以以表格的方式保存,也可以以其它方式保存,本申请不做限定。In the embodiment of the present application, the text processing module can be preset with a text serial number table (which can also be called text serial number information, character code table, etc., which is not limited in this application). The text serial number table is used to indicate text (words or characters) and Serial number correspondence. For example, the corresponding serial number of "item" in the text serial number table is "12". For another example, the corresponding serial number of "relationship" in the text serial number table is "52". The corresponding relationship between text and serial numbers can be set according to actual needs, and is not limited in this application. It should be noted that the correspondence between text and serial numbers can be saved in a table or in other ways, which is not limited in this application.
可选地,文本序号表中包含的文字可以覆盖词典,也可以覆盖专业领域的任一书籍等,本申请不做限定。Optionally, the text contained in the text sequence number table can cover dictionaries or any books in professional fields, etc., which is not limited by this application.
如图12所示,示例性的,文本处理模块可基于文本序号表,查找分词序列w中的每个分词(字或词)对应的序号(也可以称为文本序号),以得到文本序号序列n,n可以表示为:
n=[n1,n2,……nm]
As shown in Figure 12, for example, the text processing module can search the sequence number (also called text sequence number) corresponding to each segment (word or word) in the segment sequence w based on the text sequence number table to obtain the text sequence number. n, n can be expressed as:
n=[n 1 , n 2 ,...n m ]
(3)文本处理模块将文本序号序列n通过word embedding,得到二维张量Et0(3) The text processing module passes the text sequence n through word embedding to obtain the two-dimensional tensor E t0 .
示例性的,文本处理模块将文本序号序列n通过embedding层,可得到二维张量Et0,Et0可以表示为:
Et0=Embedding(n)       (4)
For example, the text processing module passes the text sequence n through the embedding layer to obtain the two-dimensional tensor E t0 , and E t0 can be expressed as:
E t0 =Embedding(n) (4)
例如,在本申请实施例中,二维张量Et0可以表示为:
For example, in the embodiment of this application, the two-dimensional tensor E t0 can be expressed as:
其中,二维张量Et0的维度为(m,embedding_size)。Among them, the dimension of the two-dimensional tensor E t0 is (m, embedding_size).
需要说明的是,Et0的维度与embedding层有关,本申请不做限定。 It should be noted that the dimension of E t0 is related to the embedding layer and is not limited in this application.
(4)文本处理模块将Et0与位置编码相加,得到文本信息编码Et(4) The text processing module adds E t0 to the position code to obtain the text information code E t .
示例性的,如图12所示,文本处理模块将Et0通过与位置编码Epos′相加,得到文本信息编码Et。示例性的,温恩信息编码Et可以表示为:
Et=Et0+Epos′          (5)
For example, as shown in Figure 12, the text processing module adds E t0 to the position code E pos ′ to obtain the text information code E t . For example, the Winn information encoding E t can be expressed as:
E t =E t0 +E pos ′ (5)
需要说明的是,位置编码的维度与上文处理后的结果的维度有关,本申请仅以二维为例进行说明,本申请不做限定。It should be noted that the dimension of the position coding is related to the dimension of the result after the above processing. This application only takes two dimensions as an example for explanation, and this application does not limit it.
举例说明,假设Epos’表示为:
For example, assume that E pos ' is expressed as:
其中,Epos′的维度为(m,embedding_size)。可选地,Epos′为神经网络可学习参数。本申请实施例中,为方便表示,记Nt=m。Among them, the dimension of E pos ′ is (m, embedding_size). Optionally, E pos ′ is a learnable parameter of the neural network. In the embodiment of this application, for convenience of expression, N t =m is recorded.
如图12所示,以上文中的Et0为例,文本处理模块将Et0与Epos’相加,得到文本信息编码Et,Et表示为:
As shown in Figure 12, taking E t0 in the above example as an example, the text processing module adds E t0 and E pos ' to obtain the text information encoding E t , which is expressed as:
需要说明的是,本申请实施例中仅以文字编码与位置编码进行结合的方式为相加(add)为例进行说明,在其他实施例中还可以是其它结合方式,本申请不做限定。It should be noted that in the embodiment of this application, the method of combining text encoding and position encoding is only added as an example for explanation. In other embodiments, other combination methods are also possible, and this application does not limit it.
示例性的,本申请实施例中的位置编码可以是类似Bert Positional Embedding的参数可学习embedding层,也可以是类似原生Transformer架构中基于正弦/余弦变换的positional encoding,可根据实际需求设置,本申请不做限定。Illustratively, the positional encoding in the embodiment of this application can be a parameter-learnable embedding layer similar to Bert Positional Embedding, or it can be a positional encoding based on sine/cosine transformation similar to the native Transformer architecture, which can be set according to actual needs. This application No restrictions.
仍参照图5,示例性的,文本处理模块获取到图像编码信息与文字编码信息后,可基于图像编码信息与文字编码信息,得到中间表征信息。图13为示例性示出的获取中间表征信息的流程示意图,请参照图13,具体包括但不限于以下步骤:Still referring to FIG. 5 , for example, after the text processing module obtains the image coding information and the text coding information, it can obtain intermediate representation information based on the image coding information and the text coding information. Figure 13 is a schematic flowchart illustrating an exemplary process for obtaining intermediate representation information. Please refer to Figure 13. Specifically, it includes but is not limited to the following steps:
(1)文本处理模块将图像编码信息Ev与文本编码信息Et进行特征融合,得到混合语义编码Em(也可以称为混合编码信息,本申请不做限定)。(1) The text processing module performs feature fusion on the image encoding information E v and the text encoding information E t to obtain the mixed semantic encoding Em (which can also be called mixed encoding information, and is not limited in this application).
示例性的,文本处理模块(具体可以为多模态编码模型,下文中不再重复说明)将图像编码信息Ev与文本编码信息Et进行拼接,得到混合语义编码Em,例如可以表示为:
Em=[Ev,Et]         (6)
Exemplarily, the text processing module (specifically, it can be a multi-modal coding model, which will not be repeated below) splices the image coding information E v and the text coding information E t to obtain the mixed semantic coding Em , which can be expressed as, for example :
E m =[E v , E t ] (6)
例如,结合上文所述的图像编码信息Ev与文本编码信息Et,混合语义编码Em可以表示为:
For example, combining the above-mentioned image encoding information E v and text encoding information E t , the mixed semantic encoding E m can be expressed as:
其中,混合语义编码Em的维度为(Nv+Nt,embedding_size)Among them, the dimension of the mixed semantic encoding E m is (N v +N t , embedding_size)
需要说明的是,本申请实施例中仅以图像编码信息Ev与文本编码信息Et的融合方式为拼接为例进行说明,在其他实施例中还可以是其他方式,例如相加等,本申请不做限定。It should be noted that in the embodiment of the present application, the fusion method of image encoding information E v and text encoding information E t is only used as splicing as an example. In other embodiments, other methods can also be used, such as addition, etc. There are no restrictions on application.
(2)文本处理模块将混合语义编码Em通过多模态编码器,得到多模态编码信息(即中间表征信息)。(2) The text processing module passes the mixed semantic encoding E m through the multi-modal encoder to obtain multi-modal encoding information (ie, intermediate representation information).
图14a为示例性示出的多模态编码示意图,请参照图14a,文本处理模块将混合语义编码Em通过多模态编码器1301,得到多模态编码信息(即中间表征信息),记为EIR。示例性的,多模态编码器也可以理解为是用于基于输入的多模态编码信息,提取出融合了图像信息和文本信息的高维语义信息。Figure 14a is an exemplary multi-modal coding schematic diagram. Please refer to Figure 14a. The text processing module passes the mixed semantic encoding Em through the multi-modal encoder 1301 to obtain multi-modal coding information (ie, intermediate representation information), denoted by is E IR . For example, a multi-modal encoder can also be understood as being used to extract high-dimensional semantic information that combines image information and text information based on input multi-modal encoding information.
可选地,多模态编码器(Encoder)1301为堆叠的Transfomer Encoder组成,例如,堆叠数量为L个。每个Transformer Encoder主要由多头注意力层(Multi-Head Attention layer),层归一化(Layer Normalization)(即图14a中的(Norm))和前馈神经网络(Feed forward neural network)(即图14a中的Feed Forwad)组成。Optionally, the multi-modal encoder (Encoder) 1301 is composed of stacked Transformer Encoder, for example, the number of stacks is L. Each Transformer Encoder mainly consists of Multi-Head Attention layer, Layer Normalization (Norm in Figure 14a) and Feed forward neural network (Feed forward neural network) (Figure 14a) 14a Feed Forwad) composition.
图14b为多模态编码器1301的处理流程示意图。请参照图14b,本申请实施例中以堆叠数量L为3为例进行说明,即,多模态编码器1301包括多模态编码器1301a、多模态编码器1301b、多模态编码器1301c。需要说明的是,本申请实施例中所述的编码器的数量仅为示意性举例,可根据实际需求设置,本申请不做限定。示例性的,混合语义编码Em通过多模编码器1301a,并得到输出结果。多模态编码器1301a的输出结果,即作为多模态编码器1301b的输入,继续进行编码。多模态编码器1301b基于多模编码器1301a的输出结果进行编码,得到输出结果,并作为多模态编码器1301c的输入。多模态编码器1301c基于多模态编码器1301b的输出结果进行编码,得到输出结果,即为多模态编码信息EIR,EIR可以表示为:
EIR=TE(TE(TE(Em)))         (7)
Figure 14b is a schematic diagram of the processing flow of the multi-modal encoder 1301. Please refer to Figure 14b. In this embodiment of the present application, the stacking number L is 3 as an example. That is, the multi-modal encoder 1301 includes a multi-modal encoder 1301a, a multi-modal encoder 1301b, and a multi-modal encoder 1301c. . It should be noted that the number of encoders described in the embodiments of this application is only a schematic example and can be set according to actual needs, and is not limited in this application. Exemplarily, the mixed semantic encoding Em passes through the multi-mode encoder 1301a, and an output result is obtained. The output result of the multi-modal encoder 1301a is used as the input of the multi-modal encoder 1301b and continues to be encoded. The multi-modal encoder 1301b performs encoding based on the output result of the multi-modal encoder 1301a, and obtains the output result, which is used as the input of the multi-modal encoder 1301c. The multi-modal encoder 1301c performs encoding based on the output result of the multi-modal encoder 1301b, and obtains the output result, which is the multi-modal encoding information E IR , and E IR can be expressed as:
E IR =TE(TE(TE(E m ))) (7)
其中,TE标识多模态编码器1301中的单个多模态编码器。多模态编码信息EIR的维度为(Nv+Nt,embedding_size)。例如可以表示为:
Among them, TE identifies a single multi-modal encoder in multi-modal encoder 1301. The dimension of the multi-modal encoding information E IR is (N v +N t , embedding_size). For example it can be expressed as:
需要说明的是,多模态编码器1301中的各层的内部处理流程可参照已有技术实施例中的相关内容,本申请不再赘述。It should be noted that the internal processing flow of each layer in the multi-modal encoder 1301 may refer to the relevant content in the prior art embodiments, and will not be described in detail in this application.
进一步需要说明的是,本申请实施例中仅以多模态编码器为Transformer Encoder为例进行说明。在其他实施例中,多模态编码器还可以是类似双向循环神经网络,亦或是更为简单的卷积神经网络编码器,可根据实际需求设置,本申请不做限定。It should be further noted that in the embodiment of this application, the multi-modal encoder is a Transformer Encoder as an example for explanation. In other embodiments, the multi-modal encoder can also be similar to a bidirectional recurrent neural network, or a simpler convolutional neural network encoder, which can be set according to actual needs and is not limited in this application.
进一步需要说明的是,文本处理模块获得多模态编码的方法不限于将图像编码信息和文本编码信息拼接后通过多模态编码器的方式,在其他实施例中,文本处理模块也可以将图像编码信息与文字编码信息分别通过各自对应的编码器后,再进行融合。例如,文本处理模块将图像编码信息通过图像编码器得到高维图像语义信息,将文本编码信息通过文本编码器获得高维文本语义信息。文本处理模块将高维图像语义信息与高维文本语义信息进行维度对齐后拼接,以得到中间表征信息。具体方式可根据实际需求设置,其目的为获取到具有高维的图像语义特征和文本语义特征。It should be further noted that the method by which the text processing module obtains multi-modal coding is not limited to the method of splicing image coding information and text coding information through a multi-modal encoder. In other embodiments, the text processing module can also convert images into The coded information and text coded information pass through their respective encoders and then are fused. For example, the text processing module passes the image encoding information through the image encoder to obtain high-dimensional image semantic information, and passes the text encoding information through the text encoder to obtain high-dimensional text semantic information. The text processing module dimensionally aligns the high-dimensional image semantic information and the high-dimensional text semantic information and splices them together to obtain intermediate representation information. The specific method can be set according to actual needs, and the purpose is to obtain high-dimensional image semantic features and text semantic features.
请继续参照图5,示例性的,文本处理模块(具体可以为分类模型,下文中不再重复)可对中间表征信息进行分类,以基于分类结果,判定是否输出文本内容602b。Please continue to refer to Figure 5. For example, the text processing module (specifically, it can be a classification model, which will not be repeated below) can classify the intermediate representation information to determine whether to output text content 602b based on the classification results.
图14c为示例性示出的分类流程示意图。请参照图14c,示例性的,文本处理模块可将多模态编码信息(即中间表征信息)通过分类模型,以得到分类结果。示例性的,分类模型可以包括但不限于分类解码器、和argmax层(或者是softmax层)。本申请实施例中以分类解码器为全连接层,且全连接层为MLP(Multi-layer perceptron,多层感知机)为例进行说明。示例性的,MLP可以包括多个隐层(hidden layer)。需要说明的是,本申请实施例中仅以全连接层(例如MLP)作为分类解码器为例进行说明。在其他实施例中,分类解码器还可以是其它解码器,例如可以包括但不限于类似Transformer Decoder或循环神经网络(Recurrent Neural Network,RNN)Decoder等解码器,可根据实际需求设置,本申请不做限定。其目的在于基于输入的中间表征信息,输出对应的分类结果。进一步需要说明的是,本申请实施例中仅以argmax层为例进行说明,在其他实施例中,也可以是argmax层与softmax层,可根据实际需求设置,本申请不做限定。其目的在于输出最大分值所对应的分类项。Figure 14c is a schematic diagram of an exemplary classification process. Referring to Figure 14c, for example, the text processing module can pass the multi-modal encoding information (ie, intermediate representation information) through the classification model to obtain the classification result. For example, the classification model may include, but is not limited to, a classification decoder, and an argmax layer (or softmax layer). In the embodiment of this application, the classification decoder is a fully connected layer, and the fully connected layer is an MLP (Multi-layer perceptron, multi-layer perceptron) as an example. For example, the MLP may include multiple hidden layers. It should be noted that in the embodiment of the present application, only the fully connected layer (such as MLP) is used as the classification decoder as an example for explanation. In other embodiments, the classification decoder can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs. This application does not Make limitations. Its purpose is to output corresponding classification results based on the input intermediate representation information. It should be further noted that in the embodiment of the present application, only the argmax layer is used as an example for explanation. In other embodiments, the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.
可选地,在本申请实施例中,分类结果包括但不限于三种分类项:Optionally, in the embodiment of this application, the classification results include but are not limited to three classification items:
(a)过滤(a)Filtering
(b)修正并输出(b) Correct and output
(c)直接输出(c) Direct output
多模态编码信息通过分类解码器后,得到的分类结果中包括三种分类项对应的分值。模块可将三种分类项对应的分值通过argmax层或者是softmax层,以得到最终的决策类别。After the multi-modal coding information passes through the classification decoder, the classification result obtained includes the scores corresponding to the three classification items. The module can pass the scores corresponding to the three classification items through the argmax layer or softmax layer to obtain the final decision category.
举例说明:如上文所述,多模态编码信息EIR的维度为(Nv+Nt,embedding_size)。 可选地,本申请实施例中可取多模态编码信息EIR的第一维,得到长度为embedding_size的一维张量EIR0,表示为:
For example: As mentioned above, the dimension of the multi-modal coding information E IR is (N v +N t , embedding_size). Optionally, in the embodiment of the present application, the first dimension of the multi-modal coding information E IR can be obtained to obtain a one-dimensional tensor E IR0 with a length of embedding_size, which is expressed as:
文本处理模块将一维张量EIR0通过全连接层,输出长度为3(即与分类项的数量相同)的一维张量Tout。可选地,该全连接层可以为MLP,MLP可以包括多个隐层(hidden layer)。相应的,Tout可以表示为:
Tout=MLP(EIR0)        (8)
The text processing module passes the one-dimensional tensor E IR0 through the fully connected layer and outputs the one-dimensional tensor T out with a length of 3 (that is, the same number as the number of classification items). Optionally, the fully connected layer can be an MLP, and the MLP can include multiple hidden layers. Correspondingly, T out can be expressed as:
T out =MLP(E IR0 ) (8)
其中,Tout的维度为3,可以理解为Tout包括上述a、b、c三种分类项对应的分值,例如可以表示为:
Tout=[f(a),f(b),f(c)]
Among them, the dimension of T out is 3. It can be understood that T out includes the scores corresponding to the above three classification items a, b, and c. For example, it can be expressed as:
T out =[f(a), f(b), f(c)]
其中,f(a)为分类项a(即过滤分类项)对应的分值,f(b)为分类项b(即修正后输出分类项)对应的分值,f(c)为分类项c(即直接输出分类项)对应的分值。Among them, f(a) is the score corresponding to the classification item a (that is, the filtered classification item), f(b) is the score corresponding to the classification item b (that is, the corrected output classification item), f(c) is the classification item c (That is, directly output the score corresponding to the classification item).
示例性的,文本处理模块将Tout通过argmax层,以输出最大分值对应的分类项。需要说明的是,本申请实施例中仅以MLP作为全连接层为例进行说明。在其他实施例中,全连接层还可以是其它解码器,例如可以包括但不限于类似Transformer Decoder或循环神经网络(Recurrent Neural Network,RNN)Decoder等解码器,可根据实际需求设置,本申请不做限定。其目的在于基于输入的中间表征信息,输出对应的分类结果。进一步需要说明的是,本申请实施例中仅以argmax层为例进行说明,在其他实施例中,也可以是argmax层与softmax层,可根据实际需求设置,本申请不做限定。其目的在于输出最大分值所对应的分类项。For example, the text processing module passes T out through the argmax layer to output the classification item corresponding to the maximum score. It should be noted that in the embodiment of this application, MLP is only used as a fully connected layer as an example for explanation. In other embodiments, the fully connected layer can also be other decoders, such as but not limited to decoders such as Transformer Decoder or Recurrent Neural Network (RNN) Decoder, which can be set according to actual needs. This application does not Make limitations. Its purpose is to output corresponding classification results based on the input intermediate representation information. It should be further noted that in the embodiment of the present application, only the argmax layer is used as an example for explanation. In other embodiments, the argmax layer and the softmax layer may also be used, and may be set according to actual needs, and are not limited in this application. Its purpose is to output the classification item corresponding to the maximum score.
一个示例中,若f(a)为最大值,则输出结果为a,即分类结果为过滤分类项。相应的,文本处理模块可将对应的文本内容过滤,即不在文本识别结果中显示对应的文本内容。例如,文本处理模块对文本图像602a和文本内容602b处理的过程中,检测到分类结果为a类,即过滤分类项,则文本处理模块将文本内容602b过滤,则如图4的(2)所示,文本识别结果中不包括被截断的首行文字,从而避免截断文字识别结果错误,影响用户使用体验。In an example, if f(a) is the maximum value, the output result is a, that is, the classification result is the filtered classification item. Correspondingly, the text processing module can filter the corresponding text content, that is, the corresponding text content is not displayed in the text recognition result. For example, when the text processing module processes the text image 602a and the text content 602b, it detects that the classification result is category a, that is, the filtered classification item, then the text processing module filters the text content 602b, as shown in (2) of Figure 4 indicates that the text recognition results do not include the truncated first line of text, thereby avoiding errors in the truncated text recognition results and affecting the user experience.
另一个示例中,若f(c)为最大值,则输出结果为c,即分类结果为直接输出分类项。也就是说,OCR技术识别的结果是正确的。相应的,文本处理模块可在文本识别结果中显示对应的文本内容。例如,文本处理模块对文本图像603a和文本内容603b处理的过程中,检测到分类结果为c类,即分类结果为直接输出分类项。文本处理模块确定可直接输出文本内容603b,如图4的(2)所示,文本处理模块可在文本识别结果中的相应位置显示文本内容603b。In another example, if f(c) is the maximum value, the output result is c, that is, the classification result is to directly output the classification item. In other words, the results of OCR technology recognition are correct. Correspondingly, the text processing module can display the corresponding text content in the text recognition result. For example, when the text processing module processes the text image 603a and the text content 603b, it is detected that the classification result is category c, that is, the classification result is a direct output classification item. The text processing module determines that the text content 603b can be output directly. As shown in (2) of Figure 4, the text processing module can display the text content 603b at a corresponding position in the text recognition result.
又一个示例中,若f(b)为最大值,则输出结果为b,即分类结果为修正后输出分类项。 可以理解为,OCR技术识别到的结果中包括部分错误,需要进行修正之后才能输出。如上文所述(即图5中),文本处理模块将获取到的每个多模态编码信息(即中间表征信息)均会通过修正模块进行修正。在文本处理模块检测到单一多模态编码信息对应的分类结果为修正后输出分类项后,文本处理模块可在文本识别结果中显示修正模块修正后的文本内容。需要说明的是,如果分类结果为a类或c类,则文本处理模块丢弃(或忽略)修正模块输出的修正结果。In another example, if f(b) is the maximum value, the output result is b, that is, the classification result is the corrected output classification item. It can be understood that the results identified by OCR technology include some errors and need to be corrected before they can be output. As mentioned above (ie, in Figure 5), each multi-modal encoding information (ie, intermediate representation information) obtained by the text processing module will be corrected through the correction module. After the text processing module detects that the classification result corresponding to the single multi-modal encoding information is the corrected output classification item, the text processing module can display the text content corrected by the correction module in the text recognition result. It should be noted that if the classification result is category a or category c, the text processing module discards (or ignores) the correction result output by the correction module.
图15为示例性示出的文本修正示意图。请参照图15,本申请实施例中以修正模块包括Transformer Decoder为例进行说明。文本处理模块将多模态编码信息(即中间表征信息)通过Transformer Decoder1501,全连接层以及argmax层,得到修正后的文本内容。FIG. 15 is an exemplary text modification schematic diagram. Please refer to Figure 15. In the embodiment of this application, the correction module including Transformer Decoder is used as an example for explanation. The text processing module passes multi-modal coding information (i.e. intermediate representation information) through Transformer Decoder1501, fully connected layer and argmax layer to obtain the corrected text content.
需要说明的是,在其他实施例中,修正模块也可以为其它架构,例如可以包括但不限于:基于循环神经网络的前向解码器、Bert Decoder架构、类似stepwise monotonic attention(逐步单调注意力)的解码器等,可根据实际需求设置,本申请不做限定。其目的均是对输入的中间表征信息进行修正,以得到修正后的文本。It should be noted that in other embodiments, the correction module can also be other architectures, such as but not limited to: forward decoder based on recurrent neural network, Bert Decoder architecture, similar stepwise monotonic attention (stepwise monotonic attention) The decoder, etc. can be set according to actual needs and is not limited in this application. Its purpose is to correct the input intermediate representation information to obtain the corrected text.
请参照图15,Transformer Decoder1501包括Q个堆叠的Transformer Decoder,Q可以为大于0的正整数。单个Transformer Decoder可以表示为TD,单个TD包括但不限于:Masked multi-head attention层,多头注意力层,层归一化(即图15中的(Norm))和前馈神经网络(即图15中的Feed forwad)。各层的具体处理细节可参照已有技术实施例中的相关内容,本申请不再赘述。Please refer to Figure 15. Transformer Decoder1501 includes Q stacked Transformer Decoder, and Q can be a positive integer greater than 0. A single Transformer Decoder can be represented as TD. A single TD includes but is not limited to: Masked multi-head attention layer, multi-head attention layer, layer normalization (i.e. (Norm) in Figure 15) and feedforward neural network (i.e. Figure 15 Feed forward). For the specific processing details of each layer, please refer to the relevant content in the prior art embodiments, and will not be repeated in this application.
可选地,在Transformer Decoder架构中,Transformer Decoder的K向量和V向量为多模态编码信息(即Encoder的输出),Q向量为Masked multi-head attention层的输出。Optionally, in the Transformer Decoder architecture, the K vector and V vector of the Transformer Decoder are multi-modal encoding information (that is, the output of the Encoder), and the Q vector is the output of the Masked multi-head attention layer.
图16为示例性示出的修正模块处理流程示意图。请参照图16,示例性的,假设文本处理模块获取到的OCR技术的识别结果包括文本内容和文本图像,其中,文本内容为“火山暴发”。也就是说其中的“暴”字识别错误。文本处理模块基于上文实施例中的方法,获取到与文本内容和文本图像对应的多模态编码信息。并且,文本处理模块基于多模态编码信息,获取到对应的分类结果,分类结果为修正后输出分类项。具体细节可参照上文,此处不再赘述。请参照图16,示例性的,文本处理模块将多模态编码信息作为K向量和V向量输入Transformer Decoder1501,起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501。图17为示例性示出的Transformer Decoder的处理流程示意图。请参照图17,假设本申请实施例中的Transformer Decoder1501的堆叠数量Q为2,Figure 16 is a schematic diagram of the processing flow of the correction module. Please refer to Figure 16. As an example, it is assumed that the recognition results of the OCR technology obtained by the text processing module include text content and text images, where the text content is "volcanic eruption". In other words, the word "violent" is recognized incorrectly. Based on the method in the above embodiment, the text processing module obtains multi-modal coding information corresponding to text content and text images. Moreover, the text processing module obtains the corresponding classification results based on the multi-modal coding information, and the classification results are the corrected output classification items. Specific details can be found above and will not be repeated here. Please refer to Figure 16. As an example, the text processing module inputs the multi-modal encoding information into Transformer Decoder1501 as K vector and V vector, and the start character <s> is input into Transformer Decoder1501 as Q vector through Output Embedding and Positional Encoding. Figure 17 is a schematic diagram of the processing flow of the Transformer Decoder. Please refer to Figure 17, assuming that the stack number Q of Transformer Decoder1501 in the embodiment of this application is 2,
可选地,Output Embedding可以为Word Embedding,具体实现可参照上文实施例中的方式,或者是其它已有技术实施例中的实现方式,本申请不再追逐。Optionally, Output Embedding can be Word Embedding. The specific implementation can refer to the method in the above embodiment, or the implementation method in other existing technical embodiments, which this application will no longer pursue.
示例性的,假设本申请实施例中的Transformer Decoder1501的堆叠数量Q为2(可根据实际需求设置,本申请不做限定),包括Transformer Decoder1501a和Transformer Decoder1501b。示例性的,文本处理模块将多模态编码信息作为K向量和V向量输入Transformer Decoder1501a,起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501a。Transformer Decoder1501a的输入作为Transformer Decoder1501b的Q向量输入,并且多模态编码信息作为K向量和V向量输 入Transformer Decoder1501b。Transformer Decoder1501b的输出记为Edout1,Edout1经过全连接层,得到Eout1,其中,Eout1的维度为(seq_len,Nvocab)。可选地,文本处理模块对Eout1的第一维度进行切片,取其最后一列,得到长度为Nvocab的一维张量。文本处理模块将该一维张量通过argmax层(也可以是argmax与softmax层,可根据实际需求设置,本申请不做限定)。其中,Nvocab可选地为文本序号表中包括的文本数量,例如,词典中包括100个词及对应的序号,则Nvocab的数值为100。示例性的,seq_len的数值为输出字符数量,例如在本申请实施例中,输出字符数量为5,包括“火”、“山”、“爆”、“发”及结束符<end>。示例性的,argmax层输出的数值用于指示词典中的序号。文本处理模块可基于序号,确定对应的字或词。在本示例中,文本处理模块可确定对应的字或词为“火”。也就是说,文本处理模块将多模态编码信息和起始符<s>,通过Transformer Decoder1501可得到字符“火”。For example, it is assumed that the stack number Q of Transformer Decoder 1501 in the embodiment of the present application is 2 (can be set according to actual requirements, and is not limited by this application), including Transformer Decoder 1501a and Transformer Decoder 1501b. For example, the text processing module inputs the multi-modal encoding information into Transformer Decoder1501a as K vector and V vector, and the start character <s> is input into Transformer Decoder1501a as Q vector through Output Embedding and Positional Encoding. The input of Transformer Decoder1501a is input as the Q vector of Transformer Decoder1501b, and the multi-modal encoding information is input as K vector and V vector. Enter Transformer Decoder1501b. The output of Transformer Decoder1501b is recorded as E dout 1. E dout 1 passes through the fully connected layer to obtain E out 1, where the dimension of E out 1 is (seq_len,N vocab ). Optionally, the text processing module slices the first dimension of E out 1, takes its last column, and obtains a one-dimensional tensor with a length of N vocab . The text processing module passes the one-dimensional tensor through the argmax layer (it can also be the argmax and softmax layers, which can be set according to actual needs, and is not limited in this application). Among them, N vocab is optionally the number of texts included in the text sequence number table. For example, if the dictionary includes 100 words and corresponding sequence numbers, the value of N vocab is 100. For example, the value of seq_len is the number of output characters. For example, in the embodiment of this application, the number of output characters is 5, including "fire", "mountain", "explosion", "fa" and the end character <end>. For example, the value output by the argmax layer is used to indicate the sequence number in the dictionary. The text processing module can determine the corresponding word or phrase based on the serial number. In this example, the text processing module may determine that the corresponding word or word is "fire". In other words, the text processing module encodes the multi-modal information and the starting character <s>, and the character "fire" can be obtained through Transformer Decoder1501.
仍参照图16,多模态编码信息作为K向量和Q向量,“火”字符与起始符<s>作为Q向量输入Transformer Decoder1501。可选地,火”字符与起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501a。Transformer Decoder1501基于多模态编码信息、“火”字符与起始符<s>,输出Edout2。Edout2通过全连接层Eout2。Eout2通过argmax层得到对应的数值。文本处理模块可基于该数值,确定对应的字符,例如为“山”。也就是说,文本处理模块将多模态编码信息、字符“火”和起始符<s>通过Transformer Decoder1501可得到字符“山”。未描述细节可参照上文获取字符“火”的相关内容,此处不再赘述。Still referring to Figure 16, the multi-modal encoding information is used as a K vector and a Q vector, and the "fire" character and the start symbol <s> are input into Transformer Decoder 1501 as a Q vector. Optionally, the "fire" character and the start character <s> are input to Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 is based on the multi-modal encoding information, the "fire" character and the start character <s>, and outputs E dout 2. E dout 2 passes through the fully connected layer E out 2. E out 2 obtains the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, such as "mountain". In other words, the text The processing module passes the multi-modal encoding information, the character "fire" and the start character <s> through Transformer Decoder1501 to obtain the character "mountain". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be discussed here. Repeat.
请继续参照图16,多模态编码信息作为K向量和Q向量,“火”字符、“山”字符与起始符<s>作为Q向量输入Transformer Decoder1501。可选地,“火”字符、“山”字符与起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501a。Transformer Decoder1501基于多模态编码信息、“火”字符、“山”字符与起始符<s>,输出Edout3。Edout3通过全连接层Eout3。Eout3通过argmax层得到对应的数值。文本处理模块可基于该数值,确定对应的字符,例如为“爆”。也就是说,文本处理模块将多模态编码信息、“火”字符、“山”字符与起始符<s>通过Transformer Decoder1501可得到字符“爆”。从而将OCR技术识别结果中的错误字符“暴”修正为“爆”。未描述细节可参照上文获取字符“火”的相关内容,此处不再赘述。Please continue to refer to Figure 16. The multi-modal encoding information is used as K vector and Q vector, and the "fire" character, "mountain" character and start character <s> are input into Transformer Decoder1501 as Q vector. Optionally, the "fire" character, "mountain" character and the start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E dout 3 based on the multi-modal encoding information, the "fire" character, the "mountain" character and the start character <s>. E doout 3 passes through the fully connected layer E out 3 . E out 3 gets the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, for example, "explosion". In other words, the text processing module passes the multi-modal encoding information, the "fire" character, the "mountain" character and the starting character <s> through Transformer Decoder1501 to obtain the character "explosion". Thus, the incorrect character "explosion" in the OCR technology recognition result is corrected to "explosion". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be described again here.
请继续参照图16,多模态编码信息作为K向量和Q向量,“火”字符、“山”字符、“爆”字符与起始符<s>作为Q向量输入Transformer Decoder1501。可选地,“火”字符、“山”字符、“爆”字符与起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501a。Transformer Decoder1501基于多模态编码信息、“火”字符、“山”字符、“爆”字符与起始符<s>与起始符<s>,输出Edout4。Edout4通过全连接层Eout4。Eout4通过argmax层得到对应的数值。文本处理模块可基于该数值,确定对应的字符,例如为“发”。也就是说,文本处理模块将多模态编码信息、“火”字符、“山”字符、“爆”字符与起始符<s>通过Transformer Decoder1501可得到字符“发”。未描述细节可参照上文获取字符“火”的相关内容,此处不再赘述。Please continue to refer to Figure 16. The multi-modal encoding information is used as K vector and Q vector, and the "fire" character, "mountain" character, "explosion" character and the start character <s> are input into Transformer Decoder1501 as Q vector. Optionally, the "fire" character, "mountain" character, "explosion" character and start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E dout 4 based on the multi-modal encoding information, the "fire" character, the "mountain" character, the "explosion" character and the start character <s> and the start character <s>. E doout 4 passes through the fully connected layer E out 4 . E out 4 gets the corresponding value through the argmax layer. The text processing module can determine the corresponding character based on the value, for example, "Fa". In other words, the text processing module passes the multi-modal encoding information, the "fire" character, the "mountain" character, the "blast" character and the start character <s> through Transformer Decoder1501 to obtain the character "fa". For details that are not described, please refer to the above to obtain the relevant content of the character "fire", which will not be described again here.
请继续参照图16,多模态编码信息作为K向量和Q向量,“火”字符、“山”字符、 “爆”字符、“发”字符与起始符<s>作为Q向量输入Transformer Decoder1501。可选地,“火”字符、“山”字符、“爆”字符、“发”字符与起始符<s>通过Output Embedding和Positional Encoding,作为Q向量输入Transformer Decoder1501a。Transformer Decoder1501基于多模态编码信息、“火”字符、“山”字符、“爆”字符、“发”字符与起始符<s>,输出Edout5。Edout5通过全连接层Eout5。Eout5通过argmax层得到对应的数值。文本处理模块确定输出结果为结束符<end>,即结束循环。Please continue to refer to Figure 16. The multi-modal encoding information is as K vector and Q vector, "fire" character, "mountain" character, The "explosion" character, the "fa" character and the start character <s> are input into Transformer Decoder1501 as Q vectors. Optionally, the "Fire" character, "Mountain" character, "Explosion" character, "Fa" character and the start character <s> are input into Transformer Decoder1501a as a Q vector through Output Embedding and Positional Encoding. Transformer Decoder1501 outputs E dout 5 based on multi-modal encoding information, "fire" characters, "mountain" characters, "blast" characters, "fa" characters and the start character <s>. E dout 5 passes through the fully connected layer E out 5 . E out 5 gets the corresponding value through the argmax layer. The text processing module determines that the output result is the terminator <end>, which ends the loop.
示例性的,文本处理模块在检测到分类结果为b,即修正后输出分类项后,可获取修正模块输出的修正结果,即“火山爆发”。文本处理模块在识别结果中显示获取到的修正结果。For example, after the text processing module detects that the classification result is b, that is, the corrected classification item is output, the text processing module can obtain the correction result output by the correction module, that is, "volcanic eruption." The text processing module displays the obtained correction results in the recognition results.
需要说明的是,本申请实施例中涉及到的模型,包括但不限于:图像编码模型、文本编码模型、多模态编码模型、分类模型以及修正模型可组成文本处理模型,也可以理解为是神经网络。在对文本处理模型训练过程中,模型的输入数据主要为文本图像(包含截断和未截断的样本)以及对应的文字识别内容(即文本内容)。对于每一个输入模型的文本图像和对应的文本内容,组成一对训练样本。对于每一对训练样本,可通过人工的方式进行标注,标签分别为前文所提的三个类别。也就是说,通过人工标注方式,将输入的文本图像和文本内容进行分类。特别地,对于可以修正的情况,对待修正的文本通过人工修改后得到修正后的文本,作为文本修正解码器输出的监督数据。可选地,文本处理模型的训练过程为有监督训练,分类解码器(即分类模型)随时函数采用分类交叉熵损失,文本修正解码器(即修正模型)与原生Transformer自回归解码器训练类似,对于每个时间步均采用teacher-forcing的方法进行训练。两个解码器由于共享编码器(即文本处理模型的神经网络的主干网),因此实际训练过程为联合训练。It should be noted that the models involved in the embodiments of this application, including but not limited to: image coding model, text coding model, multi-modal coding model, classification model and correction model, can form a text processing model, which can also be understood as Neural Networks. During the training process of the text processing model, the input data of the model are mainly text images (including truncated and untruncated samples) and the corresponding text recognition content (i.e., text content). For each text image and corresponding text content input to the model, a pair of training samples is formed. For each pair of training samples, they can be manually labeled, and the labels are the three categories mentioned above. That is to say, the input text images and text content are classified through manual annotation. In particular, for situations that can be corrected, the text to be corrected is manually modified to obtain the corrected text, which is used as the supervision data output by the text correction decoder. Optionally, the training process of the text processing model is supervised training. The classification decoder (i.e., classification model) uses categorical cross-entropy loss at any time. The text correction decoder (i.e., correction model) is trained similarly to the native Transformer autoregressive decoder. The teacher-forcing method is used for training at each time step. Since the two decoders share the encoder (that is, the backbone of the neural network of the text processing model), the actual training process is joint training.
在一种可能的实现方式中,如上文所述,截断文本还可以包括横向截断文本以及斜向截断文本。需要说明的是,对于横向截断文本,例如每行文本中的首个字符被截断,在该场景中,文本识别模块基于OCR识别过程中,通常情况下可对文本内容进行预测等处理,以得到正确的文本。也就是说,横向截断文本通常情况下可能不存在上文所述的垂直截断文本的语义错误的问题。相应的,本申请实施例中的方案在应用于横向截断文本时,同样可以对齐进行相应的处理。而处理后的结果可能与OCR技术的识别结果差别较小。对于斜向截断文本与横向阶段文本类似,对于斜向夹角较小(例如小于或等于10°)的文本行,其在OCR技术中可通过预测等方式获取到正确的文本内容。也就是说,通过本申请实施例中的方案处理后,其输出结果与OCR技术识别结果差别较小。而对于斜向夹角较大(例如大于10°)的文本,则OCR技术可能无法全部识别文本区域。举例说明,如图18a所示,文本行的夹角假设为30°,OCR技术在执行文本区域检测过程中,其识别到的文本区域仅包括虚线内所示的部分。而OCR技术在对检测到的文本区域进行文本内容识别时,根据其预测功能,可输出与原文一致的文本内容。也可以理解为,对于斜向夹角较大的文本,其对应的识别结果可能不存在语义错误的问题。In a possible implementation, as mentioned above, truncating text may also include transverse truncating text and oblique truncating text. It should be noted that for horizontally truncated text, for example, the first character in each line of text is truncated. In this scenario, the text recognition module can usually predict the text content based on the OCR recognition process to obtain Correct text. That is to say, horizontal truncation of text may generally not cause the semantic error of vertical truncation of text mentioned above. Correspondingly, when the solution in the embodiment of the present application is applied to horizontally truncate text, it can also be aligned and processed accordingly. The processed results may be slightly different from the recognition results of OCR technology. The oblique truncated text is similar to the horizontal stage text. For text lines with a small oblique angle (for example, less than or equal to 10°), the correct text content can be obtained through prediction and other methods in OCR technology. That is to say, after processing through the solution in the embodiment of the present application, the difference between the output result and the OCR technology recognition result is small. For text with a large oblique angle (for example, greater than 10°), OCR technology may not be able to recognize all text areas. For example, as shown in Figure 18a, the angle between the text lines is assumed to be 30°. When the OCR technology performs text area detection, the text area recognized by the OCR technology only includes the part shown in the dotted line. When OCR technology recognizes the text content of the detected text area, based on its prediction function, it can output text content that is consistent with the original text. It can also be understood that for texts with large oblique angles, the corresponding recognition results may not have semantic errors.
需要说明的是,本申请实施例中的技术方案可以有效解决被部分遮挡的文本的识别结果出现语义错误的问题。在本申请实施例中,“部分遮挡”可选地为整行文本的所有文字的上部被遮挡,例如图4的(1)中所示的首行文字遮挡的场景。一个示例中,“部 分遮挡”可选地为整行文字的下部被遮挡。例如,图18b为示例性示出的一种应用场景示意图,请参照图18b,示例性的,待识别图像中包括下部分被截断的文本行,对于该文本行所对应的OCR识别结果,文本处理模块同样可基于上文实施例中所描述的方案进行处理。另一个示例中,“部分遮挡”可选地为整行文字的其中部分文字的上部分(也可以是下部分,或任意部分)被遮挡。例如,图18c为示例性示出的另一种应用场景示意图。请参照图18c,待识别图像中的文本行的一部分文字被遮挡。即,原始文本为“多模态编码信息(中间表征信息)”,而其中的“中间表征信息”被部分遮挡。可选地,文本识别模块对该文本行进行OCR识别,可获取到多个文本区域。例如图18d所示,文本识别模块可能识别到“多模态编码信息”所对应的文本区域,以及被遮挡的“(中间表征信息)”所对应的文本区域,以及两个文本区域对应的文本内容。则,文本处理模块可以对两个文本区域的图像以及对应的文本内容执行本申请实施例中的处理方案。可选地,文本识别模块对文本行进行OCR识别,也可能获取到一个文本区域,例如图18e所示,文本识别模块可能将被遮挡的文本部分与未被遮挡的文本部分划分到同一个文本区域中。本申请实施例中同样可以对该类文本区域的图像以及文本内容进行处理。也就是说,本申请实施例中的技术方案,可以应用于多种文本被遮挡的场景,从而满足不同场景下对文本识别的需求。可选地,本申请实施例中可以有效解决遮挡率为20%~50%(也可以在该范围内浮动,本申请不做限定)的文本行的文本识别问题。需要说明的是,如上文所述,如果文本行的遮挡率过高(例如为80%),则可能在OCR阶段不会检测到对应的文本区域。而如果遮挡率较低,则可能OCR的识别结果是正确的。文本处理模块可以直接输出或者修正后输出对应的文本内容。It should be noted that the technical solutions in the embodiments of the present application can effectively solve the problem of semantic errors in the recognition results of partially occluded text. In the embodiment of the present application, "partial occlusion" optionally means that the upper part of all characters in the entire line of text is blocked, such as the scene in which the first line of text is blocked as shown in (1) of Figure 4 . In one example, “Department "Partial Occlusion" optionally means that the lower part of the entire line of text is blocked. For example, Figure 18b is a schematic diagram of an exemplary application scenario. Please refer to Figure 18b. For example, the image to be recognized includes the lower part of the text being truncated. For the text line, the text processing module can also process the OCR recognition result corresponding to the text line based on the solution described in the above embodiment. In another example, "partial occlusion" can optionally be the part of the entire line of text. The upper part (or the lower part, or any part) of part of the text is blocked. For example, Figure 18c is a schematic diagram of another application scenario. Please refer to Figure 18c, a part of the text line in the image to be recognized The text is occluded. That is, the original text is "multimodal encoding information (intermediate representation information)", and the "intermediate representation information" is partially occluded. Optionally, the text recognition module performs OCR recognition on the text line, which can Multiple text areas are obtained. For example, as shown in Figure 18d, the text recognition module may identify the text area corresponding to "multi-modal encoding information", as well as the text area corresponding to the occluded "(intermediate representation information)", and The text content corresponding to the two text areas. Then, the text processing module can perform the processing solution in the embodiment of the present application on the images of the two text areas and the corresponding text content. Optionally, the text recognition module performs OCR recognition on the text lines. , it is also possible to obtain a text area. For example, as shown in Figure 18e, the text recognition module may divide the occluded text part and the unobstructed text part into the same text area. In the embodiment of the present application, this type of text area can also be The image and text content of the text area are processed. In other words, the technical solution in the embodiment of the present application can be applied to a variety of scenes where the text is occluded, thereby meeting the needs for text recognition in different scenarios. Optionally, this The application embodiment can effectively solve the text recognition problem of text lines with an occlusion rate of 20% to 50% (it can also float within this range, which is not limited by this application). It should be noted that, as mentioned above, if the text If the occlusion rate of the line is too high (for example, 80%), the corresponding text area may not be detected during the OCR stage. If the occlusion rate is low, the OCR recognition result may be correct. The text processing module can directly output Or output the corresponding text content after correction.
图19为本申请实施例提供的另一种文本识别方法的流程示意图。请参照图19,该方法包括但不限于:Figure 19 is a schematic flowchart of another text recognition method provided by an embodiment of the present application. Please refer to Figure 19. This method includes but is not limited to:
(1)文本处理模块将文本图像通过分类模型,得到分类结果。(1) The text processing module passes the text image through the classification model to obtain the classification result.
(2)文本处理模块基于分类结果,判断文本内容是否被截断。(2) The text processing module determines whether the text content is truncated based on the classification results.
示例性的,文本处理模块可将文本图像进行预处理,例如预处理可以为将文本图像进行resize,具体细节可参照上文实施例的相关内容,此处不再赘述。For example, the text processing module can pre-process the text image. For example, the pre-processing can be resizing the text image. For specific details, please refer to the relevant content of the above embodiments, which will not be described again here.
示例性的,图20为示例性示出的文本图像处理示意图,请参照图20,示例性的,仍以上文中的文本图像602a为例,文本处理模块将文本图像602a(也可以是与处理后的文本图像)输入至分类模型。分类模型可对文本图像602a进行分类,并得到分类结果。Exemplarily, FIG. 20 is an exemplary schematic diagram of text image processing. Please refer to FIG. 20 . Exemplarily, still taking the text image 602a above as an example, the text processing module converts the text image 602a (which can also be processed with text image) is input to the classification model. The classification model can classify the text image 602a and obtain a classification result.
可选地,分类模型在训练阶段所使用的训练数据包括但不限于被截断的文本对应的文本图像和未被截断的文本对应的文本图像。Optionally, the training data used by the classification model in the training phase includes, but is not limited to, text images corresponding to truncated text and text images corresponding to uncensored text.
可选地,分类模型可通过交叉熵损失函数监督模型训练。Optionally, the classification model can be supervised with a cross-entropy loss function.
可选地,分类模型可以包括但不限于主流的基于卷积神经网络(Convolutional Neural Network,CNN)的分类网络(例如包括VGG,ResNet,EfficientNet等),或者是基于Transformer结构的VIT(Vision Transformer,视觉Transfomer)分类模型及其变体。其目的主要为输出二分类问题的概率,即,截断分类项或非截断分类项对应的分值。Optionally, the classification model may include but is not limited to mainstream classification networks based on Convolutional Neural Network (CNN) (for example, including VGG, ResNet, EfficientNet, etc.), or VIT (Vision Transformer, etc.) based on the Transformer structure. Visual Transformer) classification model and its variants. Its purpose is mainly to output the probability of a two-classification problem, that is, the score corresponding to the truncated classification item or the non-truncated classification item.
示例性的,记分类模型为CLS,则分类模型的输出结果(也可以称为分类结果)可 以表示为:
score=CLS(I)        (9)
For example, if the classification model is recorded as CLS, then the output result of the classification model (also called the classification result) can be Expressed as:
score=CLS(I) (9)
其中,I用于指示文本图像,文本图像包括宽、高和通道数三个维度的参数,具体概念可参照图10的相关内容,此处不再赘述。Among them, I is used to indicate a text image. The text image includes parameters in three dimensions: width, height, and number of channels. For specific concepts, please refer to the relevant content in Figure 10 and will not be described again here.
可选地,输出结果score可选地为大于0或小于1的数值。其中,该数值越接近1,则表示截断概率越高。文本处理模块可设置截断阈值,例如为0.5,可根据实际需求设置,本申请不做限定。一个示例中,若输出结果score大于或等于截断阈值(0.5),则判定文本图像所对应的文本内容为截断文本。另一个示例中,若输出结果score小于截断阈值(0.5),则判定文本图像所对应的文本内容为非截断文本。Optionally, the output result score can optionally be a value greater than 0 or less than 1. Among them, the closer the value is to 1, the higher the truncation probability is. The text processing module can set a truncation threshold, for example, 0.5, which can be set according to actual needs and is not limited in this application. In one example, if the output result score is greater than or equal to the truncation threshold (0.5), the text content corresponding to the text image is determined to be truncated text. In another example, if the output result score is less than the truncation threshold (0.5), the text content corresponding to the text image is determined to be non-truncated text.
(3)输出文本。(3) Output text.
示例性的,若文本处理模块判定文本图像所对应的文本内容为非截断文本,则可直接输出对应的文本内容,即在识别结果中显示对应的文本内容。未描述部分可参照上文实施例的相关内容,此处不再赘述。For example, if the text processing module determines that the text content corresponding to the text image is non-truncated text, the corresponding text content can be directly output, that is, the corresponding text content can be displayed in the recognition result. For undescribed parts, please refer to the relevant content of the above embodiments and will not be repeated here.
(4)文本处理模块将文本内容通过语义模型,得到语义判断结果。(4) The text processing module passes the text content through the semantic model to obtain semantic judgment results.
示例性的,若文本处理模块判定文本图像所对应的文本内容为截断文本,则文本处理模块将文本图像所对应的文本内容输入至语义模型(也可以称为语义判断模块)。For example, if the text processing module determines that the text content corresponding to the text image is truncated text, the text processing module inputs the text content corresponding to the text image into the semantic model (which may also be called a semantic judgment module).
示例性的,图21为示例性示出的语义模型的处理流程,请参照图21,语义模型的处理流程包括但不限于以下步骤:Exemplarily, Figure 21 is an exemplary processing flow of the semantic model. Please refer to Figure 21. The processing flow of the semantic model includes but is not limited to the following steps:
a.文本处理模块将文本内容进行分词,得到分词结果。a. The text processing module segments the text content into words and obtains the word segmentation results.
示例性的,仍以上文实施例中的文本内容602b为例,文本处理模块(具体为语义模型)将文本内容602b进行分词,并获取对应的分词序号序列。分词以及获取文本序号序列的具体的具体步骤可参照上文实施例中的相关内容,此处不再赘述。Illustratively, still taking the text content 602b in the above embodiment as an example, the text processing module (specifically, the semantic model) segments the text content 602b into words and obtains the corresponding word segmentation serial number sequence. For the specific steps of word segmentation and obtaining the text sequence, please refer to the relevant content in the above embodiments, and will not be described again here.
b.文本处理模块将分词结果通过Word embedding和Positional Endcoing,得到Etextb. The text processing module passes the word segmentation results through Word embedding and Positional Endcoing to obtain E text .
示例性的,文本模块(具体为语义模型)将获取到的文本序号序列通过Word embedding和Positional Endcoing,得到文本编码信息Etext。具体细节可参照上文实施例中图12的相关描述,此处不再赘述。For example, the text module (specifically, the semantic model) passes the obtained text serial number sequence through Word embedding and Positional Endcoing to obtain the text encoding information E text . For specific details, please refer to the relevant description of Figure 12 in the above embodiment, and will not be described again here.
c.文本处理模块将Etext通过编码模块,得到Ftextc. The text processing module passes E text through the encoding module to obtain F text .
示例性的,文本处理模块将Etext通过编码模块(即编码器(Encoder)),可以得到具有高维语义特征的编码信息,即Ftext。编码模块包括但不限于:CNN编码器,RNN编码器,BiRNN(双向循环神经网络)编码器(比如双向LSTM(Long Short-Term Memory,长短期记忆网络)),Transformer Encoder等,本申请不做限定。其编码器的处理流程可以参照图14a和图14b的相关描述,此处不再赘述。其中,其在实现过程中,Etext替换为图14a和图14b中的多模态编码信息。For example, the text processing module passes E text through the encoding module (ie, the encoder (Encoder)), and can obtain the encoded information with high-dimensional semantic features, that is, F text . Encoding modules include but are not limited to: CNN encoder, RNN encoder, BiRNN (bidirectional recurrent neural network) encoder (such as bidirectional LSTM (Long Short-Term Memory, long short-term memory network)), Transformer Encoder, etc. This application does not limited. The processing flow of the encoder can be referred to the relevant descriptions in Figure 14a and Figure 14b and will not be described again here. Among them, during the implementation process, E text is replaced by the multi-modal coding information in Figure 14a and Figure 14b.
示例性的,记编码器为Encoder,则Ftext可以表示为:
Ftext=Encoder(Etext)       (10)
For example, if the encoder is denoted as Encoder, then F text can be expressed as:
F text =Encoder(E text ) (10)
d.文本处理模块将Ftext通过解码模块,得到输出分值scoret(即为语义判断结果)。d. The text processing module passes the F text through the decoding module to obtain the output score score t (which is the semantic judgment result).
示例性的,记解码模块(即解码器)为Decoder,scoret可以表示为:
scoret=Decoder(Ftext)      (11)
For example, let the decoding module (i.e. decoder) be Decoder, and score t can be expressed as:
score t =Decoder(F text ) (11)
可选地,解码模块包括但不限于:MLP(即全连接层)解码器,CNN解码器,RNN解码器和Transfomrer解码器,可根据实际需求设置,本申请不做限定。解码模块的具体处理流程可以参照图15、图16和图17的相关内容,此处不再赘述。可选地,由于本示例中的输出结果即scoret是二分类问题的结果。可以理解为,输出结果用于指示语义连贯,或者是不连贯。相应的,解码器中可以不包括armax层。在其他实施例中也可以包括argmax层,本申请不做限定。Optionally, the decoding module includes but is not limited to: MLP (ie fully connected layer) decoder, CNN decoder, RNN decoder and Transformer decoder, which can be set according to actual needs and is not limited in this application. For the specific processing flow of the decoding module, please refer to the relevant contents of Figure 15, Figure 16 and Figure 17, and will not be described again here. Optionally, since the output result in this example, score t, is the result of a binary classification problem. It can be understood that the output results are used to indicate semantic coherence or incoherence. Correspondingly, the armax layer may not be included in the decoder. In other embodiments, an argmax layer may also be included, which is not limited in this application.
示例性的,在本示例中,语义模型的输入主要为一行或一条字符串,输出为类别(即语意连贯类型或语义不连贯类型)。训练过程中,语义模型通过收集语料,并针对每一条通过人工标注的方式,确定语义是否连贯。可选的,语义模型也可以通过数据生成等方式获得正负训练样本。Illustratively, in this example, the input of the semantic model is mainly a line or a string, and the output is a category (ie, a semantically coherent type or a semantically incoherent type). During the training process, the semantic model collects corpus and manually annotates each item to determine whether the semantics are coherent. Optionally, the semantic model can also obtain positive and negative training samples through data generation and other methods.
示例性的,与分类模型类似,解码模块输出的scoret可以用于指示语义连贯性。例如,scoret可选地为大于0或小于1的数值,文本处理模块可设置语意连贯阈值,例如为0.5,可根据实际需求设置,本申请不做限定。Illustratively, similar to the classification model, the score t output by the decoding module can be used to indicate semantic coherence. For example, score t can optionally be a value greater than 0 or less than 1. The text processing module can set a semantic coherence threshold, such as 0.5, which can be set according to actual needs and is not limited in this application.
一个示例中,若scoret的数值大于或等于语义连贯阈值(即0.5),文本处理模块可确定对应的文本内容中的语义连贯。也就是说,OCR技术对截断的文本识别的结果是正确的,相应的,文本处理模块可以直接输出文本内容,即在文本识别结果中显示对应的文本内容。In an example, if the value of score t is greater than or equal to the semantic coherence threshold (ie, 0.5), the text processing module can determine the semantic coherence of the corresponding text content. In other words, the results of OCR technology for truncated text recognition are correct. Correspondingly, the text processing module can directly output the text content, that is, the corresponding text content is displayed in the text recognition result.
另一个示例中,若scoret的数值小于语义连贯阈值(即0.5),文本处理模块可确定对应的文本内容中的语义不连贯。也就是说,对于OCR技术对截断文本的识别结果存在语义错误,文本处理模块继续执行步骤(5)。In another example, if the value of score t is less than the semantic coherence threshold (ie, 0.5), the text processing module may determine that the corresponding text content is semantically incoherent. That is to say, there is a semantic error in the recognition result of the truncated text by the OCR technology, and the text processing module continues to perform step (5).
需要说明的是,本申请实施例中的语意连贯模型检测文本语义连贯性的方式仅为示意性举例。在其他实施例中,文本处理模块还可以通过其它方式进行语义连贯性的检测,例如可以是基于语法错误检查模型,语法错误检查模型可基于输入的文本内容,输出在语法错误的位置候选集,并通过候选集占总token(最小于语义单元)数的比值设置阈值判断。再例如,文本处理模块可通过前向语言模型,以获取到每个token的概率,并基于平均概率,基于预设阈值进行判断。具体细节可参照已有技术实施例中的相关内容,此处不再赘述。It should be noted that the manner in which the semantic coherence model in the embodiment of the present application detects the semantic coherence of text is only a schematic example. In other embodiments, the text processing module can also detect semantic coherence in other ways, for example, it can be based on a grammatical error checking model. The grammatical error checking model can output a candidate set of grammatical error positions based on the input text content. And set the threshold judgment based on the ratio of the candidate set to the total number of tokens (minimum than the semantic unit). For another example, the text processing module can obtain the probability of each token through the forward language model, and make a judgment based on the average probability and a preset threshold. For specific details, please refer to the relevant content in the prior art embodiments and will not be described again here.
(5)文本处理模块判断文本内容是否可修正。(5) The text processing module determines whether the text content can be corrected.
在本申请实施例中,文本处理模块可以继续基于语义模型输出的结果,进一步判定文本内容是否可以修正。例如,文本处理模块可以设置修正阈值,例如为0.2,可根据实际需求设置,本申请不做限定。In this embodiment of the present application, the text processing module can continue to determine whether the text content can be modified based on the results output by the semantic model. For example, the text processing module can set a correction threshold, such as 0.2, which can be set according to actual needs and is not limited in this application.
一个示例中,若scoret的数值大于或等于修正阈值(即0.2),文本处理模块可确定对应的文本内容可修正,文本处理模块可对文本内容进行修正后输出。例如,文本处理模块可以将文本内容作为修正模块的输入,通过修正模块进行修正,修正模块的处理流程可参照图15、图16和图17的相关内容,此处不再赘述。In one example, if the value of score t is greater than or equal to the correction threshold (ie, 0.2), the text processing module can determine that the corresponding text content can be corrected, and the text processing module can correct the text content and output it. For example, the text processing module can use the text content as the input of the correction module, and perform correction through the correction module. The processing flow of the correction module can be referred to the relevant contents of Figure 15, Figure 16 and Figure 17, which will not be described again here.
另一个示例中,若scoret的数值小于修正阈值(即0.2),文本处理模块可确定对应 的文本内容不可修正,则文本处理模块过滤该文本内容,即不在文本识别结果中显示对应的文本内容。In another example, if the value of score t is less than the correction threshold (i.e. 0.2), the text processing module can determine the corresponding The text content cannot be corrected, then the text processing module filters the text content, that is, the corresponding text content is not displayed in the text recognition result.
需要说明的是,步骤(5)中的判断是否可修正的方式,即基于语义连贯性的输出结果进行检测方式仅为示意性举例。在其他实施例中,文本处理模块还可以基于其它检测方式,以检测文本内容是否可修正。例如,如上文所述,文本处理模块可以在语意连贯判断过程中基于语法错误检查模型进行处理,文本处理模块可以进一步基于语法错误检查模型的输出结果,确定语法错误数量或语法错误字符数量的占比。文本处理模块可以基于该占比,判定文本内容是否可修正。再例如,如上文所述,语义连贯性判断可以基于前向语言模型计算出平均概率,文本处理模块可以基于该平均概率(例如设置对应的修正阈值)判断文本内容是否修正。It should be noted that the method of determining whether it is modifiable in step (5), that is, the detection method based on the output results of semantic coherence, is only a schematic example. In other embodiments, the text processing module can also detect whether the text content can be corrected based on other detection methods. For example, as mentioned above, the text processing module can perform processing based on the grammatical error checking model in the semantic coherence judgment process. The text processing module can further determine the number of grammatical errors or the proportion of the number of grammatical error characters based on the output results of the grammatical error checking model. Compare. The text processing module can determine whether the text content can be corrected based on this proportion. For another example, as mentioned above, the semantic coherence judgment can calculate the average probability based on the forward language model, and the text processing module can determine whether the text content is modified based on the average probability (for example, setting a corresponding correction threshold).
进一步需要说明的是,文本处理模块除基于本申请实施例中所述的修正方法(也可以理解为是神经机器翻译方法)对文本内容进行修正外,还可以采用其它修正方法,例如可以基于语法错误检查模型的输出结果,通过混淆集召回以及候选排序的方式,对文本内容进行文本修正。再例如,文本处理模块可以基于语法错误检查模型的输出结果,通过调用统计语言模型、神经语言模型或者是Bert的双向语言模型,得到错误位置的混淆集,再通过候选排序和错误较远机制,召回修正的文本。具体实现可以参照已有技术实施例中的相关内容,此处不再赘述。It should be further noted that, in addition to correcting text content based on the correction method described in the embodiments of this application (which can also be understood as a neural machine translation method), the text processing module can also use other correction methods, such as grammar-based The output results of the error checking model are used to correct the text content through confusion set recall and candidate ranking. For another example, the text processing module can obtain a confusion set of error positions based on the output results of the grammatical error checking model by calling a statistical language model, a neural language model, or Bert's bidirectional language model, and then through candidate sorting and error remoteness mechanisms, Recall corrected text. For specific implementation, reference may be made to relevant content in prior art embodiments, which will not be described again here.
进一步需要说明的是,图19中的各步骤涉及到的各模型可组成神经网络,该神经网络的训练方式可参照上文实施例中涉及到的神经网络训练的相关描述,此处不再赘述。It should be further noted that each model involved in each step in Figure 19 can form a neural network. The training method of the neural network can refer to the relevant description of the neural network training involved in the above embodiments, and will not be described again here. .
一个示例中,图22示出了本申请实施例的一种装置2200的示意性框图装置2200可包括:处理器2201和收发器/收发管脚2202,可选地,还包括存储器2203。In one example, FIG. 22 shows a schematic block diagram of a device 2200 according to an embodiment of the present application. The device 2200 may include: a processor 2201 and a transceiver/transceiver pin 2202, and optionally, a memory 2203.
装置2200的各个组件通过总线2204耦合在一起,其中总线2204除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图中将各种总线都称为总线2204。The various components of device 2200 are coupled together by bus 2204, which includes a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, various buses are referred to as bus 2204 in the figure.
可选地,存储器2203可以用于前述方法实施例中的指令。该处理器2201可用于执行存储器2203中的指令,并控制接收管脚接收信号,以及控制发送管脚发送信号。Optionally, the memory 2203 may be used for instructions in the foregoing method embodiments. The processor 2201 can be used to execute instructions in the memory 2203, and control the receiving pin to receive signals, and control the transmitting pin to send signals.
装置2200可以是上述方法实施例中的电子设备或电子设备的芯片,The device 2200 may be the electronic device or a chip of the electronic device in the above method embodiment,
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。All relevant content of each step involved in the above method embodiments can be quoted from the functional description of the corresponding functional module, and will not be described again here.
本实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的方法。This embodiment also provides a computer storage medium that stores computer instructions. When the computer instructions are run on an electronic device, the electronic device causes the electronic device to execute the above related method steps to implement the method in the above embodiment.
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的方法。This embodiment also provides a computer program product. When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the method in the above embodiment.
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置 运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例中的方法。In addition, embodiments of the present application also provide a device. This device may be a chip, a component or a module. The device may include a connected processor and a memory; where the memory is used to store computer execution instructions. When the device When running, the processor can execute computer execution instructions stored in the memory, so that the chip executes the methods in each of the above method embodiments.
其中,本实施例提供的电子设备、计算机存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。Among them, the electronic equipment, computer storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the corresponding methods provided above. The beneficial effects of the method will not be repeated here.
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art should realize that in one or more of the above examples, the functions described in the embodiments of the present application can be implemented using hardware, software, firmware, or any combination thereof. When implemented using software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Storage media can be any available media that can be accessed by a general purpose or special purpose computer.
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。The term "and/or" in this article is just an association relationship that describes related objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于区别不同的目标对象,而不是用于描述目标对象的特定顺序。The terms “first” and “second” in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects. For example, the first target object, the second target object, etc. are used to distinguish different target objects, rather than to describe a specific order of the target objects.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in the embodiments of the present application is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary" or "such as" is intended to present the concept in a concrete manner.
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个系统是指两个或两个以上的系统。In the description of the embodiments of this application, unless otherwise specified, the meaning of “plurality” refers to two or more. For example, multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。 The embodiments of the present application have been described above in conjunction with the accompanying drawings. However, the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Inspired by this application, many forms can be made without departing from the purpose of this application and the scope protected by the claims, all of which fall within the protection of this application.

Claims (27)

  1. 一种文本识别方法,其特征在于,包括:A text recognition method, characterized by including:
    电子设备对待识别对象进行文本区域检测,得到第一文本区域的图像;所述第一文本区域中包括文本内容;The electronic device performs text area detection on the object to be recognized, and obtains an image of the first text area; the first text area includes text content;
    所述电子设备对所述第一文本区域进行文本内容识别,得到第一文本内容;The electronic device performs text content recognition on the first text area to obtain the first text content;
    所述电子设备基于所述第一文本区域的图像与所述第一文本内容进行分类,得到分类结果;The electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result;
    所述电子设备基于所述分类结果,显示所述第一文本区域的文本识别结果;The electronic device displays the text recognition result of the first text area based on the classification result;
    所述电子设备基于所述分类结果,显示所述第一文本区域的文本识别结果包括:The electronic device, based on the classification result, displays the text recognition result of the first text area including:
    若所述分类结果为第一分类,所述文本识别结果过滤了所述第一文本内容;若所述分类结果为第二分类,所述文本识别结果包括所述第一文本内容修正后的文本内容;若所述分类结果为第三分类,所述文本识别结果包括所述第一文本内容。If the classification result is the first classification, the text recognition result filters the first text content; if the classification result is the second classification, the text recognition result includes the corrected text of the first text content. Content; if the classification result is the third classification, the text recognition result includes the first text content.
  2. 根据权利要求1所述的方法,其特征在于,所述电子设备基于所述第一文本区域的图像与所述第一文本内容进行分类,得到分类结果,包括:The method of claim 1, wherein the electronic device performs classification based on the image of the first text area and the first text content, and obtains a classification result, including:
    所述电子设备基于所述第一文本区域的图像与所述第一文本内容,得到中间表征信息;The electronic device obtains intermediate representation information based on the image of the first text area and the first text content;
    所述电子设备对所述中间表征信息进行分类,得到所述分类结果。The electronic device classifies the intermediate representation information to obtain the classification result.
  3. 根据权利要求2所述的方法,其特征在于,所述电子设备对所述中间表征信息进行分类,得到所述分类结果,包括:The method according to claim 2, characterized in that the electronic device classifies the intermediate representation information to obtain the classification result, including:
    所述电子设备通过分类模型对所述中间表征信息进行分类,得到所述分类结果。The electronic device classifies the intermediate representation information through a classification model to obtain the classification result.
  4. 根据权利要求3所述的方法,其特征在于,所述电子设备基于所述分类结果,显示所述第一文本区域的文本识别结果之前,还包括:The method according to claim 3, characterized in that, before the electronic device displays the text recognition result of the first text area based on the classification result, it further includes:
    所述电子设备对所述中间表征信息进行修正,得到所述第一文本内容修正后的文本内容。The electronic device corrects the intermediate representation information to obtain the corrected text content of the first text content.
  5. 根据权利要求4所述的方法,其特征在于,所述电子设备对所述中间表征信息进行修正,得到修正后的目标文本内容,包括:The method according to claim 4, characterized in that the electronic device corrects the intermediate representation information to obtain the corrected target text content, including:
    所述电子设备通过修正模型对所述中间表征信息进行修正,得到所述第一文本内容修正后的文本内容。The electronic device corrects the intermediate representation information through a correction model to obtain the corrected text content of the first text content.
  6. 根据权利要求5所述的方法,其特征在于,所述电子设备基于所述第一文本区域的图像与所述第一文本内容,得到中间表征信息,包括:The method of claim 5, wherein the electronic device obtains intermediate representation information based on the image of the first text area and the first text content, including:
    所述电子设备对所述第一文本区域的图像进行图像编码,得到第一图像编码信息; The electronic device performs image encoding on the image of the first text area to obtain first image encoding information;
    所述电子设备对所述第一文本内容进行文本编码,得到所述第一文本编码信息;The electronic device performs text encoding on the first text content to obtain the first text encoding information;
    所述电子设备通过多模态编码模型对所述第一图像编码信息与所述第一文本编码信息进行多模态编码,得到所述中间表征信息。The electronic device performs multi-modal coding on the first image coding information and the first text coding information through a multi-modal coding model to obtain the intermediate representation information.
  7. 根据权利要求6所述的方法,其特征在于,所述多模态编码模型、所述分类模型和所述修正模型组成神经网络,所述神经网络的训练数据包括第二文本区域和与第二文本区域对应的第二文本内容,以及第三文本区域和与第三文本区域对应的第三文本内容;所述第二文本区域中包括部分缺失的文本内容,所述第三文本区域中的文本内容为完整文本内容。The method according to claim 6, characterized in that the multi-modal coding model, the classification model and the correction model form a neural network, and the training data of the neural network includes a second text area and a second text area. The second text content corresponding to the text area, and the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text in the third text area The content is full text content.
  8. 根据权利要求1所述的方法,其特征在于,所述第一文本区域的文本识别结果显示于文本识别区域中,所述文本识别区域中还包括所述待识别对象中的第三文本区域对应的文本内容。The method according to claim 1, characterized in that the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes a third text area corresponding to the object to be recognized. text content.
  9. 根据权利要求1所述的方法,其特征在于,若所述第一文本区域中包括部分缺失的文本内容,所述文本识别结果为所述第一分类或所述第二分类。The method of claim 1, wherein if the first text area includes partially missing text content, the text recognition result is the first category or the second category.
  10. 根据权利要求9所述的方法,其特征在于,所述第一文本内容表达的语义与所述第一文本区域中的文本内容表达的语义不相同。The method according to claim 9, characterized in that the semantics expressed by the first text content are different from the semantics expressed by the text content in the first text area.
  11. 根据权利要求1至10任一项所述的方法,其特征在于,所述待识别对象为图片、网页或文档。The method according to any one of claims 1 to 10, characterized in that the object to be identified is a picture, a web page or a document.
  12. 一种文本识别方法,其特征在于,包括:A text recognition method, characterized by including:
    电子设备对待识别对象进行文本区域检测,得到第一文本区域的图像;所述第一文本区域中包括文本内容;The electronic device performs text area detection on the object to be recognized, and obtains an image of the first text area; the first text area includes text content;
    所述电子设备对所述第一文本区域进行文本内容识别,得到第一文本内容;The electronic device performs text content recognition on the first text area to obtain the first text content;
    所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果;The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content;
    所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果,包括:The electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:
    若所述第一文本区域的图像表征所述第一文本区域包括部分缺失的文本内容且所述第一文本内容为语意连贯的文本内容,或者,所述第一文本区域的图像表征所述第一文本区域不包括部分缺失的文本内容,所述文本识别结果包括所述第一文本内容;若所述第一文本区域的图像表征所述第一文本区域包括部分缺失的文本内容,且所述第一文本内容包括语义错误的文本内容,所述文本识别结果过滤了所述第一文本内容或者所述文本识别结果包括所述第一文本内容修正后的文本内容。 If the image of the first text area represents that the first text area includes partially missing text content and the first text content is semantically coherent text content, or the image of the first text area represents the third A text area does not include partially missing text content, and the text recognition result includes the first text content; if the image of the first text area represents that the first text area includes partially missing text content, and the The first text content includes text content with semantic errors, the text recognition result filters the first text content, or the text recognition result includes text content after the first text content is corrected.
  13. 根据权利要求12所述的方法,其特征在于,所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果,包括:The method of claim 12, wherein the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:
    若所述第一文本区域的图像表征所述第一文本区域包括部分缺失的文本内容,且所述第一文本内容包括语义不连贯的文本内容,所述电子设备检测所述第一文本内容是否可被修正;If the image of the first text area represents that the first text area includes partially missing text content, and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be corrected;
    若所述第一文本内容不可被修正,所述文本识别结果过滤了所述第一文本内容;If the first text content cannot be modified, the text recognition result filters the first text content;
    若所述第一文本内容可被修正,所述文本识别结果包括所述第一文本内容修正后的文本内容。If the first text content can be corrected, the text recognition result includes the text content after the first text content is corrected.
  14. 根据权利要求13所述的方法,其特征在于,若所述第一文本内容可被修正,所述方法还包括:The method according to claim 13, characterized in that if the first text content can be modified, the method further includes:
    所述电子设备通过修正模型对所述第一文本内容进行修正,得到所述第一文本内容修正后的文本内容。The electronic device corrects the first text content through a correction model to obtain text content after the first text content has been corrected.
  15. 根据权利要求14所述的方法,其特征在于,所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果,包括:The method of claim 14, wherein the electronic device displays the text recognition result of the first text area based on the image of the first text area and the first text content, including:
    所述电子设备通过分类模型对所述第一文本区域的图像进行分类,得到分类结果;所述分类结果用于指示所述第一文本区域中是否包括部分缺失的文本内容。The electronic device classifies the image of the first text area through a classification model to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content.
  16. 根据权利要求15所述的方法,其特征在于,若所述第一文本区域的图像表征所述第一文本区域包括部分缺失的文本内容,所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果,包括:The method of claim 15, wherein if the image of the first text area represents that the first text area includes partially missing text content, the electronic device based on the image of the first text area and The first text content displays the text recognition result of the first text area, including:
    所述电子设备通过语义模型对所述第一文本内容进行语义分析,得到语义分析结果;所述语义分析结果用于指示所述第一文本内容是否包括语意错误的文本内容。The electronic device performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors.
  17. 根据权利要求16所述的方法,其特征在于,所述语义分析结果还用于指示所述第一文本内容是否可被修正,所述电子设备基于所述第一文本区域的图像与所述第一文本内容,显示所述第一文本区域的文本识别结果,包括:The method of claim 16, wherein the semantic analysis result is also used to indicate whether the first text content can be modified, and the electronic device is based on the image of the first text area and the third text content. A text content that displays the text recognition result of the first text area, including:
    所述电子设备基于所述语义分析结果,确定所述第一文本内容是否可被修改。The electronic device determines whether the first text content can be modified based on the semantic analysis result.
  18. 根据权利要求17所述的方法,其特征在于,所述修正模型、所述分类模型、所述语义模型组成神经网络,所述神经网络的训练数据包括第二文本区域和与第二文本区域对应的第二文本内容,以及第三文本区域和与第三文本区域对应的第三文本内容;所述第二文本区域中包括部分缺失的文本内容,所述第三文本区域中的文本内容为完整文本内容。The method according to claim 17, characterized in that the correction model, the classification model and the semantic model form a neural network, and the training data of the neural network includes a second text area and a second text area corresponding to the second text area. The second text content, as well as the third text area and the third text content corresponding to the third text area; the second text area includes partially missing text content, and the text content in the third text area is complete Text content.
  19. 根据权利要求12所述的方法,其特征在于,所述第一文本区域的文本识别结果显示于文本识别区域中,所述文本识别区域中还包括所述待识别对象中的第三文本区域 对应的文本内容。The method according to claim 12, characterized in that the text recognition result of the first text area is displayed in the text recognition area, and the text recognition area also includes a third text area in the object to be recognized. the corresponding text content.
  20. 根据权利要求9所述的方法,其特征在于,所述语义错误的文本内容表达的语义与所述第一文本区域中对应的文本内容表达的语义不相同。The method according to claim 9, characterized in that the semantics expressed by the semantically incorrect text content are different from the semantics expressed by the corresponding text content in the first text area.
  21. 根据权利要求12至20任一项所述的方法,其特征在于,所述待识别对象为图片、网页或文档。The method according to any one of claims 12 to 20, characterized in that the object to be recognized is a picture, a web page or a document.
  22. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    一个或多个处理器;one or more processors;
    存储器;memory;
    以及一个或多个计算机程序,其中所述一个或多个计算机程序存储在所述存储器上,当所述计算机程序被所述一个或多个处理器执行时,使得所述电子设备执行如权利要求1至11中任一项所述的方法。and one or more computer programs, wherein said one or more computer programs are stored on said memory, and when said computer programs are executed by said one or more processors, cause said electronic device to perform as claimed The method described in any one of 1 to 11.
  23. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    一个或多个处理器;one or more processors;
    存储器;memory;
    以及一个或多个计算机程序,其中所述一个或多个计算机程序存储在所述存储器上,当所述计算机程序被所述一个或多个处理器执行时,使得所述电子设备执行如权利要求12至21中任一项所述的方法。and one or more computer programs, wherein said one or more computer programs are stored on said memory, and when said computer programs are executed by said one or more processors, cause said electronic device to perform as claimed The method described in any one of 12 to 21.
  24. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在电子设备运行时,使得所述电子设备执行如权利要求1至11中任一项所述的方法。A computer-readable storage medium, characterized by comprising a computer program, which when the computer program is run on an electronic device, causes the electronic device to execute the method according to any one of claims 1 to 11.
  25. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在电子设备运行时,使得所述电子设备执行如权利要求12至21中任一项所述的方法。A computer-readable storage medium, characterized by comprising a computer program, which when the computer program is run on an electronic device, causes the electronic device to execute the method according to any one of claims 12 to 21.
  26. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序被电子设备执行时,使得所述电子设备执行权利要求1至11任一项所述的方法。A computer program product, characterized in that it includes a computer program that, when executed by an electronic device, causes the electronic device to execute the method described in any one of claims 1 to 11.
  27. 一种计算机程序产品,其特征在于,包括计算机程序,当所述计算机程序被电子设备执行时,使得所述电子设备执行权利要求12至21任一项所述的方法。 A computer program product, characterized by comprising a computer program that, when executed by an electronic device, causes the electronic device to execute the method described in any one of claims 12 to 21.
PCT/CN2023/096921 2022-05-30 2023-05-29 Text recognition method and electronic device WO2023231987A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210597895.6 2022-05-30
CN202210597895.6A CN117197811A (en) 2022-05-30 2022-05-30 Text recognition method and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023231987A1 true WO2023231987A1 (en) 2023-12-07

Family

ID=88987403

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096921 WO2023231987A1 (en) 2022-05-30 2023-05-29 Text recognition method and electronic device

Country Status (2)

Country Link
CN (1) CN117197811A (en)
WO (1) WO2023231987A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117478435A (en) * 2023-12-28 2024-01-30 中汽智联技术有限公司 Whole vehicle information security attack path generation method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269135A1 (en) * 2014-03-19 2015-09-24 Qualcomm Incorporated Language identification for text in an object image
CN110059694A (en) * 2019-04-19 2019-07-26 山东大学 The intelligent identification Method of lteral data under power industry complex scene
CN111090991A (en) * 2019-12-25 2020-05-01 北京百度网讯科技有限公司 Scene error correction method and device, electronic equipment and storage medium
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN113128494A (en) * 2019-12-30 2021-07-16 华为技术有限公司 Method, device and system for recognizing text in image
CN114140782A (en) * 2021-11-26 2022-03-04 北京奇艺世纪科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114419646A (en) * 2022-01-17 2022-04-29 马上消费金融股份有限公司 Image classification method and device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269135A1 (en) * 2014-03-19 2015-09-24 Qualcomm Incorporated Language identification for text in an object image
CN110059694A (en) * 2019-04-19 2019-07-26 山东大学 The intelligent identification Method of lteral data under power industry complex scene
CN111090991A (en) * 2019-12-25 2020-05-01 北京百度网讯科技有限公司 Scene error correction method and device, electronic equipment and storage medium
CN113128494A (en) * 2019-12-30 2021-07-16 华为技术有限公司 Method, device and system for recognizing text in image
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN114140782A (en) * 2021-11-26 2022-03-04 北京奇艺世纪科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114419646A (en) * 2022-01-17 2022-04-29 马上消费金融股份有限公司 Image classification method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117478435A (en) * 2023-12-28 2024-01-30 中汽智联技术有限公司 Whole vehicle information security attack path generation method and system

Also Published As

Publication number Publication date
CN117197811A (en) 2023-12-08

Similar Documents

Publication Publication Date Title
WO2022142014A1 (en) Multi-modal information fusion-based text classification method, and related device thereof
US11024058B2 (en) Encoding and decoding a stylized custom graphic
US10140549B2 (en) Scalable image matching
US9721156B2 (en) Gift card recognition using a camera
US10354199B2 (en) Transductive adaptation of classifiers without source data
US9436883B2 (en) Collaborative text detection and recognition
CN111465918B (en) Method for displaying service information in preview interface and electronic equipment
US11893767B2 (en) Text recognition method and apparatus
US20240031644A1 (en) Video playback device and control method thereof
WO2023231987A1 (en) Text recognition method and electronic device
CN111242273B (en) Neural network model training method and electronic equipment
KR102122561B1 (en) Method for recognizing characters on document images
WO2022227218A1 (en) Drug name recognition method and apparatus, and computer device and storage medium
CN111754414B (en) Image processing method and device for image processing
WO2024103775A1 (en) Answer generation method and apparatus, and storage medium
CN114154467B (en) Structure picture restoration method, device, electronic equipment, medium and program product
CN116994169A (en) Label prediction method, label prediction device, computer equipment and storage medium
US20210073458A1 (en) Comic data display system, method, and program
CN115691486A (en) Voice instruction execution method, electronic device and medium
US12124696B2 (en) Electronic device and method to provide sticker based on content input
US20220326846A1 (en) Electronic device and method to provide sticker based on content input
US20240031655A1 (en) Video Playback Method, Terminal Device, Apparatus, System, and Storage Medium
WO2024187949A1 (en) Image description generation method and electronic device
US20240046616A1 (en) Apparatus and method for classifying immoral images using deep learning technology
US20230156317A1 (en) Electronic device for obtaining image at user-intended moment and method for controlling the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815166

Country of ref document: EP

Kind code of ref document: A1