CN116978030A

CN116978030A - Text information recognition method and training method of text information recognition model

Info

Publication number: CN116978030A
Application number: CN202310823027.XA
Authority: CN
Inventors: 曹浩宇; 保长存; 陈皇; 尹坤; 姜德强; 刘银松
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-05
Filing date: 2023-07-05
Publication date: 2023-10-31

Abstract

The application relates to a text information identification method, a text information identification device, a text information identification computer device, a text information storage medium and a text information identification computer program product based on an artificial intelligence technology. The method comprises the following steps: performing visual coding processing on the image to be identified to obtain a target feature map of the image to be identified, wherein the target feature map comprises a plurality of visual features; acquiring a problem text about an image to be identified, and performing problem decoding processing on the problem text based on a plurality of visual features to obtain text features of the problem text; determining the correlation degree between the plurality of visual features and the text features respectively; and screening target visual characteristics from the visual characteristics based on the correlation degree, and decoding based on the target visual characteristics to identify text information corresponding to the problem indicated by the problem text from the image to be identified. By adopting the method, the identification accuracy of the text information can be improved.

Description

Text information recognition method and training method of text information recognition model

Technical Field

The application relates to the technical field of image processing, in particular to a text information recognition method and a training method of a text information recognition model.

Background

OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method. In an end-to-end scenario, it is often necessary to output text content of a specific field, rather than a mere text recognition result.

In the traditional mode, firstly, an input image is subjected to image recognition through a detection network, after a character recognition result is obtained, the character recognition result is analyzed through semantic understanding, and therefore character contents of specific fields are extracted from the character recognition result.

However, the text content identified in the conventional manner is not only affected by the accuracy of image identification, but also further reduces the accuracy of the text content due to poor semantic understanding effect.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a text information recognition method, apparatus, computer device, computer readable storage medium, and computer program product, and a training method, apparatus, computer device, computer readable storage medium, and computer program product for a text information recognition model, which are capable of improving accuracy.

In one aspect, the application provides a text information recognition method. The method comprises the following steps:

performing visual coding processing on an image to be identified to obtain a target feature map of the image to be identified, wherein the target feature map comprises a plurality of visual features;

acquiring a target question text about the image to be identified, and performing question decoding processing on the target question text based on the plurality of visual features to obtain text features of the target question text;

determining a degree of correlation between the plurality of visual features and the text features, respectively;

screening target visual features from the plurality of visual features based on correlations between the plurality of visual features and the text features, respectively;

and decoding based on the target visual characteristics so as to identify text information corresponding to the problem indicated by the target problem text from the image to be identified.

In some embodiments, the method further comprises: acquiring an original problem text; determining a field to be identified in the image to be identified; and obtaining target question text about the image to be recognized based on the original question text when the original question text is associated with the field to be recognized.

In some embodiments, the obtaining, based on the original question text, the target question text regarding the image to be recognized in a case where the original question text is associated with the field to be recognized includes: converting the original question text into a preset standardized question text based on the field to be identified under the condition that the original question text is associated with the field to be identified; and taking the standardized question text as a target question text related to the image to be identified.

In some embodiments, the performing the problem decoding process on the target problem text to obtain text features of the target problem text includes: coding the target problem text to obtain an embedded vector of the target problem text; and decoding the embedded vector of the target question text based on the plurality of visual features to obtain text features of the target question text.

On the other hand, the application also provides a text information recognition device. The device comprises:

the processing module is used for carrying out visual coding processing on the image to be identified to obtain a target feature map of the image to be identified, wherein the target feature map comprises a plurality of visual features;

The acquisition module is used for acquiring a target question text related to the image to be identified, and carrying out question decoding processing on the target question text based on the plurality of visual features to obtain text features of the target question text;

a determining module, configured to determine a degree of correlation between the plurality of visual features and the text features, respectively;

the screening module is used for screening target visual features from the plurality of visual features based on the correlation degree between the plurality of visual features and the text features;

and the processing module is also used for carrying out decoding processing based on the target visual characteristics so as to identify text information corresponding to the problem indicated by the target problem text from the image to be identified.

On the other hand, the application also provides computer equipment. The computer device comprises a memory storing a computer program and a processor implementing the steps of the text information identification method described above when the processor executes the computer program.

In another aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the text information identification method described above.

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the text information recognition method described above.

The text information identification method, the device, the computer equipment, the storage medium and the computer program product are characterized in that the target feature diagram of the image to be identified is obtained by performing visual coding processing on the image to be identified, and the target problem text is subjected to problem decoding processing by combining a plurality of visual features included in the obtained target feature diagram to obtain the text features of the target problem text, so that an association relationship is established between the target problem text and the visual features of the image to be identified, and answers for representing the problems indicated by the target problem text are mapped in which image areas possibly exist in the image; and finally, decoding processing is carried out based on the target visual features, and text information corresponding to the problems in the target problem text is identified from the image to be identified, so that the answer corresponding to the problem text is directly extracted based on the image features, end-to-end image text information identification is realized, independent modeling is not needed for an intermediate process, identification efficiency is improved, and identification accuracy is further improved by combining the visual features and the text features for processing.

On the other hand, the application provides a training method of the text information recognition model. The method comprises the following steps:

acquiring a pre-training sample pair; the pre-training sample pair comprises at least one of a first sample pair consisting of a sample initial image and a sample text position, a second sample pair consisting of a first sample text and a second sample text, and a third sample pair consisting of a sample segmentation image and a third sample text;

pre-training a first local network in the text information recognition model based on the first sample, pre-training a second local network in the text information recognition model based on the second sample, and pre-training a third local network of the text information recognition model based on the third sample to obtain an initial text information recognition model;

obtaining a target sample pair; the target sample pair comprises a target sample image, a target question text and a target question answer;

training the initial text information recognition model based on the target sample to obtain a trained text information recognition model; the trained text information recognition model is used for recognizing text information of an image to be recognized so as to recognize text information corresponding to a problem in a problem text matched with the image to be recognized from the image to be recognized.

On the other hand, the application also provides a training device of the text information recognition model. The device comprises:

the acquisition module is used for acquiring a pre-training sample pair; the pre-training sample pair comprises at least one of a first sample pair consisting of a sample initial image and a sample text position, a second sample pair consisting of a first sample text and a second sample text, and a third sample pair consisting of a sample segmentation image and a third sample text;

the pre-training module is used for pre-training a first local network in the text information recognition model based on the first sample, pre-training a second local network in the text information recognition model based on the second sample, and pre-training a third local network of the text information recognition model based on the third sample to obtain an initial text information recognition model;

the acquisition module is also used for acquiring a target sample pair; the target sample pair comprises a target sample image, a target question text and a target question answer;

the training module is used for training the initial text information recognition model based on the target sample to obtain a trained text information recognition model; the trained text information recognition model is used for recognizing text information of an image to be recognized so as to recognize text information corresponding to a problem in a problem text matched with the image to be recognized from the image to be recognized.

On the other hand, the application also provides computer equipment. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the training method of the text information recognition model when executing the computer program.

In another aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the training method of the text information recognition model described above.

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the training method of the text information recognition model described above.

According to the training method, the device, the computer equipment, the storage medium and the computer program product of the text information recognition model, the text information recognition model is firstly pre-trained by constructing the pre-training sample pair, so that the text information recognition model has basic text understanding capability, then the text information recognition model is subjected to fine adjustment according to different service scenes, namely, a target sample pair corresponding to the service scenes is obtained, the text information recognition model is trained based on the target sample pair, and finally the trained text information recognition model is obtained, and therefore, the text information recognition model can be subjected to fine adjustment according to different service scenes, the cost of deploying the model and the training cost of the model are greatly reduced, the consumption of computer resources is reduced, and the model deployment efficiency is improved.

Drawings

FIG. 1 is an application environment diagram of a text information recognition method in some embodiments;

FIG. 2 is a schematic diagram of an application scenario for text information recognition in some embodiments;

FIG. 3 is a flow chart of a text message recognition method in some embodiments;

FIG. 4 is a schematic illustration of the effect of identifying problematic text in an image in some embodiments;

FIG. 5 is a schematic diagram of a visual encoding process in some embodiments;

FIG. 6A is a schematic diagram of a problem decoding process in some embodiments;

FIG. 6B is a schematic diagram of a problem decoding process according to other embodiments;

FIG. 7 is a schematic diagram of the effect of text recognition of an image in some embodiments;

FIG. 8A is a schematic diagram of feature conversion of text features in some embodiments;

FIG. 8B is a schematic diagram of feature transformation of text features in other embodiments;

FIG. 9 is a schematic diagram of determining an image region based on primary visual features in some embodiments;

FIG. 10 is a flow diagram of text information recognition in some embodiments;

FIG. 11 is a flow diagram of a training method for a text information recognition model in some embodiments;

FIG. 12 is a block diagram of a text message recognition device in some embodiments;

FIG. 13 is a block diagram of a training device for text information recognition models in some embodiments;

FIG. 14 is an internal block diagram of a computer device in some embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

When OCR technology is applied to a scene where image text information is recognized, it is generally necessary to reprocess the recognition result of OCR-based detection and finally output text information that a user wants to acquire. Included in this are a number of intermediate tasks such as image recognition, chinese segmentation, syntactic analysis, etc. But the user is concerned only with the recognition result finally presented, such as text information of the payer name, payee name, and transaction amount, which the user wants to directly output by OCR recognition for one transaction ticket.

The common OCR technology is to extract the characters contained in the images by recognizing the images, and output text information of specific fields after semantic understanding of the characters. On one hand, a plurality of neural network models are required to be processed in sequence, and the recognition efficiency is low; on the other hand, the finally output text information depends on the accuracy of each intermediate task, and if OCR recognition is wrong, a subsequent series of processing results are established on the wrong recognition result, so that the accuracy is greatly reduced.

In view of this, an embodiment of the present application provides a method for identifying text information based on end-to-end, which performs visual encoding processing on an image to be identified, and directly outputs specific text information through decoding processing. According to the text information recognition method provided by the embodiment of the application, based on end-to-end training, intermediate tasks such as text detection, character recognition, feature extraction and the like are fused into one model, special modeling of the intermediate process is not needed, the optimization cost is effectively reduced, and the recognition effect of the model can be further improved under the interaction of each intermediate task. Further, since text information is directly generated from image features, recognition efficiency is also greatly increased. In addition, the method is not affected by OCR errors, is free from dependence on accuracy of OCR detection and recognition results, is easy to realize, can process complex use scenes, and is suitable for various business types.

The text information identification method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 is connected to the server 104 for communication. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers.

In some embodiments, the terminal 102 or the server 104 acquires an image to be identified, and performs visual encoding processing on the image to be identified to obtain a target feature map of the image to be identified. The terminal 102 or the server 104 acquires a target question text about the image to be identified, and performs question decoding processing on the target question text based on a plurality of visual features included in the target feature map, so as to obtain text features of the target question text. Thus, the terminal 102 or the server 104 determines the correlation degree between the plurality of visual features and the text features, and screens out the target visual features from the plurality of visual features based on the correlation degree, and then performs decoding processing based on the target visual features, so as to identify text information corresponding to the problem in the target problem text from the image to be identified.

In other embodiments, the terminal 102 obtains at least one of an image to be identified or a target question text related to the image to be identified, and sends the at least one of the image to be identified and the target question text to the server 104, so that the server 104 performs a visual encoding process on the image to be identified to obtain a target feature map of the image to be identified, and performs a question decoding process on the target question text based on a plurality of visual features included in the target feature map to obtain text features of the target question text. Accordingly, the server 104 determines the correlation degree between the plurality of visual features and the text features, and screens out the target visual features from the plurality of visual features based on the correlation degree, and further performs decoding processing based on the target visual features, so as to identify text information corresponding to the problem in the target problem text from the image to be identified. The server 104 may return text information corresponding to the question in the target question text to the terminal 102.

The terminal 102 may be, but not limited to, one or more of various desktop computers, notebook computers, smart phones, tablet computers, internet of things devices, portable wearable devices, etc., and the internet of things devices may be one or more of smart speakers, smart televisions, smart air conditioners, or smart vehicle devices, etc. The portable wearable device may be one or more of a smart watch, a smart bracelet, or a headset device, etc.

The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and common biometric technologies such as face recognition, fingerprint recognition, and the like.

Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence, and generally includes techniques of text processing, semantic understanding, machine translation, robot questions and answers, knowledge graph, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

In some embodiments, the terminal may have a APP (Application) application or an OCR-enabled application loaded thereon, including one or more of an application that conventionally needs to be installed separately, and an applet application that can be used without downloading an installation, such as a browser client, a web page client, or a software client.

In some embodiments, the terminal may call a hardware device (e.g., a camera device) through an application program to capture, scan, or record an image, or download an image from a network through the application program, or read a locally stored image, or receive an image transmitted by a server, etc., and perform text information recognition on the image to recognize specific text information from the image.

For example, as shown in fig. 2, the terminal may perform text information recognition on an electronic transaction certificate or transfer record generated by the application program, thereby obtaining text information contained therein, such as a consumption amount, a commodity number, a payment time, and the like.

In some embodiments, the terminal may initiate a service call to the server, which runs a related business process for text information identification, and returns the identified text information to the terminal.

The scheme provided by the embodiment of the application relates to technology such as text information identification of artificial intelligence, and specifically is described by the following embodiment. As shown in fig. 3, a text information recognition method is provided, which can be applied to a terminal or a server, or can be cooperatively performed by the terminal and the server. The following description will take an example in which the method is applied to a computer device, which may be a terminal or a server. The method comprises the following steps:

step S302, performing visual coding processing on an image to be identified to obtain a target feature map of the image to be identified, wherein the target feature map comprises a plurality of visual features.

The image to be identified refers to an image containing text information, and the purpose of text information identification on the image to be identified is to extract the required text information from the image to be identified.

The image to be identified can be obtained by shooting or scanning a physical entity, for example, a computer device shoots or scans a document such as a bill, a certificate, etc., so as to obtain the image to be identified.

In some scenarios, after the computer device performs operations such as online shopping, transaction transfer and the like through the application program, the application program generates electronic transaction credentials or transfer records, and the computer device can obtain an image to be identified through screenshot, saving and the like. In other words, the image to be identified may be obtained by capturing or storing the electronic entity.

Generally, according to different service scenes, text information contained in an image to be identified is different, and fields to which the text information to be extracted belongs are also different. For example, in a business scenario in which a business license is identified by text information, the image to be identified is a business license image, and the field to which the text information to be extracted belongs is typically one or more fields of credit code, business name, legal representatives, business scope, or business deadline.

For another example, in a business scenario of identifying text information of a document (such as a consumption document, a payment document, a remittance document, etc.), the image to be identified is a document image, and the field to which the text information to be extracted belongs is usually one or more fields of a transaction date, a remittance party name, a remittance party account number, a payee party name, a payee party account number, or a remittance amount.

For another example, in a business scenario of identifying text information of a ticket (for example, a train ticket, an air ticket, etc.), if the image to be identified is a ticket image, the field to which the text information to be extracted belongs is usually one or more fields of a departure place, a destination, a date of sitting, a train number/flight, etc.

It is easy to understand that the types of the service scenario and the image to be identified are merely examples, and in a specific application scenario, the type of the service scenario and the image to be identified may be appropriately adjusted according to practical situations, for example, the image to be identified may also be a card (such as an identity card, a driving license, a passport, etc.), a certificate, an express bill, a test paper, or a diagnostic report, etc., and the applicable service scenario may also be traffic scenario text information identification, medical scenario text information identification, or education scenario text information identification, etc. It should be clear to a person skilled in the art that reasonable variations and appropriate adjustments made to the above-described traffic scenario and type of image to be identified are within the scope of the present application.

The visual coding process refers to a process of performing a series of image processes such as feature extraction on an image to be identified to obtain a feature representation of the image to be identified. The Feature representation of the image to be identified is typically represented by a Feature Map (Feature Map).

The computer equipment performs visual coding processing on the image to be identified, namely, performs feature extraction on the image to be identified. In some embodiments, the computer device performs a visual encoding process on the image to be identified by a visual Encoder (Vision Encoder). Wherein the visual encoder comprises at least a backbone network (Back-Bone) for feature extraction. Backbone networks include, but are not limited to ViT (Vision Transformer, visual translation), swin (short for Swin Transformer, a transducer-based deep learning network), or CNN (Convolutional Neural Network ), etc.

In some embodiments, the computer device may perform feature extraction on the image to be identified by way of convolution. Specifically, the computer device performs convolution processing on the image to be identified to perform feature extraction, so as to obtain a feature map of the image to be identified. Each convolution layer can realize efficient convolution processing of the image to be identified through a plurality of convolution kernels, and each convolution kernel can be used for convoluting a partial region in the image to be identified; further, the mapping of the image features in the image to be identified output by each convolution kernel constitutes a feature map output by the corresponding convolution layer. In some embodiments, the computer device treats each image feature in the feature map as the extracted plurality of visual features. The computer device may perform feature extraction on the image to be identified through CNN, where the extracted feature map is a target feature map of the image to be identified, and the image features in the feature map are visual features.

In some embodiments, the computer device performs a visual encoding process on the image to be identified, including performing a resizing process, a pooling process, an upsampling process, or a downsampling process on the image to be identified in addition to performing feature extraction on the image to be identified.

In other embodiments, a computer device performs a visual encoding process on an image to be identified, comprising: the image to be identified is segmented into a plurality of image slices (patches), each as a visual token, i.e., a visual feature. The computer equipment carries out linear embedding and feature transformation on each image slice through a plurality of layers of network layers, merges the adjacent image slices through each layer of network layers to reduce feature dimension, outputs the merged image slices through the last layer of network layers, and the merged image slices are taken as visual features finally extracted to form a feature map. The computer device may, for example, perform feature extraction on the image to be identified by means of a Swin transducer, and the plurality of visual features output by the transducer blocks in the Swin transducer structure constitute a target feature map of the image to be identified. In order to improve the accuracy of the recognition, the computer device may further perform an upsampling process on the feature map composed of the visual features to improve the resolution, and use the feature map after the upsampling process as the target feature map.

Illustratively, the computer device resizes the image to be identified to a size of w×h to fit the backbone network as input. Wherein W is width and H is height. Therefore, the computer equipment performs feature extraction on the image to be identified after the size is adjusted through the backbone network, and the backbone network outputs a feature map. The computer device pools the feature maps to unify the feature maps to a fixed size. For example, the computer device sets the downsampling multiple n and performs an average pooling process on the feature map to adjust the feature map to the width of W/n and the height of H/n. In order to make the subsequent position of the answer to the question in the image more specific, the computer device may further perform upsampling processing on the feature map to improve resolution, thereby improving accuracy of subsequent text information recognition.

Step S304, a target question text about the image to be identified is obtained, and question decoding processing is carried out on the target question text based on a plurality of visual characteristics, so that text characteristics of the target question text are obtained.

The target question text refers to a question text (query) to be subjected to a question decoding process. The question text (query) describes a question posed to text information to be acquired. In other words, the text information that is output by text information recognition of the image to be recognized is an answer corresponding to the question indicated by the question text.

The question Text may be Text organized in the form of sentences such as question sentences and presentation sentences, or may be Text organized in the form of phrases and keywords, or the question Text may be one or more of Text sequences extracted from Text with any shape (Text rotation). For example, the image to be identified is a travel ticket, and text information of fields such as departure place, destination, date of sitting, or number of cars/flights is included, and the question text may be question sentences such as "departure place is, date of sitting is on the day", or phrases such as "origin station", "number of cars". Accordingly, the text information which is obtained by performing text information recognition on the image to be recognized and output can be an answer corresponding to the question indicated by the question text, for example, "a city", "1 month and 1 day", etc.

In some embodiments, the question text may also be one or more text sequences query1, query2, …, and queryn obtained after the image is identified and detected. As shown in fig. 4, the image may include text information with any shape, and the computer device may output the detected text sequence after recognizing the image.

The image used when the image is recognized to obtain the text of the question is not the same image as the image to be recognized.

In some embodiments, the target question text needs to be adapted to the image to be identified, so that an answer corresponding to the question indicated by the question text can be identified in the text information contained in the image to be identified. For example, when the image to be identified is a money transfer document, the text of the question describes the question corresponding to the text information in the traffic card, and the corresponding answer cannot be identified from the image to be identified.

In other embodiments, in the case that the target question text does not fit the image to be identified, the computer device may also perform text information identification, and since the corresponding answer cannot be identified from the image to be identified, the computer device may output preset text information to indicate that there is no answer corresponding to the question indicated by the target question text. For example, in the case where the target question text does not fit the image to be recognized, the computer device outputs preset text information such as "no related information found", "no query result exists", and the like.

The target question text is adapted to the image to be identified, which means that the target question text at least comprises a field to which text information contained in the image to be identified belongs, or the semantic meaning of the target question text is the same as or similar to the semantic meaning of the field to which the text information contained in the image to be identified belongs. For example, if the image to be identified is a travel ticket, in which text information of a field such as a departure place, a destination, a date of departure, or a number of cars/flights is included, the target question text may be "where the departure place is", in which a field "departure place" to which the text information included in the image to be identified belongs is included. Alternatively, the target question text may be "where to go" with the same semantic meaning as "where to go". The same or similar semantics can be characterized by text similarity, and when the similarity is higher than a preset threshold value, the same or similar semantics are indicated.

In some embodiments, the computer device obtains target question text regarding the image to be identified, which may be one or more of obtaining question text entered by a user by voice or the like, or obtaining preset fixed question text or the like.

In an actual application scenario, under the condition that the type of the image to be identified is fixed, a field to which text information to be extracted in the image to be identified belongs is usually fixed, and then the computer equipment can preset a fixed target problem text so as to extract the text information of a preset field from the image to be identified. For example, the image to be identified is an express bill, and the computer device may preset the fixed target question text to be "express bill number", "mail address", "contact phone", etc. Illustratively, the computer device obtains the target question text for the image to be identified by invoking the question text preset within the application.

The question decoding process refers to a process of performing a series of text processes such as feature extraction on the target question text to obtain a feature representation of the target question text. The characteristic representation of the question text is the text characteristic. In order to characterize which image areas in the image the answer map corresponding to the question indicated by the target question text may be located, the computer device further combines visual features of the image to be identified in the process of performing the question encoding processing on the target question text, so that an association relationship is established between the target question text and the visual features of the image to be identified.

In some embodiments, the computer device performs a question decoding process on the question text based on the plurality of visual features to obtain text features of the question text, including: the computer equipment decodes the target question text, and takes a plurality of visual features as the introduced position prediction, so that an association relationship is established between the text features and the visual features, and further, the image area in the image to be identified, in which the answer corresponding to the question indicated by the target question text is located, can be determined according to the association relationship.

Step S306, determining correlations between the plurality of visual features and the text features, respectively.

The correlation represents the correlation between the text feature and the visual feature, and further can represent the correlation between the image area mapped by the visual feature in the image to be identified and the text feature. The stronger the correlation between the text feature and the visual feature, the more relevant the text feature and the text information contained in the image area is explained.

Specifically, the computer device determines a degree of correlation between a plurality of visual features and text features, respectively, comprising: and carrying out feature operation on each visual feature based on the visual feature and the text feature to obtain the correlation degree between the visual feature and the text feature. Wherein the characteristic operation includes, but is not limited to, one or more of an addition and subtraction operation, a dot multiplication operation, a cross multiplication operation, or the like.

In the case where there are a plurality of question texts, each of the question texts having a respective text feature, the computer apparatus determines a degree of correlation between the plurality of visual features and the text features, respectively, comprising: and for each visual feature, carrying out feature operation on the visual feature and each text feature respectively to obtain the correlation degree between the visual feature and each text feature.

Illustratively, the computer device will have a visual characteristic ε _pixel And text feature epsilon _mask Is the product of (a) and (b) as the correlation epsilon between the two _score . Wherein:sw represents the width of the up-sampled feature map, sh represents the height of the up-sampled feature map, and N is the number of question texts.

In some embodiments, the computer device determines a degree of correlation between the plurality of visual features and the text feature, respectively, comprising: determining a correlation score corresponding to each of the plurality of visual features based on the product of each of the plurality of visual features and the text feature; and obtaining the correlation degree between the plurality of visual features and the text features respectively based on the correlation scores corresponding to the plurality of visual features.

In particular, the computer device multiplies each of the plurality of visual features with a text feature to obtain a plurality of products, the products characterized in terms of vectors. The computer device then converts the product into a correlation score by linearly activating the product. Wherein the computer device can linearly activate the product by an activation function. Illustratively, the computer device inputs the product to the fully connected layer to output the correlation score.

Further, the computer device obtains a degree of correlation between the plurality of visual features and the text feature, respectively, based on the respective corresponding correlation scores of the plurality of visual features. In some embodiments, the computer device uses the respective relevance scores of the plurality of visual features as a relevance between the respective ones of the plurality of visual features and the text feature. Alternatively, the computer device may normalize the respective correlation scores of the plurality of visual features, or adjust the respective correlation scores to a uniform percentage, so as to obtain correlations between the plurality of visual features and the text feature.

In the above embodiment, the correlation degree is determined based on the product of the visual feature and the text feature, and the correlation between the text feature and the image area in the image to be identified can be measured by the magnitude or the height of the correlation score, so that when the text information is identified later, the accuracy of identifying the text information can be improved by determining the image area to be identified based on the correlation.

Step S308, a target visual feature is screened from the plurality of visual features based on the correlation degree between the plurality of visual features and the text feature.

The target visual feature refers to a visual feature that mainly contributes to identifying text information corresponding to a problem indicated by a problem text in an image to be identified, and may also be referred to as a main visual feature. The target visual feature may be considered as a visual feature corresponding to an image area in the image to be recognized that contains text information corresponding to the question indicated by the question text.

Accordingly, the visual features corresponding to other image areas in the image to be identified may be considered secondary visual features, which are typically global features or visual features corresponding to image areas with low relevance to the text information corresponding to the question indicated by the question text.

Specifically, the computer device screens out, as the target visual feature, the visual features satisfying the correlation threshold condition from among the plurality of visual features based on correlations between the plurality of visual features and the text feature, respectively. Wherein the relevance threshold condition includes, but is not limited to, one or more of a threshold number of visual features, a threshold score, and the like.

Illustratively, the computer device sorts the plurality of visual features by relevance, and selects the first K visual features in order from large to small, and takes the K visual features as target visual features. As another example, the computer device selects as the target visual feature a visual feature having a correlation score greater than a predetermined threshold value of the correlation score.

In step S310, decoding processing is performed based on the target visual feature to identify text information corresponding to the question indicated by the question text from the image to be identified.

The decoding process based on the target visual features refers to a process of performing text recognition on the target visual features and decoding a text recognition result into a text sequence. In some embodiments, the computer device performs character matching based on the target visual features for text recognition to obtain image text recognition results. Decoding the image text recognition result by the computer device through a text decoder to convert the image text recognition result into a text sequence (token sequence); the text sequence constitutes readable text information, which is text information corresponding to a question indicated by the question text, as an answer to the question indicated by the question text.

In some embodiments, the computer device decodes the image text recognition results through a text decoder, including but not limited to one or more of BERT (Bidirectional Encoder Representations from Transformers, bi-directional based transducer), transducer (translation model), or LLM (Large Language Model ), or the like. The text decoder may decode the corresponding text sequence for OCR results such as VQA (Visual Question Answering, visual question-and-answer), information extraction, text recognition, etc.

In the text information identification method, the target feature diagram of the image to be identified is obtained by performing visual coding processing on the image to be identified, and the problem text related to the image to be identified is subjected to problem decoding processing by combining a plurality of visual features in the obtained target feature diagram to obtain the text features of the problem text, so that an association relationship is established between the problem text and the visual features of the image to be identified, and answers of the problems indicated by the characterization problem text are mapped in image areas possibly; and then, through determining the correlation degree between each visual feature and the text feature, screening out target visual features based on the correlation degree, further characterizing the most probable image area of answer mapping in the image, finally, carrying out decoding processing based on the target visual features, and identifying text information corresponding to the problems in the problem text from the image to be identified, thereby realizing direct extraction based on the image features to obtain the answer corresponding to the problem text, realizing end-to-end image text information identification, avoiding individual modeling aiming at an intermediate process, improving identification efficiency, and further improving identification accuracy by combining the visual features and the text features for processing.

In some embodiments, performing visual encoding processing on an image to be identified to obtain a target feature map of the image to be identified, including: extracting features of an image to be identified to obtain an initial low-resolution feature map; performing up-sampling treatment on the initial low-resolution feature map to obtain an initial high-resolution feature map; and obtaining a target feature map of the image to be identified based on the initial high-resolution feature map.

Image resolution (PPI) refers to the number of Pixels Per Inch of an image, and is typically used to measure the amount of information stored in an image. In general, the higher the resolution of an image, the higher the sharpness perceived from a visual effect; the lower the image resolution, the more blurred the visual effect.

The low-resolution feature map refers to a feature map with image resolution lower than a preset resolution. The preset resolution may be set according to the actual situation, so as to be used as a reference standard for distinguishing the feature map from the low resolution or the high resolution. Accordingly, the high-resolution feature map refers to a feature map in which the image resolution is not lower than a preset resolution.

Specifically, the computer equipment performs feature extraction on the image to be identified to obtain an initial low-resolution feature map. The computer device performs feature extraction on the image to be identified through a CNN (computer network) and other backbone networks to obtain an initial low-resolution feature map. In order to improve the resolution to further improve the accuracy of the subsequent recognition, the computer device performs an upsampling process on the initial low-resolution feature map to improve the resolution of the initial low-resolution feature map, resulting in an initial high-resolution feature map. Furthermore, the computer equipment can obtain the target feature map of the image to be identified based on the initial high-resolution feature map. Wherein the target feature map comprises a plurality of visual features.

In some embodiments, the computer device obtains a target feature map of the image to be identified based on the initial high resolution feature map, comprising: the initial high resolution feature map is taken as a target feature map. In other embodiments, the computer device obtains a target feature map of the image to be identified based on the initial high resolution feature map, comprising: and carrying out post-processing based on the image features in the initial high-resolution feature map, and taking the feature map obtained by the post-processing as a target feature map. Among them, post-processing includes, but is not limited to, encoding processing, feature transformation processing, and the like.

Illustratively, as shown in fig. 5, the computer device performs feature extraction on the image 401 to be identified through the backbone network to obtain an initial low-resolution feature map F0, and performs upsampling processing on the initial low-resolution feature map F0 to obtain an initial high-resolution feature map F1; based on initial high resolutionThe feature map F1 obtains a plurality of visual features epsilon of the image to be identified _pixel 。

In the above embodiment, the resolution of the feature map is improved by performing the upsampling process on the feature map obtained by extracting the features of the image to be identified, so that the accuracy of text information identification on the basis of the visual features obtained based on the high-resolution feature map can be higher.

In order to enable the visual features in the image to be identified to have the position information in the image, in some embodiments, the visual features of the image to be identified can be additionally subjected to position coding, so that the features in the feature map carry the position information, and the accuracy of subsequent identification is improved. To this end, in some embodiments, the initial high resolution feature map includes a plurality of initial image features, and accordingly, the computer device obtains a target feature map of the image to be identified based on the initial high resolution feature map, including: respectively determining the respective position information of a plurality of initial image features included in the initial high-resolution feature map; and for each initial image feature, carrying out position coding on the initial image feature based on the position information of the initial image feature, and obtaining a target feature map of the image to be identified.

Specifically, for each of a plurality of initial image features, the computer device first determines respective location information of a plurality of initial image features included in the initial high resolution feature map. The position information of the image features may be represented by the region coordinates of the image region in the image to be identified, which image features map. The region coordinates are, for example, one or more of coordinates of an upper left corner, coordinates of a lower right corner, or coordinates of upper left and lower right corners of the image region.

Illustratively, to enhance the correlation between image features belonging to the same image region, the region coordinates of one image region are represented by at least two coordinates, for example by (x 0, x1, w 0) and (y 0, y1, h 0). Where (x 0, y 0) and (x 1, y 1) are the upper left and lower right coordinates of the image area, respectively, w0 is the width of the image area, and h0 is the height of the image area. The position information of the image region in the image to be recognized and the word order information of the text information contained in the image region can be simultaneously represented by the manner that the region coordinates of the image region are represented by at least two coordinates.

Furthermore, the computer equipment performs position coding on the aimed initial image features based on the position information of the aimed initial image features to obtain a target feature map of the image to be identified. In some embodiments, the computer device may position encode the initial image feature by a position encoding function to obtain a plurality of visual features carrying the position information.

For any initial image feature, the computer device obtains the position information of the initial image feature, and performs position coding on the initial image feature through a position embedding function to obtain a visual feature carrying the position information, and a specific formula can be expressed as follows:

X _i ＝PosEmb2Dx(x ₀ ,x ₁ ,w0)

Y _i ＝PosEmb2Dy(y ₀ ,y ₁ ,h0)

Wherein PosEmb2Dx and PosEmb2Dy are location embedding functions. X is X _i And Y _i And mapping the visual features carrying the position information after the position coding into the coordinate information in the image to be identified.

In the above embodiment, the initial image features included in the initial low-resolution feature map are subjected to position coding, so that the subsequent visual features carry position information, that is, the layout condition of the image area in the image to be identified, corresponding to each visual feature, in the image to be identified, thereby further improving the accuracy of subsequent identification.

In some embodiments, determining respective location information for each of a plurality of image features included in the initial high resolution feature map includes: for each initial image feature, respectively determining an image area of the initial image feature mapped in the image to be identified; determining size information and vertex coordinate information of an image area in an image to be identified; position information of the targeted image feature is determined based on the size information and the vertex coordinate information.

Specifically, for each of a plurality of initial image features included in the initial high resolution feature map, the computer device determines an image region in the image to be identified for which the initial image feature is mapped, respectively. In general, when extracting features of an image to be identified, a computer device performs convolution and other processes on different image areas in the image to be identified, so as to obtain a feature map. Furthermore, the computer device may also determine in turn the image areas in the feature map in which the image features in the feature map are mapped in the image to be identified.

In some embodiments, the computer device normalizes and discretizes the region coordinates of each image region such that the region coordinates are unified to an integer [0, α ] within the range, α being a preset value, typically a scale of the image to be identified. The marking of the position information is then performed by the vertex coordinates and the width and height of each image area. Thus, the size information and the vertex coordinate information of each image area in the image to be identified can be obtained. The size information refers to the size of the image area, such as width and height. The vertex coordinate information refers to one or more of the upper left corner, the upper right corner, the lower left corner, the lower right corner, and the like.

Illustratively, the computer device represents the region coordinates of one image region by (x 0, x1, w 0) and (y 0, y1, h 0). Where (x 0, y 0) and (x 1, y 1) are the upper left and lower right coordinates of the image area, respectively, w0 is the width of the image area, and h0 is the height of the image area.

And the computer equipment performs position coding on the image features corresponding to the image region according to the size information and the vertex coordinate information of the image region in the image to be identified, so as to obtain the position information of the aimed image features.

In the above embodiment, by performing position coding on the image features included in the feature map, the subsequent visual features carry position information, that is, the layout condition of the image area corresponding to each image feature in the image to be identified, so that the accuracy of subsequent identification can be further improved.

As stated earlier, in some embodiments, the computer device obtains question text regarding an image to be identified, meaning that question text is obtained that is adapted to the image to be identified. Thus, in the case where the question text is adapted to the image to be recognized, the computer device can recognize an answer corresponding to the question indicated by the question text in the text information contained in the image to be recognized. To this end, in some embodiments, obtaining question text about an image to be identified includes: acquiring an original problem text; determining a field to be identified in the image to be identified; in case the original question text is associated with the field to be identified, the question text about the image to be identified is derived based on the original question text.

In particular, the computer device obtains the original question text, which may be any form of text, such as one or more sentences, one or more phrases, or a sequence of words identified from an image, or the like.

Further, the computer device determines a field to be identified in the image to be identified according to the acquired image to be identified. The field to be identified in the image to be identified refers to a field to which text information in the image to be identified belongs. For example, the image to be identified is a travel ticket, and text information of a departure place, a destination, a date of sitting, or a number of cars/flights is included in the image to be identified, and the field to be identified may be one or more fields to which the text information belongs, such as a "departure place", "date", or a "number of cars".

Thus, in case the original question text is associated with the field to be identified, the computer device obtains the question text regarding the image to be identified based on the original question text. In some embodiments, the computer device obtains question text about the image to be identified based on the original question text, comprising: the computer device takes this original question text as question text for the image to be identified. In other embodiments, the computer device obtains question text about the image to be identified based on the original question text, including: the computer device performs text conversion based on the original question text, and takes the converted text as the question text about the image to be recognized.

In the case where the original question text is not associated with the field to be identified, such as where the field to be identified is a field in the traffic domain and the original question text corresponds to a field in the financial domain, the computer device does not treat the original question text as the question text for the image to be identified.

In the above embodiment, by determining whether the original question text is associated with the image to be identified, and further determining the question text about the image to be identified if there is an association relationship, accuracy of text information corresponding to the question indicated by the question text can be ensured.

In some scenarios, the type of the image to be identified is a preset type, for example, the application program provides an identification function for the document, and the type of the image to be identified is an image of the document type. Accordingly, the text information to be extracted in the image to be recognized is also usually text information of a fixed field. For example, for an image to be identified as an express bill, text information in the image to be identified is usually text information of fields such as "recipient address", "sender address", or "contact phone".

In some cases, after the original question text is obtained, the computer device may convert the original question text into a preset field, and use the preset field as the question text about the image to be identified, so as to further improve the accuracy of identification.

To this end, in some embodiments, where the original question text is associated with the field to be identified, deriving the question text for the image to be identified based on the original question text includes: under the condition that the original question text is associated with the field to be identified, converting the original question text into a preset standardized question text based on the field to be identified; the standardized question text is taken as the question text about the image to be recognized.

Specifically, in the case where the original question text is associated with the field to be identified, the computer device converts the original question text into a preset standardized question text based on the field to be identified. The preset standardized question text may be a preset field, a preset fixed question text, or the like. For example, the computer device converts the original question text "number of departure" into a preset fixed question text "what is the date of day" or the computer device converts the original question text "what is the day of day" into a preset field "date of day". Further, the computer device may treat the normalized question text as question text about the image to be identified.

In the above embodiment, the original problem text is converted to the standardized problem text, so that the semantic interference caused by the problem text form is greatly reduced, and the accuracy of text information recognition is improved.

In some embodiments, the computer device performs a problem decoding process on the target problem text to obtain text features of the target problem text, including: encoding the target problem text to obtain an embedded vector of the target problem text; and decoding the embedded vector of the target question text based on the plurality of visual features to obtain text features of the target question text. Specifically, the computer equipment encodes the target question text to obtain an embedded vector of the target question text, wherein the embedded vector corresponding to the target question text is a feature vector obtained by sequentially arranging feature representations (emmbeddings) corresponding to each word in the target question text. The computer equipment decodes the embedded vector of the target question text based on the extracted visual features, so that the text features of the target question text are obtained. In some embodiments, the computer device determines the visual feature corresponding to each text in the target question text, for example, the embedded vector of the text in the same order corresponds to the visual feature, and inputs the embedded vector of a certain text and the corresponding visual feature together to the decoder for decoding, so as to obtain the text feature corresponding to the text. Further, the computer device determines text features of the target question text based on text features corresponding to each word in the target question text.

Illustratively, as shown in fig. 6A, the computer device inputs the target question text (query) into the question decoder to perform the question decoding process, so as to obtain a corresponding text feature Q. In the case that there are a plurality of question texts, as shown in fig. 6B, for the question text 1 and the question text 2, … …, where any one of the question texts is the target question text, the computer device performs the question decoding process through the question decoder, so as to obtain corresponding text features Q1, Q2, …, and Qn.

Typically, a user may wish to obtain text information for multiple fields from an image to be identified at once. Accordingly, the computer device obtains a plurality of target question texts. To this end, in some embodiments, the method further comprises: and acquiring a plurality of question texts, and determining a target question text to be subjected to question decoding processing from the plurality of question texts. The computer equipment marks the target question text to be subjected to the question decoding processing as Q _i And embedding each question text except the target question text into a vector, namely Q _j . Where i is not equal to j.

Accordingly, the computer device decodes the embedded vector of the target question text to obtain the text feature of the target question text, including: and decoding the embedded vectors of the target question text based on the embedded vectors corresponding to the question text except the target question text in the plurality of question texts to obtain the text characteristics of the target question text.

Specifically, for the obtained plurality of question texts, the computer device respectively performs coding processing on each question text, so as to obtain respective embedded vectors corresponding to the respective question texts. The computer device then determines a target question text from the plurality of question texts that is currently to be subject to the question decoding process. And for the target problem text to be subjected to problem decoding processing currently, the computer equipment performs decoding processing according to the multiple visual characteristics, the embedded vectors corresponding to the target problem text and the embedded vectors corresponding to the problem text except the target problem text, so as to obtain the text characteristics of the target problem text.

Illustratively, the computer device performs decoding processing through multiple decoding layers, and the target question text i currently to be subjected to the question decoding processing is based on multiple visual featuresAn embedded vector Q corresponding to the target question text _i And an embedded vector Q corresponding to each of the question texts other than the target question text _j Decoding to obtain text feature h of target question text _i The mathematical form is as follows:

wherein h is _i-1 Representing the text characteristics of the output of the decoding layer of the upper layer.

In the above embodiment, by acquiring a plurality of problem texts and considering the influence of other problem texts on the target problem text to be subjected to the problem encoding processing at present during the problem decoding processing, the accuracy of identifying the problem text can be improved, and the accuracy of identifying the text information can be further improved. Since the image text information recognition task is different from the usual image recognition task, a large amount of text information is usually concentrated in a very small area in an image in the image, usually less than 5% of the entire image. Thus, a large number of extraneous regions are included in the visual features of the image to be identified, which increases the difficulty of decoding and reduces the speed of decoding.

To this end, in some embodiments, screening the target visual feature from the plurality of visual features based on a degree of correlation between the plurality of visual features and the text feature, respectively, includes: screening a plurality of first candidate visual features meeting a correlation threshold condition from the plurality of visual features based on correlations between the plurality of visual features and the text features respectively; and performing feature conversion on the first candidate visual features to obtain target visual features.

Specifically, the computer device screens out a plurality of first candidate visual features satisfying a correlation threshold condition from among the plurality of visual features based on correlations between the plurality of visual features and the text feature, respectively. Wherein the relevance threshold condition includes, but is not limited to, one or more of a threshold number of visual features, a threshold score, and the like.

After screening the plurality of first candidate visual features, the computer device performs feature conversion on the plurality of first candidate visual features to obtain a plurality of target visual features. Wherein the computer device performs feature transformation on the first plurality of candidate visual features, including but not limited to: one or more of dimension conversion, upsampling, downsampling, or the like is performed on the first plurality of candidate visual features.

Wherein in some embodiments, performing feature transformation on the plurality of candidate visual features to obtain a plurality of target visual features comprises: and multiplying the aimed first candidate visual feature and the correlation corresponding to the aimed first candidate visual feature respectively aiming at each of the plurality of first candidate visual features to obtain the target visual feature.

Specifically, for each of a plurality of first candidate visual features, the computer device multiplies the first candidate visual feature with its corresponding correlation score to obtain a target visual feature. For example, the computer device will first candidate visual feature z _l Score associated therewithMultiplying to obtain target visual characteristics +.>Where l=1, 2. K is the number of selected first candidate visual features.

Through the mode, the relevance can be converted from the numerical value to the characteristic, so that the method can be applied to back propagation for training, and further the accuracy of text recognition is improved.

Meanwhile, in order to enable irrelevant image features in an image to be identified to also play a role in text information identification, in some embodiments, the method further comprises: among the plurality of visual features, other visual features than the plurality of first candidate visual features are taken as second candidate visual features; and converting the second candidate visual features into secondary visual features, the number of which is the same as that of the plurality of target visual features, and performing feature stitching on the first candidate visual features and the secondary visual features to obtain the target visual features.

Specifically, the computer device treats, among the plurality of visual features, other visual features than the plurality of first candidate visual features as second candidate visual features. The computer device converts the second candidate visual feature into a number of secondary visual features equal to the number of the plurality of target visual features. After obtaining the first candidate visual feature and the secondary visual feature, the computer device performs feature stitching on the first candidate visual feature and the secondary visual feature to obtain the target visual feature. For example, the computer device concatenates the K secondary visual features after the K first candidate visual features, resulting in the target visual feature.

Thus, the computer device decodes the target visual feature to identify text information corresponding to the question indicated by the question text from the image to be identified.

Through the method, text information identification is carried out based on the first candidate visual features and the second candidate visual features with the same dimensions, so that the image region more relevant to the problem text can be focused, the focusing degree of other regions is reduced, information omission is avoided, and the accuracy of text information identification is improved.

The computer device may, for example, merge the plurality of second candidate visual features via an attention mechanism, thereby converting the plurality of second candidate visual features into K secondary visual features.

Illustratively, the computer device may formulate the secondary visual features by the following formula

Wherein,,where d represents the feature dimension. Q, W in the above formula ^v For attention parameters in the attention mechanism, T represents the matrix transpose and K is the number of first candidate visual features selected. /> Representing secondary visual features.

Therefore, some visual features with low correlation are projected onto the target visual features with high correlation, so that other image areas in the missing image are avoided, and meanwhile, the interference of the secondary visual features on the target visual features is prevented.

In some embodiments, the computer device performs a decoding process based on the target visual feature to identify text information corresponding to a question indicated by the target question text from the image to be identified, including: coding processing is carried out based on the target visual characteristics, so that a plurality of text coding characteristics which are sequentially arranged are obtained; decoding the first text coding feature in the plurality of text coding features to obtain text information corresponding to the first text coding feature; for each text coding feature except the first text coding feature in the plurality of text coding features, decoding according to the text coding feature and the previous text coding feature of the text coding feature to obtain text information corresponding to the text coding feature; based on the text information corresponding to the first text coding feature and the text information corresponding to each text coding feature except the first text coding feature, obtaining the text information corresponding to the problem indicated by the target problem text.

In the process of decoding the target visual features, in order to improve accuracy, the embodiment of the application generates texts in an autoregressive mode, and each text can pay attention to the coding condition of the previous text. Specifically, the computer device performs encoding processing based on the target visual feature to obtain a plurality of text encoding features which are sequentially arranged. The computer equipment performs decoding processing based on the plurality of text coding features, so that text information corresponding to each text coding feature after decoding, namely an answer corresponding to the question indicated by the target question text, is obtained.

For the first text coding feature in the plurality of text coding features, the first text in the text information is usually corresponding, and the computer equipment decodes the first text according to the plurality of visual features and the embedded vector corresponding to the target problem text to obtain the text information corresponding to the first text coding feature.

And aiming at each text coding feature except the first text coding feature in the plurality of text coding features, the computer equipment decodes according to the target visual feature, the aimed text coding feature and the text coding feature before the aimed text coding feature to obtain the text information corresponding to the aimed text coding feature. Where the preceding text encoding feature refers to the text encoding feature preceding the text encoding feature for which the order is directed. The computer equipment combines the text information corresponding to each text coding feature in the text coding features according to the sequence, so that the text information corresponding to the question indicated by the target question text, namely the answer of the question, can be obtained.

In some embodiments, the computer device may perform the decoding process on the target visual features through a text decoder. The text decoder is intended to decode the corresponding text sequence for various tasks such as VQA, information extraction, OCR results of text recognition, etc. It generates text in an autoregressive manner while noticing previously generated token and encoded visual features. Text decoders include, but are not limited to, one or more of BERT, transformer, LLM, etc. Illustratively, the mathematical form of the text decoder may be expressed by the following formula:

wherein F is ₀ For target visual characteristics, Q ₀ For the text of the question that is currently corresponding,representing a previously generated token. h is a _t Representing the hidden state of the text decoder at time step t.

In the above embodiment, by performing decoding processing based on the target visual characteristics to generate text, end-to-end image text recognition for the problem text can be realized, and quick question-answering based on images is realized without relying on the accuracy of image segmentation and text detection in the conventional OCR recognition technology. And by generating text based on the target visual features in an autoregressive manner, the accuracy of the generated text is improved, and the accuracy of end-to-end image text recognition is further improved.

In some embodiments, the above method further comprises: responding to the triggering operation of the input device, and acquiring an image to be identified through interaction between the input device and a physical entity to be identified; wherein the input device at least comprises an image pickup device.

Specifically, when the computer device is a terminal, the terminal may acquire the image to be recognized in response to a trigger operation on the input device. Wherein the input device includes, but is not limited to, an imaging device, etc. For example, the terminal shoots a physical entity (such as an entity file, a card, a ticket, etc.) to be identified by calling the camera device to realize interaction, thereby obtaining an image to be identified.

By the method, the user can extract the digital form of the characters in the physical entity to be identified by shooting the image of the physical entity to be identified, the extracted text information can be applied to various subsequent services, and the service processing efficiency is improved. For example, a user shoots an express delivery bill, and after acquiring an image of the express delivery bill, the terminal executes the text information identification method, so that the logistics related information in the text information identification method is extracted, and the processing efficiency of express delivery business is improved.

The application also provides an application scene, which applies the text information identification method. Specifically, the application of the text information identification method in the application scene is as follows: the method comprises the steps that visual coding processing is carried out on an image to be identified by computer equipment, so that a plurality of visual features of the image to be identified are obtained; acquiring a problem text about an image to be identified, and performing problem decoding processing on the problem text based on a plurality of visual features to obtain text features of the problem text; determining the correlation degree between the plurality of visual features and the text features respectively; screening target visual features from the plurality of visual features based on correlation between the plurality of visual features and the text features respectively; and decoding based on the target visual characteristics so as to identify text information corresponding to the problem indicated by the problem text from the image to be identified.

Of course, the text information recognition provided by the embodiment of the application is not limited to this, and the text information recognition provided by the embodiment of the application can also be applied to other application scenes, such as traffic scene text information recognition, medical scene text information recognition, education scene text information recognition and the like.

Illustratively, as shown in fig. 7, for different images to be identified, the computer device determines a degree of correlation according to text features obtained based on the text of the question and visual features extracted from the images to be identified, where the degree of correlation represents a degree of correlation of an image region of the visual features mapped in the images to be identified and an answer corresponding to the question indicated by the text of the question. The computer equipment screens the visual features based on the correlation degree to obtain target visual features, further determines the most relevant image area based on the target visual features, and further obtains answers corresponding to the questions indicated by the question text through text recognition.

In a specific example, the text information identification provided by the embodiment of the application comprises the following steps: the computer equipment performs feature extraction on the image to be identified to obtain an initial low-resolution feature map, and performs up-sampling processing on the initial low-resolution feature map to obtain an initial high-resolution feature map. Further, for each of a plurality of image features included in the initial high resolution feature map, the computer device determines an image region in the image to be identified for which the image feature is mapped, and determines relative size information and vertex coordinate information of the image region in the image to be identified, respectively. Thus, the computer device determines the position information of the image feature targeted based on the relative size information and the vertex coordinate information. For each image feature of the plurality of image features, the computer device performs a position encoding of the image feature for which it is based on the position information of the image feature for which it is made, resulting in a plurality of visual features of the image to be identified.

Meanwhile, the computer equipment acquires the original problem text and determines the field to be identified in the image to be identified. In the case that the original question text is associated with the field to be recognized, converting the original question text into a preset standardized question text based on the field to be recognized. Thus, the computer device can treat the standardized question text as question text about the image to be recognized.

Then, the computer equipment performs feature extraction on the target problem text to obtain an embedded vector corresponding to the target problem text; aiming at a first text in the target problem text, performing self-coding on the first text according to a plurality of visual features and embedded vectors corresponding to the target problem text to obtain coding features of the first text; aiming at each text except the first text in the target question text, performing self-coding according to a plurality of visual features, an embedded vector corresponding to the target question text and coding features of a previous text of the aimed text to obtain coding features of the aimed text; and obtaining the text characteristics of the target question text based on the coding characteristics of the first text and the coding characteristics of each text except the first text in the target question text.

In the case that there are a plurality of question texts, the computer device acquires the plurality of question texts and determines a target question text to be subjected to the question decoding process from among the plurality of question texts. Illustratively, for a first text in the target question text, the computer device self-encodes the first text according to a plurality of visual features to obtain encoded features of the first text, including: aiming at a first text in the target problem text, the first text is self-coded according to a plurality of visual features and embedded vectors corresponding to the problem texts, and coding features of the first text are obtained. Accordingly, for each text of the target question text except for the first text, the computer device performs self-encoding according to the plurality of visual features and the encoding features of the preceding text of the target text to obtain the encoding features of the target text, including: and aiming at each text except the first text in the target question text, carrying out self-coding according to a plurality of visual features, embedded vectors corresponding to the question texts and coding features of the previous text of the aimed text to obtain the coding features of the aimed text.

Further, the computer device determines a correlation score corresponding to each of the plurality of visual features based on a product of each of the plurality of visual features and the text feature, and obtains a correlation degree between each of the plurality of visual features and the text feature based on the correlation score corresponding to each of the plurality of visual features.

Thus, the computer device may screen a plurality of first candidate visual features from the plurality of visual features that satisfy a relevance threshold condition based on relevance between the plurality of visual features and the text feature, respectively. Wherein, for each of the plurality of first candidate visual features, the computer device multiplies the first candidate visual feature for which it is intended and the correlation corresponding to the first candidate visual feature for which it is intended, respectively, to obtain the target visual feature. And, the computer device converts, from the plurality of visual features, other visual features than the plurality of first candidate visual features as second candidate visual features, and then converts the second candidate visual features into the same number of secondary visual features as the plurality of target visual features. And the computer equipment performs feature stitching on the target visual features and the secondary visual features to obtain the target visual features. Thus, the computer device can decode the target visual feature to identify text information corresponding to the question indicated by the question text from the image to be identified.

After the text feature is obtained, the computer device also performs feature conversion on the text feature in order to convert it into a form that can be co-processed with the visual feature, since the text feature is a discrete feature representation. Illustratively, as shown in FIG. 8A, the computer device inputs the text feature Q into the MLP (Multilayer Perceptron, multi-layer perceptron) for linear transformation, resulting in a transformed text feature ε _mask . In the case of a plurality of question texts, as shown in fig. 8B, for the question text 1 and the question text 2 and … …, the text features Q1, Q2, …, and Qn corresponding to the question text n are respectively obtained, and the computer device respectively performs linear transformation on the text features through MLP, and uses the transformed text features as text features epsilon _mask 。

Furthermore, the computer device is based on the visual features epsilon _pixel And converted text feature epsilon _mask Obtaining the correlation epsilon between each visual feature and the text feature _score The size of the correlation represents the probability that the text information corresponding to the problem text is located in each image area in the image to be identified. As shown in fig. 9, the computer device obtains the target visual features according to the first K visual features with the highest selected correlation degree, and maps each image area in the image to be identified according to the target visual features. Furthermore, the computer device may perform decoding processing based on the target visual features to obtain text information included in the image areas, and further obtain answers corresponding to the questions indicated by the question text. In some embodiments, the computer device may also perform a decoding process based on the combination of the secondary visual features and the target visual features to improve the accuracy of text information recognition.

Based on the above example, in some examples, the flow of the computer device performing text information recognition may be as shown in fig. 10, where the computer device acquires an image to be recognized and a plurality of question texts q1, q2, … … qn. Computer equipment treats the image to be identified through the visual encoderPerforming visual coding processing to obtain a plurality of visual features epsilon _pixel . The computer device performs problem decoding processing on the plurality of problem texts through the problem decoder to obtain text characteristics Q corresponding to the plurality of problem texts, and converts the text characteristics Q through a multi-layer perceptron to obtain converted text characteristics epsilon _mask . Furthermore, the computer device is based on the visual features epsilon _pixel And converted text feature epsilon _mask Obtaining the correlation epsilon between each visual feature and the text feature _score And according to the degree of correlation epsilon _score To determine the respective image areas in the image to be identified to which the target visual features map. Meanwhile, the computer equipment determines secondary visual characteristics, splices the target visual characteristics with the secondary visual characteristics, and inputs the spliced characteristics to a text decoder for decoding processing, so that text information contained in a corresponding image area is identified.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, as shown in fig. 11, the embodiment of the application further provides a training method of the text information recognition model, and the method can be applied to a terminal or a server, or can be cooperatively executed by the terminal and the server. The following description will take an example in which the method is applied to a computer device, which may be a terminal or a server. The method comprises the following steps:

Step S1102, obtaining a pre-training sample pair; the pre-training sample pair includes at least one of a first sample pair consisting of a sample initial image and a sample text position, a second sample pair consisting of a first sample text and a second sample text, and a third sample pair consisting of a sample segmentation image and a third sample text.

Specifically, the computer device first pre-trains the text information recognition model such that the text information recognition model acquires the underlying image recognition capabilities and text recognition capabilities.

Wherein the pre-training sample pair may comprise a first sample pair consisting of a sample initial image and a sample text position. That is, given an image as a sample initial image, the corresponding label is the respective position of all the texts in the sample initial image, i.e. the sample text position, thereby being able to train the model to locate the text in the image, i.e. the text detection capability.

The pre-training sample pair may further include a second sample pair consisting of a first sample text and a second sample text. That is, given a text segment of an image, such as the text of the first line corresponding to the image, i.e., the first text sample, it is input into the text information recognition model to make it predict the corresponding text content, thereby enabling the text information recognition model to have the ability to understand the different locations of each text.

The pre-training sample pair may further comprise a third sample pair consisting of a sample segmentation image and a third sample text. That is, the computer device gives an image area to which the text information belongs, i.e. a sample segmentation image, and trains it in combination with the text information contained in the image area, i.e. a third sample text, so that the text information recognition model has the ability to recognize text content from the segmentation image.

Step S1104, for the text information recognition model to be trained, pre-training a first local network in the text information recognition model based on the first sample, pre-training a second local network in the text information recognition model based on the second sample, and pre-training a third local network of the text information recognition model based on the third sample, to obtain an initial text information recognition model.

In order for the text information recognition model to have sufficient text understanding capabilities, extensive document pre-training is required. The computer device therefore pre-trains a first local network in the text information recognition model based on the first sample for the text information recognition model to be trained, wherein the first local network comprises at least a local network for recognizing images and a local network for recognizing problem text, for example at least a local network comprising a visual encoder and a problem decoder.

The computer device pre-trains a second local network in the text information recognition model based on the second sample, wherein the second local network comprises at least a local network for recognizing the problem text, e.g. at least a local network comprising a problem decoder or the like.

And, the computer device pre-trains a third local network of the text information recognition model based on the third sample, wherein the third local network comprises at least a local network for the text information, e.g. at least a local network comprising a text decoder or the like.

Step S1106, obtaining a target sample pair; the target sample pair comprises a target sample image, target question text and target question answers.

After pre-training to enable the model to have basic text understanding capabilities, the computer device may obtain corresponding target sample pairs for different business scenarios, the target sample pairs including target sample images, target question text, and target question answers corresponding to subsequent business scenarios in use. For example, for text information identification in the traffic field, the computer device obtains a target sample pair related to a certificate, a card and the like in the traffic field, such as a target certificate image, a target question text corresponding to the target certificate image, and a target question answer corresponding to the target question text.

Therefore, the text information recognition model can be finely tuned according to different business scenes, and the cost of model deployment and the training cost of the model are greatly reduced.

Step S1108, training an initial text information recognition model based on a target sample to obtain a trained text information recognition model; the trained text information recognition model is used for recognizing text information of the image to be recognized so as to recognize text information corresponding to the problem in the problem text matched with the image to be recognized from the image to be recognized.

Specifically, the computer device trains the initial text information recognition model based on the target samples respectively until training is finished when training termination conditions are reached, so as to obtain a trained text information recognition model. Wherein the training termination conditions include, but are not limited to, one or more of a training duration reaching a threshold, a training number reaching a threshold, or minimizing model loss, etc.

The computer equipment inputs the trained text information recognition model into the service scene, and the text information recognition model can recognize the text information of the image to be recognized so as to recognize the text information corresponding to the problem in the problem text matched with the image to be recognized from the image to be recognized.

According to the training method of the text information recognition model, the text information recognition model is firstly pre-trained by constructing the pre-training sample pair, so that the text information recognition model has basic text understanding capability, then the text information recognition model is finely tuned according to different service scenes, namely, a target sample pair corresponding to the service scenes is obtained, the text information recognition model is trained based on the target sample pair, and finally the trained text information recognition model is obtained, so that the text information recognition model can be finely tuned according to different service scenes, the cost of deploying the model and the training cost of the model are greatly reduced, the consumption of computer resources is reduced, and the model deployment efficiency is improved.

Based on the same inventive concept, the embodiment of the application also provides a text information recognition device for realizing the above related text information recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the text information recognition device or devices provided below may refer to the limitation of the text information recognition method hereinabove, and will not be repeated herein.

In some embodiments, as shown in fig. 12, there is provided a text information recognition apparatus 1200 including: a processing module 1201, an acquisition module 1202, a determination module 1203, and a screening module 1204, wherein:

the processing module 1201 is configured to perform visual encoding processing on an image to be identified, so as to obtain a target feature map of the image to be identified, where the target feature map includes a plurality of visual features.

The obtaining module 1202 is configured to obtain a question text related to an image to be identified, and perform a question decoding process on the question text based on a plurality of visual features, so as to obtain text features of the question text.

A determining module 1203 is configured to determine a degree of correlation between the plurality of visual features and the text features, respectively.

And a screening module 1204, configured to screen the target visual feature from the plurality of visual features based on the correlation between the plurality of visual features and the text feature, respectively.

The processing module 1201 is further configured to perform decoding processing based on the target visual feature, so as to identify text information corresponding to the question indicated by the question text from the image to be identified.

In some embodiments, the processing module is further configured to perform feature extraction on the image to be identified to obtain an initial low-resolution feature map; performing up-sampling treatment on the initial low-resolution feature map to obtain an initial high-resolution feature map; and obtaining a target feature map of the image to be identified based on the initial high-resolution feature map.

In some embodiments, the processing module is further configured to determine respective location information of a plurality of initial image features included in the initial high resolution feature map; and for each initial image feature, carrying out position coding on the initial image feature based on the position information of the initial image feature, and obtaining a target feature map of the image to be identified.

In some embodiments, the initial high resolution feature map includes a plurality of initial image features therein; the processing module is also used for respectively determining an image area of the initial image feature mapped in the image to be identified according to each initial image feature; determining size information and vertex coordinate information of an image area in an image to be identified; based on the size information and the vertex coordinate information, position information of the initial image feature to which the image feature is directed is determined.

In some embodiments, the obtaining module is further configured to obtain an original question text; determining a field to be identified in the image to be identified; in case the original question text is associated with the field to be identified, the question text about the image to be identified is derived based on the original question text.

In some embodiments, the obtaining module is further configured to, in a case where the original question text is associated with the field to be identified, convert the original question text into a preset standardized question text based on the field to be identified; the standardized question text is taken as the question text about the image to be recognized.

In some embodiments, the obtaining module is further configured to perform feature extraction on the target problem text to obtain an embedded vector corresponding to the target problem text; aiming at a first text in the target problem text, performing self-coding on the first text according to a plurality of visual features and embedded vectors corresponding to the target problem text to obtain coding features of the first text; aiming at each text except the first text in the target question text, performing self-coding according to a plurality of visual features, an embedded vector corresponding to the target question text and coding features of a previous text of the aimed text to obtain coding features of the aimed text; and obtaining the text characteristics of the target question text based on the coding characteristics of the first text and the coding characteristics of each text except the first text in the target question text.

In some embodiments, the obtaining module is further configured to encode the target question text to obtain an embedded vector of the target question text; and decoding the embedded vector of the target question text based on the plurality of visual features to obtain text features of the target question text.

In some embodiments, the obtaining module is further configured to obtain a plurality of question texts, and determine a target question text to be subjected to the question decoding process from the plurality of question texts; the acquisition module is also used for decoding the embedded vectors of the target question text based on the embedded vectors corresponding to the question text except the target question text in the plurality of question texts, so as to obtain the text characteristics of the target question text.

In some embodiments, the obtaining module is further configured to obtain a plurality of question texts, and determine a target question text to be subjected to the question decoding process from the plurality of question texts; the acquisition module is also used for self-coding the first text according to the multiple visual features and the embedded vectors corresponding to the multiple problem texts aiming at the first text in the target problem text, so as to obtain coding features of the first text; the acquisition module is further used for carrying out self-coding on each text except the first text in the target problem text according to the plurality of visual features, the embedded vectors corresponding to the problem texts and the coding features of the previous text of the target text to obtain the coding features of the target text.

In some embodiments, the determining module is further configured to determine a relevance score for each of the plurality of visual features based on a product of each of the plurality of visual features and the text feature; and obtaining the correlation degree between the plurality of visual features and the text features respectively based on the correlation scores corresponding to the plurality of visual features.

In some embodiments, the screening module is further configured to screen a plurality of first candidate visual features from the plurality of visual features that satisfy a relevance threshold condition based on relevance between the plurality of visual features and the text feature, respectively; and performing feature conversion on the first candidate visual features to obtain target visual features.

In some embodiments, the screening module is further configured to, for each of the plurality of first candidate visual features, multiply the first candidate visual feature with a correlation corresponding to the first candidate visual feature, respectively, to obtain the target visual feature.

In some embodiments, the screening module is further configured to identify, among the plurality of visual features, a visual feature other than the plurality of first candidate visual features as a second candidate visual feature; and converting the second candidate visual features into secondary visual features, the number of which is the same as that of the plurality of target visual features, and performing feature stitching on the first candidate visual features and the secondary visual features to obtain the target visual features.

In some embodiments, the screening module is further configured to perform feature stitching on the target visual feature and the secondary visual feature to obtain the target visual feature; and decoding the target visual characteristics to identify text information corresponding to the problem indicated by the problem text from the image to be identified.

In some embodiments, the processing module is further configured to perform encoding processing based on the target visual feature to obtain a plurality of text encoding features that are sequentially arranged; decoding the first text coding feature in the plurality of text coding features to obtain text information corresponding to the first text coding feature; for each text coding feature except the first text coding feature in the plurality of text coding features, decoding according to the text coding feature and the previous text coding feature of the text coding feature to obtain text information corresponding to the text coding feature; based on the text information corresponding to the first text coding feature and the text information corresponding to each text coding feature except the first text coding feature, obtaining the text information corresponding to the problem indicated by the target problem text.

In some embodiments, the obtaining module is further configured to obtain the image to be identified through interaction between the input device and the physical entity to be identified in response to a triggering operation on the input device; wherein the input device at least comprises an image pickup device.

The respective modules in the text information recognition apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Based on the same inventive concept, the embodiment of the application also provides a training device for the text information recognition model, which is used for realizing the training method of the related text information recognition model. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the training device for one or more text information recognition models provided below may be referred to the limitation of the training method for text information recognition models hereinabove, and will not be repeated here.

In some embodiments, as shown in fig. 13, there is provided a training apparatus 1300 of a text information recognition model, comprising: an acquisition module 1301, a pre-training module 1302, and a training module 1303, wherein:

an acquisition module 1301, configured to acquire a pair of pre-training samples; the pre-training sample pair includes at least one of a first sample pair consisting of a sample initial image and a sample text position, a second sample pair consisting of a first sample text and a second sample text, and a third sample pair consisting of a sample segmentation image and a third sample text.

The pre-training module 1302 is configured to pre-train a first local network in the text information recognition model based on a first sample, pre-train a second local network in the text information recognition model based on a second sample, and pre-train a third local network in the text information recognition model based on a third sample, so as to obtain an initial text information recognition model.

The acquisition module 1301 is further configured to acquire a target sample pair; the target sample pair comprises a target sample image, target question text and target question answers.

The training module 1303 is configured to train the initial text information recognition model based on the target sample, so as to obtain a trained text information recognition model; the trained text information recognition model is used for recognizing text information of the image to be recognized so as to recognize text information corresponding to the problem in the problem text matched with the image to be recognized from the image to be recognized.

The respective modules in the training device of the text information recognition model can be realized in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer device is provided, which may be a server or a terminal, and the internal structure of the computer device is illustrated as a server in the following description, and may be shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing image related data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text information recognition method, or training of a text information recognition model.

It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for identifying text information, the method comprising:

2. The method according to claim 1, wherein the performing the visual encoding process on the image to be identified to obtain the target feature map of the image to be identified includes:

extracting features of an image to be identified to obtain an initial low-resolution feature map;

performing up-sampling processing on the initial low-resolution feature map to obtain an initial high-resolution feature map;

and obtaining a target feature map of the image to be identified based on the initial high-resolution feature map.

3. The method of claim 2, wherein the initial high resolution feature map includes a plurality of initial image features; the obtaining the target feature map of the image to be identified based on the initial high-resolution feature map comprises the following steps:

Determining respective position information of the plurality of initial image features;

and for each initial image feature, carrying out position coding on the initial image feature based on the position information of the initial image feature to obtain a target feature map of the image to be identified.

4. A method according to claim 3, wherein said determining the respective location information of the plurality of initial image features comprises:

for each initial image feature, respectively determining an image area of the initial image feature mapped in the image to be identified;

determining size information and vertex coordinate information of the image area in the image to be identified;

based on the size information and the vertex coordinate information, position information of the aimed initial image feature is determined.

5. The method according to claim 1, wherein the method further comprises:

acquiring a plurality of question texts, and determining a target question text to be subjected to question decoding processing from the plurality of question texts;

the decoding processing is carried out on the embedded vector of the target question text to obtain the text characteristics of the target question text, and the decoding processing comprises the following steps:

And decoding the embedded vectors of the target question text based on the embedded vectors corresponding to the question text except the target question text in the question texts to obtain the text characteristics of the target question text.

6. The method of claim 1, wherein the determining the relevance between the plurality of visual features and the text feature, respectively, comprises:

determining a correlation score corresponding to each of the plurality of visual features based on a product of each of the plurality of visual features and the text feature;

and obtaining the correlation degree between the visual features and the text features respectively based on the correlation scores corresponding to the visual features.

7. The method of claim 1, wherein the screening the target visual feature from the plurality of visual features based on the correlation between the plurality of visual features and the text feature, respectively, comprises:

screening a plurality of first candidate visual features meeting a correlation threshold condition from the plurality of visual features based on correlations between the plurality of visual features and the text feature respectively;

And performing feature conversion on the plurality of first candidate visual features to obtain a plurality of target visual features.

8. The method of claim 7, wherein the feature transforming the first plurality of candidate visual features to obtain a plurality of target visual features comprises:

and multiplying the aimed first candidate visual feature and the correlation corresponding to the aimed first candidate visual feature respectively aiming at each of the plurality of first candidate visual features to obtain the target visual feature.

9. The method of claim 7, wherein the feature transforming the first plurality of candidate visual features to obtain a plurality of target visual features, further comprises:

among the plurality of visual features, other visual features than the plurality of first candidate visual features are taken as second candidate visual features;

and converting the second candidate visual features into secondary visual features, the number of which is the same as that of the first candidate visual features, and performing feature stitching on the first candidate visual features and the secondary visual features to obtain target visual features.

10. The method according to claim 1, wherein the decoding process based on the target visual feature to identify text information corresponding to a question indicated by the target question text from the image to be identified includes:

Coding processing is carried out based on the target visual characteristics, so that a plurality of text coding characteristics which are sequentially arranged are obtained;

decoding a first text coding feature in the plurality of text coding features to obtain text information corresponding to the first text coding feature;

for each text coding feature except the first text coding feature in the plurality of text coding features, decoding according to the text coding feature and the previous text coding feature of the text coding feature to obtain text information corresponding to the text coding feature;

and obtaining text information corresponding to the problem indicated by the target problem text based on the text information corresponding to the first text coding feature and the text information corresponding to each text coding feature except the first text coding feature.

11. The method according to claim 1, wherein the method further comprises:

responding to triggering operation of an input device, and acquiring an image to be identified through interaction between the input device and a physical entity to be identified; wherein the input device at least comprises an image pickup device.

12. A method for training a text information recognition model, the method comprising:

13. A text information recognition device, the device comprising:

the acquisition module is used for acquiring a question text related to the image to be identified, and carrying out question decoding processing on the question text based on the plurality of visual features to obtain text features of the question text;

and the processing module is also used for carrying out decoding processing based on the main visual characteristics so as to identify text information corresponding to the problem indicated by the problem text from the image to be identified.

14. A training device for a text information recognition model, the device comprising:

15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.

17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.