CN116758572A

CN116758572A - Text acquisition method and related equipment thereof

Info

Publication number: CN116758572A
Application number: CN202310488188.8A
Authority: CN
Inventors: 柏昊立; 毛志铭; 侯璐; 魏建生; 刘群; 蒋欣
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-09-15

Abstract

The application discloses a text acquisition method and related equipment, which can acquire accurate target text from a target image. The method of the application comprises the following steps: when it is desired to extract a target text from a target image, a target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image.

Description

Text acquisition method and related equipment thereof

Technical Field

The embodiment of the application relates to the field of artificial intelligence (artificial intelligence, AI), in particular to a text acquisition method and related equipment thereof.

Background

With the rapid development of AI technology, more and more users use a pre-trained neural network model (which may also be referred to as a pre-trained model) to complete an analysis process for an image in which a plurality of texts are presented, that is, the pre-trained neural network model can fully understand the image to extract a target text from the plurality of texts presented by the image.

In the related art, the pre-trained neural network model may include an encoder and a decoder. When it is desired to extract a target text from a plurality of texts presented by the image, the image may be input into a neural network model. The encoder may then encode the image to obtain features of the image and provide the features of the image to the decoder. The decoder may then decode based on the features of the image, resulting in the target text.

In the above process, the neural network model is based on the characteristics of the image to understand the content of the image so as to extract the target text from the plurality of texts presented by the image. However, when the neural network model is used for understanding the image, the considered factors are relatively single, so that the target text finally obtained by the model may not be accurate text.

Disclosure of Invention

The embodiment of the application provides a text acquisition method and related equipment thereof, which can acquire accurate target text from a target image.

A first aspect of an embodiment of the present application provides a text obtaining method, where the method is implemented by using a target model, and the method includes:

when it is desired to acquire the target text from the contained target image, the target image may be acquired first. It should be noted that, the content presented by the target image includes a plurality of texts, and the plurality of texts includes the target text to be acquired.

After the target image is obtained, the target image can be input into the target model, so the target model can encode the target image first, and the characteristics of the target image are obtained. After the characteristics of the target image are obtained, the target model can process the characteristics of the target image, so that the position information of the target text in the target image is obtained. After the position information of the target text in the target image is obtained, the decoder can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained.

It should be noted that, the input of the target model includes not only the target image input from the outside, but also the position information of the target text in the target image obtained by the input of the target model, and the output of the target model includes not only the target text but also the position information of the target text in the target image, that is, the position information of the target text and the target text in the target image are two outputs of the target model. So far, the target text is successfully acquired from the target image.

From the above method, it can be seen that: when it is desired to extract a target text from a target image, a target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

In one possible implementation, based on the features, obtaining location information of the target text in the target image includes: decoding the 1 st vector representation of the position information of the target text in the target image to the i-th vector representation of the position information based on the features to obtain the i+1-th vector representation of the position information, i=1. In the foregoing implementation, if the number of the target texts is one, after the features of the target image are obtained, the target model may decode the preset vector representation based on the features of the target image, so as to obtain the 1 st vector representation of the position information of the target text in the target image. The target model may then decode a 1 st vector representation of the location information of the target text in the target image based on the features of the target image to obtain a 2 nd vector representation of the location information of the target text in the target image. In this way, the target model can accurately obtain the position information of the target text presented in the form of vector representation in the target image.

In one possible implementation, based on the feature and the location information, obtaining the target text includes: decoding the 1 st vector representation of the target text to the j-th vector representation of the target text based on the feature, obtaining the j+1-th vector representation of the target text, j=1. In the foregoing implementation, if the number of target texts is one, after the position information of the target texts in the target image is obtained, the target model may decode the position information of the target texts in the target image based on the features of the target image, so as to obtain the 1 st vector representation of the target texts. The target model may then decode the location information of the target text in the target image and the 1 st vector representation of the target text based on the features of the target image to obtain a 2 nd vector representation of the target text. In this way, the target model can accurately obtain the target text presented in the form of vector representation.

In one possible implementation, the target text includes a first text and a second text, the location information includes first location information of the first text in the target image and second location information of the second text in the target image, and based on the feature, obtaining the location information of the target text in the target image includes: decoding the 1 st vector representation of the first position information to the i-th vector representation of the first position information based on the features to obtain the i+1 th vector representation of the first position information, i=1. Decoding the 1 st vector representation of the first position information, the first text, the second position information to the k vector representation of the second position information based on the features to obtain a k+1 st vector representation of the first position information, k=1. In the foregoing implementation, if the number of target texts is two, the two target texts may be referred to as a first text and a second text, respectively. After obtaining the features of the target image, the target model may decode the preset vector representation based on the features of the target image, thereby obtaining the 1 st vector representation of the first position information of the first text in the target image. The object model may then decode the 1 st vector representation of the first location information based on the features of the object image, resulting in a 2 nd vector representation of the first location information. In this way, the target model can obtain the complete first position information presented in the form of vector representation. After the first position information is obtained, the target model may process the first position information based on the features of the target image, thereby obtaining the first text. After the first text is obtained, the target model may decode the first location information and the first text based on the features of the target image to obtain a 1 st vector representation of the second location information of the second text in the target image. The target model may then decode the 1 st vector representation of the first location information, the first text, and the second location information based on the features of the target image to obtain a 2 nd vector representation of the second location information. In this way, the target model can accurately obtain the second position information presented in the form of vector representation.

In one possible implementation, based on the feature and the location information, obtaining the target text includes: decoding the 1 st vector representation of the first text to the j-th vector representation of the first text based on the feature to obtain a j+1-th vector representation of the first text, j=1. Decoding the first position information, the first text, the second position information and the 1 st vector representation of the second text to the t vector representation of the second text based on the characteristics to obtain the t+1st vector representation of the second text, t=1. In the foregoing implementation, if the number of target texts is two, the two target texts may be referred to as a first text and a second text, respectively. After obtaining the first location information of the first text in the target image, the target model may first decode the first location information based on the features of the target image, thereby obtaining a 1 st vector representation of the first text. The target model may then decode the first location information and the 1 st vector representation of the first text based on the characteristics of the target image to obtain a 2 nd vector representation of the first text. In this way, the target model may obtain the first text presented in the form of a vector representation. After the first text is obtained, the target model can process the first position information and the first text based on the characteristics of the target image, so that second position information of the second text in the target image is obtained. After obtaining the second position information of the second text in the target image, the target model may decode the first position information, the first text, and the second position information based on the features of the target image, thereby obtaining the 1 st vector representation of the second text. The target model may then decode the first location information, the first text, the second location information, and the 1 st vector representation of the second text based on the characteristics of the target image to obtain a 2 nd vector representation of the second text. In this way, the target model can accurately obtain the second text presented in the form of a vector representation.

In one possible implementation, the method further includes: and converting all vector representations of the position information to obtain coordinates of the area occupied by the target text in the target image. In the foregoing implementation, the target model may convert the location information of the target text (in the target image) presented in the form of a vector representation into the location information of the target text presented in the form of coordinates to provide the user with a visual effect of the target text in the target image.

In one possible implementation, the method further includes: and converting all vector representations of the target text to obtain all characters of the target text. In the foregoing implementation, the target model may further convert the target text presented in the form of vector representations into target text presented in the form of characters (words), and further provide the user with a visual effect of the target text in the target image.

In one possible implementation, the transformation performed by the object model on the object text and the location information may be at least one of: feature extraction based on a cyclic neural network, feature extraction based on a multi-layer perceptron and feature extraction based on a time convolution network.

In one possible implementation, the coordinates of the region are at least one of: vertex coordinates of an upper left corner of the region and vertex coordinates of a lower right corner of the region; or, the vertex coordinates of the upper right corner of the region and the vertex coordinates of the lower left corner of the region; or, vertex coordinates of four corners of the region; or, the vertex coordinates of the upper left corner of the region, the vertex coordinates of the lower left corner of the region, and the center point coordinates of the region; or, the vertex coordinates of the upper right corner of the region, the vertex coordinates of the lower right corner of the region, and the center point coordinates of the region; or, the vertex coordinates of the upper right corner of the region, the vertex coordinates of the upper left corner of the region, and the center point coordinates of the region; or, the vertex coordinates of the lower right corner of the region, the vertex coordinates of the lower left corner of the region, and the center point coordinates of the region; or, the vertex coordinates of the upper right corner of the region, the vertex coordinates of the lower right corner of the region, the vertex coordinates of the upper left corner of the region, the vertex coordinates of the lower left corner of the region, and the center point coordinates of the region.

A second aspect of an embodiment of the present application provides a model training method, including: acquiring a target image, wherein the target image comprises a plurality of texts; processing the target image through a model to be trained to obtain the position information of the target text in the target image and the target text, wherein the texts comprise the target text, and the model to be trained is used for: encoding the target image to obtain the characteristics of the target image; acquiring position information based on the features; acquiring a target text based on the characteristics and the position information; training the model to be trained based on the target text to obtain a target model.

The target text trained by the method has a text acquisition function. Specifically, when it is necessary to extract a target text from a target image, the target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

In one possible implementation, the model to be trained is configured to decode, based on the feature, a 1 st vector representation of position information of the target text in the target image to an i-th vector representation of the position information, to obtain an i+1th vector representation of the position information, i=1.

In one possible implementation, the model to be trained is configured to decode the 1 st vector representation of the target text to the j-th vector representation of the target text based on the feature, to obtain the j+1th vector representation of the target text, j=1.

In one possible implementation, the target text includes a first text and a second text, the location information includes first location information of the first text in the target image and second location information of the second text in the target image, and the model to be trained is used for: decoding the 1 st vector representation of the first position information to the i-th vector representation of the first position information based on the features to obtain the i+1 th vector representation of the first position information, i=1. Decoding the 1 st vector representation of the first position information, the first text, the second position information to the k vector representation of the second position information based on the features to obtain a k+1 st vector representation of the first position information, k=1.

In one possible implementation, the model to be trained is used for: decoding the 1 st vector representation of the first text to the j-th vector representation of the first text based on the feature to obtain a j+1-th vector representation of the first text, j=1. Decoding the first position information, the first text, the second position information and the 1 st vector representation of the second text to the t vector representation of the second text based on the characteristics to obtain the t+1st vector representation of the second text, t=1.

In one possible implementation, the model to be trained is further configured to convert all vector representations of the location information to obtain coordinates of an area occupied by the target text in the target image.

In one possible implementation, the model to be trained is further configured to convert all vector representations of the target text to obtain all characters of the target text. Training the model to be trained based on the target text, wherein obtaining the target model comprises the following steps: training the model to be trained based on the characters and the coordinates to obtain a target model.

In one possible implementation, the conversion performed by the model to be trained on the target text and the location information may be at least one of: feature extraction based on a cyclic neural network, feature extraction based on a multi-layer perceptron and feature extraction based on a time convolution network.

A third aspect of an embodiment of the present application provides a text obtaining apparatus, including a target model, including: the first acquisition module is used for acquiring a target image, wherein the target image comprises a plurality of texts; the coding module is used for coding the target image to obtain the characteristics of the target image; the second acquisition module is used for acquiring the position information of the target text in the target image based on the characteristics, wherein the plurality of texts comprise the target text; and the third acquisition module is used for acquiring the target text based on the characteristics and the position information.

From the above device, it can be seen that: when it is desired to extract a target text from a target image, a target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

In one possible implementation manner, the second obtaining module is configured to decode, based on the feature, the 1 st vector representation of the position information of the target text in the target image to the i-th vector representation of the position information, to obtain the i+1th vector representation of the position information, i=1.

In one possible implementation, the third obtaining module is configured to decode the 1 st vector representation of the target text to the j-th vector representation of the target text based on the feature, to obtain the j+1-th vector representation of the target text, where j=1, Y-1, Y is greater than or equal to 1, and the 1 st vector representation of the target text is obtained by decoding the location information based on the feature.

In one possible implementation manner, the target text includes a first text and a second text, the position information includes first position information of the first text in the target image and second position information of the second text in the target image, the second acquisition module is used for decoding a 1 st vector representation of the first position information to an i-th vector representation of the first position information based on the feature to obtain an i+1-th vector representation of the first position information, i=1,..; decoding the 1 st vector representation of the first position information, the first text, the second position information to the k vector representation of the second position information based on the features to obtain a k+1 st vector representation of the first position information, k=1.

In one possible implementation manner, the third obtaining module is configured to decode, based on the feature, the first location information, from the 1 st vector representation of the first text to the j-th vector representation of the first text, to obtain a j+1th vector representation of the first text, where j=1, Y-1, Y is greater than or equal to 1, and the 1 st vector representation of the first text is obtained by decoding, based on the feature, the location information; decoding the first position information, the first text, the second position information and the 1 st vector representation of the second text to the t vector representation of the second text based on the characteristics to obtain the t+1st vector representation of the second text, t=1.

In one possible implementation, the apparatus further includes: the first conversion module is used for converting all vector representations of the position information to obtain coordinates of an area occupied by the target text in the target image.

In one possible implementation, the apparatus further includes: and the second conversion module is used for converting all vector representations of the target text to obtain all characters of the target text.

A fourth aspect of an embodiment of the present application provides a model training apparatus, including: the acquisition module is used for acquiring a target image, wherein the target image comprises a plurality of texts; the processing module is used for processing the target image through a model to be trained to obtain the position information of the target text in the target image and the target text, wherein the texts comprise the target text, and the model to be trained is used for: encoding the target image to obtain the characteristics of the target image; acquiring position information based on the features; acquiring a target text based on the characteristics and the position information; the training module is used for training the model to be trained based on the target text to obtain a target model.

The target text obtained by training the device has a text obtaining function. Specifically, when it is necessary to extract a target text from a target image, the target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

In one possible implementation, the model to be trained is further configured to convert all vector representations of the target text to obtain all characters of the target text. And the training module is used for training the model to be trained based on the characters and the coordinates to obtain a target model.

A fifth aspect of an embodiment of the present application provides a failure prediction apparatus, the apparatus including a memory and a processor; the memory stores code, the processor being configured to execute the code, the fault prediction device performing the method as described in the first aspect or any one of the possible implementations of the first aspect when the code is executed.

A sixth aspect of an embodiment of the present application provides a model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor is configured to execute the code, and when the code is executed, the model training apparatus performs the method as described in the second aspect or any one of the possible implementations of the second aspect.

A seventh aspect of the embodiments of the present application provides a circuitry comprising processing circuitry configured to perform the method of any one of the first aspect, the second aspect or any one of the possible implementations of the second aspect.

An eighth aspect of the embodiments of the present application provides a chip system, the chip system comprising a processor for invoking a computer program or computer instructions stored in a memory to cause the processor to perform a method as described in any one of the first aspect, any one of the possible implementations of the first aspect, the second aspect, or any one of the possible implementations of the second aspect.

In one possible implementation, the processor is coupled to the memory through an interface.

In one possible implementation, the system on a chip further includes a memory having a computer program or computer instructions stored therein.

A ninth aspect of the embodiments of the present application provides a computer storage medium storing a computer program which, when executed by a computer, causes the computer to carry out the method according to any one of the first aspect, the second aspect or any one of the possible implementations of the second aspect.

A tenth aspect of embodiments of the present application provides a computer program product storing instructions which, when executed by a computer, cause the computer to carry out the method according to any one of the first aspect, the second aspect or any one of the possible implementations of the second aspect.

In the embodiment of the application, when the target text is required to be extracted from the target image, the target image containing a plurality of texts can be acquired first, and the target image is input into the target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

Drawings

FIG. 1 is a schematic diagram of a structure of an artificial intelligence main body frame;

fig. 2a is a schematic structural diagram of a text obtaining system according to an embodiment of the present application;

fig. 2b is another schematic structural diagram of a text obtaining system according to an embodiment of the present application;

FIG. 2c is a schematic diagram of a related device for text retrieval according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a system 100 architecture according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a structure of a target model according to an embodiment of the present application;

fig. 5 is a schematic flow chart of a text obtaining method according to an embodiment of the present application;

FIG. 6 is another schematic diagram of a target model according to an embodiment of the present application;

FIG. 7 is another schematic diagram of a target model according to an embodiment of the present application;

FIG. 8 is another schematic diagram of a target model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a document question and answer provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of another structure of a target model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of information extraction according to an embodiment of the present application;

FIG. 12 is a schematic flow chart of a model training method according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of a text obtaining device according to an embodiment of the present application;

FIG. 14 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 16 is a schematic diagram of a training apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

In the related art, the pre-trained neural network model may include an encoder and a decoder. When a user needs to extract a target text from a plurality of texts presented by an image, the image may be input into a neural network model. The encoder may then encode the image to obtain features of the image and provide the features of the image to the decoder. The decoder may then decode based on the characteristics of the image, thereby obtaining and returning the target text to the user. For example, when a user needs to extract a name of a passenger from an image of a train ticket, the image may be input into a pre-training model, the pre-training model may extract features of the image, and based on the features of the image, extract a text of "name of passenger" from a plurality of texts such as "name of passenger", "shift", "time", "departure place" and "destination" presented by the image, and return the text to the user.

Further, in the above process, the neural network model generally only outputs the target text of plain text to the user, and cannot provide reasonable output interpretation (i.e., the target text cannot be extracted by the interpretation model) and visual interaction (i.e., some other content related to the target text except text cannot be provided additionally), so that the user experience is reduced,

furthermore, in the above process, the output length of the neural network model is limited, and in some special scenarios, the user often needs to obtain a long text, and the model cannot meet the requirement, so that the user experience is further reduced.

To solve the above-described problems, embodiments of the present application provide a text acquisition method that can be implemented in combination with artificial intelligence (artificial intelligence, AI) technology. AI technology is a technical discipline that utilizes digital computers or digital computer controlled machines to simulate, extend and extend human intelligence, and obtains optimal results by sensing environments, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Data processing using artificial intelligence is a common application of artificial intelligence.

First, the overall workflow of the artificial intelligence system will be described, referring to fig. 1, fig. 1 is a schematic structural diagram of an artificial intelligence subject framework, and the artificial intelligence subject framework is described below in terms of two dimensions, namely, an "intelligent information chain" (horizontal axis) and an "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.

Next, several application scenarios of the present application are described.

Fig. 2a is a schematic structural diagram of a text obtaining system according to an embodiment of the present application, where the text obtaining system includes a user device and a data processing device. The user equipment comprises intelligent terminals such as a mobile phone, a personal computer or an information processing center. The user equipment is an initiating terminal of text acquisition, and is used as an initiating terminal of a text acquisition request, and the user usually initiates the request through the user equipment.

The data processing device may be a device or a server having a data processing function, such as a cloud server, a web server, an application server, and a management server. The data processing equipment receives a text processing request from the intelligent terminal through the interactive interface, and then performs text processing in modes of machine learning, deep learning, searching, reasoning, decision and the like through a memory for storing data and a processor link for data processing. The memory in the data processing device may be a generic term comprising a database storing the history data locally, either on the data processing device or on another network server.

In the text acquisition system shown in fig. 2a, the user device may receive an instruction from the user, for example, the user device may acquire a target image containing a plurality of texts input/selected by the user, and then initiate a request to the data processing device, so that the data processing device executes an image processing application for the target image obtained by the user device, thereby obtaining a corresponding processing result for the image. For example, the user device may acquire one target image (content presented by the target image includes a plurality of texts) input by the user, and then initiate a processing request of the target image to the data processing device, so that the data processing device performs a series of processing on the target image, thereby obtaining a processing result of the target image, that is, target texts in a plurality of texts included in the target image and position information of the target texts in the target image.

In fig. 2a, a data processing device may perform the text retrieval method of an embodiment of the present application.

Fig. 2b is another schematic structural diagram of a text obtaining system according to an embodiment of the present application, in fig. 2b, a user device directly serves as a data processing device, and the user device can directly obtain an input from a user and directly process the input by hardware of the user device, and a specific process is similar to that of fig. 2a, and reference is made to the above description and will not be repeated here.

In the text capturing system shown in fig. 2b, the user device may receive an instruction from a user, for example, the user device may capture a target image input by the user (where the content presented by the target image includes a plurality of texts), and then perform a series of processing on the target image, so as to obtain a processing result of the target image, that is, target text in the plurality of texts included in the target image and location information of the target text in the target image.

In fig. 2b, the user equipment itself may perform the text acquisition method according to the embodiment of the present application.

Fig. 2c is a schematic diagram of a related device for text acquisition according to an embodiment of the present application.

The user device in fig. 2a and 2b may be the local device 301 or the local device 302 in fig. 2c, and the data processing device in fig. 2a may be the executing device 210 in fig. 2c, where the data storage system 250 may store data to be processed of the executing device 210, and the data storage system 250 may be integrated on the executing device 210, or may be disposed on a cloud or other network server.

The processors in fig. 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other models (e.g., a model based on a support vector machine), and perform image processing application on the image using the model obtained by the data final training or learning, thereby obtaining corresponding processing results.

Fig. 3 is a schematic diagram of a system 100 architecture provided by an embodiment of the present application, in fig. 3, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include in an embodiment of the present application: each task to be scheduled, callable resources, and other parameters.

In the preprocessing of the input data by the execution device 110, or in the process of performing a processing related to computation or the like (for example, performing a functional implementation of a neural network in the present application) by the computation module 111 of the execution device 110, the execution device 110 may call the data, the code or the like in the data storage system 150 for the corresponding processing, or may store the data, the instruction or the like obtained by the corresponding processing in the data storage system 150.

Finally, the I/O interface 112 returns the processing results to the client device 140 for presentation to the user.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule for different targets or different tasks, where the corresponding target model/rule may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result. Wherein the training data may be stored in database 130 and derived from training samples collected by data collection device 160.

In the case shown in FIG. 3, the user may manually give input data, which may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 3, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110. As shown in fig. 3, the neural network may be trained in accordance with the training device 120.

The embodiment of the application also provides a chip, which comprises the NPU. The chip may be provided in an execution device 110 as shown in fig. 3 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 3 to complete the training work of the training device 120 and output the target model/rule.

The neural network processor NPU is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) which distributes tasks. The core part of the NPU is an operation circuit, and the controller controls the operation circuit to extract data in a memory (a weight memory or an input memory) and perform operation.

In some implementations, the arithmetic circuitry includes a plurality of processing units (PEs) internally. In some implementations, the operational circuit is a two-dimensional systolic array. The arithmetic circuitry may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit takes the data corresponding to the matrix B from the weight memory and caches the data on each PE in the arithmetic circuit. The operation circuit takes the matrix A data and the matrix B from the input memory to perform matrix operation, and the obtained partial result or the final result of the matrix is stored in an accumulator (accumulator).

The vector calculation unit may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, etc. For example, the vector computation unit may be used for network computation of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit can store the vector of processed outputs to a unified buffer. For example, the vector calculation unit may apply a nonlinear function to an output of the arithmetic circuit, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to an arithmetic circuit, for example for use in subsequent layers in a neural network.

The unified memory is used for storing input data and output data.

The weight data is transferred to the input memory and/or the unified memory directly by the memory cell access controller (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory, and the data in the unified memory is stored in the external memory.

And a bus interface unit (bus interface unit, BIU) for implementing interaction among the main CPU, the DMAC and the instruction fetch memory through a bus.

The instruction fetching memory (instruction fetch buffer) is connected with the controller and used for storing instructions used by the controller;

and the controller is used for calling the instruction which refers to the cache in the memory and controlling the working process of the operation accelerator.

Typically, the unified memory, input memory, weight memory, and finger memory are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Because the embodiments of the present application relate to a large number of applications of neural networks, for convenience of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit having xs and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

The operation of each layer in a neural network can be described by the mathematical expression y=a (wx+b): the operation of each layer in a physical layer neural network can be understood as the transformation of input space into output space (i.e., row space to column space of the matrix) is accomplished by five operations on the input space (set of input vectors), including: 1. dimension increasing/decreasing; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein operations of 1, 2, 3 are completed by Wx, operation of 4 is completed by +b, and operation of 5 is completed by a (). The term "space" is used herein to describe two words because the object being classified is not a single thing, but rather a class of things, space referring to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value of a neuron in the layer neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weights W of each layer control how the space is transformed. The purpose of training the neural network is to finally obtain a weight matrix (a weight matrix formed by a plurality of layers of vectors W) of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

Since it is desirable that the output of the neural network is as close as possible to the value actually desired, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually desired target value and then according to the difference between the two (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the neural network can predict the actually desired target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

(2) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

The method provided by the application is described below from the training side of the neural network and the application side of the neural network.

The model training method provided by the embodiment of the application relates to the processing of a data sequence, and can be particularly applied to methods such as data training, machine learning, deep learning and the like, and intelligent information modeling, extraction, preprocessing, training and the like of symbolizing and formalizing training data (for example, a target image in the model training method of the embodiment of the application) are performed, so that a trained neural network (such as the target model in the application) is finally obtained; in addition, the text obtaining method provided by the embodiment of the present application may use the trained neural network, and input data (for example, the target image in the text obtaining method of the embodiment of the present application) into the trained neural network to obtain output data (for example, the target text in the text obtaining method of the embodiment of the present application and the position information of the target text in the target image). It should be noted that, the model training method and the text obtaining method provided by the embodiments of the present application are applications based on the same concept, and may be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.

The text acquisition method provided by the embodiment of the application can be realized through a target model (also called as a pre-trained text (document) model), and the structure of the target model is briefly introduced. Fig. 4 is a schematic structural diagram of a target model provided in an embodiment of the present application, as shown in fig. 4, an input end of the target model may be used to receive a target image from outside and position information of a target text in the target image from the input end of the target model, and an output end of the target model may output the target text and the position information of the target text in the target image. In order to understand the workflow of the object model shown in fig. 4, the workflow will be described with reference to fig. 5, and fig. 5 is a schematic flow chart of a text obtaining method according to an embodiment of the present application, and as shown in fig. 5, the method includes:

501. a target image is acquired, the target image comprising a plurality of text.

In this embodiment, when the target text needs to be acquired from the included target image, the target image may be acquired first. The content presented by the target image includes a plurality of texts, and when the target image is an image of a income statement, for example, the image includes "declaration name: sheet XX "," declaration time: 20XX year 10 month 02 day "," income amount: 200XXX element "," sex of the claimant: man "and" contact means of declaration person: 130XXXXXXXX ", etc., and when the target image is an image of a diabetes statistic, for example, the image contains" introduction of diabetes: diabetes is a common disease., "," prevalence of diabetes: 0.X "," prevalence of different sexes: male 4.x% and female 3.X% "," annual economic loss due to diabetes: 6X hundred million "and" preventable probability of diabetes: 6X% ", etc.

It will be appreciated that the number of target texts to be acquired may be one or more, that is, one text or more of the plurality of texts included in the target image need to be acquired. Still as in the above example, when the target image is an image of a income claim form, for example, the target text to be acquired is "claim name: sheet XX "and" declaration time: two texts of 20XX, 10 month and 02 day ". For another example, when the target image is an image of a diabetes statistic, the target text to be acquired is "preventable probability of diabetes: 6X% "text.

502. And encoding the target image to obtain the characteristics of the target image.

After the target image is obtained, the target image can be input into a target model, and the target model can encode the target image first, so that the characteristics of the target image are obtained.

Specifically, as shown in fig. 4, the object model may include an encoder and a decoder. Then, the object model may acquire the features of the object image by:

after inputting the target image into the target model, the encoder of the target model may encode the target image, thereby obtaining features of the target image, and transmit the features of the target image to the decoder of the target model.

For example, as shown in fig. 6 (fig. 6 is another schematic structural diagram of a target model provided in an embodiment of the present application), it is assumed that contents presented by a certain image include text 1, text 2, text 3, text n (n is a positive integer greater than or equal to 2), and the target model includes an encoder (encoder) and a decoder (decoder). When it is desired to obtain text 1, text 2, text m in the image (m is less than or equal to n, and m is a positive integer greater than or equal to 1), the image may be included in the object model. Then, the encoder of the object model, upon receiving the image, may encode the image to obtain visual features of the image. The coding process is shown in the following formula:

H＝Encoder(image) (2)

in the above formula, image is the image, and H is the visual feature of the image. Then, the encoder obtains the visual characteristics of the image and then provides the visual characteristics of the image to the decoder.

503. Based on the features, position information of the target text in the target image is acquired, and the plurality of texts comprise the target text.

After the characteristics of the target image are obtained, the target model can process the characteristics of the target image, so that the position information of the target text in the target image is obtained, and the position information of the target text in the target image is externally output, namely, the position information of the target text in the target image is one of the target models.

Specifically, the target model may obtain the location information of the target text in the target image in a plurality of ways:

(1) If the number of target texts is one, after obtaining the features of the target image, the decoder may first decode a preset vector representation (which may also be referred to as a start vector representation of the sequence, whose content is usually nonsensical) based on the features of the target image, so as to obtain the 1 st vector representation (token) of the position information of the target text in the target image. The decoder may then decode the 1 st vector representation of the location information of the target text in the target image based on the features of the target image, resulting in the 2 nd vector representation of the location information of the target text in the target image. The decoder may then decode the 1 st vector representation of the location information of the target text in the target image and the 2 nd vector representation of the location information of the target text in the target image based on the characteristics of the target image, resulting in a 3 rd vector representation of the location information of the target text in the target image. In this way, the decoder can obtain the position information of the target text presented in the form of vector representation in the target image.

Still as in the example above, it is only necessary to obtain text 1 from the image. Then, after obtaining the visual features of the image, the decoder may decode the starting vector representation (beginning of sequence, BOS) of the sequence based on the visual features of the image, resulting in a vector representation 1 of the position information 1 of text 1 (in the image), i.e. equivalent to the position information 1 being presented in the form of a vector representation.

(2) If the number of target texts is two, the two target texts may be referred to as a first text and a second text, respectively. After obtaining the features of the target image, the decoder may first decode the preset vector representation based on the features of the target image, thereby obtaining the 1 st vector representation of the first position information of the first text in the target image. The decoder may then decode the 1 st vector representation of the first location information based on the features of the target image, resulting in a 2 nd vector representation of the first location information. The decoder may then decode the 1 st vector representation of the first location information and the 2 nd vector representation of the first location information based on the characteristics of the target image, resulting in a 3 rd vector representation of the first location information. In this way, the decoder can obtain the complete first position information in the form of a vector representation.

After obtaining the first position information, the decoder may process the first position information based on the features of the target image, thereby obtaining a first text, which is not expanded first.

After obtaining the first text, the decoder may decode the first location information and the first text based on the features of the target image to obtain a 1 st vector representation of the second location information of the second text in the target image. The decoder may then decode the 1 st vector representation of the first location information, the first text, and the second location information based on the characteristics of the target image, resulting in a 2 nd vector representation of the second location information. The decoder may then decode the first, second, and second vector representations of the first, second, and second position information based on the characteristics of the target image to obtain a 3 rd vector representation of the second position information, and finally, the decoder may decode the first, and second position information 1 st vector representations to a second position information Z-1 st vector representation based on the characteristics of the target image to obtain a second position information Z vector representation (Z being a positive integer greater than or equal to 1). In this way, the decoder can obtain the second position information presented in the form of a vector representation.

Still as in the example above, it is only necessary to obtain text 1 and text 2 from the image. Then, after obtaining the visual features of the image, the decoder may decode the starting vector representation of the sequence based on the visual features of the image, resulting in a vector representation 1 of the position information 1 of the text 1 (in the image), i.e. equivalent to the position information 1 being presented in the form of a vector representation.

The decoder may then process the position information 1 based on the visual characteristics of the image, resulting in text 1, which is not first unfolded.

The decoder may then decode the vector representation 1 of the position information 1, the vector representation 2 of the text 1, the vector representation 3 of the text 1 and the vector representation 4 of the text 1 based on the visual features of the image, resulting in a vector representation 5 of the position information 2 of the text 2 (in the image), i.e. equivalent to the position information 2 being presented in the form of a vector representation.

(3) If the number of the target texts is three or more, that is, there are the first text, the second text, the third text, and so on, in this case, the procedure of the decoder acquiring the first position information of the first text in the target image, the second position information of the second text in the target image, the third position information of the third text in the target image, and so on is similar to the procedure described in the above (2), and will not be repeated here.

504. And acquiring the target text based on the characteristics and the position information.

After obtaining the position information of the target text in the target image, the decoder can further process the characteristics of the target image and the position information of the target text in the target image, so as to obtain the target text, and output the target text to the outside, that is, the target text is output by the other one of the target models.

Specifically, the target model may obtain the target text in a number of ways:

(1) If the number of the target texts is one, after obtaining the position information of the target texts in the target image, the decoder may decode the position information of the target texts in the target image based on the features of the target image, so as to obtain the 1 st vector representation of the target texts. The decoder may then decode the location information of the target text in the target image and the 1 st vector representation of the target text based on the features of the target image, resulting in the 2 nd vector representation of the target text. The decoder may then decode the position information of the target text in the target image, based on the features of the target image, the 1 st vector representation of the target text to the 2 nd vector representation of the target text, resulting in a 3 rd vector representation of the target text. In this way, the decoder can obtain the target text presented in the form of a vector representation.

Still as in the example above, it is only necessary to obtain text 1 from the image. After obtaining the position information 1 of the text 1, the decoder may decode the position information 1 based on the visual features of the image, such that the vector of the text 1 represents 2. The decoder may then further decode the position information 1 and the vector representation 2 of the text 1 based on the visual features of the image, such that the vector representation 3 of the text 1. The decoder may then also decode the position information 1, the vector representation 2 of the text 1 and the vector representation 3 of the text 1 based on the visual features of the image, so that the vector representation 4 of the text 1. This corresponds to the text 1 presented in the form of a vector representation.

(2) If the number of target texts is two, the two target texts may be referred to as a first text and a second text, respectively. After obtaining the first location information of the first text in the target image, the decoder may first decode the first location information based on the characteristics of the target image, thereby obtaining a 1 st vector representation of the first text. The decoder may then decode the first location information and the 1 st vector representation of the first text based on the characteristics of the target image, resulting in a 2 nd vector representation of the first text. The decoder may then decode the first location information, the 1 st vector representation of the first text to the 2 nd vector representation of the first text, resulting in a 3 rd vector representation of the first text, based on the characteristics of the target image, and finally the decoder may decode the first location information, the 1 st vector representation of the first text to the Y-1 st vector representation of the first text, resulting in a Y-th vector representation of the first text, based on the characteristics of the target image. In this way, the decoder can obtain the first text presented in the form of a vector representation.

After the first text is obtained, the decoder may process the first location information and the first text based on the features of the target image, so as to obtain second location information of the second text in the target image.

After obtaining the second position information of the second text in the target image, the decoder may decode the first position information, the first text, and the second position information based on the features of the target image, thereby obtaining the 1 st vector representation of the second text. The decoder may then decode the first location information, the first text, the second location information, and the 1 st vector representation of the second text based on the characteristics of the target image, resulting in a 2 nd vector representation of the second text. The decoder may then decode the first location information, the first text, the second location information, the 1 st vector representation of the second text, and the 2 nd vector representation of the second text based on the characteristics of the target image, resulting in a 3 rd vector representation of the second text. In this way, the decoder can obtain the second text presented in the form of a vector representation.

The decoder may then process the location information 1 and the text 1 based on the visual features of the image, so as to obtain the location information 2 of the text 2, which is referred to in the related description section and will not be described here again.

The decoder may then further decode the vector representation 1 of the position information 1, the vector representation 2 of the text 1, the vector representation 3 of the text 1, the vector representation 4 of the text 1 and the vector representation 5 of the position information 2 based on the visual features of the image, such that the vector representation 6 of the text 2. Next, the decoder may also decode the vector representation 1 of the position information 1, the vector representation 2 of the text 1, the vector representation 3 of the text 1, the vector representation 4 of the text 1, the vector representation 5 of the position information 2, and the vector representation 6 of the text 2 based on the visual features of the image, such that the vector representation 7 of the text 2. This corresponds to the text 2 being presented in a vector representation.

(3) If the number of the target texts is three or more, that is, there are the first text, the second text, the third text, and so on, in this case, the process of the decoder acquiring the first text, the second text, the third text, and so on is similar to the process described in the above (2), and a detailed description thereof will be omitted.

More specifically, as shown in fig. 7 (fig. 7 is another schematic structural diagram of a target model provided in an embodiment of the present application), the target model may further include a converter on the basis of including an encoder and a decoder. It should be noted that, in the target model shown in fig. 4, (the decoder of) the target model outputs the target text presented in the form of a vector representation and the position information of the target text in the target image, whereas in the target model shown in fig. 7, the decoder may send the target text presented in the form of a vector representation to the converter, and the converter may convert all vector representations of the target text to obtain all characters of the target text, so that the converter may output the target text presented in the form of characters (words) to the outside. Similarly, the decoder may send the position information of the target text in the target image, which is presented in the form of a vector representation, to the conversion model, and the conversion model may convert all vector representations of the position information of the target text in the target image to obtain coordinates of an area occupied by the target text in the target image, so that the converter may output the position information of the target text in the target image, which is presented in the form of coordinates, to the outside.

More specifically, in the object model shown in fig. 7, the converter may be at least one of: a recurrent neural network, a multi-layer perceptron, and a time convolution network. Accordingly, the conversion performed by the converter on the target text and the position information may be at least one of the following: feature extraction based on a cyclic neural network, feature extraction based on a multi-layer perceptron and feature extraction based on a time convolution network.

More specifically, the coordinates of the region occupied by the target text in the image generally include the vertex coordinates of the upper left corner of the region and the vertex coordinates of the lower right corner of the region. It should be noted that, the conversion module may obtain the vertex coordinates of the region by the following formula:

in the above formula, the conversion module can calculate all vector representations of the target text in the position information of the target image, so as to obtain the characteristic of the vertex abscissa of the upper left corner of the regionThen pair->Calculating to obtain the vertex abscissa Z of the upper left corner ₁ . Then, the conversion module can->Z is as follows ₁ Calculating to obtain the characteristic of the vertical coordinate of the vertex of the upper left corner of the area +.>Then pair->Calculating to obtain the vertical coordinate Z of the vertex of the upper left corner ₂ . And so on until the vertex abscissa Z of the upper left corner of the region is obtained ₁ Vertex ordinate Z of upper left corner ₂ Vertex abscissa Z of lower right corner ₃ Lower right cornerVertex ordinate Z ₄ 。

It should be understood that in this embodiment, the coordinates of the region are only schematically described as including the coordinates of the vertex in the upper left corner of the region and the coordinates of the vertex in the lower right corner of the region. In practical applications, the coordinates of the region may also be any of the following: (1) Vertex coordinates of an upper right corner of the region and vertex coordinates of a lower left corner of the region; (2) vertex coordinates of four corners of the region; (3) Vertex coordinates of an upper left corner of the region, vertex coordinates of a lower left corner of the region, and center point coordinates of the region; (4) Vertex coordinates of an upper right corner of the region, vertex coordinates of a lower right corner of the region, and center point coordinates of the region; (5) Vertex coordinates of an upper right corner of the region, vertex coordinates of an upper left corner of the region, and center point coordinates of the region; (6) Vertex coordinates of a lower right corner of the region, vertex coordinates of a lower left corner of the region, and center point coordinates of the region; (7) The vertex coordinates of the upper right corner of the region, the vertex coordinates of the lower right corner of the region, the vertex coordinates of the upper left corner of the region, the vertex coordinates of the lower left corner of the region, the center point coordinates of the region, and so forth.

Further, in the object model shown in fig. 4 or fig. 7, the total output length of the decoder is not limited. In general, the decoder may output all vector representations at once when the total number of required (location information and text) vector representations is less than or equal to a preset threshold (e.g., 1024), and may output the vector representations in batches when the total number of required vector representations is greater than the preset threshold. For example, when the required vector representation is 2000, the decoder may output the 1 st to 1024 th vector representations as the first batch of outputs, and then the decoder may continue decoding with the last 25% of the vector representations in the first batch of outputs as inputs, thereby outputting the 1025 th to 2000 th vector representations as the second batch of outputs. Therefore, the target model provided by the embodiment of the application can output a sufficient number of texts and texts with a sufficient length.

Furthermore, the target model provided by the embodiment of the application is a pre-training model, and in order to adapt to the downstream tasks, the (structure and parameters of the) target model can be finely adjusted according to the requirements of different downstream tasks. The process is described below in connection with several downstream tasks:

Assuming that the downstream task is a document question-and-answer task, as shown in fig. 8 (fig. 8 is another schematic diagram of the object model provided by the embodiment of the present application), the decoder in the object model is still connected to the coordinate converter and the text converter (the structure of the object model shown in fig. 8 and the structure of the object model shown in fig. 7 may be the same), at this time, the BOS input to the decoder is a prompt (question) input by the user, and the decoder outputs the position information of the answer in the vector representation form and the answer in the vector representation form based on the feature of the image input by the user. After conversion by the coordinate converter, coordinates of the answer can be obtained, and after conversion by the text converter, the answer in the text form can be obtained. Then, the coordinates of the answer and the answer in text form can be appended to the image entered by the user and returned to the user.

For example, as shown in fig. 9 (fig. 9 is a schematic diagram of a document question and answer provided by an embodiment of the present application), a user inputs an image of a high-speed railway ticket and asks "X-haipeng goes where? The target model can process the image to obtain the answer 'zalan flat station' in the text form and the coordinates of the answer 'zalan flat station' in the image, and then the 'zalan flat station' can be selected in the image by a frame, and the corresponding highlighted text display is attached to the image so as to return to the user for browsing.

Assuming that the downstream task is an information extraction task, as shown in fig. 10 (fig. 10 is another schematic diagram of the object model provided in the embodiment of the present application), the decoder in the object model is connected to the information extractor (that is, the converter in the object model shown in fig. 7 is replaced by the information extractor), at this time, the BOS input to the decoder is still a nonsensical vector representation, and the decoder outputs the position information of the object information in the vector representation form and the object information in the vector representation form based on the characteristics of the image input by the user. After comprehensive processing by the information extractor, the target information in the form of characters can be obtained.

For example, as shown in fig. 11 (fig. 11 is a schematic diagram of information extraction provided in the embodiment of the present application), a user inputs an image of a high-speed railway ticket. The target model can process the image, obtain information of time, destination, name, departure place, shift and the like in the form of text, and return the information to the user for browsing.

Furthermore, the object model provided by the embodiment of the present application may be compared with a model provided by a related technology, where the comparison result is shown in table 1:

TABLE 1

Model	Decoding output	Long document decoding length	Computational complexity
				Prior Art	Text of	Without limitation	Higher up
Embodiments of the application	Text+location information	Without limitation	Lower in height

Based on the results shown in table 1, the object model provided by the embodiment of the application not only can decode more types of output, but also has lower computational complexity, thereby effectively reducing the power consumption of a decoder in the model and accelerating the decoding speed.

Still further, the object model (for example, the outer in table 2) provided by the embodiment of the present application may be compared with another model (for example, the remaining models except the outer in table 2, such as BERT, etc.) provided by another related technology, and the comparison result is shown in table 2:

TABLE 2

Based on table 2, it can be seen that the target model provided by the embodiment of the application has better performance.

Further, the target text which is finally output by the target model not only comprises the target text in the form of characters (words), but also comprises the coordinates of the area occupied by the target text in the target image, the two types of information can be overlapped on the target image and returned to a user for browsing, so that the target text required by the user can be provided for the user in a visual interaction interface mode, and the basis for extracting the target text by the target model can be explained for the user.

Further, the output length of the target model is not limited, so that even if a user needs to acquire a longer text or a larger number of texts, the target model can meet the requirements of the user, thereby improving the user experience.

The foregoing is a detailed description of the text acquisition method provided by the embodiment of the present application, and the model training method provided by the embodiment of the present application will be described below. Fig. 12 is a schematic flow chart of a model training method according to an embodiment of the present application, as shown in fig. 12, the method includes:

1201. a target image is acquired, the target image comprising a plurality of text.

In this embodiment, when the model to be trained is required to be trained, a batch of training data is acquired, where the batch of training data includes a target image, the content presented by the target image includes a plurality of texts, for the plurality of texts included in the target image, all real characters of the target text in the plurality of texts are known, and real coordinates of the target text in the area occupied by the target image are also known

1202. Processing the target image through a model to be trained to obtain the position information of the target text in the target image and the target text, wherein the texts comprise the target text, and the model to be trained is used for: encoding the target image to obtain the characteristics of the target image; acquiring position information based on the features; and acquiring the target text based on the characteristics and the position information.

After the target image is obtained, the target image can be input into the model to be trained. Then, the model to be trained can encode the target image to obtain the characteristics of the target image. The model to be trained may then obtain location information of the target text in the target image based on the features of the target image. Finally, the target model may obtain the target text based on the features of the target image and the location information of the target text in the target image.

In a possible implementation, the model to be trained is further configured to convert all vector representations of the position information to obtain (predicted) coordinates of the region occupied by the target text in the target image.

In one possible implementation, the model to be trained is also used to transform all vector representations of the target text, resulting in all (predicted) characters of the target text.

For the description of step 1202, reference may be made to the relevant description of steps 502 to 504 in the embodiment shown in fig. 5, which is not repeated here.

1203. Training the model to be trained based on the target text to obtain a target model.

After the target text is obtained, the model to be trained can be trained based on the target text, so as to obtain the target model in the embodiment shown in fig. 5.

Specifically, training of the target model may be accomplished by:

after the coordinates of all characters of the target text and the area occupied by the target text in the target image are obtained, the characters of the target text and the real characters of the target text can be calculated through a preset first loss function, so that first loss is obtained. Wherein the first loss can be obtained by the following formula:

in the above, L _Read For the first loss, X is the feature of the target image, y _r-1 R-1 st character, y of target text _r Is the r character, y 'of the target text' _r Is the r-th real character of the target text. It follows that the first penalty may be used to indicate the difference between the characters of the target text and the real characters of the target text.

Then, the coordinates of the area occupied by the target text in the target image and the real coordinates of the area occupied by the target text in the target image can be calculated through a preset second loss function, so that the second loss is obtained. Wherein the second loss can be obtained by the following formula:

In the above, L _Locate As a result of the second loss,for the (u-1) th coordinate of the region occupied by the(s) th target text in the target image, (-1)>For the (u) th coordinate of the region occupied by the(s) th target text in the target image, the (a) th target text is displayed in the target image>The (u) th real coordinate of the region occupied by the(s) th target text in the target image. It follows that the second penalty may be used to indicate the difference between the coordinates of the region occupied by the target text in the target image and the true coordinates of the region occupied by the target text in the target image.

The first loss and the second loss may then be superimposed to obtain the target loss. Then, the parameters of the model to be trained can be updated by using the target loss, so as to obtain an updated model to be trained, and the model to be trained after the parameters are updated is continuously trained by using the next batch of training data until the model training conditions (such as target loss convergence and the like) are met, so that the target model is obtained.

The target text obtained through training in the embodiment of the application has a text obtaining function. Specifically, when it is necessary to extract a target text from a target image, the target image containing a plurality of texts may be acquired first and input to a target model. The target model may then encode the target image, resulting in features of the target image. The target model may then process the features of the target image to obtain location information of the target text in the plurality of texts in the target image. Finally, the target model can further process the characteristics of the target image and the position information of the target text in the target image, so that the target text is obtained. So far, the target text is successfully extracted from the target image. In the foregoing process, when the object model understands the content of the object image, not only the characteristics of the object image but also the position information of the object text in the object image are considered, so that the considered factors are comprehensive, and the content of the object image can be fully and accurately understood, so that the object text, usually the correct text, extracted from a plurality of texts presented by the object image by the object model in this way can be seen.

The text acquisition method and the model training method provided by the embodiment of the application are described in detail, and the text acquisition device and the model training device provided by the embodiment of the application are described below. Fig. 13 is a schematic structural diagram of a text obtaining apparatus according to an embodiment of the present application, as shown in fig. 13, where the apparatus includes:

a first obtaining module 1301, configured to obtain a target image, where the target image includes a plurality of texts;

the encoding module 1302 is configured to encode the target image to obtain characteristics of the target image;

a second obtaining module 1303, configured to obtain, based on the features, position information of a target text in the target image, where the multiple texts include the target text;

a third obtaining module 1304 is configured to obtain the target text based on the feature and the location information.

In one possible implementation manner, the second obtaining module 1303 is configured to decode, based on the feature, the 1 st vector representation of the position information of the target text in the target image to the i-th vector representation of the position information, to obtain the i+1th vector representation of the position information, i=1, X-1, X is greater than or equal to 1, and the 1 st vector representation of the position information is obtained by decoding, based on the feature, the preset vector representation.

In one possible implementation, the third obtaining module 1304 is configured to decode the 1 st vector representation of the target text to the j-th vector representation of the target text based on the feature, to obtain the j+1th vector representation of the target text, where j=1, Y-1, Y is greater than or equal to 1, and the 1 st vector representation of the target text is decoded based on the feature.

In one possible implementation manner, the target text includes a first text and a second text, the position information includes first position information of the first text in the target image and second position information of the second text in the target image, the second obtaining module 1303 is configured to decode, based on the feature, a 1 st vector representation of the first position information to an i-th vector representation of the first position information to obtain an i+1-th vector representation of the first position information, i=1,..; decoding the 1 st vector representation of the first position information, the first text, the second position information to the k vector representation of the second position information based on the features to obtain a k+1 st vector representation of the first position information, k=1.

In one possible implementation manner, the third obtaining module 1304 is configured to decode the first location information based on the feature, where the 1 st vector representation of the first text to the j-th vector representation of the first text, to obtain the j+1th vector representation of the first text, where j=1, Y-1, Y is greater than or equal to 1, and where the 1 st vector representation of the first text is obtained by decoding the location information based on the feature; decoding the first position information, the first text, the second position information and the 1 st vector representation of the second text to the t vector representation of the second text based on the characteristics to obtain the t+1st vector representation of the second text, t=1.

Fig. 14 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application, as shown in fig. 14, where the apparatus includes:

an acquisition module 1401 for acquiring a target image, the target image comprising a plurality of texts;

the processing module 1402 is configured to process the target image through a model to be trained, obtain location information of a target text in the target image and the target text, where the multiple texts include the target text, and the model to be trained is configured to: encoding the target image to obtain the characteristics of the target image; acquiring position information based on the features; acquiring a target text based on the characteristics and the position information;

the training module 1403 is configured to train the model to be trained based on the target text, so as to obtain a target model.

In one possible implementation, the model to be trained is further configured to convert all vector representations of the target text to obtain all characters of the target text. The training module 1403 is configured to train the model to be trained based on the characters and the coordinates, so as to obtain a target model.

It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned apparatus is based on the same concept as the method embodiment of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and specific content may refer to the description in the foregoing illustrated method embodiment of the present application, and will not be repeated herein.

The embodiment of the application also relates to an execution device, and fig. 15 is a schematic structural diagram of the execution device provided by the embodiment of the application. As shown in fig. 15, the execution device 1500 may be embodied as a mobile phone, a tablet, a notebook, a smart wearable device, a server, etc., which is not limited herein. The text obtaining apparatus described in the corresponding embodiment of fig. 13 may be disposed on the executing device 1500, so as to implement the function of text obtaining in the corresponding embodiment of fig. 5. Specifically, the execution apparatus 1500 includes: a receiver 1501, a transmitter 1502, a processor 1503 and a memory 1504 (where the number of processors 1503 in the execution device 1500 may be one or more, one processor is exemplified in fig. 15), wherein the processor 1503 may include an application processor 15031 and a communication processor 15032. In some embodiments of the application, the receiver 1501, transmitter 1502, processor 1503 and memory 1504 may be connected by a bus or other means.

Memory 1504 may include read only memory and random access memory and provide instructions and data to the processor 1503. A portion of the memory 1504 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1504 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 1503 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the above embodiment of the present application may be applied to the processor 1503 or implemented by the processor 1503. The processor 1503 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 1503 or by instructions in the form of software. The processor 1503 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, and may further include an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The processor 1503 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1504 and the processor 1503 reads the information in the memory 1504 and in combination with its hardware performs the steps of the above method.

The receiver 1501 may be used to receive input digital or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1502 may be used to output numeric or character information through a first interface; the transmitter 1502 may also be configured to send instructions to the disk set through the first interface to modify data in the disk set; the transmitter 1502 may also include a display device such as a display screen.

In an embodiment of the present application, in one case, the processor 1503 is configured to obtain the target text from the target image through the target model in the corresponding embodiment of fig. 5.

The embodiment of the application also relates to training equipment, and fig. 16 is a schematic structural diagram of the training equipment provided by the embodiment of the application. As shown in fig. 16, the training device 1600 is implemented by one or more servers, the training device 1600 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 1616 (e.g., one or more processors) and memory 1632, one or more storage media 1630 (e.g., one or more mass storage devices) storing applications 1642 or data 1644. Wherein memory 1632 and storage medium 1630 may be transitory or persistent. The program stored on the storage medium 1630 may include one or more modules (not shown), each of which may include a series of instruction operations in the training device. Still further, central processor 1616 may be configured to communicate with storage medium 1630 to execute a series of instruction operations in storage medium 1630 on training device 1600.

The training device 1600 may also include one or more power supplies 1626, one or more wired or wireless network interfaces 1650, one or more input/output interfaces 1658; or one or more operating systems 1641, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

Specifically, the training device may perform the model training method in the corresponding embodiment of fig. 12, so as to obtain the target model.

The embodiment of the application also relates to a computer storage medium in which a program for performing signal processing is stored which, when run on a computer, causes the computer to perform the steps as performed by the aforementioned performing device or causes the computer to perform the steps as performed by the aforementioned training device.

Embodiments of the present application also relate to a computer program product storing instructions that, when executed by a computer, cause the computer to perform steps as performed by the aforementioned performing device or cause the computer to perform steps as performed by the aforementioned training device.

The execution device, training device or terminal device provided in the embodiment of the present application may be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 17, fig. 17 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 1700, and the NPU 1700 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The NPU has a core part of an arithmetic circuit 1703, and the controller 1704 controls the arithmetic circuit 1703 to extract matrix data in a memory and perform multiplication.

In some implementations, the arithmetic circuit 1703 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuit 1703 is a two-dimensional systolic array. The arithmetic circuit 1703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 1703 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1702 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 1701 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 1708.

The unified memory 1706 is used for storing input data and output data. The weight data is directly transferred to the weight memory 1702 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1705. The input data is also carried into the unified memory 1706 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 1717, for the AXI bus to interact with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1709.

The bus interface unit 1717 (Bus Interface Unit, abbreviated as BIU) is configured to obtain an instruction from the external memory by the instruction fetch memory 1709, and is also configured to obtain the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1705.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1706 or to transfer weight data to the weight memory 1702 or to transfer input data to the input memory 1701.

The vector calculation unit 1707 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1703 as needed. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a predicted label plane and the like.

In some implementations, the vector computation unit 1707 can store the vector of processed outputs to the unified memory 1706. For example, the vector calculation unit 1707 may perform a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 1703, such as linear interpolation of the predicted label plane extracted by the convolutional layer, and then such as a vector of accumulated values, to generate the activation value. In some implementations, the vector computation unit 1707 generates a normalized value, a pixel-level summed value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 1703, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1709 connected to the controller 1704, for storing instructions used by the controller 1704;

the unified memory 1706, input memory 1701, weight memory 1702, and finger memory 1709 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A method for obtaining text, wherein the method is implemented by a target model, the method comprising:

acquiring a target image, wherein the target image comprises a plurality of texts;

encoding the target image to obtain the characteristics of the target image;

acquiring position information of a target text in the target image based on the characteristics, wherein the plurality of texts comprise the target text;

and acquiring the target text based on the characteristics and the position information.

2. The method of claim 1, wherein the obtaining location information of target text in the target image based on the features comprises:

decoding the 1 st vector representation of the position information of the target text in the target image to the i-th vector representation of the position information based on the characteristics, obtaining the i+1-th vector representation of the position information, i=1.

3. The method of claim 2, wherein the obtaining the target text based on the feature and the location information comprises:

And decoding the 1 st vector representation of the target text to the j-th vector representation of the target text based on the feature to obtain the j+1-th vector representation of the target text, j=1.

4. The method of claim 1, wherein the target text comprises a first text and a second text, the location information comprises first location information of the first text in the target image and second location information of the second text in the target image, and wherein the obtaining location information of the target text in the target image based on the feature comprises:

decoding the 1 st vector representation of the first location information to the i-th vector representation of the first location information based on the feature, resulting in an i+1-th vector representation of the first location information, i=1, X-1, X being equal to or greater than 1, the 1 st vector representation of the first location information being obtained by decoding a preset vector representation based on the feature;

decoding the 1 st vector representation of the second position information to the kth vector representation of the second position information based on the feature to obtain the k+1st vector representation of the first position information, k=1.

5. The method of claim 4, wherein the obtaining the target text based on the feature and the location information comprises:

decoding the 1 st vector representation of the first text to the j-th vector representation of the first text based on the feature to obtain a j+1-th vector representation of the first text, j=1..y-1, Y is greater than or equal to 1, the 1 st vector representation of the first text being obtained by decoding the position information based on the feature;

based on the characteristics, decoding the first position information, the first text and the second position information, wherein the 1 st vector representation of the second text is to the t vector representation of the second text, so as to obtain the t+1st vector representation of the second text, t=1.

6. The method according to any one of claims 1 to 5, further comprising:

and converting all vector representations of the position information to obtain coordinates of an area occupied by the target text in the target image.

7. The method according to any one of claims 1 to 6, further comprising:

and converting all vector representations of the target text to obtain all characters of the target text.

8. A method of model training, the method comprising:

processing the target image through a model to be trained to obtain the position information of the target text in the target image and the target text, wherein the plurality of texts comprise the target text, and the model to be trained is used for: encoding the target image to obtain the characteristics of the target image; acquiring the position information based on the characteristics; acquiring the target text based on the characteristics and the position information;

and training the model to be trained based on the target text to obtain a target model.

9. The method according to claim 8, wherein the model to be trained is configured to decode a 1 st vector representation of position information of a target text in the target image to an i-th vector representation of the position information based on the feature, resulting in an i+1-th vector representation of the position information, i=1, X-1, X being equal to or greater than 1, the 1-st vector representation of the position information being obtained by decoding a preset vector representation based on the feature.

10. The method of claim 9, wherein the model to be trained is configured to decode the location information based on the feature, the 1 st vector representation of the target text to the j-th vector representation of the target text, resulting in the j+1-th vector representation of the target text, j = 1, Y-1, Y being ∈1, the 1-th vector representation of the target text being decoded based on the feature.

11. The method of claim 8, wherein the target text comprises a first text and a second text, the location information comprises first location information of the first text in the target image and second location information of the second text in the target image, the model to be trained is to:

12. The method of claim 11, wherein the model to be trained is for:

13. The method according to any one of claims 8 to 12, wherein the model to be trained is further configured to convert all vector representations of the location information to obtain coordinates of an area occupied by the target text in the target image.

14. The method of claim 13, wherein the model to be trained is further configured to transform all vector representations of the target text to obtain all characters of the target text;

training the model to be trained based on the target text, and obtaining a target model comprises the following steps:

and training the model to be trained based on the characters and the coordinates to obtain a target model.

15. A text acquisition device, the device comprising a target model, the device comprising:

the first acquisition module is used for acquiring a target image, wherein the target image comprises a plurality of texts;

the encoding module is used for encoding the target image to obtain the characteristics of the target image;

the second acquisition module is used for acquiring the position information of a target text in the target image based on the characteristics, wherein the plurality of texts comprise the target text;

And the third acquisition module is used for acquiring the target text based on the characteristics and the position information.

16. A model training apparatus, the apparatus comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a target image, and the target image comprises a plurality of texts;

the processing module is used for processing the target image through a model to be trained to obtain the position information of the target text in the target image and the target text, the texts comprise the target text, and the model to be trained is used for: encoding the target image to obtain the characteristics of the target image; acquiring the position information based on the characteristics; acquiring the target text based on the characteristics and the position information;

and the training module is used for training the model to be trained based on the target text to obtain a target model.

17. A text retrieval apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to execute the code, the text retrieval device performing the method of any of claims 1 to 14 when the code is executed.

18. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 14.

19. A computer program product, characterized in that it stores instructions that, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 14.