WO2023202268A1

WO2023202268A1 - Text information extraction method and apparatus, target model acquisition method and apparatus, and device

Info

Publication number: WO2023202268A1
Application number: PCT/CN2023/081379
Authority: WO
Inventors: 姜媚
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2022-04-19
Filing date: 2023-03-14
Publication date: 2023-10-26
Also published as: CN114511864B; CN114511864A

Abstract

A text information extraction method and apparatus, a target model acquisition method and apparatus, and a device, relating to the technical field of image processing. The method comprises: acquiring (201) a target text image; acquiring (202) association information between at least one pair of target text segments; on the basis of the association information between each pair of target text segments, determining (203) an association result between each pair of target text segments; and extracting (204) text information in the target text image on the basis of the association result between each pair of target text segments.

Description

Text information extraction method, target model acquisition method, device and equipment

This application claims priority to the Chinese patent application with application number 202210411039.7 and the invention title "Text Information Extraction Method, Target Model Obtaining Method, Device and Equipment" submitted on April 19, 2022, the entire content of which is incorporated by reference. in this application.

Technical field

Embodiments of the present application relate to the field of image processing technology, and in particular to a text information extraction method, a target model acquisition method, apparatus and equipment.

Background technique

Text images containing text information such as menu images and receipt images are common in daily life. Such text images are structured text images or semi-structured text images. How to accurately extract text information from structured text images and semi-structured text images has become an urgent problem to be solved in the field of image processing technology.

Contents of the invention

Embodiments of the present application provide a text information extraction method, a target model acquisition method, a device and equipment.

On the one hand, embodiments of the present application provide a method for extracting text information. The method includes: acquiring a target text image, which includes multiple target text segments; and acquiring an association between at least a pair of target text segments. Information, the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; based on the association information between the at least one pair of target text segments, determine the and extracting text information in the target text image based on the correlation result between the at least one pair of target text segments.

On the other hand, embodiments of the present application provide a method for obtaining a target model. The method includes: obtaining a sample text image, where the sample text image includes multiple sample text segments; and obtaining at least one pair of sample text segments. based on the predicted correlation information between the at least one pair of sample text segments and the labeled correlation information between the at least one pair of sample text segments; , obtain the target model.

On the other hand, embodiments of the present application provide a text information extraction device. The device includes: an acquisition module, used to acquire a target text image, where the target text image includes multiple target text segments; the acquisition module, It is also used to obtain association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; the determination module, configured to determine an association result between the at least one pair of target text segments based on the association information between the at least one pair of target text segments; an extraction module configured to determine the association result between the at least one pair of target text segments based on the association between the at least one pair of target text segments As a result, text information in the target text image is extracted.

On the other hand, embodiments of the present application provide a device for acquiring a target model. The device includes: a first acquisition module for acquiring a sample text image, where the sample text image includes a plurality of sample text segments; a second acquisition module. The acquisition module is used to acquire the predicted association information between at least one pair of sample text segments and the annotation association information between the at least one pair of sample text segments; the third acquisition module is used to acquire the predicted association information between the at least one pair of sample text segments based on the at least one pair of sample text segments. The target model is obtained by predicting the correlation information between the at least one pair of sample text segments and the annotation correlation information between the at least one pair of sample text segments.

On the other hand, embodiments of the present application provide an electronic device. The electronic device includes a processor and a memory. At least one computer program is stored in the memory. The at least one computer program is loaded and executed by the processor. , so that the electronic device implements the above text information extraction method or the above target model acquisition method.

On the other hand, a computer-readable storage medium is also provided. At least one computer program is stored in the computer-readable storage medium. The at least one computer program is loaded and executed by the processor to enable the computer to realize the above text information. Extraction method or acquisition method of the above target model.

On the other hand, a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer program The computer implements the above text information extraction method or the above target model acquisition method.

Description of the drawings

Figure 1 is a schematic diagram of the implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application;

Figure 2 is a flow chart of a text information extraction method provided by an embodiment of the present application;

Figure 3 is a schematic diagram of a target text image provided by an embodiment of the present application;

Figure 4 is a schematic diagram of extracting image features of the image area where the target text segment is located according to an embodiment of the present application;

Figure 5 is a flow chart of a method for obtaining a target model provided by an embodiment of the present application;

Figure 6 is an example diagram of distances between features of a sample text segment provided by an embodiment of the present application;

Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application;

Figure 8 is a schematic diagram of extracting text information from a target text image provided by an embodiment of the present application;

Figure 9 is another schematic diagram of extracting text information from a target text image provided by an embodiment of the present application;

Figure 10 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application;

Figure 11 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application;

Figure 12 is a schematic structural diagram of a text information extraction device provided by an embodiment of the present application;

Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application;

Figure 14 is a schematic structural diagram of a terminal device provided by an embodiment of the present application;

Figure 15 is a schematic structural diagram of a server provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

In related technologies, text recognition is first performed on a target text image to obtain multiple text segments. The target text image is a structured text image or a semi-structured text image. Then, determine the category of each text segment. Next, obtain the association between preset categories, such as the association between dish name and dish price. Based on the association between categories and the categories of each text segment, multiple text segments are associated to obtain association results of multiple text segments, and text information in the target text image is extracted based on the association results of multiple text segments.

Since any two text segments may correspond to the same category, when associating multiple text segments based on the association between categories and the category of each text segment, the accuracy of the association results is poor, resulting in the target text image being The accuracy of the extracted text information is also poor.

FIG. 1 is a schematic diagram of an implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application. As shown in FIG. 1 , the implementation environment includes a terminal device 101 and a server 102 . Among them, the text information extraction method or the target model acquisition method in the embodiment of the present application is executed by the terminal device 101, or executed by the server 102, or jointly executed by the terminal device 101 and the server 102.

The terminal device 101 is a smartphone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart TV, a smart vehicle-mounted device, a smart voice interaction device, a smart home appliance, etc. The server 102 is one server, or a server cluster composed of multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 102 communicates with the terminal device 101 through a wired network or a wireless network. The server 102 has functions such as data processing, data storage, and data sending and receiving, which are not limited in the embodiment of the present application. The number of terminal devices 101 and servers 102 is not limited, for example, one or more.

Based on the above implementation environment, the embodiment of the present application provides a method for extracting text information. Taking the flow chart of a method of extracting text information provided by the embodiment of the present application as shown in Figure 2 as an example, the method consists of the terminal in Figure 1 The device 101 or the server 102 executes, or the terminal device 101 and the server 102 jointly execute. For the convenience of description, the terminal device 101 or the server 102 is collectively referred to as an electronic device. The following description takes the electronic device as a terminal as an example. As shown in Figure 2, the method includes steps 201 to 204.

Step 201: The terminal obtains a target text image, which includes multiple target text segments.

Target text images refer to images containing text segments, including structured text images and unstructured text images. Among them, text segments refer to words, phrases or sentences composed of characters. Different text segments are usually separated by blank areas in the target text image.

In the embodiment of the present application, each target text segment includes at least one character, and each character is any one of alphabetic characters, numbers, special symbols (such as punctuation marks, currency symbols, etc.), etc. When the target text segment includes multiple characters, the multiple characters form at least one word and can also form at least one sentence.

For example, the target text image is a structured text image, and the structured text image is an image that expresses text through a two-dimensional table structure, and the text in the image has organization, regularity, etc. The structured text image includes multiple target text segments. For each target text segment in the structured text image, there is at least one other target text segment associated with this target text segment, and the other target text segments are the target texts in the multiple target text segments except this target text segment. part.

Please refer to Figure 3. Figure 3 is a schematic diagram of a target text image provided by an embodiment of the present application, in which (1) is a structured text image. It can be seen from the structured text image that the target text segment "Item A" is related to the target text segment "×10", the target text segment "Item B" is related to the target text segment "×15", and the target text segment "Item C” is associated with the target text segment “×3”, the target text segment “Item D” is associated with the target text segment “×9”, and the target text segment “Item E” is associated with the target text segment “×1”. Therefore, each target text segment in the structured text image has at least one other target text segment associated with this target text segment.

For example, the target text image is a semi-structured text image, and the semi-structured text image includes a structured text area and an unstructured text area. Among them, the structured text area is an image area that expresses text through a two-dimensional table structure. The text in this part of the image area has organization, regularity, etc. The unstructured text area is an image area that expresses text through irregular and unorganized data structures. The semi-structured text image includes multiple target text segments. For semi-structured text images, each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in the other part of the target text segments does not exist. Other target text segments associated with this target text segment.

Please continue to refer to Figure 3, where (2) is a semi-structured text image. It can be seen from the semi-structured text image that the target text segment "Dish A" is related to the target text segment "9 yuan", the target text segment "Dish B" is related to the target text segment "from 8 yuan", and the target text segment "Dish C" is related to the target text segment "13 yuan", and the target text segment "Dish D" is related to the target text segment "10 yuan". The target text segment "Price List" and the target text segment "Dish A", "9 yuan", "Dish B", "Starting from 8 yuan", "Dish C", "13 yuan", "Dish D", "10 Yuan" are not related. Therefore, for semi-structured text images, each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in another part of the target text segments has at least one other target text segment associated with it. There is no other target text segment associated with this target text segment.

The embodiments of this application do not limit the image content, acquisition method, quantity, etc. of the target text image. For example, the target text image is a receipt image, a menu image, etc., and the target text image is a photographed image, or an image downloaded from the network.

Step 202: The terminal obtains association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to represent the possibility of association between any pair of target text segments.

In some embodiments, for multiple target text segments in the target text image, any two target text segments among the multiple target text segments are regarded as a pair of target text segments, thereby obtaining at least one pair of target text segments. Obtain the correlation information between at least one pair of target text segments, that is, obtain the correlation information between each pair of target text segments. For example, the correlation information between each pair of target text segments is a non-negative number. When the correlation information between each pair of target text segments is a number greater than or equal to 0 and less than or equal to 1, the number between each pair of target text segments is The correlation information between them is called the correlation probability between each pair of target text segments.

The association information between each pair of target text segments is used to characterize the possibility of association between the pair of target text segments. Among them, the greater the correlation information between each pair of target text segments, the higher the possibility of correlation between the pair of target text segments. That is to say, the correlation information between each pair of target text segments is related to the pair of target text segments. The probability of association between text segments is proportional to that.

In some embodiments, a target model is obtained; an association between at least one pair of target text segments is obtained according to the target model information. Among them, the method of obtaining the target model is described in the relevant description of Figure 5 below, and will not be described again here.

In some embodiments, the target model includes at least one of an image feature extraction network or a text feature extraction network, and the target model determines and outputs the content of the target text image based on an output of at least one of the image feature extraction network or the text feature extraction network. Association information between at least one pair of target text segments. Among them, the image feature extraction network is used to extract image features of the image area where each target text segment is located in at least a pair of target text segments, and the text feature extraction network is used to extract text features of each target text segment in at least a pair of target text segments. .

In some embodiments, obtaining the association information between each pair of target text segments includes: obtaining at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments, each pair of target text segments The features of each target text segment in the segment include at least one of the image features of the image area where the target text segment is located or the text features of the target text segment. The relative position features between each pair of target text segments are used to characterize this The relative position between the image areas where the target text segments are located; determine the relative position between each pair of target text segments based on at least one of the characteristics of at least one pair of target text segments or the relative position characteristics between at least one pair of target text segments. associated information.

In some embodiments, the features of each target text segment are image features of the image area where the target text segment is located, or text features of the target text segment, or include both image features of the image area where the target text segment is located. Includes text features of the target text segment.

In some embodiments, the characteristics of each target text segment include image characteristics of the image area where the target text segment is located. The image characteristics are obtained by: obtaining the image characteristics of the target text image; based on the image characteristics of the target text image and the The position information of the image area where the target text segment is located determines the image characteristics of the image area where the target text segment is located.

In some embodiments, the target text image is input into the image feature extraction network, and the image feature extraction network outputs the image features of the image area where each of the target text segments is located in at least one pair of target text segments. Among them, the image features of the image area where the target text segment is located are used to characterize the texture information of the image area where the target text segment is located.

For example, image detection is performed on the target text image to obtain position information of the image area where each target text segment in the target text image is located. This embodiment of the present application does not limit the position information of the image area where each target text segment is located. For example, the image area where each target text segment is located is a rectangle, a circle, etc. The position information of the image area where each target text segment is located includes at least one of the center point coordinates, vertex coordinates, side length, perimeter, area, radius, etc. of the image area where each target text segment is located. Among them, the coordinates include the abscissa and ordinate, and the side length includes the height and width.

Optionally, the image feature extraction network includes a first extraction network and a second extraction network. After the target text image is input into the image feature extraction network, the first extraction network extracts the image features of the target text image based on the pixel information of each pixel in the target text image (or the normalized target text image). The target text The image features of the image are used to characterize the texture information of the target text image. The second extraction network determines the image features of the image area where each target text segment is located (recorded as the first area feature) based on the image features of the target text image and the position information of the image area where each target text segment is located.

In some embodiments, the target text image is input to the first extraction network, and the first extraction network sequentially performs convolution processing and normalization processing on the target text image to normalize the target text image obtained by the convolution processing to On the standard distribution, it prevents gradient oscillation during training and reduces the problem of model overfitting. Then, the image features of the target text image are determined and output based on the normalized target text image. Optionally, based on the pixel information of each pixel point in the target text image, at least one of the average value of the pixel information and the variance of the pixel information is determined. The target text image after convolution is normalized using at least one of the average value of the pixel information and the variance of the pixel information. This method of normalization is called instance normalization (IN). Since the target text image has large layout and image differences, retaining the shallow appearance information of the image through instance normalization is conducive to the integration and adjustment of the global information of the image, and improves the stability of training and the generalization of the model. .

Among them, the embodiments of the present application do not limit the network structure, network size, etc. of the first extraction network and the second extraction network. Illustratively, both the first extraction network and the second extraction network are convolutional neural networks (Convolutional Neural Networks, CNN). The first extraction network is the backbone network using U-Net architecture, which is used to extract visual features of the target text image. According to the pixel information of each pixel in the target text image, the target text image is first down-sampled to obtain downsampling. Features, and then perform upsampling processing on the downsampled features to obtain the image features of the target text image.

The second extraction network is a region of interest pooling (Region Of Interest Pooling, ROI Pooling) layer, or a region of interest alignment (Region Of Interest Align, ROI Align) layer, which is used according to the image features of the target text image and each The position information of the image area where the target text segment is located determines the image characteristics of the image area where each target text segment is located. That is to say, the ROI Pooling layer or ROI Align layer performs feature extraction on the image features of the target text image again based on the position information of the image area where each target text segment is located, and obtains the image features of the image area where each target text segment is located. . For example, the image feature of the image area where each target text segment is located is a visual feature of a fixed dimension (such as 16 dimensions).

It should be noted that, in addition to the backbone network using the U-Net architecture, the first extraction network can also use the backbone network using the Feature Pyramid Networks (FPN) architecture, or the backbone network using the ResNet architecture. This application The examples do not limit this.

Please refer to Figure 4. Figure 4 is a schematic diagram of extracting image features of an image area where a target text segment is located according to an embodiment of the present application. The target text image is the image shown in (2) in Figure 3, and the target text image includes the image area where the target text segment "price list" is located, as shown by the dotted box in Figure 4. The target text image is input into the backbone network, and the backbone network outputs the image features of the target text image. According to the position information of the image area where the target text segment "price list" is located, feature extraction is performed again on the image features of the target text image to obtain the image features of the image area where the target text segment is located.

Among them, the backbone network using U-Net architecture has the design feature of cross-layer connections, which is more friendly to feature extraction of image areas. The ROI Pooling layer or ROI Align layer performs feature extraction again on the image features of the target text image obtained after upsampling to obtain the image features of the image area where each target text segment is located, which can avoid errors caused by downsampling. accumulation, improving accuracy. In addition, since the image features of the target text image are obtained based on the global information of the target text image, the image features of the image area where each target text segment is located also have the global information of the target text image, and the feature expression ability is stronger and the accuracy is higher. high.

Optionally, image detection is performed on the target text image first to obtain position information of the image area where each target text segment in the target text image is located. Based on the position information of the image area where each target text segment in the target text image is located, image segmentation is performed on the target text image to obtain the image area where each target text segment in the target text image is located. For each target text segment, based on the pixel information of each pixel in the image area where each target text segment is located, the image features of the image area where the target text segment is located are extracted (recorded as second area features).

Optionally, the first area feature and the second area feature are spliced or fused to obtain image features of the image area where each target text segment is located. For example, the first area feature is spliced before or after the second area feature to obtain the image features of the image area where each target text segment is located. Or, use the form of Kronecker product to calculate the outer product between the first area feature and the second area feature to obtain the image features of the image area where each target text segment is located. Or, divide the first area feature into a reference number of first area blocks, divide the second area feature into a reference number of second area blocks, and for each first area block, combine the first area block and the first area block. The second area block associated with the area block is fused to obtain a fusion area block, and each fusion area block is spliced to obtain the image features of the image area where each target text segment is located.

It can be understood that in structured text images or semi-structured text images, different target text segments have obvious visual distinctions in character style, character color, character size, etc. (as shown in Figure 3). The image features of the image area where each target text segment is located can represent the visual information of each target text segment. This visual information plays a role in subsequently determining the associated information between each pair of target text segments, the category of each target text segment, etc. Very good auxiliary, thereby improving accuracy.

In some embodiments, the characteristics of each target text segment include the text characteristics of the target text segment, and the method of obtaining the text characteristics includes: obtaining the word vector of each word in the target text segment; The word vectors of the words are fused to obtain the text features of the target text segment. Among them, the text features of the target text segment are used to represent the semantic information of the words contained in the target text segment itself.

Features of the target text segment include but are not limited to text features of the target text segment. In the embodiment of the present application, image detection and image segmentation are performed on the target text image in sequence to obtain the image area where each target text segment in the target text image is located. For each target text segment, perform image recognition on the image area where the target text segment is located to obtain each target text segment.

After each target text segment is obtained, the target text segment is input into the text feature extraction network. Extracted from text features The network first uses a tokenizer to segment the target text segment to obtain each word in the target text segment. Through a vector lookup table, the word vector of each word in the target text segment is determined. The word vector is a vector with a fixed dimension (such as 200 dimensions). After that, the contextual semantic relationship of the text is further learned based on the word vectors of each word in each target text segment, so as to fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment. Optionally, the text feature extraction network is a Bi-directional Long Short Term Memory (Bi-LSTM) network or a TransFormer network.

Optionally, the characteristics of each target text segment include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment. Obtaining the characteristics of each target text segment includes: for each pair of target text segments For each target text segment, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; for each image feature block, The image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.

In some embodiments, the target model can also splice or fuse the image features of the image area where each target text segment is located with the text features of the target text segment to obtain the features of the target text segment. For example, during fusion, the outer product between the image features of the image area where each target text segment is located and the text features of the target text segment is calculated in the form of a Kronecker product to obtain the features of the target text segment. .

When the dimension of at least one of the image features of the image area where the target text segment is located or the text features of the target text segment is larger, directly combine the image features of the image area where the target text segment is located and the text features of the target text segment. Fusion will take a long time. When fused in the form of Kronecker product, the dimensionality of the features of the target text segment will increase dramatically. In order to reduce computational overhead, block fusion is used for fusion.

Optionally, for each target text segment, first divide the image features of the image area where the target text segment is located into a target number of image feature blocks, which are recorded as the 1st to Nth image feature blocks respectively, where N is a positive number greater than 1. An integer and represents the target quantity. In addition, the text features of the target text segment are also divided into a target number of text feature blocks, which are respectively recorded as the 1st to Nth text feature blocks. Next, for each image feature block, the image feature block and the associated text feature block are fused to obtain a fused feature block, where the association between the image feature block and the text feature block means having the same serial number. Among them, the image feature block is recorded as the i-th image feature block, then the text feature block associated with the image feature block is the i-th text feature block, the fusion feature block is recorded as the i-th fusion feature block, and i is the value A positive integer from 1 to N. Optionally, the outer product between the i-th image feature block and the i-th text feature block is calculated in the form of a Kronecker product to obtain the i-th fused feature block. After that, each fused feature block is spliced to obtain the characteristics of the target text segment. That is to say, the 1st to Nth fused feature blocks are spliced to obtain the features of the target text segment.

Optionally, the image features of the image area where each target text segment is located and the text features of the target text segment are first spliced or fused, and then a nonlinear operation is performed to obtain the features of the target text segment. Among them, the characteristics of each target text segment are used to indicate that the target text segment itself has characteristic information, such as image features representing the texture information of the image area where it is located, or text features representing its own semantic information, or representing the above image features and the above text features. fused information.

Through the above method, the characteristics of each target text segment in each pair of target text segments are obtained, and then the characteristics of each pair of target text segments are obtained. Afterwards, the target model determines the associated information between each pair of target text segments based on the characteristics of each pair of target text segments.

Optionally, obtaining the relative position characteristics between each pair of target text segments includes: obtaining the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the size of the target text image information to determine the relative position characteristics between each pair of target text segments. Among them, the relative position feature between each pair of target text segments is used to characterize the position difference between the pair of target text segments in the target text image.

For each pair of target text segments, obtain the position information of the image area where the two target text segments in the pair are located. The position information of the image area where each target text segment is located includes the center point of the image area where the target text segment is located. At least one of coordinates, vertex coordinates, side length, perimeter, area, radius, etc. Size information of the target text image is also obtained, and the size information of the target text image includes at least one of side length, perimeter, area, radius, etc. of the target text image. Among them, the coordinates include the abscissa and ordinate, and the side length includes the width and height.

Next, based on the abscissa of the center point of the image area where each pair of target text segments is located, calculate the location of each pair of target text segments. The relative horizontal distance of the image area. Based on the ordinate of the center point of the image area where each pair of target text segments is located, the relative vertical distance of the image area where each pair of target text segments is located is calculated. And based on the relative horizontal distance of the image area where each pair of target text segments is located, the relative vertical distance of the image area where each pair of target text segments is located, the side length of the image area where each pair of target text segments is located, and the side length of the target text image, as follows: Formula (1) determines the relative position characteristics between each pair of target text segments.

Among them, r _ij is the relative position feature between the i-th target text segment and the j-th target text segment. d is the normalization factor to prevent numerical fluctuations calculated for images in different formats. Δx _ij =x _j -x _i , where Δx _ij represents the relative horizontal distance between the image area where the i-th target text segment and the j-th target text segment are located, and x _j is the center of the image area where the j-th target text segment is located. Point abscissa, x _i is the abscissa of the center point of the image area where the i-th target text segment is located. Δy _ij =y _j -y _i , where Δy _ij represents the relative vertical distance between the i-th target text segment and the j-th target text segment in the image area, and y _j is the center of the image area in which the j-th target text segment is located. Point ordinate, y _i is the ordinate of the center point of the image area where the i-th target text segment is located. w _i is the width of the image area where the i-th target text segment is located, h _i is the height of the image area where the i-th target text segment is located. w _j is the width of the image area where the jth target text segment is located, and h _j is the height of the image area where the jth target text segment is located. W is the width of the target text image, and H is the height of the target text image.

According to formula (1), the target model determines the relative position characteristics between at least one pair of target text segments. Afterwards, the target model determines the associated information between each pair of target text segments based on the relative position characteristics between each pair of target text segments.

Optionally, according to formula (2) shown below, the relative position features between each pair of target text segments are normalized and linearly processed to obtain the processed relative position features between each pair of target text segments.
e′ _ij ＝N _l2 (Er _ij ) Formula (2)

Among them, e′ _ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment. N _l2 represents normalization processing, such as L2 norm normalization processing, which can improve stability. E represents linear processing, capable of projecting r _ij to fixed dimensions. r _ij is the relative position feature between the i-th target text segment and the j-th target text segment.

Then, the relative position features between each pair of target text segments (after processing) are used to determine the associated information between each pair of target text segments.

Optionally, based on the characteristics of at least one pair of target text segments and the relative position characteristics between the at least one pair of target text segments, determining the association information between each pair of target text segments includes: based on the characteristics of the at least one pair of target text segments and the relative position features between at least one pair of target text segments to construct a graph structure. The graph structure includes at least two nodes and at least one edge. Each node represents the characteristics of a target text segment, and each edge represents a pair connected by the edge. The relative position characteristics between a pair of target text segments indicated by the nodes; then, the association information between each pair of target text segments is determined based on the graph structure.

In the embodiment of the present application, the characteristics of each target text segment in at least a pair of target text segments are used as a node of the graph structure. That is, a node in the graph structure is associated with the characteristics of a target text segment. Furthermore, the relative position features between each pair of target text segments are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure. Or perform normalization and linear processing on the relative position features between each pair of target text segments according to the above formula (2) to obtain the processed relative position features between each pair of target text segments. The processed relative position features are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure.

Optionally, refer to the formula (3) shown below, combine the processed relative position features between each pair of target text segments (or the relative position features between each pair of target text segments) and the Features are spliced to obtain spliced features. The features obtained by fusing the spliced features (or the spliced features) using the Multi-Layer Perceptron (MLP) will be used as the edge between the two nodes associated with each pair of target text segments. In this way , can better combine edges and nodes to more accurately obtain the association information between each pair of target text segments.
e _ij =M(n _i ||e′ _ij ||n _j ) Formula (3)

Among them, e _ij is the feature obtained by fusing the spliced features using a multi-layer perceptron. M is a multi-layer perceptron that can Transform vector features into scalar features. n _i is the characteristic of the i-th target text segment. || is the splicing symbol. e′ _ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment. n _j is the characteristic of the jth target text segment.

The graph structure is obtained through the above method, so as to simulate the layout relationship between each target text segment in the target text image through the graph structure. In some embodiments, the target model includes a Graph Convolutional Network (GCN). The graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs the association information between at least one pair of target text segments. Optionally, the graph convolution network mines the structured relationship between the two nodes at both ends of the edge in the graph structure by continuously iteratively updating the graph structure, thereby obtaining the associated information between at least one pair of target text segments. Among them, the process of iteratively updating the graph structure is the process of iteratively updating the nodes of the graph structure, while the edges of the graph structure are not updated.

Optionally, in each iteration, first determine the weight of each edge based on each edge in the graph structure according to formula (4) shown below.

in, is the weight of edge e _ij in the graph structure at the lth iteration. exp is the exponent symbol. Σ is the summation symbol. k is the sequence number. e _ik represents the edge between the i-th node and the k-th node in the graph structure. The edge e _ij is the edge between the i-th node and the j-th node in the graph structure.

Then, according to the formula (5) shown below, based on each node in the graph structure, the weight of each edge of the node at one end of the graph structure, and the edges of the node at one end of the graph structure, update the this node.

in, is the updated i-th node in the graph structure at the l-th iteration. is the i-th node in the graph structure at the l-th iteration. σ represents nonlinear processing. W ^l represents the linear processing at the lth iteration. is the weight of edge e _ij in the graph structure at the l-th iteration, where edge e _ij is the edge between the i-th node and the j-th node in the graph structure.

Through the above method, an iterative update of each node in the graph structure is achieved, that is, an iterative update of the graph structure is achieved. If the iteration end condition is met, the updated graph structure is used as the final graph structure, and the final graph structure is used to determine the associated information between at least one pair of target text segments. If the iteration end conditions are not met, the updated graph structure will be used as the graph structure for the next iteration, and the graph structure will be updated again in the manner shown in formula (4) to formula (5) until the iteration end conditions are met. , obtain the final graph structure, and use the final graph structure to determine the associated information between at least a pair of target text segments. It should be noted that when iteratively updating the graph structure, in addition to iteratively updating the nodes of the graph structure, the edges of the graph structure can also be iteratively updated.

Optionally, satisfying the iteration end condition can be that the number of iterations has been reached, or it can be that the change between the graph structure before iterative update and the graph structure after iterative update is less than the change threshold, that is, the graph structure tends to be stable.

The embodiment of the present application first constructs a graph structure based on the characteristics of each pair of target text segments in the target text image and the relative position characteristics between each pair of target text segments, and then determines the associated information between at least one pair of target text segments based on the graph structure. Since each pair of target text segments is every two target text segments, the features of each pair of target text segments are equivalent to the features of every two target text segments.

For example, if the target text image includes target text segments 1 to 3, then each pair of target text segments in the target text image includes target text segments 1 and 2, target text segments 2 and 3, and target text segments 1 and 3. Then based on the characteristics of target text segment 1, the characteristics of target text segment 2, the characteristics of target text segment 3, the relative position characteristics between target text segments 1 and 2, the relative position characteristics between target text segments 2 and 3, and the target The relative position characteristics between text segments 1 and 3 are used to construct a graph structure, and then the associated information between target text segments 2 and 3 is determined based on the graph structure.

Optionally, the method of obtaining the correlation information between each pair of target text segments also includes: obtaining the category of each target text segment and the correlation information between every two target text segments; based on the category of each target text segment, From the correlation information between each two target text segments, the correlation information between each pair of target text segments is determined. Among them, the category of each target text segment It is used to represent the category to which the target text segment belongs in multiple preset categories, and represents the category to which the target text segment's own semantic information belongs. For example, although the two target text segments "10 yuan" and "15 yuan" represent different Price semantics, but they all belong to the same category "vegetable price".

In the embodiment of the present application, the category of each target text segment is determined based on the characteristics of each target text segment. Determine the associated information between any two target text segments based on the characteristics of any two target text segments, or determine the associated information between any two target text segments based on the relative position characteristics between any two target text segments, Or determine the associated information between any two target text segments based on the characteristics of any two target text segments and the relative position characteristics between any two target text segments. Among them, the embodiment of the present application does not limit the category of each target text segment. For example, if the target text image is a menu image, then the category of each target text segment is dish name, dish price, store name, dish type, others, etc. at least one of them.

Optionally, first construct a graph structure based on the characteristics of each two target text segments in the target text image and the relative position characteristics between each two target text segments, and then determine the category of each target text segment and each two target text segments based on the graph structure. Association information between target text segments. Among them, the determination of the associated information between each pair of target text segments based on the graph structure is described above about "determining the associated information between each pair of target text segments based on the graph structure". The implementation principles of the two are similar and will not be discussed here. Repeat. Next, based on the category of each target text segment, the associated information between each pair of target text segments is determined from the associated information between each two target text segments.

In some embodiments, a Long Short Term Memory (LSTM) network and a Conditional Random Field (CRF) network are used to determine the category of each target text segment based on the characteristics of the target text segment. Among them, the LSTM network and the CRF network determine the category of each character in the target text segment based on the characteristics of each target text segment, and determine the category of the target text segment based on the category of each character. Optionally, if the categories of each character in the target text segment are the same category, then the category of the target text segment is the category of any character. If the categories of each character in the target text segment are different categories, then segment the target text segment into at least two target text segments based on the categories of each character in the target text segment, and each segmented target text The categories of each character in the segment are the same, and the category of each segmented target text segment is the category of any character in the segmented target text segment.

For example, the target text segment A is "Eggs 6 yuan". In the target text segment A, the category of the character "chicken" is the name of the dish, the category of the character "egg" is the name of the dish, the category of the character "6" is the price of the dish, and the category of the character " Yuan" category is vegetable price. Then the target text segment A is divided into the target text segment A1 "eggs" and the target text segment A2 "6 yuan". The category of the target text segment A1 "eggs" is the name of the dish, and the category of the target text segment A2 "6 yuan" is Vegetable prices.

Optionally, based on the category of each target text segment, determining the associated information between each pair of target text segments from the associated information between each two target text segments, including: based on the category of each target text segment, from the target text segment Filter out the text segments to be associated with the target category from the multiple target text segments included in the text image; filter out the associated information between each two text segments to be associated from the associated information between each two target text segments, Obtain the association information between each pair of target text segments. Among them, the text segment to be associated refers to the target text segment with the target category, and also refers to the target text segment that needs to participate in the calculation process of associated information. That is, only every two text segments to be associated need to calculate the associated information, and other categories of text segments need to be calculated. There is no need to calculate associated information for each pair of target text segments.

In the embodiment of the present application, for each target text segment, if the category of the target text segment is the target category, then the target text segment is the text segment to be associated. If the category of the target text segment is not the target category, the target text segment is not a text segment to be associated. In this way, the text segments to be associated can be filtered out from multiple target text segments. The embodiment of the present application does not limit the target category. For example, the target text image is a menu image. Since the main focus is on the matching relationship between the dish name and the dish price in the menu image, the target category is the dish name and the dish price.

After filtering out the text segments to be associated from multiple target text segments, the associated information between each two text segments to be associated can be filtered out from the associated information between each two target text segments. The association information between each two text segments to be associated is regarded as the association information between a pair of target text segments.

For example, if the multiple target text segments are target text segments 1 to 3, and the text segments to be associated are target text segments 2 and 3, then the association information between target text segments 1 and 2, and the relationship between target text segments 2 and 3 Among the associated information and the associated information between target text segments 1 and 3, the associated information between target text segments 2 and 3 is directly filtered out.

Step 203: The terminal determines one of the at least one pair of target text segments based on the association information between the at least one pair of target text segments. The correlation results between.

For each pair of target text segments, an association result between each pair of target text segments is determined based on the association information between each pair of target text segments, where the association result between each pair of target text segments indicates the pair of target text segments Whether related. Optionally, if the correlation information between a certain pair of target text segments is greater than the correlation threshold, the correlation result between the pair of target text segments is determined to be correlation. If the correlation information between a certain pair of target text segments is not greater than the correlation threshold, the correlation result between the pair of target text segments is determined to be non-correlated. The embodiment of the present application does not limit the value of the correlation threshold. For example, the correlation threshold is 0.5.

Optionally, determine the category of each target text segment in at least one pair of target text segments, and obtain an association relationship between each two categories. The association relationship between two categories is used to characterize whether the two categories are associated. Among them, based on the characteristics of each target text segment, the category of each target text segment is determined, or the graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs at least one pair of target text segments. Category for each target text segment. Optionally, after the graph convolution network updates the graph structure at least once, a final graph structure is obtained, and the final graph structure is used to determine the category of each target text segment in at least one pair of target text segments.

For each pair of target text segments, if the correlation information between the pair of target text segments is greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is correlation, then determine this pair The result of the association between target text segments is association. If the correlation information between this pair of target text segments is greater than the correlation threshold, but the correlation between the categories of the two target text segments in this pair of target text segments is not relevant, then determine the relationship between this pair of target text segments The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, but the correlation between the categories of the two target text segments in the pair is correlation, then determine the relationship between the pair of target text segments. The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is not relevant, then determine the relationship between the pair of target text segments. The result of the correlation between them is no correlation.

For example, the association threshold is 0.5, and the association between the two categories includes the association between the dish name and the dish price. Among them, the correlation information between a pair of target text segments is 0.7, and the categories of the two target text segments in the pair are dish name and dish price respectively, then determine the correlation result between the pair of target text segments for association. The correlation information between another pair of target text segments is 0.51, but the categories of both target text segments in this pair of target text segments are dish names, then it is determined that the correlation result between this pair of target text segments is not relevant. .

Step 204: The terminal extracts text information in the target text image based on the association result between at least one pair of target text segments.

In the embodiment of the present application, the text information in the target text image is extracted based on the association result between each pair of target text segments, where the text information at least includes the associated pairs obtained by combining each pair of associated target text segments, that is, Text information not only represents the content of the target text segment identified from the target text image, but also reflects the association results between the target text segments. If the association result between a pair of target text segments is association, add a target symbol (such as at least one of ":", "-", "/", etc.) between the pair of target text segments so that the A pair of target text segments are combined into an associated pair. If the association result between a pair of target text segments is not associated, the pair of target text segments cannot be combined into an associated pair.

Through the above method, it is determined whether each pair of target text segments can be combined into an associated pair, and if it can be combined into an associated pair (that is, when the association result between this pair of target text segments is an association), this pair Target text segments are combined into associated pairs. Realize the association of multiple target text segments in the target text image to obtain the text information in the target text image.

In the above method, the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, through the association information between each pair of target text segments, the relationship between each pair of target text segments is determined. When correlating results between at least one pair of target text segments, the phenomenon of correlation errors is reduced and the accuracy of the correlation results is improved, so that when extracting text information in the target text image based on the correlation results between at least one pair of target text segments, the accuracy of the text information is improved. sex.

Based on the above implementation environment, the embodiment of the present application provides a method for obtaining a target model. Taking the flow chart of a method for obtaining a target model provided by the embodiment of the present application as shown in Figure 5 as an example, the method can be shown in Figure 1 It can be executed by the terminal device 101 or the server 102, or it can also be executed by the terminal device 101 and the server 102 together. For ease of description, the terminal The device 101 and the server 102 are collectively referred to as electronic devices. The electronic device is used as a server for illustration. As shown in Figure 5, the method includes steps 501 to 503.

Step 501: The server obtains a sample text image, which includes multiple sample text segments.

Among them, the sample text image is a structured text image or a semi-structured text image. The sample text image in the embodiment of the present application is the same as the target text image mentioned above. See the description of the target text image above, which will not be described again.

Step 502: The server obtains predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.

In some embodiments, the predicted correlation information between each pair of sample text segments is a positive number. Wherein, when the predicted association information between each pair of sample text segments is a number greater than or equal to 0 and less than or equal to 1, the predicted association information between each pair of sample text segments is called the association probability between each pair of sample text segments. Among them, the predicted correlation information between each pair of sample text segments can be found in the above description of "the correlation information between each pair of target text segments". The implementation principles of the two are the same and will not be described again.

In the embodiment of the present application, the predicted association information between each pair of sample text segments is obtained according to the neural network model. Wherein, the neural network model includes at least one of a first initial network and a second initial network. The neural network model determines and outputs predicted association information between each pair of sample text segments based on an output of at least one of the first initial network or the second initial network. Among them, the first initial network is used to extract image features of the image area where each sample text segment in each pair of sample text segments is located, and the second initial network is used to extract text features of each sample text segment in each pair of sample text segments.

It should be noted that the first initial network is trained using sample text images to obtain an image feature extraction network, so as to use the image feature extraction network to extract image features of the image area where each target text segment is located in each pair of target text segments. For a description of the initial network, see the description of the image feature extraction network above. The implementation principles of the two are the same and will not be described again. For the description of the second initial network, see the description of the text feature extraction network above. The implementation principles of the two are the same and will not be described again.

Characteristics of at least a pair of sample text segments are obtained, and the characteristics of each sample text segment in each pair of sample text segments include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment. Among them, the characteristics of the sample text segment are obtained in the same way as the characteristics of the target text segment. See the relevant description of the characteristics of the target text segment above, which will not be described again here.

Optionally, the first initial network includes a first sub-network and a second sub-network. After the sample text image is input into the first initial network, the first sub-network extracts the image features of the sample text image based on the pixel information of each pixel in the sample text image. The image features of the sample text image are used to characterize the texture of the sample text image. information. The second sub-network determines the image features of the image area where each sample text segment is located based on the image features of the sample text image and the position information of the image area where each sample text segment is located. Among them, the first sub-network is trained to obtain the first extraction network. For the first sub-network, see the above description of the first extraction network. The implementation principles of the two are the same and will not be described again. The second sub-network is trained to obtain the second extraction network. For the second sub-network, see the above description of the second extraction network. The implementation principles of the two are the same and will not be described again.

Next, relative position features between at least one pair of sample text segments are obtained. The relative position features between each pair of sample text segments are used to characterize the relative position between the image areas where each pair of sample text segments are located. Among them, the relative position features between each pair of sample text segments are obtained in the same way as the relative position features between each pair of target text segments. See above for the relative position features between each pair of target text segments. Description will not be repeated here.

Predictive association information between at least one pair of sample text segments is determined based on the characteristics of the at least one pair of sample text segments and the relative position characteristics between the at least one pair of sample text segments. Optionally, a graph structure is constructed based on the characteristics of at least one pair of sample text segments and the relative position characteristics between at least one pair of sample text segments. The graph structure includes at least two nodes and at least one edge, and each node represents a sample text segment. The characteristics of each edge represent the relative position characteristics between a pair of sample text segments, and the predicted association information between at least one pair of sample text segments is determined based on the graph structure. The description of determining the predicted correlation information between at least one pair of sample text segments can be found in the above description on determining the correlation information between at least one pair of target text segments. The implementation principles of the two are the same and will not be described again here.

Wherein, the neural network model also includes a third initial network, the graph structure of the sample text image is input into the third initial network, and the third initial network determines and outputs the correlation information between at least one pair of sample text segments. Among them, the third initial network is trained to obtain a graph convolution network. The third initial network is described in the description of the graph convolution network. The implementation principles of the two are the same and will not be described again.

In the embodiment of the present application, at least one pair of sample text segments is annotated with associated information to obtain at least one pair of sample text segments. Label association information between segments. Among them, the annotated association information between each pair of sample text segments is 0 or 1. 0 indicates that the pair of sample text segments are not associated, and 1 indicates that the pair of sample text segments are associated.

Step 503: Obtain a target model based on predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.

The loss value of the neural network model is determined using the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments. The neural network model is adjusted through the loss value of the neural network model to obtain the adjusted neural network model. If the training end conditions are met, the adjusted neural network model will be used as the target model. If the training end conditions are not met, the adjusted neural network model will be used as the neural network model for the next training, and the neural network model will be trained again according to steps 501 to 503 until the training end conditions are met and the target is obtained. Model.

In the embodiment of the present application, according to the formula (6) shown below, the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments are used to determine the correlation information loss value. Among them, formula (6) is the focal loss (Focal Loss) function.

Among them, L _rel is the associated information loss value. α and γ are two hyperparameters, used to control the loss ratio of positive and negative samples. The embodiments of the present application do not limit the values of α and γ. For example, α=0.25 and γ=2. p _ij represents the predicted association information between the i-th sample text segment and the j-th sample text segment. log is a logarithmic sign. y _ij represents the annotation association information between the i-th sample text segment and the j-th sample text segment. Characterizes the edge between the node associated with the i-th sample text segment and the node associated with the j-th sample text segment on the graph structure after the L-th iteration. E is a linear layer, used to map edges into predicted association information.

It should be noted that in the process of iteratively updating the graph structure, the edges of the graph structure are updated, or the edges of the graph structure are not updated. Among them, the edges in the updated graph structure are marked as Represents the edge between the i-th node and the j-th node in the graph structure after the L-th update. According to the graph structure, an N ² ×a-dimensional probability distribution matrix can be determined and output. N ² represents the number of combinations between any two nodes in the graph structure, that is, the sample text composed of any two sample text segments in the sample text image. The number of segment pairs, a, represents the dimension of predicted association information between a pair of sample text segments. Optionally, a=2. At this time, predicted correlation information of 0 indicates that there is no correlation between the pair of sample text segments, and predicted correlation information of 1 indicates that there is correlation between the pair of sample text segments. Optionally, a=1, in which case the prediction associated information is data greater than or equal to 0 and less than or equal to 1.

For sample text images (such as menu images), the number M of sample text segment pairs associated in the image will be much less than (<<) N ² . If the associated pair of sample text segments are regarded as positive samples and the unassociated pair of sample text segments are regarded as negative samples, then the number of negative samples is much greater than the number of positive samples. Therefore, the probability distribution matrix to be fitted is extremely sparse, and the proportion of positive and negative samples is seriously imbalanced. Using the above formula (6) can solve the problem of sparse probability distribution matrix and unbalanced ratio of positive and negative samples. By balancing the loss ratio of positive and negative samples, the network can avoid over-learning of negative samples, thereby improving network performance.

Optionally, the associated information loss value is used as the loss value of the neural network model. Alternatively, the loss value of the neural network model is determined based on the associated information loss value, the predicted category of each sample text segment in at least one pair of sample text segments, and the label category of each sample text segment in at least one pair of sample text segments. Alternatively, the loss value of the neural network model is determined based on the associated information loss value and the characteristics of each sample text segment in at least one pair of sample text segments.

Optionally, obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the predicted category of each sample text segment in each pair of sample text segments and the predicted category of each sample text segment in each pair of sample text segments. The annotation category of each sample text segment; based on the predicted association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments and each Obtain the target model for the annotation category of each sample text segment in the sample text segment.

In the embodiment of the present application, based on the characteristics of each sample text segment, the predicted category of the sample text segment is determined. Alternatively, the graph structure of the sample text image is input into a third initial network, and the third initial network determines and outputs the predicted category of each sample text segment in each pair of sample text segments. In addition, each sample text segment is annotated to obtain the annotation of the sample text segment. category.

In an exemplary embodiment of the present application, according to the formula (7) shown below, using the predicted category of each sample text segment in at least one pair of sample text segments and the annotation category of each sample text segment in at least one pair of sample text segments, Determine the class loss value. Among them, formula (7) is the cross entropy loss (Cross Entropy Loss, CE Loss) function.

Among them, L _node is the category loss value. N is the number of sample text segments in at least one pair of sample text segments. CE is the symbol of the cross-entropy loss function. E is linear processing, which is used to map the nodes associated with the i-th sample text segment on the graph structure after the L-th iteration to the probability distribution dimension to obtain the predicted category of the i-th sample text segment. is the node associated with the i-th sample text segment on the graph structure after the L-th iteration. y _i is the annotation category of the i-th sample text segment.

It should be noted that during the process of iteratively updating the graph structure, the nodes of the graph structure are updated. Among them, the nodes in the updated graph structure are marked as Represents the i-th node in the graph structure after the L-th update. Determine and output an N×b-dimensional probability distribution matrix according to the graph structure, where N is the number of nodes in the graph structure, that is, the number of sample text segments in the sample text image, and b is the number of predicted categories of the sample text segment. Optionally, when the sample text image is a menu image, b=5, including dish name, dish price, store name, dish type, and other five prediction categories respectively.

In addition, according to formula (6), the correlation information loss value is determined based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments. Afterwards, the loss value of the neural network model is determined based on the category loss value and the associated information loss value.

Optionally, obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment. Including at least one of the image features of the image area where the sample text segment is located or the text features of the sample text segment; based on the characteristics of each sample text segment in each pair of sample text segments and the predicted association between each pair of sample text segments information and the annotated association information between each pair of sample text segments to obtain the target model.

In the embodiment of this application, the image features of the sample text image are obtained. For each sample text segment, based on the image features of the sample text image and the position information of the image area where the sample text segment is located, the image features of the image area where the sample text segment is located are determined (recorded as the first area feature). Or, perform image detection and image segmentation on the sample text image in sequence to obtain the image area where each sample text segment is located in the sample text image. For each sample text segment, based on the pixels of each pixel in the image area where the sample text segment is located, Information, extract the image features of the image area where the sample text segment is located (recorded as the second area feature). Alternatively, the first region feature and the second region feature are spliced or fused to obtain image features of the image region where the sample text segment is located. The method for determining the image features of the image area where each sample text segment is located is as described above regarding the image features of the image area where each target text segment is located. The implementation principles of the two are the same and will not be described again.

After obtaining the image area where each sample text segment is located in the sample text image, image recognition is performed on the image area where each sample text segment is located to obtain the sample text segment. Then, use a word segmenter to perform word segmentation processing on the sample text segment, and obtain each word in the sample text segment. By using a vector lookup table, the word vectors of each word in the sample text segment are determined. Afterwards, based on the word vectors of each word in the sample text segment, the text features of the sample text segment are determined. Among them, the text characteristics of each sample text segment are as described above about the text characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.

Optionally, the image features of the image area where each sample text segment is located are used as the features of the sample text segment. Or, use the text features of each sample text segment as the features of the sample text segment. Alternatively, the image features of the image area where each sample text segment is located and the text features of the sample text segment are spliced or fused to obtain the features of the sample text segment. Alternatively, the image features of the image area where each sample text segment is located and the text features of the sample text segment are first spliced or fused, and then nonlinear operations are performed to obtain the features of the sample text segment. Among them, the characteristics of each sample text segment are as described above about the characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.

Optionally, obtaining the target model based on the characteristics of each sample text segment in each pair of sample text segments, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments includes: obtaining Each pair of sample text segments The annotation category of each sample text segment in The predicted correlation information between segments and the annotation correlation information between each pair of sample text segments are used to obtain the target model.

In the embodiment of the present application, each sample text segment is annotated to obtain the annotation category of the sample text segment. For each annotation category, calculate the sum of the features of each sample text segment in the annotation category, and divide the sum value by the number of sample text segments in the annotation category to obtain the average feature value of the annotation category.

Optionally, according to formula (8) shown below, the first loss value is determined based on the average feature value of each annotation category.

In addition, according to the formula (9) shown below, the second loss value is determined based on the average feature value of each annotation category.

Among them, L _pull is the first loss value. L _push is the second loss value. M is the number of sample text pairs with annotated association information of 1. Since the annotated association information between each pair of sample text segments is 1, it indicates the association between the pair of sample text segments. Therefore, M is also a sample text with an associated relationship. The number of segment pairs. m and k are both serial numbers. The average value of features characterizing the m-th annotation category. e _mk represents the characteristics of the k-th sample text segment in the m-th annotation category. ||x|| ² represents the second norm of x, and x is the independent variable. Σ is the summation symbol. is the average feature value of the j-th labeled category. Δ is a hyperparameter used to increase the distance between the features of two sample text segments belonging to different annotation categories. The embodiment of the present application does not limit the value of Δ. For example, Δ=1.

It should be noted that although the embodiment of the present application can determine the prediction correlation information of a pair of sample text segments, determining the prediction category of each sample text segment is equivalent to classifying the sample text segments. The classification problem can only optimize the boundaries between classes, which easily results in a large distance between the features of two sample text segments belonging to the same annotation category, while a small distance between the features of two sample text segments belonging to different annotation categories. The problem.

Please refer to FIG. 6 , which is an example diagram of the distance between features of a sample text segment provided by an embodiment of the present application. It can be seen from Figure 6 that R+ is greater than R-. Among them, R+ is the distance between the features of two sample text segments in the labeled category A, and R- is the distance between the features of one sample text segment in the labeled category A and the feature of one sample text segment in the labeled category B.

In order to improve the performance of determining the predicted associated information, the above formula (8) is used to calculate the first loss value based on the characteristics of the sample text segment. Since the first loss value is based on the average feature value of the m-th annotation category and the average value of the m-th annotation category. The characteristics of the k-th sample text segment are calculated so that the characteristics of each sample text segment in each annotation category are close to the average feature of the annotation category. Therefore, the first loss value is used to bring the features of each sample text segment in each annotation category closer to the feature average of the annotation category, and reduce the distance between the features of two sample text segments of the same annotation category.

Use the above formula (9) to calculate the second loss value based on the characteristics of the sample text segment. Since the second loss value is determined based on the feature average of the m-th annotation category, the feature average of the j-th annotation category and the hyperparameter Δ , so that the distance between the feature averages of any two labeled categories is at least greater than Δ. Therefore, the second loss value is used to distance the feature average of each annotation category from the feature average of another annotation category, which can widen the distance between the features of two sample text segments belonging to different annotation categories.

Through formula (8) and formula (9), the performance of the network in determining and predicting associated information can be improved. For example, with regard to Figure 6, through the first loss value and the second loss value in the embodiment of the present application, R- can be increased while reducing R+, thereby improving the accuracy of the features of the sample text image.

The features of each sample text segment include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment. Optionally, first splice (or fuse) the image features of the image area where the sample text segment is located and the text features of the sample text segment, and then perform at least one layer of nonlinear operations on the spliced (or fused) features, Characteristics of the sample text segment of a fixed dimension are obtained, and the first loss value and the second loss value are calculated using the characteristics of each sample text segment.

After the first loss value and the second loss value are calculated, the loss value of the neural network model is determined based on the first loss value, the second loss value and the associated information loss value. Optionally, the loss value of the neural network model can also be determined based on the first loss value, the second loss value, the associated information loss value and the category loss value.

Optionally, the weight of the first loss value, the weight of the second loss value, the weight of the associated information loss value and the weight of the category loss value are set. The loss value of the neural network model is determined based on at least one of the first loss value, the second loss value, the associated information loss value and the category loss value combined with their respective weights. For example, the loss value of the neural network model is determined based on the associated information loss value, the category loss value, the weight of the associated information loss value, and the weight of the category loss value.

Optionally, according to formula (10) shown below, determine the loss value of the neural network model based on the first loss value, the second loss value, the associated information loss value, the category loss value and their respective weights.
L＝αL _node +βL _rel +γL _pull +γL _push formula (10)

Among them, L is the loss value of the neural network model. α is the weight of the category loss value. L _node is the category loss value. β is the weight of the associated information loss value. L _rel is the associated information loss value. γ is the weight of the first loss value and is also the weight of the second loss value. L _pull is the first loss value. L _push is the second loss value. The embodiments of this application do not limit the values of α, β, and γ. For example, α=1, β=10, and γ=0.5. Among them, the weight of the first loss value and the weight of the second loss value are the same or different.

After the loss value of the neural network model is determined, the gradient of the loss value of the neural network model is calculated, and the gradient of the loss value of the neural network model is back-transmitted layer by layer to update the model parameters of the neural network model. That is, the neural network model is adjusted through the loss value of the neural network model to obtain a target model. The target model is used to obtain the associated information between at least a pair of target text segments.

In some embodiments, the loss function of contrastive learning can also be used to calculate the contrastive learning loss value. For example, each pair of sample text segments marked with associated information is regarded as a positive sample, and the loss value of the positive sample is calculated using the characteristics of each pair of sample text segments marked with associated information. Each pair of sample text segments marked with unrelated information is regarded as a negative sample. The loss value of the negative sample is calculated using the characteristics of each pair of sample text segments marked with unrelated information. Afterwards, the loss value of the positive sample and the loss value of the negative sample are used to determine the comparative learning loss value. Using at least one of the first loss value, the second loss value, the associated information loss value, the category loss value, and the comparative learning loss value, combined with their respective weights, the loss value of the neural network model is determined.

In the above method, the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation information between each pair of text segments, It helps reduce association errors and improve the accuracy of text information.

The above describes the text information extraction method and the target model acquisition method from the perspective of method steps. The target model acquisition method according to the embodiment of the present application will be further described below in conjunction with FIG. 7 . Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application. Wherein, the neural network model includes a first initial network, a second initial network and a third initial network, and the first initial network includes a first sub-network and a second sub-network. The embodiment of the present application uses the predicted correlation information between each pair of sample text segments (that is, every two sample text segments) in the sample text image to train the neural network model.

In the embodiment of the present application, a sample text image is obtained, where the sample text image is the image shown in (2) in Figure 3 . The sample text image is input to the first sub-network, and the first sub-network outputs image features of the sample text image. The image features of the sample text image are input to the second sub-network, and the second sub-network outputs the image features of the image area where each sample text segment in the sample text image is located. Optionally, image recognition is also performed on the sample text image to obtain an image recognition result of the sample text image. The image recognition result of the sample text image includes each sample text segment. The second initial network is used to obtain text features of each sample text segment.

Then, for each sample text segment, the image features of the image area where the sample text segment is located and the text features of the sample text segment are fused to obtain the features of the sample text segment. Make at least one update to the characteristics of the sample text segment. In order to facilitate the distinction and description, the characteristics of each sample text segment are called the characteristics of the sample text segment before the update, and the characteristics obtained by updating the characteristics of the sample text segment before the update at least once are called the updated sample text. segment characteristics.

On the one hand, a nonlinear operation is performed on the characteristics of each sample text segment before the update to update the characteristics of each sample text segment before the update to obtain the characteristics of each updated sample text segment. In this way, the characteristics of each sample text segment can be obtained. Based on the characteristics of each sample text segment, the characteristic loss value is calculated according to the above-mentioned formula (8) and formula (9), where the characteristic loss value includes the above-mentioned first loss value and the second loss value.

On the other hand, the characteristics of each pre-updated sample text segment are input into the third initial network. The third initial network can construct an initial graph structure based on the characteristics of each pre-updated sample text segment and update the graph structure multiple times. , that is, the characteristics of each sample text segment before updating are updated multiple times until the final graph structure is obtained. The final graph structure includes the characteristics of each updated sample text segment. The third initial network can determine and output the predicted category of each sample text segment and the predicted association information between every two sample text segments based on the final graph structure. Next, based on the predicted category of each sample text segment, the category loss value is calculated according to the formula (7) mentioned above. Based on the predicted correlation information between each two sample text segments, the correlation information loss value is calculated according to the formula (6) mentioned above.

Afterwards, based on the feature loss value, category loss value and associated information loss value, the loss value of the neural network model is calculated according to the formula (10) mentioned above. Based on the loss value of the neural network model, the neural network model is adjusted to obtain the target model.

After obtaining the target model, the text information in the target text image is extracted based on the target model. In the embodiment of this application, the target model includes an image feature extraction network (trained by the first initial network), a text feature extraction network (trained by the second initial network), and a graph convolution network (trained by the third initial network). , and the image feature extraction network includes a first extraction network (trained by the first sub-network) and a second extraction network (trained by the second sub-network).

Target text images include menu images and license images. First, perform image recognition on the target text image to obtain the image recognition result of the target text image, and then input the target text image and the image recognition result of the target text image into the target model, and the target model outputs the category and sum of each target text segment in the target text image. Association information between each two target text segments. Afterwards, the text information in the target text image is determined based on the categories of each target text segment in the target text image and the association information between each two target text segments.

Please refer to FIG. 8 , which is a schematic diagram of extracting text information from a menu image provided by an embodiment of the present application. Among them, the menu image includes "Dish A 20 yuan", "Dish B 20 yuan", "Dish C 28 yuan", "Dish D 28 yuan", "Dish E 25 yuan", "Dish F 25 yuan" and their respective associations. picture of. By performing image recognition on the menu image, the image recognition result is obtained. The image recognition results include each text segment in the menu image (ie, the target text segment mentioned above). In other words, the image recognition results include the text segments "Dish A", "20 Yuan", "Dish B", "20 Yuan", "Dish C", "28 Yuan", "Dish D", "28 Yuan" , "Cai E", "25 yuan", "Cai F", "25 yuan". It can be seen from Figure 8 that the image recognition result only identifies each text segment in the menu image, and does not associate each text segment. The menu image and the image recognition result of the menu image are input into the target model, and the target model outputs the category of each text segment in the menu image and the association information between each two text segments in the menu image. Based on the categories of each text segment in the menu image and the associated information between each two text segments in the menu image, the text information in the menu image can be obtained, that is, "Dish A: 20 yuan", "Dish B: 20 yuan" ", "Dish C: 28 yuan", "Dish D: 28 yuan", "Dish E: 25 yuan", "Dish F: 25 yuan".

Please refer to FIG. 9 , which is a schematic diagram of extracting text information in yet another menu image provided by an embodiment of the present application. Based on the same principle as Figure 8, in the embodiment of this application, the menu image, the image recognition result of the menu image, and the target model can be used to determine the text information in the menu image, that is, "Dish A: 20 yuan" and "Dish B: "20 yuan", "Dish C: 20 yuan", "Dish D: 6/piece", "Dish E: 5/piece", "Dish F: 2/bowl" and "Dish G: 5/bowl".

It should be noted that the menu images shown in Figures 8 and 9 are both structured text images. The target model in the embodiment of the present application can also extract text information from semi-structured text images. For example, text information is extracted from the license image (semi-structured text image) shown in Figures 10 and 11.

Please refer to Figure 10, which is a schematic diagram of extracting text information from a license image provided by an embodiment of the present application. Among them, the license image includes "license", "name XXX company", "company type sole proprietorship", "legal representative XX" and "date X year X month X day". By performing image recognition on the license image, the image recognition result is obtained. Among them, the image recognition results include "license", "name XXX company", "company type sole proprietorship", "legal representative" Person XX" and "Date X year Association information between segments. Based on the categories of each text segment in the license image and the association information between each two text segments in the license image, the text information in the license image can be obtained, that is, "License" and "Name" can be obtained: "XXX Company", "Company Type: Sole Proprietor", "Legal Representative: XX" and "Date: X Month X Day".

Please refer to Figure 11, which is a schematic diagram of extracting text information from a license image according to another embodiment of the present application. Based on the principle similar to Figure 10, the license image, the image recognition result of the license image and the target model can be used to determine the text information in the license image, that is, "License", "Name: XXX Company", "Residence: XX Town" can be obtained , "Registration number: 1111111", and "Business scope: fruits and vegetables, daily necessities, cultural and sporting goods".

The embodiment of this application uses four methods to train the neural network model and obtains four target models.

Among them, the first target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: first perform batch normalization on the sample text image, and then based on batch normalization The transformed sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.

The second target model is to input the sample text image and the image recognition result of the sample text image into the neural network model. The neural network model performs the following processing: first perform instance normalization on the sample text image, and then normalize it based on the instance. The sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.

The third target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: perform instance normalization on the sample text image, and then based on the instance normalized sample The text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the predicted categories of each sample text segment and the characteristics of each sample text segment are determined and output, according to the above formulas (7)-(9) ) is obtained by determining the loss value of the neural network model and adjusting the neural network model based on the loss value of the neural network model.

The fourth target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: after instance normalization of the sample text image, and then based on the instance normalization The sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the predicted categories of each sample text segment, the characteristics of each sample text segment, and the predicted association between each two sample text pairs are determined and output The relationship is obtained by determining the loss value of the neural network model according to the above formulas (6)-(9), and adjusting the neural network model based on the loss value of the neural network model.

In the embodiment of this application, the performance index of each target model is calculated according to the formula (11) shown below.

Among them, mEF is the performance index of the target model. i is the sequence number, and F _i is the score of the i-th prediction category. P _i is the precision rate of the i-th prediction category, and R _i is the recall rate of the i-th prediction category. P is the precision rate, tp is the number of positive samples whose predicted category is consistent with the labeled category, fp is the number of negative samples whose predicted category is inconsistent with the labeled category, and fn is the number of positive samples whose predicted category is inconsistent with the labeled category. Among them, if the annotation category of a sample text segment in the sample text image is the target category, then the sample text segment is a positive sample; if the annotation category of the sample text segment is not the target category, the sample text segment is a negative sample.

Among them, the sample text images used when training these four target models are menu images. The prediction categories and annotation categories of the sample text segments in the menu images include at least one of dish name, dish price, store name, dish type, and others. Target categories include dish names and dish prices. The performance indicators of these four target models are shown in Table 1 below.

Table 1

As can be seen from Table 1, the mEFs of these four target models increase sequentially. The larger the mEF, the better the performance of the target model. Therefore, the performance of these four target models is enhanced sequentially, and the fourth target model has the best performance. Through the target model of the embodiment of the present application, the phenomenon of association errors can be effectively reduced, the accuracy of text information can be improved, and the text information in the target text image can be quickly extracted, avoiding tedious and complicated manual input.

It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and signals involved in this application, All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the target text images, sample text images, etc. involved in this application were all obtained with full authorization.

Figure 12 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application. As shown in Figure 12, the device includes:

The acquisition module 1201 is used to acquire a target text image, which includes multiple target text segments;

The acquisition module 1201 is also used to obtain association information between at least one pair of target text segments, where the association information is used to characterize the possibility of association between each pair of target text segments;

The determination module 1202 is configured to determine the association result between each pair of target text segments based on the association information between each pair of target text segments;

The extraction module 1203 is used to extract text information in the target text image based on the association results between each pair of target text segments.

In a possible implementation, the acquisition module 1201 is configured to acquire at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments. Each target in each pair of target text segments Characteristics of text segments include the At least one of the image features of the image area where the target text segment is located or the text features of the target text segment. The relative position features between each pair of target text segments are used to characterize the relative position between the image areas where each pair of target text segments are located. ; Determine the associated information between each pair of target text segments based on at least one of the characteristics of the at least one pair of target text segments or the relative position characteristics between the at least one pair of target text segments.

In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located. The acquisition module 1201 is used to obtain the image characteristics of the target text image; for at least For each target text segment in a pair of target text segments, based on the image features of the target text image and the position information of the image area where the target text segment is located, the image features of the image area where the target text segment is located are determined.

In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the text characteristics of the target text segment. The acquisition module 1201 is configured to obtain each target text segment in at least one pair of target text segments. segment, obtain the word vector of each word in the target text segment; fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment.

In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment. The acquisition module 1201 is configured to at least For each target text segment in a pair of target text segments, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; For each image feature block, the image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.

In a possible implementation, the acquisition module 1201 is configured to obtain, for each pair of target text segments, the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the target text The size information of the image determines the relative position characteristics between each pair of target text segments.

In a possible implementation, the acquisition module 1201 is configured to construct a graph structure based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments. The graph structure includes at least two nodes and at least An edge, each node represents the characteristics of a target text segment, and each edge represents the relative position characteristics between a pair of target text segments indicated by a pair of nodes connected by the edge; determine the relationship between each pair of target text segments based on the graph structure associated information.

In a possible implementation, the acquisition module 1201 is also used to acquire the category of each target text segment and the association information between every two target text segments; the acquisition module 1201 is used to acquire the category of each target text segment based on the category of each target text segment. , determine the associated information between each pair of target text segments from the associated information between each two target text segments.

In one possible implementation, the acquisition module 1201 is configured to filter out the text segments to be associated with the target category from multiple target text segments included in the target text image based on the category of each target text segment; from each target text segment, The correlation information between each two text segments to be correlated is filtered out from the correlation information between the two target text segments, and the correlation information between each pair of target text segments is obtained.

In a possible implementation, the acquisition module 1201 is also used to acquire the target model; the acquisition module 1201 is used to acquire the association information between each pair of target text segments according to the target model.

In the above technical solution, the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, each pair of target text is determined through the association information between each pair of target text segments. It can reduce the phenomenon of association errors and improve the accuracy of association results when extracting text information in target text images based on the association results between each pair of target text segments. accuracy.

It should be understood that when the device provided in Figure 12 implements its functions, the division of the above functional modules is only used as an example. In actual application, the above functions can be allocated to different functional modules according to needs, that is, the equipment The internal structure is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be described again here.

Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application. As shown in Figure 13, the device includes:

The first acquisition module 1301 is used to acquire a sample text image, which includes multiple sample text segments;

The second acquisition module 1302 is used to acquire predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments;

The third acquisition module 1303 is used to acquire the target model based on the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments.

In a possible implementation, the device further includes: a fourth acquisition module, configured to acquire the predicted category of each sample text segment in each pair of sample text segments and the annotation category of each sample text segment in each pair of sample text segments. ; The third acquisition module 1303 is used to predict the association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments, and the predicted category of each sample text segment. Obtain the target model for the annotation category of each sample text segment in the sample text segment.

In a possible implementation, the device further includes: a fifth acquisition module, used to obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include the image area where the sample text segment is located. At least one of the image features or the text features of the sample text segment; the third acquisition module 1303 is used to predict the association between each pair of sample text segments based on the characteristics of each sample text segment in each pair of sample text segments. information and the annotated association information between each pair of sample text segments to obtain the target model.

In a possible implementation, the third acquisition module 1303 is used to obtain the annotation category of each sample text segment in each pair of sample text segments; for each annotation category, based on the characteristics of each sample text segment in the annotation category , determine the feature average of the annotation category; obtain the target model based on the feature average of each annotation category, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments.

In the above technical solution, the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation between any pair of text segments. information, helping to reduce association errors and improve the accuracy of text information.

It should be understood that when the device provided in FIG. 13 implements its functions, only the division of the above functional modules is used as an example. In practical applications, the above functions can be allocated to different functional modules according to needs, that is, the equipment The internal structure is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be described again here.

Figure 14 shows a structural block diagram of a terminal device 1400 provided by an exemplary embodiment of the present application. The terminal device 1400 includes: a processor 1401 and a memory 1402.

The processor 1401 includes one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 1401 is implemented using at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array). . The processor 1401 also includes a main processor and a co-processor. The main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is used A low-power processor used to process data in standby mode. In some embodiments, the processor 1401 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 1401 also includes an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.

Memory 1402 includes one or more computer-readable storage media that are non-transitory. Memory 1402 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1402 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 1401 to implement the methods provided by the method embodiments in this application. Text information extraction method or target model acquisition method.

In some embodiments, the terminal device 1400 optionally further includes: a peripheral device interface 1403 and at least one peripheral device. The processor 1401, the memory 1402 and the peripheral device interface 1403 are connected through a bus or a signal line. Each peripheral device is connected to the peripheral device interface 1403 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: display screen 1405.

The peripheral device interface 1403 may be used to connect at least one I/O (Input/Output, input/output) related peripheral device to the processor 1401 and the memory 1402 . In some embodiments, the processor 1401, the memory 1402, and the peripheral device interface 1403 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1401, the memory 1402, and the peripheral device interface 1403 or Both are implemented on separate chips or circuit boards, which is not limited in this embodiment.

The display screen 1405 is used to display UI (User Interface, user interface). The UI includes graphics, text, icons, videos, and any combination thereof. When display screen 1405 is a touch display screen, display screen 1405 also has the ability to collect touch signals on or above the surface of display screen 1405 . The touch signal is input to the processor 1401 as a control signal for processing. At this time, the display screen 1405 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there is one display screen 1405, which is provided on the front panel of the terminal device 1400; in other embodiments, there are at least two display screens 1405, which are respectively provided on different surfaces of the terminal device 1400 or have a folding design; In other embodiments, the display screen 1405 is a flexible display screen that is disposed on a curved or folded surface of the terminal device 1400 . Even the display screen 1405 is set in a non-rectangular irregular shape, that is, a special-shaped screen. The display screen 1405 is made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.

Those skilled in the art can understand that the structure shown in FIG. 14 does not constitute a limitation on the terminal device 1400, which may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.

Figure 15 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 1500 may vary greatly due to different configurations or performance, and includes one or more processors 1501 and one or more memories 1502. The server 1500 includes one or more processors 1501 and one or more memories 1502. At least one computer program is stored in one or more memories 1502, and the at least one computer program is loaded and executed by the one or more processors 1501 to implement the text information extraction method or the target model acquisition method provided by the above method embodiments. , for example, the processor 1501 is a CPU. Of course, the server 1500 also has components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output. The server 1500 also includes other components for realizing device functions, which will not be described again here.

In an exemplary embodiment, a computer-readable storage medium is also provided. At least one computer program is stored in the storage medium. The at least one computer program is loaded and executed by the processor to enable the electronic device to implement any of the above. Text information extraction method or target model acquisition method.

Optionally, the above computer-readable storage medium is read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM) , tapes, floppy disks and optical data storage devices, etc.

In an exemplary embodiment, a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer implements Any of the above text information extraction methods or target model acquisition methods.

It should be understood that "plurality" mentioned in this article means two or more. "And/or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and/or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship.

The above serial numbers of the embodiments of the present application are only for description and do not represent the advantages and disadvantages of the embodiments.

The above are only exemplary embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the principles of the present application shall be included in the protection scope of the present application. Inside.

Claims

A text information extraction method, executed by a terminal, the method includes:

Obtaining a target text image, the target text image including a plurality of target text segments;

Obtaining association information between at least one pair of target text segments, the association information being used to characterize the possibility of association between each pair of target text segments;

Based on the association information between each pair of target text segments, determine the association results between each pair of target text segments;

Based on the association results between each pair of target text segments, text information in the target text image is extracted.
According to the method of claim 1, obtaining the association information between each pair of target text segments includes:

Obtain at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments, and the characteristics of each target text segment in each pair of target text segments include the image of the image area where the target text segment is located. Features or at least one of the text features of the target text segment, the relative position feature between each pair of target text segments is used to characterize the relative position between the image areas where each pair of target text segments are located;

Association information between each pair of target text segments is determined based on at least one of the characteristics of the at least one pair of target text segments or the relative position characteristics between the at least one pair of target text segments.
According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include image characteristics of the image area where the target text segment is located, and the characteristics of each target text segment in each pair of target text segments are obtained, include:

Obtain image features of the target text image;

Based on the image features of the target text image and the position information of the image area where the target text segment is located, the image features of the image area where the target text segment is located are determined.
According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include the text characteristics of the target text segment, and obtaining the characteristics of each target text segment in each pair of target text segments includes:

Obtain the word vector of each word in the target text segment;

The word vectors of each word in the target text segment are fused to obtain text features of the target text segment.
According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include image features of the image area where the target text segment is located and text features of the target text segment, and each pair of target text segments is obtained. The characteristics of each target text segment in , including:

segment the image features of the image area where the target text segment is located into a target number of image feature blocks, and segment the text features of the target text segment into the target number of text feature blocks;

Fuse each of the image feature blocks and the text feature blocks associated with the image feature blocks to obtain a fused feature block;

Each fused feature block is spliced to obtain the characteristics of the target text segment.
According to the method of claim 2, obtaining the relative position characteristics between each pair of target text segments includes:

Obtain the position information of the image area where each pair of target text segments is located;

Based on the position information of the image area where each pair of target text segments is located and the size information of the target text image, the relative position characteristics between each pair of target text segments are determined.
The method according to claim 2, determining the associated information between each pair of target text segments based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments, including:

Based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments, a graph structure is constructed, the graph structure includes at least two nodes and at least one edge, each node represents a target text segment Features, each edge represents the relative position feature between a pair of target text segments indicated by a pair of nodes connected by the edge;

Based on the graph structure, the associated information between each pair of target text segments is determined.
The method of claim 1, further comprising:

Obtain the category of each target text segment and the association information between each two target text segments;

Obtain the associated information between each pair of target text segments, including:

Based on the category of each target text segment, each pair of target text segments is determined from the associated information between each two target text segments. Related information between this paragraph.
The method according to claim 8, determining the associated information between each pair of target text segments from the associated information between every two target text segments based on the category of each target text segment, including:

Based on the category of each target text segment, from the multiple target text segments included in the target text image, select the text segments to be associated with the category of the target category;

From the associated information between each two target text segments, the associated information between each two text segments to be associated is filtered out to obtain the associated information between each pair of target text segments.
The method according to any one of claims 1 to 9, further comprising:

Get the target model;

Obtain the associated information between each pair of target text segments, including:

Obtain association information between each pair of target text segments according to the target model.
A method for obtaining a target model, executed by a server, the method includes:

Obtain a sample text image, where the sample text image includes multiple sample text segments;

Obtain predicted correlation information and annotation correlation information between at least one pair of sample text segments;

Based on the predicted correlation information and annotation correlation information between each pair of sample text segments, the target model is obtained.
The method of claim 11, further comprising:

Obtain the predicted category and annotation category of each sample text segment in each pair of sample text segments;

The method of obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments includes:

The target model is obtained based on the predicted correlation information and labeling correlation information between each pair of sample text segments, and the predicted category and labeling category of each sample text segment in each pair of sample text segments.
The method of claim 11, further comprising:

Obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include at least one of the image characteristics of the image area where the sample text segment is located or the text characteristics of the sample text segment;

The method of obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments includes:

The target model is obtained based on the characteristics of each sample text segment in each pair of sample text segments, and the predicted correlation information and annotation correlation information between each pair of sample text segments.
The method according to claim 13, obtaining the target model based on the characteristics of each sample text segment in each pair of sample text segments and the predicted correlation information and annotation correlation information between each pair of sample text segments, including :

Obtain the annotation category of each sample text segment in each pair of sample text segments;

For each annotation category, based on the characteristics of each sample text segment in the annotation category, determine the feature average of the annotation category;

The target model is obtained based on the feature average of each annotation category, as well as the predicted correlation information and annotation correlation information between each sample text segment.
A text information extraction device, the device includes:

An acquisition module, used to acquire a target text image, where the target text image includes multiple target text segments;

The acquisition module is also used to acquire association information between at least one pair of target text segments, where the association information is used to characterize the possibility of association between each pair of target text segments;

a determination module, configured to determine the association result between each pair of target text segments based on the association information between each pair of target text segments;

An extraction module, configured to extract text information in the target text image based on the association results between each pair of target text segments.
A device for obtaining a target model, the device includes:

The first acquisition module is used to acquire a sample text image, where the sample text image includes a plurality of sample text segments;

The second acquisition module is used to acquire the predicted correlation information and annotation correlation information between at least one pair of sample text segments;

The third acquisition module is used to obtain the target based on the predicted correlation information and annotation correlation information between each pair of sample text segments. standard model.
An electronic device. The electronic device includes a processor and a memory. At least one computer program is stored in the memory. The at least one computer program is loaded and executed by the processor, so that the electronic device implements the rights as claimed. The text information extraction method described in any one of claims 1 to 10 or the acquisition method of the target model described in any one of claims 11 to 14.
A computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor, so that the computer implements any one of claims 1 to 10 The text information extraction method or the acquisition method of the target model as described in any one of claims 11 to 14.
A computer program product, at least one computer program is stored in the computer program product, and the at least one computer program is loaded and executed by a processor to enable the computer to implement text information extraction as claimed in any one of claims 1 to 10 Method or method for obtaining the target model as described in any one of claims 11 to 14.