WO2023202268A1 - Text information extraction method and apparatus, target model acquisition method and apparatus, and device - Google Patents

Text information extraction method and apparatus, target model acquisition method and apparatus, and device Download PDF

Info

Publication number
WO2023202268A1
WO2023202268A1 PCT/CN2023/081379 CN2023081379W WO2023202268A1 WO 2023202268 A1 WO2023202268 A1 WO 2023202268A1 CN 2023081379 W CN2023081379 W CN 2023081379W WO 2023202268 A1 WO2023202268 A1 WO 2023202268A1
Authority
WO
WIPO (PCT)
Prior art keywords
target text
pair
target
image
text
Prior art date
Application number
PCT/CN2023/081379
Other languages
French (fr)
Chinese (zh)
Inventor
姜媚
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023202268A1 publication Critical patent/WO2023202268A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Embodiments of the present application relate to the field of image processing technology, and in particular to a text information extraction method, a target model acquisition method, apparatus and equipment.
  • Text images containing text information such as menu images and receipt images are common in daily life. Such text images are structured text images or semi-structured text images. How to accurately extract text information from structured text images and semi-structured text images has become an urgent problem to be solved in the field of image processing technology.
  • Embodiments of the present application provide a text information extraction method, a target model acquisition method, a device and equipment.
  • embodiments of the present application provide a method for extracting text information.
  • the method includes: acquiring a target text image, which includes multiple target text segments; and acquiring an association between at least a pair of target text segments.
  • Information, the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; based on the association information between the at least one pair of target text segments, determine the and extracting text information in the target text image based on the correlation result between the at least one pair of target text segments.
  • embodiments of the present application provide a method for obtaining a target model.
  • the method includes: obtaining a sample text image, where the sample text image includes multiple sample text segments; and obtaining at least one pair of sample text segments. based on the predicted correlation information between the at least one pair of sample text segments and the labeled correlation information between the at least one pair of sample text segments; , obtain the target model.
  • inventions of the present application provide a text information extraction device.
  • the device includes: an acquisition module, used to acquire a target text image, where the target text image includes multiple target text segments; the acquisition module, It is also used to obtain association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; the determination module, configured to determine an association result between the at least one pair of target text segments based on the association information between the at least one pair of target text segments; an extraction module configured to determine the association result between the at least one pair of target text segments based on the association between the at least one pair of target text segments As a result, text information in the target text image is extracted.
  • inventions of the present application provide a device for acquiring a target model.
  • the device includes: a first acquisition module for acquiring a sample text image, where the sample text image includes a plurality of sample text segments; a second acquisition module.
  • the acquisition module is used to acquire the predicted association information between at least one pair of sample text segments and the annotation association information between the at least one pair of sample text segments;
  • the third acquisition module is used to acquire the predicted association information between the at least one pair of sample text segments based on the at least one pair of sample text segments.
  • the target model is obtained by predicting the correlation information between the at least one pair of sample text segments and the annotation correlation information between the at least one pair of sample text segments.
  • inventions of the present application provide an electronic device.
  • the electronic device includes a processor and a memory. At least one computer program is stored in the memory. The at least one computer program is loaded and executed by the processor. , so that the electronic device implements the above text information extraction method or the above target model acquisition method.
  • a computer-readable storage medium is also provided. At least one computer program is stored in the computer-readable storage medium. The at least one computer program is loaded and executed by the processor to enable the computer to realize the above text information. Extraction method or acquisition method of the above target model.
  • a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer program The computer implements the above text information extraction method or the above target model acquisition method.
  • Figure 1 is a schematic diagram of the implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application;
  • Figure 2 is a flow chart of a text information extraction method provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of a target text image provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of extracting image features of the image area where the target text segment is located according to an embodiment of the present application
  • Figure 5 is a flow chart of a method for obtaining a target model provided by an embodiment of the present application.
  • Figure 6 is an example diagram of distances between features of a sample text segment provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application.
  • Figure 8 is a schematic diagram of extracting text information from a target text image provided by an embodiment of the present application.
  • Figure 9 is another schematic diagram of extracting text information from a target text image provided by an embodiment of the present application.
  • Figure 10 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a text information extraction device provided by an embodiment of the present application.
  • Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application.
  • Figure 14 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • text recognition is first performed on a target text image to obtain multiple text segments.
  • the target text image is a structured text image or a semi-structured text image.
  • determine the category of each text segment determines the category of each text segment.
  • Based on the association between categories and the categories of each text segment multiple text segments are associated to obtain association results of multiple text segments, and text information in the target text image is extracted based on the association results of multiple text segments.
  • any two text segments may correspond to the same category, when associating multiple text segments based on the association between categories and the category of each text segment, the accuracy of the association results is poor, resulting in the target text image being The accuracy of the extracted text information is also poor.
  • FIG. 1 is a schematic diagram of an implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application.
  • the implementation environment includes a terminal device 101 and a server 102 .
  • the text information extraction method or the target model acquisition method in the embodiment of the present application is executed by the terminal device 101, or executed by the server 102, or jointly executed by the terminal device 101 and the server 102.
  • the terminal device 101 is a smartphone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart TV, a smart vehicle-mounted device, a smart voice interaction device, a smart home appliance, etc.
  • the server 102 is one server, or a server cluster composed of multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application.
  • the server 102 communicates with the terminal device 101 through a wired network or a wireless network.
  • the server 102 has functions such as data processing, data storage, and data sending and receiving, which are not limited in the embodiment of the present application.
  • the number of terminal devices 101 and servers 102 is not limited, for example, one or more.
  • the embodiment of the present application provides a method for extracting text information.
  • the method consists of the terminal in Figure 1
  • the device 101 or the server 102 executes, or the terminal device 101 and the server 102 jointly execute.
  • the terminal device 101 or the server 102 is collectively referred to as an electronic device.
  • the following description takes the electronic device as a terminal as an example.
  • the method includes steps 201 to 204.
  • Step 201 The terminal obtains a target text image, which includes multiple target text segments.
  • Target text images refer to images containing text segments, including structured text images and unstructured text images.
  • text segments refer to words, phrases or sentences composed of characters. Different text segments are usually separated by blank areas in the target text image.
  • each target text segment includes at least one character, and each character is any one of alphabetic characters, numbers, special symbols (such as punctuation marks, currency symbols, etc.), etc.
  • the target text segment includes multiple characters, the multiple characters form at least one word and can also form at least one sentence.
  • the target text image is a structured text image
  • the structured text image is an image that expresses text through a two-dimensional table structure, and the text in the image has organization, regularity, etc.
  • the structured text image includes multiple target text segments. For each target text segment in the structured text image, there is at least one other target text segment associated with this target text segment, and the other target text segments are the target texts in the multiple target text segments except this target text segment. part.
  • FIG 3 is a schematic diagram of a target text image provided by an embodiment of the present application, in which (1) is a structured text image. It can be seen from the structured text image that the target text segment "Item A” is related to the target text segment " ⁇ 10”, the target text segment “Item B” is related to the target text segment “ ⁇ 15”, and the target text segment “Item C” is associated with the target text segment “ ⁇ 3”, the target text segment “Item D” is associated with the target text segment “ ⁇ 9”, and the target text segment “Item E” is associated with the target text segment “ ⁇ 1”. Therefore, each target text segment in the structured text image has at least one other target text segment associated with this target text segment.
  • the target text image is a semi-structured text image
  • the semi-structured text image includes a structured text area and an unstructured text area.
  • the structured text area is an image area that expresses text through a two-dimensional table structure.
  • the text in this part of the image area has organization, regularity, etc.
  • the unstructured text area is an image area that expresses text through irregular and unorganized data structures.
  • the semi-structured text image includes multiple target text segments. For semi-structured text images, each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in the other part of the target text segments does not exist. Other target text segments associated with this target text segment.
  • each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in another part of the target text segments has at least one other target text segment associated with it. There is no other target text segment associated with this target text segment.
  • the embodiments of this application do not limit the image content, acquisition method, quantity, etc. of the target text image.
  • the target text image is a receipt image, a menu image, etc.
  • the target text image is a photographed image, or an image downloaded from the network.
  • Step 202 The terminal obtains association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to represent the possibility of association between any pair of target text segments.
  • any two target text segments among the multiple target text segments are regarded as a pair of target text segments, thereby obtaining at least one pair of target text segments.
  • Obtain the correlation information between at least one pair of target text segments that is, obtain the correlation information between each pair of target text segments.
  • the correlation information between each pair of target text segments is a non-negative number.
  • the correlation information between each pair of target text segments is a number greater than or equal to 0 and less than or equal to 1
  • the number between each pair of target text segments is The correlation information between them is called the correlation probability between each pair of target text segments.
  • the association information between each pair of target text segments is used to characterize the possibility of association between the pair of target text segments. Among them, the greater the correlation information between each pair of target text segments, the higher the possibility of correlation between the pair of target text segments. That is to say, the correlation information between each pair of target text segments is related to the pair of target text segments. The probability of association between text segments is proportional to that.
  • a target model is obtained; an association between at least one pair of target text segments is obtained according to the target model information.
  • the method of obtaining the target model is described in the relevant description of Figure 5 below, and will not be described again here.
  • the target model includes at least one of an image feature extraction network or a text feature extraction network, and the target model determines and outputs the content of the target text image based on an output of at least one of the image feature extraction network or the text feature extraction network. Association information between at least one pair of target text segments.
  • the image feature extraction network is used to extract image features of the image area where each target text segment is located in at least a pair of target text segments
  • the text feature extraction network is used to extract text features of each target text segment in at least a pair of target text segments. .
  • obtaining the association information between each pair of target text segments includes: obtaining at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments, each pair of target text segments.
  • the features of each target text segment in the segment include at least one of the image features of the image area where the target text segment is located or the text features of the target text segment.
  • the relative position features between each pair of target text segments are used to characterize this The relative position between the image areas where the target text segments are located; determine the relative position between each pair of target text segments based on at least one of the characteristics of at least one pair of target text segments or the relative position characteristics between at least one pair of target text segments. associated information.
  • the features of each target text segment are image features of the image area where the target text segment is located, or text features of the target text segment, or include both image features of the image area where the target text segment is located. Includes text features of the target text segment.
  • the characteristics of each target text segment include image characteristics of the image area where the target text segment is located.
  • the image characteristics are obtained by: obtaining the image characteristics of the target text image; based on the image characteristics of the target text image and the The position information of the image area where the target text segment is located determines the image characteristics of the image area where the target text segment is located.
  • the target text image is input into the image feature extraction network, and the image feature extraction network outputs the image features of the image area where each of the target text segments is located in at least one pair of target text segments.
  • the image features of the image area where the target text segment is located are used to characterize the texture information of the image area where the target text segment is located.
  • image detection is performed on the target text image to obtain position information of the image area where each target text segment in the target text image is located.
  • This embodiment of the present application does not limit the position information of the image area where each target text segment is located.
  • the image area where each target text segment is located is a rectangle, a circle, etc.
  • the position information of the image area where each target text segment is located includes at least one of the center point coordinates, vertex coordinates, side length, perimeter, area, radius, etc. of the image area where each target text segment is located.
  • the coordinates include the abscissa and ordinate
  • the side length includes the height and width.
  • the image feature extraction network includes a first extraction network and a second extraction network.
  • the first extraction network extracts the image features of the target text image based on the pixel information of each pixel in the target text image (or the normalized target text image).
  • the target text The image features of the image are used to characterize the texture information of the target text image.
  • the second extraction network determines the image features of the image area where each target text segment is located (recorded as the first area feature) based on the image features of the target text image and the position information of the image area where each target text segment is located.
  • the target text image is input to the first extraction network, and the first extraction network sequentially performs convolution processing and normalization processing on the target text image to normalize the target text image obtained by the convolution processing to On the standard distribution, it prevents gradient oscillation during training and reduces the problem of model overfitting.
  • the image features of the target text image are determined and output based on the normalized target text image.
  • at least one of the average value of the pixel information and the variance of the pixel information is determined.
  • the target text image after convolution is normalized using at least one of the average value of the pixel information and the variance of the pixel information.
  • This method of normalization is called instance normalization (IN). Since the target text image has large layout and image differences, retaining the shallow appearance information of the image through instance normalization is conducive to the integration and adjustment of the global information of the image, and improves the stability of training and the generalization of the model. .
  • both the first extraction network and the second extraction network are convolutional neural networks (Convolutional Neural Networks, CNN).
  • the first extraction network is the backbone network using U-Net architecture, which is used to extract visual features of the target text image. According to the pixel information of each pixel in the target text image, the target text image is first down-sampled to obtain downsampling. Features, and then perform upsampling processing on the downsampled features to obtain the image features of the target text image.
  • the second extraction network is a region of interest pooling (Region Of Interest Pooling, ROI Pooling) layer, or a region of interest alignment (Region Of Interest Align, ROI Align) layer, which is used according to the image features of the target text image and each The position information of the image area where the target text segment is located determines the image characteristics of the image area where each target text segment is located. That is to say, the ROI Pooling layer or ROI Align layer performs feature extraction on the image features of the target text image again based on the position information of the image area where each target text segment is located, and obtains the image features of the image area where each target text segment is located. .
  • the image feature of the image area where each target text segment is located is a visual feature of a fixed dimension (such as 16 dimensions).
  • the first extraction network can also use the backbone network using the Feature Pyramid Networks (FPN) architecture, or the backbone network using the ResNet architecture.
  • FPN Feature Pyramid Networks
  • ResNet ResNet
  • Figure 4 is a schematic diagram of extracting image features of an image area where a target text segment is located according to an embodiment of the present application.
  • the target text image is the image shown in (2) in Figure 3, and the target text image includes the image area where the target text segment "price list" is located, as shown by the dotted box in Figure 4.
  • the target text image is input into the backbone network, and the backbone network outputs the image features of the target text image.
  • feature extraction is performed again on the image features of the target text image to obtain the image features of the image area where the target text segment is located.
  • the backbone network using U-Net architecture has the design feature of cross-layer connections, which is more friendly to feature extraction of image areas.
  • the ROI Pooling layer or ROI Align layer performs feature extraction again on the image features of the target text image obtained after upsampling to obtain the image features of the image area where each target text segment is located, which can avoid errors caused by downsampling. accumulation, improving accuracy.
  • the image features of the target text image are obtained based on the global information of the target text image, the image features of the image area where each target text segment is located also have the global information of the target text image, and the feature expression ability is stronger and the accuracy is higher. high.
  • image detection is performed on the target text image first to obtain position information of the image area where each target text segment in the target text image is located.
  • image segmentation is performed on the target text image to obtain the image area where each target text segment in the target text image is located.
  • the image features of the image area where the target text segment is located are extracted (recorded as second area features).
  • the first area feature and the second area feature are spliced or fused to obtain image features of the image area where each target text segment is located.
  • the first area feature is spliced before or after the second area feature to obtain the image features of the image area where each target text segment is located.
  • use the form of Kronecker product to calculate the outer product between the first area feature and the second area feature to obtain the image features of the image area where each target text segment is located.
  • divide the first area feature into a reference number of first area blocks divide the second area feature into a reference number of second area blocks, and for each first area block, combine the first area block and the first area block.
  • the second area block associated with the area block is fused to obtain a fusion area block, and each fusion area block is spliced to obtain the image features of the image area where each target text segment is located.
  • the characteristics of each target text segment include the text characteristics of the target text segment
  • the method of obtaining the text characteristics includes: obtaining the word vector of each word in the target text segment; The word vectors of the words are fused to obtain the text features of the target text segment.
  • the text features of the target text segment are used to represent the semantic information of the words contained in the target text segment itself.
  • target text segment include but are not limited to text features of the target text segment.
  • image detection and image segmentation are performed on the target text image in sequence to obtain the image area where each target text segment in the target text image is located.
  • For each target text segment perform image recognition on the image area where the target text segment is located to obtain each target text segment.
  • the target text segment is input into the text feature extraction network. Extracted from text features
  • the network first uses a tokenizer to segment the target text segment to obtain each word in the target text segment. Through a vector lookup table, the word vector of each word in the target text segment is determined.
  • the word vector is a vector with a fixed dimension (such as 200 dimensions).
  • the contextual semantic relationship of the text is further learned based on the word vectors of each word in each target text segment, so as to fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment.
  • the text feature extraction network is a Bi-directional Long Short Term Memory (Bi-LSTM) network or a TransFormer network.
  • the characteristics of each target text segment include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment.
  • Obtaining the characteristics of each target text segment includes: for each pair of target text segments For each target text segment, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; for each image feature block, The image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.
  • the target model can also splice or fuse the image features of the image area where each target text segment is located with the text features of the target text segment to obtain the features of the target text segment. For example, during fusion, the outer product between the image features of the image area where each target text segment is located and the text features of the target text segment is calculated in the form of a Kronecker product to obtain the features of the target text segment. .
  • each target text segment first divide the image features of the image area where the target text segment is located into a target number of image feature blocks, which are recorded as the 1st to Nth image feature blocks respectively, where N is a positive number greater than 1.
  • An integer and represents the target quantity are also divided into a target number of text feature blocks, which are respectively recorded as the 1st to Nth text feature blocks.
  • the image feature block and the associated text feature block are fused to obtain a fused feature block, where the association between the image feature block and the text feature block means having the same serial number.
  • the image feature block is recorded as the i-th image feature block
  • the text feature block associated with the image feature block is the i-th text feature block
  • the fusion feature block is recorded as the i-th fusion feature block
  • i is the value A positive integer from 1 to N.
  • the outer product between the i-th image feature block and the i-th text feature block is calculated in the form of a Kronecker product to obtain the i-th fused feature block.
  • each fused feature block is spliced to obtain the characteristics of the target text segment. That is to say, the 1st to Nth fused feature blocks are spliced to obtain the features of the target text segment.
  • the image features of the image area where each target text segment is located and the text features of the target text segment are first spliced or fused, and then a nonlinear operation is performed to obtain the features of the target text segment.
  • the characteristics of each target text segment are used to indicate that the target text segment itself has characteristic information, such as image features representing the texture information of the image area where it is located, or text features representing its own semantic information, or representing the above image features and the above text features. fused information.
  • the target model determines the associated information between each pair of target text segments based on the characteristics of each pair of target text segments.
  • obtaining the relative position characteristics between each pair of target text segments includes: obtaining the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the size of the target text image information to determine the relative position characteristics between each pair of target text segments.
  • the relative position feature between each pair of target text segments is used to characterize the position difference between the pair of target text segments in the target text image.
  • the position information of the image area where the two target text segments in the pair are located includes the center point of the image area where the target text segment is located. At least one of coordinates, vertex coordinates, side length, perimeter, area, radius, etc. Size information of the target text image is also obtained, and the size information of the target text image includes at least one of side length, perimeter, area, radius, etc. of the target text image.
  • the coordinates include the abscissa and ordinate
  • the side length includes the width and height.
  • r ij is the relative position feature between the i-th target text segment and the j-th target text segment.
  • d is the normalization factor to prevent numerical fluctuations calculated for images in different formats.
  • ⁇ x ij x j -x i , where ⁇ x ij represents the relative horizontal distance between the image area where the i-th target text segment and the j-th target text segment are located, and x j is the center of the image area where the j-th target text segment is located.
  • Point abscissa x i is the abscissa of the center point of the image area where the i-th target text segment is located.
  • ⁇ y ij y j -y i , where ⁇ y ij represents the relative vertical distance between the i-th target text segment and the j-th target text segment in the image area, and y j is the center of the image area in which the j-th target text segment is located.
  • Point ordinate y i is the ordinate of the center point of the image area where the i-th target text segment is located.
  • w i is the width of the image area where the i-th target text segment is located
  • h i is the height of the image area where the i-th target text segment is located.
  • w j is the width of the image area where the jth target text segment is located, and h j is the height of the image area where the jth target text segment is located.
  • W is the width of the target text image, and H is the height of the target text image.
  • the target model determines the relative position characteristics between at least one pair of target text segments. Afterwards, the target model determines the associated information between each pair of target text segments based on the relative position characteristics between each pair of target text segments.
  • the relative position features between each pair of target text segments are normalized and linearly processed to obtain the processed relative position features between each pair of target text segments.
  • e′ ij N l2 (Er ij )
  • e′ ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment.
  • N l2 represents normalization processing, such as L2 norm normalization processing, which can improve stability.
  • E represents linear processing, capable of projecting r ij to fixed dimensions.
  • r ij is the relative position feature between the i-th target text segment and the j-th target text segment.
  • the relative position features between each pair of target text segments are used to determine the associated information between each pair of target text segments.
  • determining the association information between each pair of target text segments includes: based on the characteristics of the at least one pair of target text segments and the relative position features between at least one pair of target text segments to construct a graph structure.
  • the graph structure includes at least two nodes and at least one edge. Each node represents the characteristics of a target text segment, and each edge represents a pair connected by the edge. The relative position characteristics between a pair of target text segments indicated by the nodes; then, the association information between each pair of target text segments is determined based on the graph structure.
  • the characteristics of each target text segment in at least a pair of target text segments are used as a node of the graph structure. That is, a node in the graph structure is associated with the characteristics of a target text segment. Furthermore, the relative position features between each pair of target text segments are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure. Or perform normalization and linear processing on the relative position features between each pair of target text segments according to the above formula (2) to obtain the processed relative position features between each pair of target text segments. The processed relative position features are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure.
  • e ij is the feature obtained by fusing the spliced features using a multi-layer perceptron.
  • M is a multi-layer perceptron that can Transform vector features into scalar features.
  • n i is the characteristic of the i-th target text segment.
  • is the splicing symbol.
  • e′ ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment.
  • n j is the characteristic of the jth target text segment.
  • the graph structure is obtained through the above method, so as to simulate the layout relationship between each target text segment in the target text image through the graph structure.
  • the target model includes a Graph Convolutional Network (GCN).
  • GCN Graph Convolutional Network
  • the graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs the association information between at least one pair of target text segments.
  • the graph convolution network mines the structured relationship between the two nodes at both ends of the edge in the graph structure by continuously iteratively updating the graph structure, thereby obtaining the associated information between at least one pair of target text segments.
  • the process of iteratively updating the graph structure is the process of iteratively updating the nodes of the graph structure, while the edges of the graph structure are not updated.
  • each iteration first determine the weight of each edge based on each edge in the graph structure according to formula (4) shown below.
  • edge e ij in the graph structure at the lth iteration. exp is the exponent symbol. ⁇ is the summation symbol. k is the sequence number. e ik represents the edge between the i-th node and the k-th node in the graph structure. The edge e ij is the edge between the i-th node and the j-th node in the graph structure.
  • the weight of each edge of the node at one end of the graph structure, and the edges of the node at one end of the graph structure update the this node.
  • represents nonlinear processing.
  • W l represents the linear processing at the lth iteration. is the weight of edge e ij in the graph structure at the l-th iteration, where edge e ij is the edge between the i-th node and the j-th node in the graph structure.
  • an iterative update of each node in the graph structure is achieved, that is, an iterative update of the graph structure is achieved. If the iteration end condition is met, the updated graph structure is used as the final graph structure, and the final graph structure is used to determine the associated information between at least one pair of target text segments. If the iteration end conditions are not met, the updated graph structure will be used as the graph structure for the next iteration, and the graph structure will be updated again in the manner shown in formula (4) to formula (5) until the iteration end conditions are met. , obtain the final graph structure, and use the final graph structure to determine the associated information between at least a pair of target text segments. It should be noted that when iteratively updating the graph structure, in addition to iteratively updating the nodes of the graph structure, the edges of the graph structure can also be iteratively updated.
  • satisfying the iteration end condition can be that the number of iterations has been reached, or it can be that the change between the graph structure before iterative update and the graph structure after iterative update is less than the change threshold, that is, the graph structure tends to be stable.
  • the embodiment of the present application first constructs a graph structure based on the characteristics of each pair of target text segments in the target text image and the relative position characteristics between each pair of target text segments, and then determines the associated information between at least one pair of target text segments based on the graph structure. Since each pair of target text segments is every two target text segments, the features of each pair of target text segments are equivalent to the features of every two target text segments.
  • each pair of target text segments in the target text image includes target text segments 1 and 2, target text segments 2 and 3, and target text segments 1 and 3. Then based on the characteristics of target text segment 1, the characteristics of target text segment 2, the characteristics of target text segment 3, the relative position characteristics between target text segments 1 and 2, the relative position characteristics between target text segments 2 and 3, and the target The relative position characteristics between text segments 1 and 3 are used to construct a graph structure, and then the associated information between target text segments 2 and 3 is determined based on the graph structure.
  • the method of obtaining the correlation information between each pair of target text segments also includes: obtaining the category of each target text segment and the correlation information between every two target text segments; based on the category of each target text segment, From the correlation information between each two target text segments, the correlation information between each pair of target text segments is determined.
  • the category of each target text segment It is used to represent the category to which the target text segment belongs in multiple preset categories, and represents the category to which the target text segment's own semantic information belongs. For example, although the two target text segments "10 yuan" and "15 yuan" represent different Price semantics, but they all belong to the same category "vegetable price”.
  • the category of each target text segment is determined based on the characteristics of each target text segment. Determine the associated information between any two target text segments based on the characteristics of any two target text segments, or determine the associated information between any two target text segments based on the relative position characteristics between any two target text segments, Or determine the associated information between any two target text segments based on the characteristics of any two target text segments and the relative position characteristics between any two target text segments.
  • the embodiment of the present application does not limit the category of each target text segment. For example, if the target text image is a menu image, then the category of each target text segment is dish name, dish price, store name, dish type, others, etc. at least one of them.
  • Association information between target text segments Among them, the determination of the associated information between each pair of target text segments based on the graph structure is described above about "determining the associated information between each pair of target text segments based on the graph structure". The implementation principles of the two are similar and will not be discussed here. Repeat.
  • the associated information between each pair of target text segments is determined from the associated information between each two target text segments.
  • a Long Short Term Memory (LSTM) network and a Conditional Random Field (CRF) network are used to determine the category of each target text segment based on the characteristics of the target text segment.
  • the LSTM network and the CRF network determine the category of each character in the target text segment based on the characteristics of each target text segment, and determine the category of the target text segment based on the category of each character.
  • the categories of each character in the target text segment are the same category, then the category of the target text segment is the category of any character.
  • each character in the target text segment is different categories, then segment the target text segment into at least two target text segments based on the categories of each character in the target text segment, and each segmented target text
  • the categories of each character in the segment are the same, and the category of each segmented target text segment is the category of any character in the segmented target text segment.
  • the target text segment A is "Eggs 6 yuan".
  • the category of the character “chicken” is the name of the dish
  • the category of the character “egg” is the name of the dish
  • the category of the character “6” is the price of the dish
  • the category of the character “ Yuan” category is vegetable price.
  • the target text segment A is divided into the target text segment A1 "eggs” and the target text segment A2 "6 yuan”.
  • the category of the target text segment A1 "eggs” is the name of the dish
  • the category of the target text segment A2 "6 yuan” is Vegetable prices.
  • determining the associated information between each pair of target text segments from the associated information between each two target text segments including: based on the category of each target text segment, from the target text segment Filter out the text segments to be associated with the target category from the multiple target text segments included in the text image; filter out the associated information between each two text segments to be associated from the associated information between each two target text segments, Obtain the association information between each pair of target text segments.
  • the text segment to be associated refers to the target text segment with the target category, and also refers to the target text segment that needs to participate in the calculation process of associated information. That is, only every two text segments to be associated need to calculate the associated information, and other categories of text segments need to be calculated. There is no need to calculate associated information for each pair of target text segments.
  • the target text segment for each target text segment, if the category of the target text segment is the target category, then the target text segment is the text segment to be associated. If the category of the target text segment is not the target category, the target text segment is not a text segment to be associated. In this way, the text segments to be associated can be filtered out from multiple target text segments.
  • the embodiment of the present application does not limit the target category.
  • the target text image is a menu image. Since the main focus is on the matching relationship between the dish name and the dish price in the menu image, the target category is the dish name and the dish price.
  • the associated information between each two text segments to be associated can be filtered out from the associated information between each two target text segments.
  • the association information between each two text segments to be associated is regarded as the association information between a pair of target text segments.
  • the multiple target text segments are target text segments 1 to 3
  • the text segments to be associated are target text segments 2 and 3
  • Step 203 The terminal determines one of the at least one pair of target text segments based on the association information between the at least one pair of target text segments. The correlation results between.
  • an association result between each pair of target text segments is determined based on the association information between each pair of target text segments, where the association result between each pair of target text segments indicates the pair of target text segments Whether related.
  • the correlation result between the pair of target text segments is determined to be correlation. If the correlation information between a certain pair of target text segments is not greater than the correlation threshold, the correlation result between the pair of target text segments is determined to be non-correlated.
  • the embodiment of the present application does not limit the value of the correlation threshold. For example, the correlation threshold is 0.5.
  • determine the category of each target text segment in at least one pair of target text segments and obtain an association relationship between each two categories.
  • the association relationship between two categories is used to characterize whether the two categories are associated.
  • the category of each target text segment is determined, or the graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs at least one pair of target text segments.
  • Category for each target text segment is determined, after the graph convolution network updates the graph structure at least once, a final graph structure is obtained, and the final graph structure is used to determine the category of each target text segment in at least one pair of target text segments.
  • the correlation information between the pair of target text segments is greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is correlation, then determine this pair The result of the association between target text segments is association. If the correlation information between this pair of target text segments is greater than the correlation threshold, but the correlation between the categories of the two target text segments in this pair of target text segments is not relevant, then determine the relationship between this pair of target text segments The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, but the correlation between the categories of the two target text segments in the pair is correlation, then determine the relationship between the pair of target text segments. The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is not relevant, then determine the relationship between the pair of target text segments. The result of the correlation between them is no correlation.
  • the association threshold is 0.5
  • the association between the two categories includes the association between the dish name and the dish price.
  • the correlation information between a pair of target text segments is 0.7
  • the categories of the two target text segments in the pair are dish name and dish price respectively, then determine the correlation result between the pair of target text segments for association.
  • the correlation information between another pair of target text segments is 0.51, but the categories of both target text segments in this pair of target text segments are dish names, then it is determined that the correlation result between this pair of target text segments is not relevant. .
  • Step 204 The terminal extracts text information in the target text image based on the association result between at least one pair of target text segments.
  • the text information in the target text image is extracted based on the association result between each pair of target text segments, where the text information at least includes the associated pairs obtained by combining each pair of associated target text segments, that is, Text information not only represents the content of the target text segment identified from the target text image, but also reflects the association results between the target text segments. If the association result between a pair of target text segments is association, add a target symbol (such as at least one of ":", "-", "/”, etc.) between the pair of target text segments so that the A pair of target text segments are combined into an associated pair. If the association result between a pair of target text segments is not associated, the pair of target text segments cannot be combined into an associated pair.
  • each pair of target text segments can be combined into an associated pair, and if it can be combined into an associated pair (that is, when the association result between this pair of target text segments is an association), this pair Target text segments are combined into associated pairs. Realize the association of multiple target text segments in the target text image to obtain the text information in the target text image.
  • the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, through the association information between each pair of target text segments, the relationship between each pair of target text segments is determined.
  • the phenomenon of correlation errors is reduced and the accuracy of the correlation results is improved, so that when extracting text information in the target text image based on the correlation results between at least one pair of target text segments, the accuracy of the text information is improved. sex.
  • the embodiment of the present application provides a method for obtaining a target model.
  • the method can be shown in Figure 1 It can be executed by the terminal device 101 or the server 102, or it can also be executed by the terminal device 101 and the server 102 together.
  • the terminal The device 101 and the server 102 are collectively referred to as electronic devices.
  • the electronic device is used as a server for illustration.
  • the method includes steps 501 to 503.
  • Step 501 The server obtains a sample text image, which includes multiple sample text segments.
  • the sample text image is a structured text image or a semi-structured text image.
  • the sample text image in the embodiment of the present application is the same as the target text image mentioned above. See the description of the target text image above, which will not be described again.
  • Step 502 The server obtains predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.
  • the predicted correlation information between each pair of sample text segments is a positive number.
  • the predicted association information between each pair of sample text segments is a number greater than or equal to 0 and less than or equal to 1
  • the predicted association information between each pair of sample text segments is called the association probability between each pair of sample text segments.
  • the predicted correlation information between each pair of sample text segments can be found in the above description of "the correlation information between each pair of target text segments". The implementation principles of the two are the same and will not be described again.
  • the predicted association information between each pair of sample text segments is obtained according to the neural network model.
  • the neural network model includes at least one of a first initial network and a second initial network.
  • the neural network model determines and outputs predicted association information between each pair of sample text segments based on an output of at least one of the first initial network or the second initial network.
  • the first initial network is used to extract image features of the image area where each sample text segment in each pair of sample text segments is located
  • the second initial network is used to extract text features of each sample text segment in each pair of sample text segments.
  • the first initial network is trained using sample text images to obtain an image feature extraction network, so as to use the image feature extraction network to extract image features of the image area where each target text segment is located in each pair of target text segments.
  • the initial network see the description of the image feature extraction network above. The implementation principles of the two are the same and will not be described again.
  • the second initial network see the description of the text feature extraction network above. The implementation principles of the two are the same and will not be described again.
  • Characteristics of at least a pair of sample text segments are obtained, and the characteristics of each sample text segment in each pair of sample text segments include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment. Among them, the characteristics of the sample text segment are obtained in the same way as the characteristics of the target text segment. See the relevant description of the characteristics of the target text segment above, which will not be described again here.
  • the first initial network includes a first sub-network and a second sub-network.
  • the first sub-network extracts the image features of the sample text image based on the pixel information of each pixel in the sample text image.
  • the image features of the sample text image are used to characterize the texture of the sample text image. information.
  • the second sub-network determines the image features of the image area where each sample text segment is located based on the image features of the sample text image and the position information of the image area where each sample text segment is located.
  • the first sub-network is trained to obtain the first extraction network. For the first sub-network, see the above description of the first extraction network.
  • the implementation principles of the two are the same and will not be described again.
  • the second sub-network is trained to obtain the second extraction network.
  • For the second sub-network see the above description of the second extraction network.
  • the implementation principles of the two are the same and will not be described again.
  • relative position features between at least one pair of sample text segments are obtained.
  • the relative position features between each pair of sample text segments are used to characterize the relative position between the image areas where each pair of sample text segments are located.
  • the relative position features between each pair of sample text segments are obtained in the same way as the relative position features between each pair of target text segments. See above for the relative position features between each pair of target text segments. Description will not be repeated here.
  • Predictive association information between at least one pair of sample text segments is determined based on the characteristics of the at least one pair of sample text segments and the relative position characteristics between the at least one pair of sample text segments.
  • a graph structure is constructed based on the characteristics of at least one pair of sample text segments and the relative position characteristics between at least one pair of sample text segments.
  • the graph structure includes at least two nodes and at least one edge, and each node represents a sample text segment.
  • the characteristics of each edge represent the relative position characteristics between a pair of sample text segments, and the predicted association information between at least one pair of sample text segments is determined based on the graph structure.
  • the description of determining the predicted correlation information between at least one pair of sample text segments can be found in the above description on determining the correlation information between at least one pair of target text segments. The implementation principles of the two are the same and will not be described again here.
  • the neural network model also includes a third initial network, the graph structure of the sample text image is input into the third initial network, and the third initial network determines and outputs the correlation information between at least one pair of sample text segments.
  • the third initial network is trained to obtain a graph convolution network.
  • the third initial network is described in the description of the graph convolution network. The implementation principles of the two are the same and will not be described again.
  • At least one pair of sample text segments is annotated with associated information to obtain at least one pair of sample text segments.
  • Label association information between segments is 0 or 1. 0 indicates that the pair of sample text segments are not associated, and 1 indicates that the pair of sample text segments are associated.
  • Step 503 Obtain a target model based on predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.
  • the loss value of the neural network model is determined using the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments.
  • the neural network model is adjusted through the loss value of the neural network model to obtain the adjusted neural network model. If the training end conditions are met, the adjusted neural network model will be used as the target model. If the training end conditions are not met, the adjusted neural network model will be used as the neural network model for the next training, and the neural network model will be trained again according to steps 501 to 503 until the training end conditions are met and the target is obtained. Model.
  • the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments are used to determine the correlation information loss value.
  • formula (6) is the focal loss (Focal Loss) function.
  • L rel is the associated information loss value.
  • p ij represents the predicted association information between the i-th sample text segment and the j-th sample text segment.
  • log is a logarithmic sign.
  • y ij represents the annotation association information between the i-th sample text segment and the j-th sample text segment. Characterizes the edge between the node associated with the i-th sample text segment and the node associated with the j-th sample text segment on the graph structure after the L-th iteration.
  • E is a linear layer, used to map edges into predicted association information.
  • N 2 represents the number of combinations between any two nodes in the graph structure, that is, the sample text composed of any two sample text segments in the sample text image.
  • the number of segment pairs, a represents the dimension of predicted association information between a pair of sample text segments.
  • a 2.
  • predicted correlation information of 0 indicates that there is no correlation between the pair of sample text segments
  • predicted correlation information of 1 indicates that there is correlation between the pair of sample text segments.
  • the number M of sample text segment pairs associated in the image will be much less than ( ⁇ ) N 2 . If the associated pair of sample text segments are regarded as positive samples and the unassociated pair of sample text segments are regarded as negative samples, then the number of negative samples is much greater than the number of positive samples. Therefore, the probability distribution matrix to be fitted is extremely sparse, and the proportion of positive and negative samples is seriously imbalanced. Using the above formula (6) can solve the problem of sparse probability distribution matrix and unbalanced ratio of positive and negative samples. By balancing the loss ratio of positive and negative samples, the network can avoid over-learning of negative samples, thereby improving network performance.
  • the associated information loss value is used as the loss value of the neural network model.
  • the loss value of the neural network model is determined based on the associated information loss value, the predicted category of each sample text segment in at least one pair of sample text segments, and the label category of each sample text segment in at least one pair of sample text segments.
  • the loss value of the neural network model is determined based on the associated information loss value and the characteristics of each sample text segment in at least one pair of sample text segments.
  • obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the predicted category of each sample text segment in each pair of sample text segments and the predicted category of each sample text segment in each pair of sample text segments.
  • the annotation category of each sample text segment based on the predicted association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments and each Obtain the target model for the annotation category of each sample text segment in the sample text segment.
  • the predicted category of the sample text segment is determined.
  • the graph structure of the sample text image is input into a third initial network, and the third initial network determines and outputs the predicted category of each sample text segment in each pair of sample text segments.
  • each sample text segment is annotated to obtain the annotation of the sample text segment. category.
  • formula (7) is the cross entropy loss (Cross Entropy Loss, CE Loss) function.
  • L node is the category loss value.
  • N is the number of sample text segments in at least one pair of sample text segments.
  • CE is the symbol of the cross-entropy loss function.
  • E is linear processing, which is used to map the nodes associated with the i-th sample text segment on the graph structure after the L-th iteration to the probability distribution dimension to obtain the predicted category of the i-th sample text segment.
  • y i is the annotation category of the i-th sample text segment.
  • the nodes of the graph structure are updated.
  • the nodes in the updated graph structure are marked as Represents the i-th node in the graph structure after the L-th update.
  • Determine and output an N ⁇ b-dimensional probability distribution matrix according to the graph structure where N is the number of nodes in the graph structure, that is, the number of sample text segments in the sample text image, and b is the number of predicted categories of the sample text segment.
  • N is the number of nodes in the graph structure, that is, the number of sample text segments in the sample text image
  • b is the number of predicted categories of the sample text segment.
  • the correlation information loss value is determined based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments.
  • the loss value of the neural network model is determined based on the category loss value and the associated information loss value.
  • obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment. Including at least one of the image features of the image area where the sample text segment is located or the text features of the sample text segment; based on the characteristics of each sample text segment in each pair of sample text segments and the predicted association between each pair of sample text segments information and the annotated association information between each pair of sample text segments to obtain the target model.
  • the image features of the sample text image are obtained. For each sample text segment, based on the image features of the sample text image and the position information of the image area where the sample text segment is located, the image features of the image area where the sample text segment is located are determined (recorded as the first area feature). Or, perform image detection and image segmentation on the sample text image in sequence to obtain the image area where each sample text segment is located in the sample text image. For each sample text segment, based on the pixels of each pixel in the image area where the sample text segment is located, Information, extract the image features of the image area where the sample text segment is located (recorded as the second area feature).
  • the first region feature and the second region feature are spliced or fused to obtain image features of the image region where the sample text segment is located.
  • the method for determining the image features of the image area where each sample text segment is located is as described above regarding the image features of the image area where each target text segment is located. The implementation principles of the two are the same and will not be described again.
  • each sample text segment After obtaining the image area where each sample text segment is located in the sample text image, image recognition is performed on the image area where each sample text segment is located to obtain the sample text segment. Then, use a word segmenter to perform word segmentation processing on the sample text segment, and obtain each word in the sample text segment. By using a vector lookup table, the word vectors of each word in the sample text segment are determined. Afterwards, based on the word vectors of each word in the sample text segment, the text features of the sample text segment are determined. Among them, the text characteristics of each sample text segment are as described above about the text characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.
  • the image features of the image area where each sample text segment is located are used as the features of the sample text segment.
  • use the text features of each sample text segment as the features of the sample text segment.
  • the image features of the image area where each sample text segment is located and the text features of the sample text segment are spliced or fused to obtain the features of the sample text segment.
  • the image features of the image area where each sample text segment is located and the text features of the sample text segment are first spliced or fused, and then nonlinear operations are performed to obtain the features of the sample text segment.
  • the characteristics of each sample text segment are as described above about the characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.
  • obtaining the target model based on the characteristics of each sample text segment in each pair of sample text segments, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments includes: obtaining Each pair of sample text segments The annotation category of each sample text segment in The predicted correlation information between segments and the annotation correlation information between each pair of sample text segments are used to obtain the target model.
  • each sample text segment is annotated to obtain the annotation category of the sample text segment.
  • For each annotation category calculate the sum of the features of each sample text segment in the annotation category, and divide the sum value by the number of sample text segments in the annotation category to obtain the average feature value of the annotation category.
  • the first loss value is determined based on the average feature value of each annotation category.
  • the second loss value is determined based on the average feature value of each annotation category.
  • L pull is the first loss value.
  • L push is the second loss value.
  • M is the number of sample text pairs with annotated association information of 1. Since the annotated association information between each pair of sample text segments is 1, it indicates the association between the pair of sample text segments. Therefore, M is also a sample text with an associated relationship.
  • the number of segment pairs. m and k are both serial numbers.
  • the average value of features characterizing the m-th annotation category. e mk represents the characteristics of the k-th sample text segment in the m-th annotation category.
  • 2 represents the second norm of x, and x is the independent variable.
  • determining the prediction category of each sample text segment is equivalent to classifying the sample text segments.
  • the classification problem can only optimize the boundaries between classes, which easily results in a large distance between the features of two sample text segments belonging to the same annotation category, while a small distance between the features of two sample text segments belonging to different annotation categories. The problem.
  • FIG. 6 is an example diagram of the distance between features of a sample text segment provided by an embodiment of the present application. It can be seen from Figure 6 that R+ is greater than R-. Among them, R+ is the distance between the features of two sample text segments in the labeled category A, and R- is the distance between the features of one sample text segment in the labeled category A and the feature of one sample text segment in the labeled category B.
  • the above formula (8) is used to calculate the first loss value based on the characteristics of the sample text segment. Since the first loss value is based on the average feature value of the m-th annotation category and the average value of the m-th annotation category. The characteristics of the k-th sample text segment are calculated so that the characteristics of each sample text segment in each annotation category are close to the average feature of the annotation category. Therefore, the first loss value is used to bring the features of each sample text segment in each annotation category closer to the feature average of the annotation category, and reduce the distance between the features of two sample text segments of the same annotation category.
  • the second loss value is determined based on the feature average of the m-th annotation category, the feature average of the j-th annotation category and the hyperparameter ⁇ , so that the distance between the feature averages of any two labeled categories is at least greater than ⁇ . Therefore, the second loss value is used to distance the feature average of each annotation category from the feature average of another annotation category, which can widen the distance between the features of two sample text segments belonging to different annotation categories.
  • the features of each sample text segment include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment.
  • the loss value of the neural network model is determined based on the first loss value, the second loss value and the associated information loss value.
  • the loss value of the neural network model can also be determined based on the first loss value, the second loss value, the associated information loss value and the category loss value.
  • the weight of the first loss value, the weight of the second loss value, the weight of the associated information loss value and the weight of the category loss value are set.
  • the loss value of the neural network model is determined based on at least one of the first loss value, the second loss value, the associated information loss value and the category loss value combined with their respective weights.
  • the loss value of the neural network model is determined based on the associated information loss value, the category loss value, the weight of the associated information loss value, and the weight of the category loss value.
  • L is the loss value of the neural network model.
  • is the weight of the category loss value.
  • L node is the category loss value.
  • is the weight of the associated information loss value.
  • L rel is the associated information loss value.
  • is the weight of the first loss value and is also the weight of the second loss value.
  • L pull is the first loss value.
  • L push is the second loss value.
  • the gradient of the loss value of the neural network model is calculated, and the gradient of the loss value of the neural network model is back-transmitted layer by layer to update the model parameters of the neural network model. That is, the neural network model is adjusted through the loss value of the neural network model to obtain a target model.
  • the target model is used to obtain the associated information between at least a pair of target text segments.
  • the loss function of contrastive learning can also be used to calculate the contrastive learning loss value.
  • each pair of sample text segments marked with associated information is regarded as a positive sample, and the loss value of the positive sample is calculated using the characteristics of each pair of sample text segments marked with associated information.
  • Each pair of sample text segments marked with unrelated information is regarded as a negative sample.
  • the loss value of the negative sample is calculated using the characteristics of each pair of sample text segments marked with unrelated information.
  • the loss value of the positive sample and the loss value of the negative sample are used to determine the comparative learning loss value. Using at least one of the first loss value, the second loss value, the associated information loss value, the category loss value, and the comparative learning loss value, combined with their respective weights, the loss value of the neural network model is determined.
  • the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation information between each pair of text segments, It helps reduce association errors and improve the accuracy of text information.
  • Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application.
  • the neural network model includes a first initial network, a second initial network and a third initial network, and the first initial network includes a first sub-network and a second sub-network.
  • the embodiment of the present application uses the predicted correlation information between each pair of sample text segments (that is, every two sample text segments) in the sample text image to train the neural network model.
  • a sample text image is obtained, where the sample text image is the image shown in (2) in Figure 3 .
  • the sample text image is input to the first sub-network, and the first sub-network outputs image features of the sample text image.
  • the image features of the sample text image are input to the second sub-network, and the second sub-network outputs the image features of the image area where each sample text segment in the sample text image is located.
  • image recognition is also performed on the sample text image to obtain an image recognition result of the sample text image.
  • the image recognition result of the sample text image includes each sample text segment.
  • the second initial network is used to obtain text features of each sample text segment.
  • the image features of the image area where the sample text segment is located and the text features of the sample text segment are fused to obtain the features of the sample text segment.
  • the characteristics of each sample text segment are called the characteristics of the sample text segment before the update, and the characteristics obtained by updating the characteristics of the sample text segment before the update at least once are called the updated sample text. segment characteristics.
  • the characteristic loss value is calculated according to the above-mentioned formula (8) and formula (9), where the characteristic loss value includes the above-mentioned first loss value and the second loss value.
  • the characteristics of each pre-updated sample text segment are input into the third initial network.
  • the third initial network can construct an initial graph structure based on the characteristics of each pre-updated sample text segment and update the graph structure multiple times. , that is, the characteristics of each sample text segment before updating are updated multiple times until the final graph structure is obtained.
  • the final graph structure includes the characteristics of each updated sample text segment.
  • the third initial network can determine and output the predicted category of each sample text segment and the predicted association information between every two sample text segments based on the final graph structure. Next, based on the predicted category of each sample text segment, the category loss value is calculated according to the formula (7) mentioned above. Based on the predicted correlation information between each two sample text segments, the correlation information loss value is calculated according to the formula (6) mentioned above.
  • the loss value of the neural network model is calculated according to the formula (10) mentioned above. Based on the loss value of the neural network model, the neural network model is adjusted to obtain the target model.
  • the target model includes an image feature extraction network (trained by the first initial network), a text feature extraction network (trained by the second initial network), and a graph convolution network (trained by the third initial network).
  • the image feature extraction network includes a first extraction network (trained by the first sub-network) and a second extraction network (trained by the second sub-network).
  • Target text images include menu images and license images.
  • the text information in the target text image is determined based on the categories of each target text segment in the target text image and the association information between each two target text segments.
  • FIG. 8 is a schematic diagram of extracting text information from a menu image provided by an embodiment of the present application.
  • the menu image includes "Dish A 20 yuan”, “Dish B 20 yuan”, “Dish C 28 yuan”, “Dish D 28 yuan”, “Dish E 25 yuan”, “Dish F 25 yuan” and their respective associations.
  • picture of. By performing image recognition on the menu image, the image recognition result is obtained.
  • the image recognition results include each text segment in the menu image (ie, the target text segment mentioned above).
  • the image recognition results include the text segments "Dish A”, “20 Yuan”, “Dish B”, “20 Yuan”, “Dish C”, “28 Yuan”, “Dish D”, “28 Yuan” , “Cai E”, “25 yuan”, “Cai F”, “25 yuan”. It can be seen from Figure 8 that the image recognition result only identifies each text segment in the menu image, and does not associate each text segment.
  • the menu image and the image recognition result of the menu image are input into the target model, and the target model outputs the category of each text segment in the menu image and the association information between each two text segments in the menu image.
  • the text information in the menu image can be obtained, that is, “Dish A: 20 yuan”, “Dish B: 20 yuan” ", “Dish C: 28 yuan”, “Dish D: 28 yuan”, “Dish E: 25 yuan”, “Dish F: 25 yuan”.
  • FIG. 9 is a schematic diagram of extracting text information in yet another menu image provided by an embodiment of the present application.
  • the menu image, the image recognition result of the menu image, and the target model can be used to determine the text information in the menu image, that is, "Dish A: 20 yuan” and "Dish B: “20 yuan”, “Dish C: 20 yuan”, “Dish D: 6/piece”, “Dish E: 5/piece”, “Dish F: 2/bowl” and "Dish G: 5/bowl”.
  • menu images shown in Figures 8 and 9 are both structured text images.
  • the target model in the embodiment of the present application can also extract text information from semi-structured text images.
  • text information is extracted from the license image (semi-structured text image) shown in Figures 10 and 11.
  • Figure 10 is a schematic diagram of extracting text information from a license image provided by an embodiment of the present application.
  • the license image includes "license”, “name XXX company”, “company type sole proprietorship”, “legal representative XX” and "date X year X month X day”.
  • the image recognition result is obtained.
  • the image recognition results include "license”, “name XXX company”, “company type sole proprietorship”, “legal representative” Person XX” and "Date X year Association information between segments.
  • the text information in the license image can be obtained, that is, "License” and "Name” can be obtained: "XXX Company”, “Company Type: Sole Proprietor”, “Legal Representative: XX” and “Date: X Month X Day”.
  • Figure 11 is a schematic diagram of extracting text information from a license image according to another embodiment of the present application.
  • the license image, the image recognition result of the license image and the target model can be used to determine the text information in the license image, that is, "License”, “Name: XXX Company", “Residence: XX Town” can be obtained , “Registration number: 1111111”, and "Business scope: fruits and vegetables, daily necessities, cultural and sporting goods”.
  • the embodiment of this application uses four methods to train the neural network model and obtains four target models.
  • the first target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: first perform batch normalization on the sample text image, and then based on batch normalization
  • the transformed sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located.
  • the text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image.
  • the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.
  • the second target model is to input the sample text image and the image recognition result of the sample text image into the neural network model.
  • the neural network model performs the following processing: first perform instance normalization on the sample text image, and then normalize it based on the instance.
  • the sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located.
  • the text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image.
  • the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.
  • the third target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: perform instance normalization on the sample text image, and then based on the instance normalized sample
  • the text image determines the image characteristics of the image area where each sample text segment in the sample text image is located.
  • the text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image.
  • the predicted categories of each sample text segment and the characteristics of each sample text segment are determined and output, according to the above formulas (7)-(9) ) is obtained by determining the loss value of the neural network model and adjusting the neural network model based on the loss value of the neural network model.
  • the fourth target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: after instance normalization of the sample text image, and then based on the instance normalization
  • the sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located.
  • the text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image.
  • the predicted categories of each sample text segment, the characteristics of each sample text segment, and the predicted association between each two sample text pairs are determined and output
  • the relationship is obtained by determining the loss value of the neural network model according to the above formulas (6)-(9), and adjusting the neural network model based on the loss value of the neural network model.
  • the performance index of each target model is calculated according to the formula (11) shown below.
  • mEF is the performance index of the target model.
  • i is the sequence number
  • F i is the score of the i-th prediction category.
  • P i is the precision rate of the i-th prediction category
  • R i is the recall rate of the i-th prediction category.
  • P is the precision rate
  • tp is the number of positive samples whose predicted category is consistent with the labeled category
  • fp is the number of negative samples whose predicted category is inconsistent with the labeled category
  • fn is the number of positive samples whose predicted category is inconsistent with the labeled category.
  • the sample text images used when training these four target models are menu images.
  • the prediction categories and annotation categories of the sample text segments in the menu images include at least one of dish name, dish price, store name, dish type, and others.
  • Target categories include dish names and dish prices.
  • Table 1 The performance indicators of these four target models are shown in Table 1 below.
  • the mEFs of these four target models increase sequentially.
  • the phenomenon of association errors can be effectively reduced, the accuracy of text information can be improved, and the text information in the target text image can be quickly extracted, avoiding tedious and complicated manual input.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • the target text images, sample text images, etc. involved in this application were all obtained with full authorization.
  • Figure 12 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application. As shown in Figure 12, the device includes:
  • the acquisition module 1201 is used to acquire a target text image, which includes multiple target text segments;
  • the acquisition module 1201 is also used to obtain association information between at least one pair of target text segments, where the association information is used to characterize the possibility of association between each pair of target text segments;
  • the determination module 1202 is configured to determine the association result between each pair of target text segments based on the association information between each pair of target text segments;
  • the extraction module 1203 is used to extract text information in the target text image based on the association results between each pair of target text segments.
  • the acquisition module 1201 is configured to acquire at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments.
  • Characteristics of text segments include the At least one of the image features of the image area where the target text segment is located or the text features of the target text segment.
  • the relative position features between each pair of target text segments are used to characterize the relative position between the image areas where each pair of target text segments are located. ; Determine the associated information between each pair of target text segments based on at least one of the characteristics of the at least one pair of target text segments or the relative position characteristics between the at least one pair of target text segments.
  • the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located.
  • the acquisition module 1201 is used to obtain the image characteristics of the target text image; for at least For each target text segment in a pair of target text segments, based on the image features of the target text image and the position information of the image area where the target text segment is located, the image features of the image area where the target text segment is located are determined.
  • the characteristics of each target text segment in each pair of target text segments include the text characteristics of the target text segment.
  • the acquisition module 1201 is configured to obtain each target text segment in at least one pair of target text segments. segment, obtain the word vector of each word in the target text segment; fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment.
  • the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment.
  • the acquisition module 1201 is configured to at least For each target text segment in a pair of target text segments, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; For each image feature block, the image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.
  • the acquisition module 1201 is configured to obtain, for each pair of target text segments, the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the target text The size information of the image determines the relative position characteristics between each pair of target text segments.
  • the acquisition module 1201 is configured to construct a graph structure based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments.
  • the graph structure includes at least two nodes and at least An edge, each node represents the characteristics of a target text segment, and each edge represents the relative position characteristics between a pair of target text segments indicated by a pair of nodes connected by the edge; determine the relationship between each pair of target text segments based on the graph structure associated information.
  • the acquisition module 1201 is also used to acquire the category of each target text segment and the association information between every two target text segments; the acquisition module 1201 is used to acquire the category of each target text segment based on the category of each target text segment. , determine the associated information between each pair of target text segments from the associated information between each two target text segments.
  • the acquisition module 1201 is configured to filter out the text segments to be associated with the target category from multiple target text segments included in the target text image based on the category of each target text segment; from each target text segment, The correlation information between each two text segments to be correlated is filtered out from the correlation information between the two target text segments, and the correlation information between each pair of target text segments is obtained.
  • the acquisition module 1201 is also used to acquire the target model; the acquisition module 1201 is used to acquire the association information between each pair of target text segments according to the target model.
  • the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, each pair of target text is determined through the association information between each pair of target text segments. It can reduce the phenomenon of association errors and improve the accuracy of association results when extracting text information in target text images based on the association results between each pair of target text segments. accuracy.
  • Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application. As shown in Figure 13, the device includes:
  • the first acquisition module 1301 is used to acquire a sample text image, which includes multiple sample text segments;
  • the second acquisition module 1302 is used to acquire predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments;
  • the third acquisition module 1303 is used to acquire the target model based on the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments.
  • the device further includes: a fourth acquisition module, configured to acquire the predicted category of each sample text segment in each pair of sample text segments and the annotation category of each sample text segment in each pair of sample text segments. ;
  • the third acquisition module 1303 is used to predict the association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments, and the predicted category of each sample text segment. Obtain the target model for the annotation category of each sample text segment in the sample text segment.
  • the device further includes: a fifth acquisition module, used to obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include the image area where the sample text segment is located. At least one of the image features or the text features of the sample text segment; the third acquisition module 1303 is used to predict the association between each pair of sample text segments based on the characteristics of each sample text segment in each pair of sample text segments. information and the annotated association information between each pair of sample text segments to obtain the target model.
  • a fifth acquisition module used to obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include the image area where the sample text segment is located. At least one of the image features or the text features of the sample text segment
  • the third acquisition module 1303 is used to predict the association between each pair of sample text segments based on the characteristics of each sample text segment in each pair of sample text segments. information and the annotated association information between each pair of sample text segments to obtain the target model.
  • the third acquisition module 1303 is used to obtain the annotation category of each sample text segment in each pair of sample text segments; for each annotation category, based on the characteristics of each sample text segment in the annotation category , determine the feature average of the annotation category; obtain the target model based on the feature average of each annotation category, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments.
  • the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation between any pair of text segments. information, helping to reduce association errors and improve the accuracy of text information.
  • Figure 14 shows a structural block diagram of a terminal device 1400 provided by an exemplary embodiment of the present application.
  • the terminal device 1400 includes: a processor 1401 and a memory 1402.
  • the processor 1401 includes one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
  • the processor 1401 is implemented using at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array).
  • the processor 1401 also includes a main processor and a co-processor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is used A low-power processor used to process data in standby mode.
  • the processor 1401 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1401 also includes an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1402 includes one or more computer-readable storage media that are non-transitory. Memory 1402 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1402 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 1401 to implement the methods provided by the method embodiments in this application. Text information extraction method or target model acquisition method.
  • the terminal device 1400 optionally further includes: a peripheral device interface 1403 and at least one peripheral device.
  • the processor 1401, the memory 1402 and the peripheral device interface 1403 are connected through a bus or a signal line.
  • Each peripheral device is connected to the peripheral device interface 1403 through a bus, a signal line or a circuit board.
  • the peripheral device includes: display screen 1405.
  • the peripheral device interface 1403 may be used to connect at least one I/O (Input/Output, input/output) related peripheral device to the processor 1401 and the memory 1402 .
  • the processor 1401, the memory 1402, and the peripheral device interface 1403 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1401, the memory 1402, and the peripheral device interface 1403 or Both are implemented on separate chips or circuit boards, which is not limited in this embodiment.
  • the display screen 1405 is used to display UI (User Interface, user interface).
  • the UI includes graphics, text, icons, videos, and any combination thereof.
  • display screen 1405 also has the ability to collect touch signals on or above the surface of display screen 1405 .
  • the touch signal is input to the processor 1401 as a control signal for processing.
  • the display screen 1405 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 1405 is a flexible display screen that is disposed on a curved or folded surface of the terminal device 1400 . Even the display screen 1405 is set in a non-rectangular irregular shape, that is, a special-shaped screen.
  • the display screen 1405 is made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.
  • FIG. 14 does not constitute a limitation on the terminal device 1400, which may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1500 may vary greatly due to different configurations or performance, and includes one or more processors 1501 and one or more memories 1502.
  • the server 1500 includes one or more processors 1501 and one or more memories 1502.
  • At least one computer program is stored in one or more memories 1502, and the at least one computer program is loaded and executed by the one or more processors 1501 to implement the text information extraction method or the target model acquisition method provided by the above method embodiments.
  • the processor 1501 is a CPU.
  • the server 1500 also has components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output.
  • the server 1500 also includes other components for realizing device functions, which will not be described again here.
  • a computer-readable storage medium is also provided. At least one computer program is stored in the storage medium. The at least one computer program is loaded and executed by the processor to enable the electronic device to implement any of the above. Text information extraction method or target model acquisition method.
  • the above computer-readable storage medium is read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM) , tapes, floppy disks and optical data storage devices, etc.
  • a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer implements Any of the above text information extraction methods or target model acquisition methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

A text information extraction method and apparatus, a target model acquisition method and apparatus, and a device, relating to the technical field of image processing. The method comprises: acquiring (201) a target text image; acquiring (202) association information between at least one pair of target text segments; on the basis of the association information between each pair of target text segments, determining (203) an association result between each pair of target text segments; and extracting (204) text information in the target text image on the basis of the association result between each pair of target text segments.

Description

文本信息提取方法、目标模型的获取方法、装置及设备Text information extraction method, target model acquisition method, device and equipment
本申请要求于2022年04月19日提交的申请号为202210411039.7、发明名称为“文本信息提取方法、目标模型的获取方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202210411039.7 and the invention title "Text Information Extraction Method, Target Model Obtaining Method, Device and Equipment" submitted on April 19, 2022, the entire content of which is incorporated by reference. in this application.
技术领域Technical field
本申请实施例涉及图像处理技术领域,特别涉及一种文本信息提取方法、目标模型的获取方法、装置及设备。Embodiments of the present application relate to the field of image processing technology, and in particular to a text information extraction method, a target model acquisition method, apparatus and equipment.
背景技术Background technique
日常生活中普遍存在菜单图像、票据图像等包含文本信息的文本图像,这类文本图像属于结构化文本图像或者半结构化文本图像。如何准确地提取出结构化文本图像、半结构化文本图像中的文本信息,成为了图像处理技术领域亟待解决的问题。Text images containing text information such as menu images and receipt images are common in daily life. Such text images are structured text images or semi-structured text images. How to accurately extract text information from structured text images and semi-structured text images has become an urgent problem to be solved in the field of image processing technology.
发明内容Contents of the invention
本申请实施例提供了一种文本信息提取方法、目标模型的获取方法、装置及设备。Embodiments of the present application provide a text information extraction method, a target model acquisition method, a device and equipment.
一方面,本申请实施例提供了一种文本信息提取方法,所述方法包括:获取目标文本图像,所述目标文本图像中包括多个目标文本段;获取至少一对目标文本段之间的关联信息,所述任一对目标文本段之间的关联信息用于表征所述任一对目标文本段之间关联的可能性;基于所述至少一对目标文本段之间的关联信息,确定所述至少一对目标文本段之间的关联结果;基于所述至少一对目标文本段之间的关联结果,提取所述目标文本图像中的文本信息。On the one hand, embodiments of the present application provide a method for extracting text information. The method includes: acquiring a target text image, which includes multiple target text segments; and acquiring an association between at least a pair of target text segments. Information, the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; based on the association information between the at least one pair of target text segments, determine the and extracting text information in the target text image based on the correlation result between the at least one pair of target text segments.
另一方面,本申请实施例提供了一种目标模型的获取方法,所述方法包括:获取样本文本图像,所述样本文本图像中包括多个样本文本段;获取至少一对样本文本段之间的预测关联信息和所述至少一对样本文本段之间的标注关联信息;基于所述至少一对样本文本段之间的预测关联信息和所述至少一对样本文本段之间的标注关联信息,获取目标模型。On the other hand, embodiments of the present application provide a method for obtaining a target model. The method includes: obtaining a sample text image, where the sample text image includes multiple sample text segments; and obtaining at least one pair of sample text segments. based on the predicted correlation information between the at least one pair of sample text segments and the labeled correlation information between the at least one pair of sample text segments; , obtain the target model.
另一方面,本申请实施例提供了一种文本信息提取装置,所述装置包括:获取模块,用于获取目标文本图像,所述目标文本图像中包括多个目标文本段;所述获取模块,还用于获取至少一对目标文本段之间的关联信息,所述任一对目标文本段之间的关联信息用于表征所述任一对目标文本段之间关联的可能性;确定模块,用于基于所述至少一对目标文本段之间的关联信息,确定所述至少一对目标文本段之间的关联结果;提取模块,用于基于所述至少一对目标文本段之间的关联结果,提取所述目标文本图像中的文本信息。On the other hand, embodiments of the present application provide a text information extraction device. The device includes: an acquisition module, used to acquire a target text image, where the target text image includes multiple target text segments; the acquisition module, It is also used to obtain association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to characterize the possibility of association between any pair of target text segments; the determination module, configured to determine an association result between the at least one pair of target text segments based on the association information between the at least one pair of target text segments; an extraction module configured to determine the association result between the at least one pair of target text segments based on the association between the at least one pair of target text segments As a result, text information in the target text image is extracted.
另一方面,本申请实施例提供了一种目标模型的获取装置,所述装置包括:第一获取模块,用于获取样本文本图像,所述样本文本图像中包括多个样本文本段;第二获取模块,用于获取至少一对样本文本段之间的预测关联信息和所述至少一对样本文本段之间的标注关联信息;第三获取模块,用于基于所述至少一对样本文本段之间的预测关联信息和所述至少一对样本文本段之间的标注关联信息,获取目标模型。On the other hand, embodiments of the present application provide a device for acquiring a target model. The device includes: a first acquisition module for acquiring a sample text image, where the sample text image includes a plurality of sample text segments; a second acquisition module. The acquisition module is used to acquire the predicted association information between at least one pair of sample text segments and the annotation association information between the at least one pair of sample text segments; the third acquisition module is used to acquire the predicted association information between the at least one pair of sample text segments based on the at least one pair of sample text segments. The target model is obtained by predicting the correlation information between the at least one pair of sample text segments and the annotation correlation information between the at least one pair of sample text segments.
另一方面,本申请实施例提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行,以使所述电子设备实现上述文本信息提取方法或者上述目标模型的获取方法。On the other hand, embodiments of the present application provide an electronic device. The electronic device includes a processor and a memory. At least one computer program is stored in the memory. The at least one computer program is loaded and executed by the processor. , so that the electronic device implements the above text information extraction method or the above target model acquisition method.
另一方面,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计算机实现上述文本信息提取方法或者上述目标模型的获取方法。On the other hand, a computer-readable storage medium is also provided. At least one computer program is stored in the computer-readable storage medium. The at least one computer program is loaded and executed by the processor to enable the computer to realize the above text information. Extraction method or acquisition method of the above target model.
另一方面,还提供了一种计算机程序或计算机程序产品,所述计算机程序或计算机程序产品中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计 算机实现上述文本信息提取方法或者上述目标模型的获取方法。On the other hand, a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer program The computer implements the above text information extraction method or the above target model acquisition method.
附图说明Description of the drawings
图1是本申请实施例提供的一种文本信息提取方法或者目标模型的获取方法的实施环境示意图;Figure 1 is a schematic diagram of the implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application;
图2是本申请实施例提供的一种文本信息提取方法的流程图;Figure 2 is a flow chart of a text information extraction method provided by an embodiment of the present application;
图3是本申请实施例提供的一种目标文本图像的示意图;Figure 3 is a schematic diagram of a target text image provided by an embodiment of the present application;
图4是本申请实施例提供的一种提取目标文本段所在图像区域的图像特征的示意图;Figure 4 is a schematic diagram of extracting image features of the image area where the target text segment is located according to an embodiment of the present application;
图5是本申请实施例提供的一种目标模型的获取方法的流程图;Figure 5 is a flow chart of a method for obtaining a target model provided by an embodiment of the present application;
图6是本申请实施例提供的一种样本文本段的特征之间的距离示例图;Figure 6 is an example diagram of distances between features of a sample text segment provided by an embodiment of the present application;
图7为本申请实施例提供的一种神经网络模型的训练示意图;Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application;
图8是本申请实施例提供的一种目标文本图像中文本信息的提取示意图;Figure 8 is a schematic diagram of extracting text information from a target text image provided by an embodiment of the present application;
图9是本申请实施例提供的又一种目标文本图像中文本信息的提取示意图;Figure 9 is another schematic diagram of extracting text information from a target text image provided by an embodiment of the present application;
图10是本申请实施例提供的又一种目标文本图像中文本信息的提取示意图;Figure 10 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application;
图11是本申请实施例提供的又一种目标文本图像中文本信息的提取示意图;Figure 11 is a schematic diagram of extracting text information from another target text image provided by an embodiment of the present application;
图12是本申请实施例提供的一种文本信息提取装置的结构示意图;Figure 12 is a schematic structural diagram of a text information extraction device provided by an embodiment of the present application;
图13是本申请实施例提供的一种目标模型的获取装置的结构示意图;Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application;
图14是本申请实施例提供的一种终端设备的结构示意图;Figure 14 is a schematic structural diagram of a terminal device provided by an embodiment of the present application;
图15是本申请实施例提供的一种服务器的结构示意图。Figure 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.
相关技术中,先对目标文本图像进行文本识别,得到多个文本段,目标文本图像是结构化文本图像或者半结构化文本图像。然后,确定各个文本段的类别。接着,获取预先设置的类别之间的关联关系,如菜名与菜价相关联。基于类别之间的关联关系和各个文本段的类别,对多个文本段进行关联,得到多个文本段的关联结果,基于多个文本段的关联结果提取目标文本图像中的文本信息。In related technologies, text recognition is first performed on a target text image to obtain multiple text segments. The target text image is a structured text image or a semi-structured text image. Then, determine the category of each text segment. Next, obtain the association between preset categories, such as the association between dish name and dish price. Based on the association between categories and the categories of each text segment, multiple text segments are associated to obtain association results of multiple text segments, and text information in the target text image is extracted based on the association results of multiple text segments.
由于任两个文本段可能对应同一个类别,因此,基于类别之间的关联关系和各个文本段的类别对多个文本段进行关联时,关联结果的准确性较差,导致从目标文本图像中提取出的文本信息的准确性也较差。Since any two text segments may correspond to the same category, when associating multiple text segments based on the association between categories and the category of each text segment, the accuracy of the association results is poor, resulting in the target text image being The accuracy of the extracted text information is also poor.
图1是本申请实施例提供的一种文本信息提取方法或者目标模型的获取方法的实施环境示意图,如图1所示,该实施环境包括终端设备101和服务器102。其中,本申请实施例中的文本信息提取方法或者目标模型的获取方法由终端设备101执行,或者由服务器102执行,或者由终端设备101和服务器102共同执行。FIG. 1 is a schematic diagram of an implementation environment of a text information extraction method or a target model acquisition method provided by an embodiment of the present application. As shown in FIG. 1 , the implementation environment includes a terminal device 101 and a server 102 . Among them, the text information extraction method or the target model acquisition method in the embodiment of the present application is executed by the terminal device 101, or executed by the server 102, or jointly executed by the terminal device 101 and the server 102.
终端设备101是智能手机、游戏主机、台式计算机、平板电脑、膝上型便携计算机、智能电视、智能车载设备、智能语音交互设备、智能家电等。服务器102为一台服务器,或者为多台服务器组成的服务器集群,或者为云计算平台和虚拟化中心中的任意一种,本申请实施例对此不加以限定。服务器102与终端设备101通过有线网络或无线网络进行通信连接。服务器102具有数据处理、数据存储以及数据收发等功能,在本申请实施例中不加以限定。终端设备101和服务器102的数量不受限制,例如是一个或多个。The terminal device 101 is a smartphone, a game console, a desktop computer, a tablet computer, a laptop computer, a smart TV, a smart vehicle-mounted device, a smart voice interaction device, a smart home appliance, etc. The server 102 is one server, or a server cluster composed of multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server 102 communicates with the terminal device 101 through a wired network or a wireless network. The server 102 has functions such as data processing, data storage, and data sending and receiving, which are not limited in the embodiment of the present application. The number of terminal devices 101 and servers 102 is not limited, for example, one or more.
基于上述实施环境,本申请实施例提供了一种文本信息提取方法,以图2所示的本申请实施例提供的一种文本信息提取方法的流程图为例,该方法由图1中的终端设备101或者服务器102执行,或者由终端设备101和服务器102共同执行,为便于描述,将终端设备101或者服务器102统称为电子设备,下面以电子设备为终端为例说明。如图2所示,该方法包括步骤201至步骤204。 Based on the above implementation environment, the embodiment of the present application provides a method for extracting text information. Taking the flow chart of a method of extracting text information provided by the embodiment of the present application as shown in Figure 2 as an example, the method consists of the terminal in Figure 1 The device 101 or the server 102 executes, or the terminal device 101 and the server 102 jointly execute. For the convenience of description, the terminal device 101 or the server 102 is collectively referred to as an electronic device. The following description takes the electronic device as a terminal as an example. As shown in Figure 2, the method includes steps 201 to 204.
步骤201,终端获取目标文本图像,目标文本图像中包括多个目标文本段。Step 201: The terminal obtains a target text image, which includes multiple target text segments.
目标文本图像是指包含有文本段的图像,包括结构化文本图像和非结构化文本图像。其中,文本段是指由字符组成的词语、短语或者句子,目标文本图像中通常会以空白区域来分隔不同的文本段。Target text images refer to images containing text segments, including structured text images and unstructured text images. Among them, text segments refer to words, phrases or sentences composed of characters. Different text segments are usually separated by blank areas in the target text image.
本申请实施例中,每个目标文本段包括至少一个字符,每个字符是文字字符、数字、特殊符号(如标点符号、货币符号等)等中的任一项。当目标文本段包括多个字符时,这多个字符组成至少一个词语,也能够组成至少一句话。In the embodiment of the present application, each target text segment includes at least one character, and each character is any one of alphabetic characters, numbers, special symbols (such as punctuation marks, currency symbols, etc.), etc. When the target text segment includes multiple characters, the multiple characters form at least one word and can also form at least one sentence.
示例性地,目标文本图像为结构化文本图像,结构化文本图像是通过二维表结构来表达文本的图像,该图像中的文本具有组织性、规则性等。结构化文本图像中包括多个目标文本段。对于结构化文本图像中的每个目标文本段,均存在与这个目标文本段关联的至少一个其他目标文本段,其他目标文本段是多个目标文本段中除这个目标文本段之外的目标文本段。For example, the target text image is a structured text image, and the structured text image is an image that expresses text through a two-dimensional table structure, and the text in the image has organization, regularity, etc. The structured text image includes multiple target text segments. For each target text segment in the structured text image, there is at least one other target text segment associated with this target text segment, and the other target text segments are the target texts in the multiple target text segments except this target text segment. part.
请参见图3,图3是本申请实施例提供的一种目标文本图像的示意图,其中,(1)为一种结构化文本图像。由结构化文本图像能够看出,目标文本段“物品A”与目标文本段“×10”相互关联,目标文本段“物品B”与目标文本段“×15”相互关联,目标文本段“物品C”与目标文本段“×3”相互关联,目标文本段“物品D”与目标文本段“×9”相互关联,目标文本段“物品E”与目标文本段“×1”相互关联。因此,结构化文本图像中的每个目标文本段均存在与这个目标文本段关联的至少一个其他目标文本段。Please refer to Figure 3. Figure 3 is a schematic diagram of a target text image provided by an embodiment of the present application, in which (1) is a structured text image. It can be seen from the structured text image that the target text segment "Item A" is related to the target text segment "×10", the target text segment "Item B" is related to the target text segment "×15", and the target text segment "Item C” is associated with the target text segment “×3”, the target text segment “Item D” is associated with the target text segment “×9”, and the target text segment “Item E” is associated with the target text segment “×1”. Therefore, each target text segment in the structured text image has at least one other target text segment associated with this target text segment.
示例性地,目标文本图像为半结构化文本图像,半结构化文本图像包括结构化文本区域和非结构化文本区域。其中,结构化文本区域是通过二维表结构来表达文本的图像区域,这部分图像区域中的文本具有组织性、规则性等。非结构化文本区域是通过不规则的、无组织性的数据结构来表达文本的图像区域。半结构化文本图像中包括多个目标文本段。对于半结构化文本图像,一部分目标文本段中的每个目标文本段均存在与这个目标文本段关联的至少一个其他目标文本段,而另一部分目标文本段中的每个目标文本段均不存在与这个目标文本段关联的其他目标文本段。For example, the target text image is a semi-structured text image, and the semi-structured text image includes a structured text area and an unstructured text area. Among them, the structured text area is an image area that expresses text through a two-dimensional table structure. The text in this part of the image area has organization, regularity, etc. The unstructured text area is an image area that expresses text through irregular and unorganized data structures. The semi-structured text image includes multiple target text segments. For semi-structured text images, each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in the other part of the target text segments does not exist. Other target text segments associated with this target text segment.
请继续参见图3,其中,(2)为一种半结构化文本图像。由半结构化文本图像能够看出,目标文本段“菜品A”与目标文本段“9元”相互关联,目标文本段“菜品B”与目标文本段“8元起”相互关联,目标文本段“菜品C”与目标文本段“13元”相互关联,目标文本段“菜品D”与目标文本段“10元”相互关联。而目标文本段“价目表”与目标文本段“菜品A”、“9元”、“菜品B”、“8元起”、“菜品C”、“13元”、“菜品D”、“10元”均不关联。因此,对于半结构化文本图像,一部分目标文本段中的每个目标文本段均存在与这个目标文本段关联的至少一个其他目标文本段,而另一部分目标文本段中的每个目标文本段均不存在与这个目标文本段关联的其他目标文本段。Please continue to refer to Figure 3, where (2) is a semi-structured text image. It can be seen from the semi-structured text image that the target text segment "Dish A" is related to the target text segment "9 yuan", the target text segment "Dish B" is related to the target text segment "from 8 yuan", and the target text segment "Dish C" is related to the target text segment "13 yuan", and the target text segment "Dish D" is related to the target text segment "10 yuan". The target text segment "Price List" and the target text segment "Dish A", "9 yuan", "Dish B", "Starting from 8 yuan", "Dish C", "13 yuan", "Dish D", "10 Yuan" are not related. Therefore, for semi-structured text images, each target text segment in one part of the target text segments has at least one other target text segment associated with this target text segment, while each target text segment in another part of the target text segments has at least one other target text segment associated with it. There is no other target text segment associated with this target text segment.
本申请实施例不对目标文本图像的图像内容、获取方式、数量等做限定。示例性地,目标文本图像为票据图像、菜单图像等,目标文本图像是拍摄的图像,或者是从网络中下载的图像。The embodiments of this application do not limit the image content, acquisition method, quantity, etc. of the target text image. For example, the target text image is a receipt image, a menu image, etc., and the target text image is a photographed image, or an image downloaded from the network.
步骤202,终端获取至少一对目标文本段之间的关联信息,任一对目标文本段之间的关联信息用于表征任一对目标文本段之间关联的可能性。Step 202: The terminal obtains association information between at least one pair of target text segments, and the association information between any pair of target text segments is used to represent the possibility of association between any pair of target text segments.
在一些实施例中,针对目标文本图像中的多个目标文本段,将多个目标文本段中的任两个目标文本段作为一对目标文本段,从而得到至少一对目标文本段。获取至少一对目标文本段之间的关联信息,即获取每对目标文本段之间的关联信息。例如,每对目标文本段之间的关联信息是一个非负数,其中,当每对目标文本段之间的关联信息是大于或等于0且小于或等于1的数字时,每对目标文本段之间的关联信息称为每对目标文本段之间的关联概率。In some embodiments, for multiple target text segments in the target text image, any two target text segments among the multiple target text segments are regarded as a pair of target text segments, thereby obtaining at least one pair of target text segments. Obtain the correlation information between at least one pair of target text segments, that is, obtain the correlation information between each pair of target text segments. For example, the correlation information between each pair of target text segments is a non-negative number. When the correlation information between each pair of target text segments is a number greater than or equal to 0 and less than or equal to 1, the number between each pair of target text segments is The correlation information between them is called the correlation probability between each pair of target text segments.
每对目标文本段之间的关联信息用于表征这一对目标文本段之间关联的可能性。其中,每对目标文本段之间的关联信息越大,表明这一对目标文本段之间关联的可能性越高,也就是说,每对目标文本段之间的关联信息与这一对目标文本段之间关联的可能性成正比。The association information between each pair of target text segments is used to characterize the possibility of association between the pair of target text segments. Among them, the greater the correlation information between each pair of target text segments, the higher the possibility of correlation between the pair of target text segments. That is to say, the correlation information between each pair of target text segments is related to the pair of target text segments. The probability of association between text segments is proportional to that.
在一些实施例中,获取目标模型;根据该目标模型获取至少一对目标文本段之间的关联 信息。其中,目标模型的获取方式见下文有关图5的相关描述,在此不再赘述。In some embodiments, a target model is obtained; an association between at least one pair of target text segments is obtained according to the target model information. Among them, the method of obtaining the target model is described in the relevant description of Figure 5 below, and will not be described again here.
在一些实施例中,目标模型包括图像特征提取网络或文本特征提取网络中的至少一项,目标模型基于图像特征提取网络或文本特征提取网络中至少一个网络的输出,确定并输出目标文本图像中至少一对目标文本段之间的关联信息。其中,图像特征提取网络用于提取至少一对目标文本段中每个目标文本段所在图像区域的图像特征,文本特征提取网络用于提取至少一对目标文本段中每个目标文本段的文本特征。In some embodiments, the target model includes at least one of an image feature extraction network or a text feature extraction network, and the target model determines and outputs the content of the target text image based on an output of at least one of the image feature extraction network or the text feature extraction network. Association information between at least one pair of target text segments. Among them, the image feature extraction network is used to extract image features of the image area where each target text segment is located in at least a pair of target text segments, and the text feature extraction network is used to extract text features of each target text segment in at least a pair of target text segments. .
在一些实施例中,获取每对目标文本段之间的关联信息,包括:获取每对目标文本段的特征或每对目标文本段之间的相对位置特征中的至少一项,每对目标文本段中每个目标文本段的特征包括该目标文本段所在图像区域的图像特征或该目标文本段的文本特征中的至少一项,每对目标文本段之间的相对位置特征用于表征这一对目标文本段所在图像区域之间的相对位置;基于至少一对目标文本段的特征或至少一对目标文本段之间的相对位置特征中的至少一项,确定每对目标文本段之间的关联信息。In some embodiments, obtaining the association information between each pair of target text segments includes: obtaining at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments, each pair of target text segments The features of each target text segment in the segment include at least one of the image features of the image area where the target text segment is located or the text features of the target text segment. The relative position features between each pair of target text segments are used to characterize this The relative position between the image areas where the target text segments are located; determine the relative position between each pair of target text segments based on at least one of the characteristics of at least one pair of target text segments or the relative position characteristics between at least one pair of target text segments. associated information.
在一些实施例中,每个目标文本段的特征是该目标文本段所在图像区域的图像特征,或者是该目标文本段的文本特征,或者,既包括该目标文本段所在图像区域的图像特征又包括该目标文本段的文本特征。In some embodiments, the features of each target text segment are image features of the image area where the target text segment is located, or text features of the target text segment, or include both image features of the image area where the target text segment is located. Includes text features of the target text segment.
在一些实施例中,每个目标文本段的特征包括该目标文本段所在图像区域的图像特征,该图像特征的获取方式包括:获取目标文本图像的图像特征;基于目标文本图像的图像特征和该目标文本段所在图像区域的位置信息,确定该目标文本段所在图像区域的图像特征。In some embodiments, the characteristics of each target text segment include image characteristics of the image area where the target text segment is located. The image characteristics are obtained by: obtaining the image characteristics of the target text image; based on the image characteristics of the target text image and the The position information of the image area where the target text segment is located determines the image characteristics of the image area where the target text segment is located.
在一些实施例中,将目标文本图像输入图像特征提取网络,由图像特征提取网络输出至少一对目标文本段中每个目标文本段所在图像区域的图像特征。其中,目标文本段所在图像区域的图像特征用于表征目标文本段所在图像区域的纹理信息。In some embodiments, the target text image is input into the image feature extraction network, and the image feature extraction network outputs the image features of the image area where each of the target text segments is located in at least one pair of target text segments. Among them, the image features of the image area where the target text segment is located are used to characterize the texture information of the image area where the target text segment is located.
示例性地,对目标文本图像进行图像检测,得到目标文本图像中每个目标文本段所在图像区域的位置信息。本申请实施例不对每个目标文本段所在图像区域的位置信息做限定。示例性地,每个目标文本段所在的图像区域为矩形、圆形等。每个目标文本段所在图像区域的位置信息包括每个目标文本段所在图像区域的中心点坐标、顶点坐标、边长、周长、面积、半径等中的至少一项。其中,坐标包括横坐标和纵坐标,边长包括高和宽。For example, image detection is performed on the target text image to obtain position information of the image area where each target text segment in the target text image is located. This embodiment of the present application does not limit the position information of the image area where each target text segment is located. For example, the image area where each target text segment is located is a rectangle, a circle, etc. The position information of the image area where each target text segment is located includes at least one of the center point coordinates, vertex coordinates, side length, perimeter, area, radius, etc. of the image area where each target text segment is located. Among them, the coordinates include the abscissa and ordinate, and the side length includes the height and width.
可选地,图像特征提取网络包括第一提取网络和第二提取网络。将目标文本图像输入图像特征提取网络之后,由第一提取网络根据目标文本图像(或者归一化处理后的目标文本图像)中各个像素点的像素信息,提取目标文本图像的图像特征,目标文本图像的图像特征用于表征目标文本图像的纹理信息。由第二提取网络根据目标文本图像的图像特征和每个目标文本段所在图像区域的位置信息,确定每个目标文本段所在图像区域的图像特征(记为第一区域特征)。Optionally, the image feature extraction network includes a first extraction network and a second extraction network. After the target text image is input into the image feature extraction network, the first extraction network extracts the image features of the target text image based on the pixel information of each pixel in the target text image (or the normalized target text image). The target text The image features of the image are used to characterize the texture information of the target text image. The second extraction network determines the image features of the image area where each target text segment is located (recorded as the first area feature) based on the image features of the target text image and the position information of the image area where each target text segment is located.
在一些实施例中,将目标文本图像输入第一提取网络,由第一提取网络对目标文本图像依次进行卷积处理和归一化处理,以将卷积处理得到的目标文本图像归一化到标准分布上,防止训练中的梯度震荡,减少模型过拟合的问题。接着基于归一化处理后的目标文本图像确定并输出目标文本图像的图像特征。可选地,基于目标文本图像中各个像素点的像素信息,确定像素信息的平均值和像素信息的方差中的至少一项。利用像素信息的平均值和像素信息的方差中的至少一项,对卷积处理后的目标文本图像进行归一化处理。这种归一化处理的方式称为实例归一化(Instance Normalization,IN)。由于目标文本图像具有较大的版式布局和图像差异性,通过实例归一化保留图像的浅层表观信息,有利于图像全局信息的整合和调整,提升训练的稳定性和模型的泛化性。In some embodiments, the target text image is input to the first extraction network, and the first extraction network sequentially performs convolution processing and normalization processing on the target text image to normalize the target text image obtained by the convolution processing to On the standard distribution, it prevents gradient oscillation during training and reduces the problem of model overfitting. Then, the image features of the target text image are determined and output based on the normalized target text image. Optionally, based on the pixel information of each pixel point in the target text image, at least one of the average value of the pixel information and the variance of the pixel information is determined. The target text image after convolution is normalized using at least one of the average value of the pixel information and the variance of the pixel information. This method of normalization is called instance normalization (IN). Since the target text image has large layout and image differences, retaining the shallow appearance information of the image through instance normalization is conducive to the integration and adjustment of the global information of the image, and improves the stability of training and the generalization of the model. .
其中,本申请实施例不对第一提取网络和第二提取网络的网络结构、网络大小等做限定。示例性地,第一提取网络和第二提取网络均为卷积神经网络(Convolutional Neural Networks,CNN)。第一提取网络为采用U-Net架构的主干网络,用于对目标文本图像进行视觉特征提取,根据目标文本图像中各个像素点的像素信息,先对目标文本图像进行下采样处理,得到下采样特征,再对下采样特征进行上采样处理,得到目标文本图像的图像特征。 Among them, the embodiments of the present application do not limit the network structure, network size, etc. of the first extraction network and the second extraction network. Illustratively, both the first extraction network and the second extraction network are convolutional neural networks (Convolutional Neural Networks, CNN). The first extraction network is the backbone network using U-Net architecture, which is used to extract visual features of the target text image. According to the pixel information of each pixel in the target text image, the target text image is first down-sampled to obtain downsampling. Features, and then perform upsampling processing on the downsampled features to obtain the image features of the target text image.
第二提取网络为感兴趣区域池化(Region Of Interest Pooling,ROI Pooling)层,或者为感兴趣区域对齐(Region Of Interest Align,ROI Align)层,用于根据目标文本图像的图像特征和每个目标文本段所在图像区域的位置信息,确定每个目标文本段所在图像区域的图像特征。也就是说,ROI Pooling层或者ROI Align层是根据每个目标文本段所在图像区域的位置信息,在目标文本图像的图像特征上再次进行特征提取,得到每个目标文本段所在图像区域的图像特征。例如,每个目标文本段所在图像区域的图像特征为一个固定维度(如16维)的视觉特征。The second extraction network is a region of interest pooling (Region Of Interest Pooling, ROI Pooling) layer, or a region of interest alignment (Region Of Interest Align, ROI Align) layer, which is used according to the image features of the target text image and each The position information of the image area where the target text segment is located determines the image characteristics of the image area where each target text segment is located. That is to say, the ROI Pooling layer or ROI Align layer performs feature extraction on the image features of the target text image again based on the position information of the image area where each target text segment is located, and obtains the image features of the image area where each target text segment is located. . For example, the image feature of the image area where each target text segment is located is a visual feature of a fixed dimension (such as 16 dimensions).
需要说明的是,第一提取网络除采用U-Net架构的主干网络之外,还能够采用特征金字塔网络(Feature Pyramid Networks,FPN)架构的主干网络,或者是采用ResNet架构的主干网络,本申请实施例对此不进行限定。It should be noted that, in addition to the backbone network using the U-Net architecture, the first extraction network can also use the backbone network using the Feature Pyramid Networks (FPN) architecture, or the backbone network using the ResNet architecture. This application The examples do not limit this.
请参见图4,图4是本申请实施例提供的一种提取目标文本段所在图像区域的图像特征的示意图。其中,该目标文本图像是图3中(2)所示的图像,目标文本图像中包括目标文本段“价目表”所在的图像区域,如图4所示的虚线框所示。将目标文本图像输入主干网络,由主干网络输出目标文本图像的图像特征。根据目标文本段“价目表”所在的图像区域的位置信息,在目标文本图像的图像特征上再次进行特征提取,得到目标文本段所在图像区域的图像特征。Please refer to Figure 4. Figure 4 is a schematic diagram of extracting image features of an image area where a target text segment is located according to an embodiment of the present application. The target text image is the image shown in (2) in Figure 3, and the target text image includes the image area where the target text segment "price list" is located, as shown by the dotted box in Figure 4. The target text image is input into the backbone network, and the backbone network outputs the image features of the target text image. According to the position information of the image area where the target text segment "price list" is located, feature extraction is performed again on the image features of the target text image to obtain the image features of the image area where the target text segment is located.
其中,采用U-Net架构的主干网络具有跨层连接的设计特点,该设计特点对于图像区域的特征提取更友好。ROI Pooling层或者ROI Align层是在上采样处理后得到的目标文本图像的图像特征上再次进行特征提取,得到每个目标文本段所在图像区域的图像特征,能够避免因为下采样处理带来的误差累积,提高准确性。另外,由于目标文本图像的图像特征是基于目标文本图像的全局信息得到的,使得每个目标文本段所在图像区域的图像特征也具有目标文本图像的全局信息,特征表达能力更强,准确性更高。Among them, the backbone network using U-Net architecture has the design feature of cross-layer connections, which is more friendly to feature extraction of image areas. The ROI Pooling layer or ROI Align layer performs feature extraction again on the image features of the target text image obtained after upsampling to obtain the image features of the image area where each target text segment is located, which can avoid errors caused by downsampling. accumulation, improving accuracy. In addition, since the image features of the target text image are obtained based on the global information of the target text image, the image features of the image area where each target text segment is located also have the global information of the target text image, and the feature expression ability is stronger and the accuracy is higher. high.
可选地,先对目标文本图像进行图像检测,得到目标文本图像中各个目标文本段所在图像区域的位置信息。基于目标文本图像中各个目标文本段所在图像区域的位置信息,对目标文本图像进行图像分割,得到目标文本图像中各个目标文本段所在的图像区域。对于每个目标文本段,基于每个目标文本段所在的图像区域中各个像素点的像素信息,提取该目标文本段所在图像区域的图像特征(记为第二区域特征)。Optionally, image detection is performed on the target text image first to obtain position information of the image area where each target text segment in the target text image is located. Based on the position information of the image area where each target text segment in the target text image is located, image segmentation is performed on the target text image to obtain the image area where each target text segment in the target text image is located. For each target text segment, based on the pixel information of each pixel in the image area where each target text segment is located, the image features of the image area where the target text segment is located are extracted (recorded as second area features).
可选地,将第一区域特征和第二区域特征进行拼接或者融合,得到每个目标文本段所在图像区域的图像特征。例如,将第一区域特征拼接在第二区域特征之前或者之后,得到每个目标文本段所在图像区域的图像特征。或者,采用克罗内克积的形式,计算第一区域特征和第二区域特征之间的外积,得到每个目标文本段所在图像区域的图像特征。或者,将第一区域特征切分成参考数量个第一区域块,将第二区域特征切分成参考数量个第二区域块,对于每个第一区域块,将该第一区域块和该第一区域块关联的第二区域块进行融合,得到融合区域块,将各个融合区域块进行拼接,得到每个目标文本段所在图像区域的图像特征。Optionally, the first area feature and the second area feature are spliced or fused to obtain image features of the image area where each target text segment is located. For example, the first area feature is spliced before or after the second area feature to obtain the image features of the image area where each target text segment is located. Or, use the form of Kronecker product to calculate the outer product between the first area feature and the second area feature to obtain the image features of the image area where each target text segment is located. Or, divide the first area feature into a reference number of first area blocks, divide the second area feature into a reference number of second area blocks, and for each first area block, combine the first area block and the first area block. The second area block associated with the area block is fused to obtain a fusion area block, and each fusion area block is spliced to obtain the image features of the image area where each target text segment is located.
能够理解的是,在结构化文本图像或者半结构化文本图像中,不同目标文本段在字符风格、字符颜色、字符大小等视觉上有着明显的区分性(如图3所示)。而每个目标文本段所在图像区域的图像特征能够表征每个目标文本段的视觉信息,该视觉信息对后续确定每对目标文本段之间的关联信息、每个目标文本段的类别等起到很好的辅助性,从而提高准确性。It can be understood that in structured text images or semi-structured text images, different target text segments have obvious visual distinctions in character style, character color, character size, etc. (as shown in Figure 3). The image features of the image area where each target text segment is located can represent the visual information of each target text segment. This visual information plays a role in subsequently determining the associated information between each pair of target text segments, the category of each target text segment, etc. Very good auxiliary, thereby improving accuracy.
在一些实施例中,每个目标文本段的特征包括该目标文本段的文本特征,该文本特征的获取方式包括:获取该目标文本段中每个词语的词向量;对该目标文本段中各个词语的词向量进行融合,得到该目标文本段的文本特征。其中,目标文本段的文本特征用于表征目标文本段自身包含的词语的语义信息。In some embodiments, the characteristics of each target text segment include the text characteristics of the target text segment, and the method of obtaining the text characteristics includes: obtaining the word vector of each word in the target text segment; The word vectors of the words are fused to obtain the text features of the target text segment. Among them, the text features of the target text segment are used to represent the semantic information of the words contained in the target text segment itself.
目标文本段的特征包括但不限于目标文本段的文本特征。本申请实施例中,依次对目标文本图像进行图像检测、图像分割,得到目标文本图像中各个目标文本段所在的图像区域。对于每个目标文本段,对该目标文本段所在的图像区域进行图像识别,得到每个目标文本段。Features of the target text segment include but are not limited to text features of the target text segment. In the embodiment of the present application, image detection and image segmentation are performed on the target text image in sequence to obtain the image area where each target text segment in the target text image is located. For each target text segment, perform image recognition on the image area where the target text segment is located to obtain each target text segment.
在得到每个目标文本段之后,将该目标文本段输入文本特征提取网络。由文本特征提取 网络先利用分词器(Tokenizer)对该目标文本段进行分词处理,得到该目标文本段中的各个词语。通过向量表查表的方式,确定该目标文本段中各个词语的词向量,词向量为一个固定维度(如200维)的向量。之后,基于每个目标文本段中各个词语的词向量来进一步学习文本的上下文语义关系,以对该目标文本段中各个词语的词向量进行融合,得到该目标文本段的文本特征。可选地,文本特征提取网络为双向长短时记忆(Bi-directional Long Short Term Memory,Bi-LSTM)网络或者为TransFormer网络。After each target text segment is obtained, the target text segment is input into the text feature extraction network. Extracted from text features The network first uses a tokenizer to segment the target text segment to obtain each word in the target text segment. Through a vector lookup table, the word vector of each word in the target text segment is determined. The word vector is a vector with a fixed dimension (such as 200 dimensions). After that, the contextual semantic relationship of the text is further learned based on the word vectors of each word in each target text segment, so as to fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment. Optionally, the text feature extraction network is a Bi-directional Long Short Term Memory (Bi-LSTM) network or a TransFormer network.
可选地,每个目标文本段的特征包括该目标文本段所在图像区域的图像特征和该目标文本段的文本特征,获取每个目标文本段的特征,包括:对于每对目标文本段中的每个目标文本段,将该目标文本段所在图像区域的图像特征切分成目标数量个图像特征块,将该目标文本段的文本特征切分成目标数量个文本特征块;对于每个图像特征块,将该图像特征块和该图像特征块关联的文本特征块进行融合,得到融合特征块;将各个融合特征块进行拼接,得到该目标文本段的特征。Optionally, the characteristics of each target text segment include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment. Obtaining the characteristics of each target text segment includes: for each pair of target text segments For each target text segment, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; for each image feature block, The image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.
在一些实施例中,目标模型还能够将每个目标文本段所在图像区域的图像特征和该目标文本段的文本特征进行拼接或者融合,得到该目标文本段的特征。示例性地,在融合时,采用克罗内克积的形式,计算每个目标文本段所在图像区域的图像特征和该目标文本段的文本特征之间的外积,得到该目标文本段的特征。In some embodiments, the target model can also splice or fuse the image features of the image area where each target text segment is located with the text features of the target text segment to obtain the features of the target text segment. For example, during fusion, the outer product between the image features of the image area where each target text segment is located and the text features of the target text segment is calculated in the form of a Kronecker product to obtain the features of the target text segment. .
当该目标文本段所在图像区域的图像特征或该目标文本段的文本特征中的至少一项的维度较大时,直接将该目标文本段所在图像区域的图像特征和该目标文本段的文本特征进行融合,会花费较长的时间。当采用克罗内克积的形式进行融合时,会使得该目标文本段的特征的维度急剧增大。为了减少计算开销,采用分块融合的方式进行融合。When the dimension of at least one of the image features of the image area where the target text segment is located or the text features of the target text segment is larger, directly combine the image features of the image area where the target text segment is located and the text features of the target text segment. Fusion will take a long time. When fused in the form of Kronecker product, the dimensionality of the features of the target text segment will increase dramatically. In order to reduce computational overhead, block fusion is used for fusion.
可选地,对于每个目标文本段,先将该目标文本段所在图像区域的图像特征切分成目标数量个图像特征块,分别记为第1至N个图像特征块,N为大于1的正整数且表征目标数量。另外,还将该目标文本段的文本特征切分成目标数量个文本特征块,分别记为第1至N个文本特征块。接着,对于每个图像特征块,将该图像特征块和关联的文本特征块进行融合,得到融合特征块,其中,图像特征块和文本特征块之间的关联关系是指具有相同的序号。其中,将该图像特征块记为第i个图像特征块,则该图像特征块关联的文本特征块为第i个文本特征块,融合特征块记为第i个融合特征块,i为取值1至N中任一个的正整数。可选地,采用克罗内克积的形式,计算第i个图像特征块和第i个文本特征块之间的外积,得到第i个融合特征块。之后,将各个融合特征块进行拼接,得到该目标文本段的特征。也就是说,将第1至N个融合特征块进行拼接,得到该目标文本段的特征。Optionally, for each target text segment, first divide the image features of the image area where the target text segment is located into a target number of image feature blocks, which are recorded as the 1st to Nth image feature blocks respectively, where N is a positive number greater than 1. An integer and represents the target quantity. In addition, the text features of the target text segment are also divided into a target number of text feature blocks, which are respectively recorded as the 1st to Nth text feature blocks. Next, for each image feature block, the image feature block and the associated text feature block are fused to obtain a fused feature block, where the association between the image feature block and the text feature block means having the same serial number. Among them, the image feature block is recorded as the i-th image feature block, then the text feature block associated with the image feature block is the i-th text feature block, the fusion feature block is recorded as the i-th fusion feature block, and i is the value A positive integer from 1 to N. Optionally, the outer product between the i-th image feature block and the i-th text feature block is calculated in the form of a Kronecker product to obtain the i-th fused feature block. After that, each fused feature block is spliced to obtain the characteristics of the target text segment. That is to say, the 1st to Nth fused feature blocks are spliced to obtain the features of the target text segment.
可选地,将每个目标文本段所在图像区域的图像特征和该目标文本段的文本特征先进行拼接或者融合后,再进行非线性运算,得到该目标文本段的特征。其中,每个目标文本段的特征用于指示目标文本段自身具有特征信息,例如代表自身所在图像区域纹理信息的图像特征,或者代表自身语义信息的文本特征,或者代表上述图像特征和上述文本特征的融合信息。Optionally, the image features of the image area where each target text segment is located and the text features of the target text segment are first spliced or fused, and then a nonlinear operation is performed to obtain the features of the target text segment. Among them, the characteristics of each target text segment are used to indicate that the target text segment itself has characteristic information, such as image features representing the texture information of the image area where it is located, or text features representing its own semantic information, or representing the above image features and the above text features. fused information.
通过上述方式得到每对目标文本段中每个目标文本段的特征,进而得到每对目标文本段的特征。之后,目标模型基于每对目标文本段的特征,确定每对目标文本段之间的关联信息。Through the above method, the characteristics of each target text segment in each pair of target text segments are obtained, and then the characteristics of each pair of target text segments are obtained. Afterwards, the target model determines the associated information between each pair of target text segments based on the characteristics of each pair of target text segments.
可选地,获取每对目标文本段之间的相对位置特征,包括:获取每对目标文本段所在图像区域的位置信息;基于每对目标文本段所在图像区域的位置信息和目标文本图像的尺寸信息,确定每对目标文本段之间的相对位置特征。其中,每对目标文本段之间的相对位置特征用于表征这一对目标文本段在目标文本图像中相互之间的位置差异。Optionally, obtaining the relative position characteristics between each pair of target text segments includes: obtaining the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the size of the target text image information to determine the relative position characteristics between each pair of target text segments. Among them, the relative position feature between each pair of target text segments is used to characterize the position difference between the pair of target text segments in the target text image.
对于每对目标文本段,获取这一对目标文本段中两个目标文本段各自所在图像区域的位置信息,每个目标文本段所在图像区域的位置信息包括该目标文本段所在图像区域的中心点坐标、顶点坐标、边长、周长、面积、半径等中的至少一项。还获取目标文本图像的尺寸信息,目标文本图像的尺寸信息包括目标文本图像的边长、周长、面积、半径等中的至少一项。其中,坐标包括横坐标和纵坐标,边长包括宽和高。For each pair of target text segments, obtain the position information of the image area where the two target text segments in the pair are located. The position information of the image area where each target text segment is located includes the center point of the image area where the target text segment is located. At least one of coordinates, vertex coordinates, side length, perimeter, area, radius, etc. Size information of the target text image is also obtained, and the size information of the target text image includes at least one of side length, perimeter, area, radius, etc. of the target text image. Among them, the coordinates include the abscissa and ordinate, and the side length includes the width and height.
接下来,基于每对目标文本段所在图像区域的中心点横坐标,计算每对目标文本段所在 图像区域的相对水平距离。基于每对目标文本段所在图像区域的中心点纵坐标,计算每对目标文本段所在图像区域的相对垂直距离。并基于每对目标文本段所在图像区域的相对水平距离、每对目标文本段所在图像区域的相对垂直距离、每对目标文本段所在图像区域的边长以及目标文本图像的边长,按照下述公式(1),确定每对目标文本段之间的相对位置特征。
Next, based on the abscissa of the center point of the image area where each pair of target text segments is located, calculate the location of each pair of target text segments. The relative horizontal distance of the image area. Based on the ordinate of the center point of the image area where each pair of target text segments is located, the relative vertical distance of the image area where each pair of target text segments is located is calculated. And based on the relative horizontal distance of the image area where each pair of target text segments is located, the relative vertical distance of the image area where each pair of target text segments is located, the side length of the image area where each pair of target text segments is located, and the side length of the target text image, as follows: Formula (1) determines the relative position characteristics between each pair of target text segments.
其中,rij为第i个目标文本段和第j个目标文本段之间的相对位置特征。d为归一化因子,防止不同版式的图像计算出的数值波动。Δxij=xj-xi,其中,Δxij表示第i个目标文本段和第j个目标文本段各自所在图像区域的相对水平距离,xj为第j个目标文本段所在图像区域的中心点横坐标,xi为第i个目标文本段所在图像区域的中心点横坐标。Δyij=yj-yi,其中,Δyij表示第i个目标文本段和第j个目标文本段各自所在图像区域的相对垂直距离,yj为第j个目标文本段所在图像区域的中心点纵坐标,yi为第i个目标文本段所在图像区域的中心点纵坐标。wi为第i个目标文本段所在图像区域的宽,hi为第i个目标文本段所在图像区域的高。wj为第j个目标文本段所在图像区域的宽,hj为第j个目标文本段所在图像区域的高。W为目标文本图像的宽,H为目标文本图像的高。Among them, r ij is the relative position feature between the i-th target text segment and the j-th target text segment. d is the normalization factor to prevent numerical fluctuations calculated for images in different formats. Δx ij =x j -x i , where Δx ij represents the relative horizontal distance between the image area where the i-th target text segment and the j-th target text segment are located, and x j is the center of the image area where the j-th target text segment is located. Point abscissa, x i is the abscissa of the center point of the image area where the i-th target text segment is located. Δy ij =y j -y i , where Δy ij represents the relative vertical distance between the i-th target text segment and the j-th target text segment in the image area, and y j is the center of the image area in which the j-th target text segment is located. Point ordinate, y i is the ordinate of the center point of the image area where the i-th target text segment is located. w i is the width of the image area where the i-th target text segment is located, h i is the height of the image area where the i-th target text segment is located. w j is the width of the image area where the jth target text segment is located, and h j is the height of the image area where the jth target text segment is located. W is the width of the target text image, and H is the height of the target text image.
按照公式(1)的方式,目标模型确定至少一对目标文本段之间的相对位置特征。之后,目标模型基于每对目标文本段之间的相对位置特征,确定每对目标文本段之间的关联信息。According to formula (1), the target model determines the relative position characteristics between at least one pair of target text segments. Afterwards, the target model determines the associated information between each pair of target text segments based on the relative position characteristics between each pair of target text segments.
可选地,按照如下所示的公式(2),对每对目标文本段之间的相对位置特征进行归一化处理和线性处理,得到每对目标文本段之间处理后的相对位置特征。
e′ij=Nl2(Erij)   公式(2)
Optionally, according to formula (2) shown below, the relative position features between each pair of target text segments are normalized and linearly processed to obtain the processed relative position features between each pair of target text segments.
e′ ij =N l2 (Er ij ) Formula (2)
其中,e′ij表示第i个目标文本段和第j个目标文本段之间处理后的相对位置特征。Nl2表示归一化处理,例如L2范数归一化处理,能提升稳定性。E表示线性处理,能够将rij投影到固定的维度。rij为第i个目标文本段和第j个目标文本段之间的相对位置特征。Among them, e′ ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment. N l2 represents normalization processing, such as L2 norm normalization processing, which can improve stability. E represents linear processing, capable of projecting r ij to fixed dimensions. r ij is the relative position feature between the i-th target text segment and the j-th target text segment.
然后,利用每对目标文本段之间(处理后)的相对位置特征,确定每对目标文本段之间的关联信息。Then, the relative position features between each pair of target text segments (after processing) are used to determine the associated information between each pair of target text segments.
可选地,基于至少一对目标文本段的特征和至少一对目标文本段之间的相对位置特征,确定每对目标文本段之间的关联信息,包括:基于至少一对目标文本段的特征和至少一对目标文本段之间的相对位置特征构建图结构,图结构包括至少两个节点和至少一个边,每个节点表征一个目标文本段的特征,每个边表征该边连接的一对节点所指示的一对目标文本段之间的相对位置特征;接着,基于图结构确定每对目标文本段之间的关联信息。Optionally, based on the characteristics of at least one pair of target text segments and the relative position characteristics between the at least one pair of target text segments, determining the association information between each pair of target text segments includes: based on the characteristics of the at least one pair of target text segments and the relative position features between at least one pair of target text segments to construct a graph structure. The graph structure includes at least two nodes and at least one edge. Each node represents the characteristics of a target text segment, and each edge represents a pair connected by the edge. The relative position characteristics between a pair of target text segments indicated by the nodes; then, the association information between each pair of target text segments is determined based on the graph structure.
本申请实施例中,将至少一对目标文本段中每个目标文本段的特征作为图结构的一个节点。也就是说,图结构的一个节点关联一个目标文本段的特征。进而,将每对目标文本段之间的相对位置特征作为图结构中每对目标文本段关联的一对节点之间的连接边。或者按照上述公式(2)对每对目标文本段之间的相对位置特征进行归一化处理和线性处理,得到每对目标文本段之间处理后的相对位置特征,将每对目标文本段之间处理后的相对位置特征作为图结构中每对目标文本段关联的一对节点之间的连接边。In the embodiment of the present application, the characteristics of each target text segment in at least a pair of target text segments are used as a node of the graph structure. That is, a node in the graph structure is associated with the characteristics of a target text segment. Furthermore, the relative position features between each pair of target text segments are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure. Or perform normalization and linear processing on the relative position features between each pair of target text segments according to the above formula (2) to obtain the processed relative position features between each pair of target text segments. The processed relative position features are used as the connecting edges between a pair of nodes associated with each pair of target text segments in the graph structure.
可选地,参照如下所示的公式(3),将每对目标文本段之间处理后的相对位置特征(或者将每对目标文本段之间的相对位置特征)和每对目标文本段的特征进行拼接,得到拼接特征。将利用多层感知机(Multi-Layer Perceptron,MLP)对拼接特征进行融合后得到的特征(或者将拼接特征),作为每对目标文本段关联的两个节点之间的边,通过这种方式,能够更好的将边与节点结合起来,以更准确的得到每对目标文本段之间的关联信息。
eij=M(ni||e′ij||nj)   公式(3)
Optionally, refer to the formula (3) shown below, combine the processed relative position features between each pair of target text segments (or the relative position features between each pair of target text segments) and the Features are spliced to obtain spliced features. The features obtained by fusing the spliced features (or the spliced features) using the Multi-Layer Perceptron (MLP) will be used as the edge between the two nodes associated with each pair of target text segments. In this way , can better combine edges and nodes to more accurately obtain the association information between each pair of target text segments.
e ij =M(n i ||e′ ij ||n j ) Formula (3)
其中,eij为利用多层感知机对拼接特征进行融合后得到的特征。M为多层感知机,能够 将矢量特征变换为标量特征。ni为第i个目标文本段的特征。||为拼接符号。e′ij表示第i个目标文本段和第j个目标文本段之间处理后的相对位置特征。nj为第j个目标文本段的特征。Among them, e ij is the feature obtained by fusing the spliced features using a multi-layer perceptron. M is a multi-layer perceptron that can Transform vector features into scalar features. n i is the characteristic of the i-th target text segment. || is the splicing symbol. e′ ij represents the processed relative position characteristics between the i-th target text segment and the j-th target text segment. n j is the characteristic of the jth target text segment.
通过上述方式得到图结构,以通过图结构来模拟目标文本图像中各个目标文本段之间的布局关系。在一些实施例中,目标模型包括图卷积网络(Graph Convolutional Network,GCN)。将目标文本图像的图结构输入图卷积网络,由图卷积网络确定并输出至少一对目标文本段之间的关联信息。可选地,图卷积网络通过不断的迭代更新图结构,来挖掘图结构中边两端的两个节点之间的结构化关系,从而得到至少一对目标文本段之间的关联信息。其中,迭代更新图结构的过程即是迭代更新图结构的节点的过程,而图结构的边不更新。The graph structure is obtained through the above method, so as to simulate the layout relationship between each target text segment in the target text image through the graph structure. In some embodiments, the target model includes a Graph Convolutional Network (GCN). The graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs the association information between at least one pair of target text segments. Optionally, the graph convolution network mines the structured relationship between the two nodes at both ends of the edge in the graph structure by continuously iteratively updating the graph structure, thereby obtaining the associated information between at least one pair of target text segments. Among them, the process of iteratively updating the graph structure is the process of iteratively updating the nodes of the graph structure, while the edges of the graph structure are not updated.
可选地,在每一次迭代时,先按照如下所示的公式(4),基于图结构中的各个边确定各个边的权重。
Optionally, in each iteration, first determine the weight of each edge based on each edge in the graph structure according to formula (4) shown below.
其中,为第l次迭代时图结构中边eij的权重。exp为指数符号。Σ为求和符号。k为序列号。eik表征图结构中第i个节点和第k个节点之间的边。边eij为图结构中第i个节点和第j个节点之间的边。in, is the weight of edge e ij in the graph structure at the lth iteration. exp is the exponent symbol. Σ is the summation symbol. k is the sequence number. e ik represents the edge between the i-th node and the k-th node in the graph structure. The edge e ij is the edge between the i-th node and the j-th node in the graph structure.
接着,按照如下所示的公式(5),基于图结构中的每个节点、图结构中一端为该节点的各个边的权重、图结构中一端为该节点的各个边,更新图结构中的这个节点。
Then, according to the formula (5) shown below, based on each node in the graph structure, the weight of each edge of the node at one end of the graph structure, and the edges of the node at one end of the graph structure, update the this node.
其中,为第l次迭代时图结构中更新后的第i个节点。为第l次迭代时图结构中的第i个节点。σ表征非线性处理。Wl表征第l次迭代时的线性处理。为第l次迭代时图结构中边eij的权重,其中,边eij为图结构中第i个节点和第j个节点之间的边。in, is the updated i-th node in the graph structure at the l-th iteration. is the i-th node in the graph structure at the l-th iteration. σ represents nonlinear processing. W l represents the linear processing at the lth iteration. is the weight of edge e ij in the graph structure at the l-th iteration, where edge e ij is the edge between the i-th node and the j-th node in the graph structure.
通过上述方式,实现了对图结构中各个节点的一次迭代更新,也就是对图结构进行了一次迭代更新。若满足迭代结束条件,则将更新后的图结构作为最终的图结构,利用最终的图结构确定至少一对目标文本段之间的关联信息。若不满足迭代结束条件,则将更新后的图结构作为下一次迭代的图结构,并按照公式(4)至公式(5)所示的方式,对图结构再次进行更新,直至满足迭代结束条件,得到最终的图结构,利用最终的图结构确定至少一对目标文本段之间的关联信息。需要说明的是,在迭代更新图结构时,除迭代更新图结构的节点之外,还能够迭代更新图结构的边。Through the above method, an iterative update of each node in the graph structure is achieved, that is, an iterative update of the graph structure is achieved. If the iteration end condition is met, the updated graph structure is used as the final graph structure, and the final graph structure is used to determine the associated information between at least one pair of target text segments. If the iteration end conditions are not met, the updated graph structure will be used as the graph structure for the next iteration, and the graph structure will be updated again in the manner shown in formula (4) to formula (5) until the iteration end conditions are met. , obtain the final graph structure, and use the final graph structure to determine the associated information between at least a pair of target text segments. It should be noted that when iteratively updating the graph structure, in addition to iteratively updating the nodes of the graph structure, the edges of the graph structure can also be iteratively updated.
可选地,满足迭代结束条件能够是达到了迭代次数,也能够是迭代更新前的图结构与迭代更新后的图结构之间的变化量小于变化量阈值,也就是图结构趋于稳定。Optionally, satisfying the iteration end condition can be that the number of iterations has been reached, or it can be that the change between the graph structure before iterative update and the graph structure after iterative update is less than the change threshold, that is, the graph structure tends to be stable.
本申请实施例先基于目标文本图像中每对目标文本段的特征和每对目标文本段之间的相对位置特征构建图结构,再基于图结构确定至少一对目标文本段之间的关联信息。由于每对目标文本段为每两个目标文本段,因此,每对目标文本段的特征相当于每两个目标文本段的特征。The embodiment of the present application first constructs a graph structure based on the characteristics of each pair of target text segments in the target text image and the relative position characteristics between each pair of target text segments, and then determines the associated information between at least one pair of target text segments based on the graph structure. Since each pair of target text segments is every two target text segments, the features of each pair of target text segments are equivalent to the features of every two target text segments.
例如,目标文本图像中包括目标文本段1至3,则目标文本图像中每对目标文本段包括目标文本段1和2、目标文本段2和3、目标文本段1和3。则基于目标文本段1的特征、目标文本段2的特征、目标文本段3的特征、目标文本段1和2之间的相对位置特征、目标文本段2和3之间的相对位置特征、目标文本段1和3之间的相对位置特征,构建图结构,再基于图结构确定目标文本段2和3之间的关联信息。For example, if the target text image includes target text segments 1 to 3, then each pair of target text segments in the target text image includes target text segments 1 and 2, target text segments 2 and 3, and target text segments 1 and 3. Then based on the characteristics of target text segment 1, the characteristics of target text segment 2, the characteristics of target text segment 3, the relative position characteristics between target text segments 1 and 2, the relative position characteristics between target text segments 2 and 3, and the target The relative position characteristics between text segments 1 and 3 are used to construct a graph structure, and then the associated information between target text segments 2 and 3 is determined based on the graph structure.
可选地,获取每对目标文本段之间的关联信息的方式,还包括:获取每个目标文本段的类别和每两个目标文本段之间的关联信息;基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文本段之间的关联信息。其中,每个目标文本段的类别 用于表征目标文本段在多个预设类别中所属的类别,代表了目标文本段其自身语义信息所属的种类,例如,两个目标文本段“10元”和“15元”虽然代表不同的价格语义,但均属于同一个类别“菜价”。Optionally, the method of obtaining the correlation information between each pair of target text segments also includes: obtaining the category of each target text segment and the correlation information between every two target text segments; based on the category of each target text segment, From the correlation information between each two target text segments, the correlation information between each pair of target text segments is determined. Among them, the category of each target text segment It is used to represent the category to which the target text segment belongs in multiple preset categories, and represents the category to which the target text segment's own semantic information belongs. For example, although the two target text segments "10 yuan" and "15 yuan" represent different Price semantics, but they all belong to the same category "vegetable price".
本申请实施例中,基于每个目标文本段的特征确定每个目标文本段的类别。基于任两个目标文本段的特征确定任两个目标文本段之间的关联信息,或者,基于任两个目标文本段之间的相对位置特征确定任两个目标文本段之间的关联信息,或者基于任两个目标文本段的特征、任两个目标文本段之间的相对位置特征确定任两个目标文本段之间的关联信息。其中,本申请实施例不对每个目标文本段的类别做限定,示例性地,目标文本图像为菜单图像,则每个目标文本段的类别为菜名、菜价、店名、菜品种类、其他等中的至少一项。In the embodiment of the present application, the category of each target text segment is determined based on the characteristics of each target text segment. Determine the associated information between any two target text segments based on the characteristics of any two target text segments, or determine the associated information between any two target text segments based on the relative position characteristics between any two target text segments, Or determine the associated information between any two target text segments based on the characteristics of any two target text segments and the relative position characteristics between any two target text segments. Among them, the embodiment of the present application does not limit the category of each target text segment. For example, if the target text image is a menu image, then the category of each target text segment is dish name, dish price, store name, dish type, others, etc. at least one of them.
可选地,先基于目标文本图像中每两个目标文本段的特征和每两个目标文本段之间的相对位置特征构建图结构,再基于图结构确定各个目标文本段的类别和每两个目标文本段之间的关联信息。其中,基于图结构确定每两个目标文本段之间的关联信息见上文有关“基于图结构确定每对目标文本段之间的关联信息”的描述,二者实现原理类似,在此不再赘述。接下来,基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文本段之间的关联信息。Optionally, first construct a graph structure based on the characteristics of each two target text segments in the target text image and the relative position characteristics between each two target text segments, and then determine the category of each target text segment and each two target text segments based on the graph structure. Association information between target text segments. Among them, the determination of the associated information between each pair of target text segments based on the graph structure is described above about "determining the associated information between each pair of target text segments based on the graph structure". The implementation principles of the two are similar and will not be discussed here. Repeat. Next, based on the category of each target text segment, the associated information between each pair of target text segments is determined from the associated information between each two target text segments.
在一些实施例中,采用长短期记忆(Long Short Term Memory,LSTM)网络和条件随机场(Conditional Random Field,CRF)网络,基于每个目标文本段的特征确定该目标文本段的类别。其中,LSTM网络和CRF网络基于每个目标文本段的特征,确定该目标文本段中各个字符的类别,基于各个字符的类别确定该目标文本段的类别。可选地,若该目标文本段中各个字符的类别为相同的类别,则该目标文本段的类别为任一字符的类别。若该目标文本段中各个字符的类别为不同的类别,则基于该目标文本段中各个字符的类别,将该目标文本段切分为至少两个目标文本段,切分后的每个目标文本段中各个字符的类别相同,且切分后的每个目标文本段的类别为切分后的目标文本段中任一字符的类别。In some embodiments, a Long Short Term Memory (LSTM) network and a Conditional Random Field (CRF) network are used to determine the category of each target text segment based on the characteristics of the target text segment. Among them, the LSTM network and the CRF network determine the category of each character in the target text segment based on the characteristics of each target text segment, and determine the category of the target text segment based on the category of each character. Optionally, if the categories of each character in the target text segment are the same category, then the category of the target text segment is the category of any character. If the categories of each character in the target text segment are different categories, then segment the target text segment into at least two target text segments based on the categories of each character in the target text segment, and each segmented target text The categories of each character in the segment are the same, and the category of each segmented target text segment is the category of any character in the segmented target text segment.
例如,目标文本段A为“鸡蛋6元”,目标文本段A中字符“鸡”的类别为菜名、字符“蛋”的类别为菜名、字符“6”的类别为菜价、字符“元”的类别为菜价。则将目标文本段A切分为目标文本段A1“鸡蛋”和目标文本段A2“6元”,目标文本段A1“鸡蛋”的类别为菜名,目标文本段A2“6元”的类别为菜价。For example, the target text segment A is "Eggs 6 yuan". In the target text segment A, the category of the character "chicken" is the name of the dish, the category of the character "egg" is the name of the dish, the category of the character "6" is the price of the dish, and the category of the character " Yuan" category is vegetable price. Then the target text segment A is divided into the target text segment A1 "eggs" and the target text segment A2 "6 yuan". The category of the target text segment A1 "eggs" is the name of the dish, and the category of the target text segment A2 "6 yuan" is Vegetable prices.
可选地,基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文本段之间的关联信息,包括:基于各个目标文本段的类别,从目标文本图像包括的多个目标文本段中筛选出类别为目标类别的待关联文本段;从每两个目标文本段之间的关联信息中筛选出每两个待关联文本段之间的关联信息,得到每对目标文本段之间的关联信息。其中,待关联文本段是指具有目标类别的目标文本段,也指需要参与到关联信息的计算过程的目标文本段,即,只有每两个待关联文本段才需要计算关联信息,其他类别的每对目标文本段并不需要计算关联信息。Optionally, based on the category of each target text segment, determining the associated information between each pair of target text segments from the associated information between each two target text segments, including: based on the category of each target text segment, from the target text segment Filter out the text segments to be associated with the target category from the multiple target text segments included in the text image; filter out the associated information between each two text segments to be associated from the associated information between each two target text segments, Obtain the association information between each pair of target text segments. Among them, the text segment to be associated refers to the target text segment with the target category, and also refers to the target text segment that needs to participate in the calculation process of associated information. That is, only every two text segments to be associated need to calculate the associated information, and other categories of text segments need to be calculated. There is no need to calculate associated information for each pair of target text segments.
本申请实施例中,对于每个目标文本段,若该目标文本段的类别为目标类别,则该目标文本段为待关联文本段。若该目标文本段的类别不为目标类别,则该目标文本段不为待关联文本段。通过这种方式,实现从多个目标文本段中筛选出待关联文本段。其中,本申请实施例不对目标类别做限定,示例性地,目标文本图像为菜单图像,由于主要关注菜单图像中菜名与菜价之间的匹配关系,因此,目标类别为菜名和菜价。In the embodiment of the present application, for each target text segment, if the category of the target text segment is the target category, then the target text segment is the text segment to be associated. If the category of the target text segment is not the target category, the target text segment is not a text segment to be associated. In this way, the text segments to be associated can be filtered out from multiple target text segments. The embodiment of the present application does not limit the target category. For example, the target text image is a menu image. Since the main focus is on the matching relationship between the dish name and the dish price in the menu image, the target category is the dish name and the dish price.
从多个目标文本段中筛选出待关联文本段之后,即可从每两个目标文本段之间的关联信息中筛选出每两个待关联文本段之间的关联信息。将每两个待关联文本段之间的关联信息作为一对目标文本段之间的关联信息。After filtering out the text segments to be associated from multiple target text segments, the associated information between each two text segments to be associated can be filtered out from the associated information between each two target text segments. The association information between each two text segments to be associated is regarded as the association information between a pair of target text segments.
例如,多个目标文本段为目标文本段1至3,且待关联文本段为目标文本段2和3,则从目标文本段1和2之间的关联信息、目标文本段2和3之间的关联信息、目标文本段1和3之间的关联信息中,直接筛选出目标文本段2和3之间的关联信息。For example, if the multiple target text segments are target text segments 1 to 3, and the text segments to be associated are target text segments 2 and 3, then the association information between target text segments 1 and 2, and the relationship between target text segments 2 and 3 Among the associated information and the associated information between target text segments 1 and 3, the associated information between target text segments 2 and 3 is directly filtered out.
步骤203,终端基于至少一对目标文本段之间的关联信息,确定至少一对目标文本段之 间的关联结果。Step 203: The terminal determines one of the at least one pair of target text segments based on the association information between the at least one pair of target text segments. The correlation results between.
对于每对目标文本段,基于每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果,其中,每对目标文本段之间的关联结果指示这一对目标文本段是否关联。可选地,若某一对目标文本段之间的关联信息大于关联阈值,则确定这一对目标文本段之间的关联结果为关联。若某一对目标文本段之间的关联信息不大于关联阈值,则确定这一对目标文本段之间的关联结果为不关联。其中,本申请实施例不对关联阈值的取值做限定,示例性地,关联阈值为0.5。For each pair of target text segments, an association result between each pair of target text segments is determined based on the association information between each pair of target text segments, where the association result between each pair of target text segments indicates the pair of target text segments Whether related. Optionally, if the correlation information between a certain pair of target text segments is greater than the correlation threshold, the correlation result between the pair of target text segments is determined to be correlation. If the correlation information between a certain pair of target text segments is not greater than the correlation threshold, the correlation result between the pair of target text segments is determined to be non-correlated. The embodiment of the present application does not limit the value of the correlation threshold. For example, the correlation threshold is 0.5.
可选地,确定至少一对目标文本段中每一个目标文本段的类别,并获取每两个类别之间的关联关系,两个类别之间的关联关系用于表征两个类别是否关联。其中,基于每个目标文本段的特征,确定每个目标文本段的类别,或者,将目标文本图像的图结构输入图卷积网络,由图卷积网络确定并输出至少一对目标文本段中每一个目标文本段的类别。可选地,图卷积网络对图结构进行至少一次更新后,得到最终的图结构,利用最终的图结构确定至少一对目标文本段中每一个目标文本段的类别。Optionally, determine the category of each target text segment in at least one pair of target text segments, and obtain an association relationship between each two categories. The association relationship between two categories is used to characterize whether the two categories are associated. Among them, based on the characteristics of each target text segment, the category of each target text segment is determined, or the graph structure of the target text image is input into the graph convolution network, and the graph convolution network determines and outputs at least one pair of target text segments. Category for each target text segment. Optionally, after the graph convolution network updates the graph structure at least once, a final graph structure is obtained, and the final graph structure is used to determine the category of each target text segment in at least one pair of target text segments.
对于每对目标文本段,若这一对目标文本段之间的关联信息大于关联阈值,且这一对目标文本段中两个目标文本段的类别之间的关联关系为关联,则确定这一对目标文本段之间的关联结果为关联。若这一对目标文本段之间的关联信息大于关联阈值,但这一对目标文本段中两个目标文本段的类别之间的关联关系为不关联,则确定这一对目标文本段之间的关联结果为不关联。若这一对目标文本段之间的关联信息不大于关联阈值,但这一对目标文本段中两个目标文本段的类别之间的关联关系为关联,则确定这一对目标文本段之间的关联结果为不关联。若这一对目标文本段之间的关联信息不大于关联阈值,且这一对目标文本段中两个目标文本段的类别之间的关联关系为不关联,则确定这一对目标文本段之间的关联结果为不关联。For each pair of target text segments, if the correlation information between the pair of target text segments is greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is correlation, then determine this pair The result of the association between target text segments is association. If the correlation information between this pair of target text segments is greater than the correlation threshold, but the correlation between the categories of the two target text segments in this pair of target text segments is not relevant, then determine the relationship between this pair of target text segments The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, but the correlation between the categories of the two target text segments in the pair is correlation, then determine the relationship between the pair of target text segments. The correlation result is no correlation. If the correlation information between the pair of target text segments is not greater than the correlation threshold, and the correlation between the categories of the two target text segments in the pair is not relevant, then determine the relationship between the pair of target text segments. The result of the correlation between them is no correlation.
例如,关联阈值为0.5,两个类别之间的关联关系包括菜名与菜价之间关联。其中,一对目标文本段之间的关联信息为0.7,且这一对目标文本段中两个目标文本段的类别分别为菜名和菜价,则确定这一对目标文本段之间的关联结果为关联。另一对目标文本段之间的关联信息为0.51,但这一对目标文本段中两个目标文本段的类别均为菜名,则确定这一对目标文本段之间的关联结果为不关联。For example, the association threshold is 0.5, and the association between the two categories includes the association between the dish name and the dish price. Among them, the correlation information between a pair of target text segments is 0.7, and the categories of the two target text segments in the pair are dish name and dish price respectively, then determine the correlation result between the pair of target text segments for association. The correlation information between another pair of target text segments is 0.51, but the categories of both target text segments in this pair of target text segments are dish names, then it is determined that the correlation result between this pair of target text segments is not relevant. .
步骤204,终端基于至少一对目标文本段之间的关联结果,提取目标文本图像中的文本信息。Step 204: The terminal extracts text information in the target text image based on the association result between at least one pair of target text segments.
本申请实施例中,基于每对目标文本段之间的关联结果,提取目标文本图像中的文本信息,其中,该文本信息至少包含每对关联的目标文本段所组合得到的关联对,即,文本信息不但表征从目标文本图像中识别出来的目标文本段的内容,而且还体现了目标文本段之间的关联结果。若一对目标文本段之间的关联结果为关联,则在这一对目标文本段之间添加目标符号(如“:”、“-”、“/”等中的至少一项),使这一对目标文本段组合为一个关联对。若一对目标文本段之间的关联结果为不关联,则不能将这一对目标文本段组合为一个关联对。In the embodiment of the present application, the text information in the target text image is extracted based on the association result between each pair of target text segments, where the text information at least includes the associated pairs obtained by combining each pair of associated target text segments, that is, Text information not only represents the content of the target text segment identified from the target text image, but also reflects the association results between the target text segments. If the association result between a pair of target text segments is association, add a target symbol (such as at least one of ":", "-", "/", etc.) between the pair of target text segments so that the A pair of target text segments are combined into an associated pair. If the association result between a pair of target text segments is not associated, the pair of target text segments cannot be combined into an associated pair.
通过上述方式,确定每对目标文本段是否能够组合为关联对,并在能够组合为关联对的情况下(即这一对目标文本段之间的关联结果为关联的情况下),将这一对目标文本段组合成关联对。实现对目标文本图像中多个目标文本段进行关联,得到目标文本图像中的文本信息。Through the above method, it is determined whether each pair of target text segments can be combined into an associated pair, and if it can be combined into an associated pair (that is, when the association result between this pair of target text segments is an association), this pair Target text segments are combined into associated pairs. Realize the association of multiple target text segments in the target text image to obtain the text information in the target text image.
上述方法中每对目标文本段之间的关联信息用于表征每对目标文本段之间关联的可能性,因此,在通过每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果时,降低关联错误的现象,提高了关联结果的准确性,使得基于至少一对目标文本段之间的关联结果,提取目标文本图像中的文本信息时,提高了文本信息的准确性。In the above method, the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, through the association information between each pair of target text segments, the relationship between each pair of target text segments is determined. When correlating results between at least one pair of target text segments, the phenomenon of correlation errors is reduced and the accuracy of the correlation results is improved, so that when extracting text information in the target text image based on the correlation results between at least one pair of target text segments, the accuracy of the text information is improved. sex.
基于上述实施环境,本申请实施例提供了一种目标模型的获取方法,以图5所示的本申请实施例提供的一种目标模型的获取方法的流程图为例,该方法可由图1中的终端设备101或者服务器102执行,也能够由终端设备101和服务器102共同执行。为便于描述,将终端 设备101和服务器102统称为电子设备,以电子设备为服务器为例进行说明。如图5所示,该方法包括步骤501至步骤503。Based on the above implementation environment, the embodiment of the present application provides a method for obtaining a target model. Taking the flow chart of a method for obtaining a target model provided by the embodiment of the present application as shown in Figure 5 as an example, the method can be shown in Figure 1 It can be executed by the terminal device 101 or the server 102, or it can also be executed by the terminal device 101 and the server 102 together. For ease of description, the terminal The device 101 and the server 102 are collectively referred to as electronic devices. The electronic device is used as a server for illustration. As shown in Figure 5, the method includes steps 501 to 503.
步骤501,服务器获取样本文本图像,样本文本图像中包括多个样本文本段。Step 501: The server obtains a sample text image, which includes multiple sample text segments.
其中,样本文本图像为结构化文本图像,或者为半结构化文本图像。本申请实施例中的样本文本图像与上文提及的目标文本图像同理,见上文有关目标文本图像的描述,不再赘述。Among them, the sample text image is a structured text image or a semi-structured text image. The sample text image in the embodiment of the present application is the same as the target text image mentioned above. See the description of the target text image above, which will not be described again.
步骤502,服务器获取至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息。Step 502: The server obtains predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.
在一些实施例中,每对样本文本段之间的预测关联信息是一个正数。其中,当每对样本文本段之间的预测关联信息是大于等于0且小于等于1的数字时,每对样本文本段之间的预测关联信息称为每对样本文本段之间的关联概率。其中,每对样本文本段之间的预测关联信息见上文有关“每对目标文本段之间的关联信息”的描述,二者实现原理相同,不再赘述。In some embodiments, the predicted correlation information between each pair of sample text segments is a positive number. Wherein, when the predicted association information between each pair of sample text segments is a number greater than or equal to 0 and less than or equal to 1, the predicted association information between each pair of sample text segments is called the association probability between each pair of sample text segments. Among them, the predicted correlation information between each pair of sample text segments can be found in the above description of "the correlation information between each pair of target text segments". The implementation principles of the two are the same and will not be described again.
本申请实施例中,根据神经网络模型获取每对样本文本段之间的预测关联信息。其中,神经网络模型包括第一初始网络和第二初始网络中的至少一项。神经网络模型基于第一初始网络或第二初始网络中至少一个网络的输出,确定并输出每对样本文本段之间的预测关联信息。其中,第一初始网络用于提取每对样本文本段中每个样本文本段所在图像区域的图像特征,第二初始网络用于提取每对样本文本段中每个样本文本段的文本特征。In the embodiment of the present application, the predicted association information between each pair of sample text segments is obtained according to the neural network model. Wherein, the neural network model includes at least one of a first initial network and a second initial network. The neural network model determines and outputs predicted association information between each pair of sample text segments based on an output of at least one of the first initial network or the second initial network. Among them, the first initial network is used to extract image features of the image area where each sample text segment in each pair of sample text segments is located, and the second initial network is used to extract text features of each sample text segment in each pair of sample text segments.
需要说明的是,利用样本文本图像对第一初始网络进行训练,得到图像特征提取网络,以利用图像特征提取网络提取每对目标文本段中每个目标文本段所在图像区域的图像特征,对第一初始网络的描述见上文有关图像特征提取网络的描述,二者实现原理相同,不再赘述。对第二初始网络的描述见上文有关文本特征提取网络的描述,二者实现原理相同,不再赘述。It should be noted that the first initial network is trained using sample text images to obtain an image feature extraction network, so as to use the image feature extraction network to extract image features of the image area where each target text segment is located in each pair of target text segments. For a description of the initial network, see the description of the image feature extraction network above. The implementation principles of the two are the same and will not be described again. For the description of the second initial network, see the description of the text feature extraction network above. The implementation principles of the two are the same and will not be described again.
获取至少一对样本文本段的特征,每对样本文本段中每个样本文本段的特征包括该样本文本段所在图像区域的图像特征或该样本文本段的文本特征中的至少一项。其中,样本文本段的特征的获取方式与目标文本段的特征的获取方式同理,见上文有关目标文本段的特征的相关描述,在此不再赘述。Characteristics of at least a pair of sample text segments are obtained, and the characteristics of each sample text segment in each pair of sample text segments include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment. Among them, the characteristics of the sample text segment are obtained in the same way as the characteristics of the target text segment. See the relevant description of the characteristics of the target text segment above, which will not be described again here.
可选地,第一初始网络包括第一子网络和第二子网络。将样本文本图像输入第一初始网络之后,由第一子网络根据样本文本图像中各个像素点的像素信息,提取样本文本图像的图像特征,样本文本图像的图像特征用于表征样本文本图像的纹理信息。由第二子网络根据样本文本图像的图像特征和每个样本文本段所在图像区域的位置信息,确定每个样本文本段所在图像区域的图像特征。其中,对第一子网络进行训练得到第一提取网络,第一子网络见上文有关第一提取网络的描述,二者实现原理相同,不再赘述。对第二子网络进行训练得到第二提取网络,第二子网络见上文有关第二提取网络的描述,二者实现原理相同,不再赘述。Optionally, the first initial network includes a first sub-network and a second sub-network. After the sample text image is input into the first initial network, the first sub-network extracts the image features of the sample text image based on the pixel information of each pixel in the sample text image. The image features of the sample text image are used to characterize the texture of the sample text image. information. The second sub-network determines the image features of the image area where each sample text segment is located based on the image features of the sample text image and the position information of the image area where each sample text segment is located. Among them, the first sub-network is trained to obtain the first extraction network. For the first sub-network, see the above description of the first extraction network. The implementation principles of the two are the same and will not be described again. The second sub-network is trained to obtain the second extraction network. For the second sub-network, see the above description of the second extraction network. The implementation principles of the two are the same and will not be described again.
接下来,获取至少一对样本文本段之间的相对位置特征,每对样本文本段之间的相对位置特征用于表征每对样本文本段所在图像区域之间的相对位置。其中,每对样本文本段之间的相对位置特征的获取方式与每对目标文本段之间的相对位置特征的获取方式同理,见上文有关每对目标文本段之间的相对位置特征的描述,在此不再赘述。Next, relative position features between at least one pair of sample text segments are obtained. The relative position features between each pair of sample text segments are used to characterize the relative position between the image areas where each pair of sample text segments are located. Among them, the relative position features between each pair of sample text segments are obtained in the same way as the relative position features between each pair of target text segments. See above for the relative position features between each pair of target text segments. Description will not be repeated here.
基于至少一对样本文本段的特征和至少一对样本文本段之间的相对位置特征,确定至少一对样本文本段之间的预测关联信息。可选地,基于至少一对样本文本段的特征和至少一对样本文本段之间的相对位置特征构建图结构,图结构包括至少两个节点和至少一个边,每个节点表征一个样本文本段的特征,每个边表征一对样本文本段之间的相对位置特征,基于图结构确定至少一对样本文本段之间的预测关联信息。其中,确定至少一对样本文本段之间的预测关联信息的描述见上文有关确定至少一对目标文本段之间的关联信息的描述,二者实现原理相同,在此不再赘述。Predictive association information between at least one pair of sample text segments is determined based on the characteristics of the at least one pair of sample text segments and the relative position characteristics between the at least one pair of sample text segments. Optionally, a graph structure is constructed based on the characteristics of at least one pair of sample text segments and the relative position characteristics between at least one pair of sample text segments. The graph structure includes at least two nodes and at least one edge, and each node represents a sample text segment. The characteristics of each edge represent the relative position characteristics between a pair of sample text segments, and the predicted association information between at least one pair of sample text segments is determined based on the graph structure. The description of determining the predicted correlation information between at least one pair of sample text segments can be found in the above description on determining the correlation information between at least one pair of target text segments. The implementation principles of the two are the same and will not be described again here.
其中,神经网络模型还包括第三初始网络,将样本文本图像的图结构输入第三初始网络,由第三初始网络确定并输出至少一对样本文本段之间的关联信息。其中,对第三初始网络进行训练得到图卷积网络,第三初始网络见图卷积网络的描述,二者实现原理相同,不再赘述。Wherein, the neural network model also includes a third initial network, the graph structure of the sample text image is input into the third initial network, and the third initial network determines and outputs the correlation information between at least one pair of sample text segments. Among them, the third initial network is trained to obtain a graph convolution network. The third initial network is described in the description of the graph convolution network. The implementation principles of the two are the same and will not be described again.
本申请实施例中,对至少一对样本文本段进行关联信息的标注,得到至少一对样本文本 段之间的标注关联信息。其中,每对样本文本段之间的标注关联信息为0或者1,0表征这一对样本文本段之间不关联,1表征这一对样本文本段之间关联。In the embodiment of the present application, at least one pair of sample text segments is annotated with associated information to obtain at least one pair of sample text segments. Label association information between segments. Among them, the annotated association information between each pair of sample text segments is 0 or 1. 0 indicates that the pair of sample text segments are not associated, and 1 indicates that the pair of sample text segments are associated.
步骤503,基于至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息,获取目标模型。Step 503: Obtain a target model based on predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments.
利用每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,确定神经网络模型的损失值。通过神经网络模型的损失值对神经网络模型进行调整,得到调整后的神经网络模型。若满足训练结束条件,则将调整后的神经网络模型作为目标模型。若不满足训练结束条件,则将调整后的神经网络模型作为下一次训练的神经网络模型,并按照步骤501至步骤503的方式,对神经网络模型再次进行训练,直至满足训练结束条件,得到目标模型。The loss value of the neural network model is determined using the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments. The neural network model is adjusted through the loss value of the neural network model to obtain the adjusted neural network model. If the training end conditions are met, the adjusted neural network model will be used as the target model. If the training end conditions are not met, the adjusted neural network model will be used as the neural network model for the next training, and the neural network model will be trained again according to steps 501 to 503 until the training end conditions are met and the target is obtained. Model.
本申请实施例中,按照如下所示的公式(6),利用至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息,确定关联信息损失值。其中,公式(6)为焦点损失(Focal Loss)函数。
In the embodiment of the present application, according to the formula (6) shown below, the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments are used to determine the correlation information loss value. Among them, formula (6) is the focal loss (Focal Loss) function.
其中,Lrel为关联信息损失值。α、γ为两个超参数,用于控制正负样本的损失比例。本申请实施例不对α、γ的取值做限定,示例性地,α=0.25,γ=2。pij表示第i个样本文本段和第j个样本文本段之间的预测关联信息。log为对数符号。yij表示第i个样本文本段和第j个样本文本段之间的标注关联信息。表征第L次迭代后图结构上第i个样本文本段关联的节点和第j个样本文本段关联的节点之间的边。E为线性层,用于将边映射成预测关联信息。Among them, L rel is the associated information loss value. α and γ are two hyperparameters, used to control the loss ratio of positive and negative samples. The embodiments of the present application do not limit the values of α and γ. For example, α=0.25 and γ=2. p ij represents the predicted association information between the i-th sample text segment and the j-th sample text segment. log is a logarithmic sign. y ij represents the annotation association information between the i-th sample text segment and the j-th sample text segment. Characterizes the edge between the node associated with the i-th sample text segment and the node associated with the j-th sample text segment on the graph structure after the L-th iteration. E is a linear layer, used to map edges into predicted association information.
需要说明的是,在迭代更新图结构的过程中,对图结构的边进行更新,或者不对图结构的边进行更新。其中,更新后的图结构中的边记为表征第L次更新后的图结构中第i个节点和第j个节点之间的边。根据图结构能够确定并输出N2×a维的概率分布矩阵,N2表征图结构中任意两个节点之间的组合数量,即样本文本图像中由任意两个样本文本段组合成的样本文本段对的数量,a表征一对样本文本段之间的预测关联信息的维度。可选地,a=2,此时,预测关联信息为0表示一对样本文本段之间不关联,预测关联信息为1表示一对样本文本段之间关联。可选地,a=1,此时预测关联信息为大于或等于0且小于或等于1的数据。It should be noted that in the process of iteratively updating the graph structure, the edges of the graph structure are updated, or the edges of the graph structure are not updated. Among them, the edges in the updated graph structure are marked as Represents the edge between the i-th node and the j-th node in the graph structure after the L-th update. According to the graph structure, an N 2 ×a-dimensional probability distribution matrix can be determined and output. N 2 represents the number of combinations between any two nodes in the graph structure, that is, the sample text composed of any two sample text segments in the sample text image. The number of segment pairs, a, represents the dimension of predicted association information between a pair of sample text segments. Optionally, a=2. At this time, predicted correlation information of 0 indicates that there is no correlation between the pair of sample text segments, and predicted correlation information of 1 indicates that there is correlation between the pair of sample text segments. Optionally, a=1, in which case the prediction associated information is data greater than or equal to 0 and less than or equal to 1.
对于样本文本图像(如菜单图像),图像中关联的样本文本段对的数量M将远小于(<<)N2。将关联的一对样本文本段作为正样本,不关联的一对样本文本段作为负样本,则负样本的数量远多于正样本的数量。因此要拟合的概率分布矩阵极其稀疏,正负样本比例严重失衡。采用上述公式(6)能够解决概率分布矩阵稀疏、正负样本比例失衡的问题,通过平衡正负样本的损失比例,能够避免网络对负样本的过度学习,从而提升网络性能。For sample text images (such as menu images), the number M of sample text segment pairs associated in the image will be much less than (<<) N 2 . If the associated pair of sample text segments are regarded as positive samples and the unassociated pair of sample text segments are regarded as negative samples, then the number of negative samples is much greater than the number of positive samples. Therefore, the probability distribution matrix to be fitted is extremely sparse, and the proportion of positive and negative samples is seriously imbalanced. Using the above formula (6) can solve the problem of sparse probability distribution matrix and unbalanced ratio of positive and negative samples. By balancing the loss ratio of positive and negative samples, the network can avoid over-learning of negative samples, thereby improving network performance.
可选地,将关联信息损失值作为神经网络模型的损失值。或者,基于关联信息损失值、至少一对样本文本段中各个样本文本段的预测类别和至少一对样本文本段中各个样本文本段的标注类别,确定神经网络模型的损失值。或者,基于关联信息损失值和至少一对样本文本段中各个样本文本段的特征,确定神经网络模型的损失值。Optionally, the associated information loss value is used as the loss value of the neural network model. Alternatively, the loss value of the neural network model is determined based on the associated information loss value, the predicted category of each sample text segment in at least one pair of sample text segments, and the label category of each sample text segment in at least one pair of sample text segments. Alternatively, the loss value of the neural network model is determined based on the associated information loss value and the characteristics of each sample text segment in at least one pair of sample text segments.
可选地,基于每对样本文本段之间的预测关联信息和标注关联信息,获取目标模型,还包括:获取每对样本文本段中每个样本文本段的预测类别和每对样本文本段中每个样本文本段的标注类别;基于每对样本文本段之间的预测关联信息、每对样本文本段之间的标注关联信息、每对样本文本段中每个样本文本段的预测类别和每对样本文本段中每个样本文本段的标注类别,获取目标模型。Optionally, obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the predicted category of each sample text segment in each pair of sample text segments and the predicted category of each sample text segment in each pair of sample text segments. The annotation category of each sample text segment; based on the predicted association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments and each Obtain the target model for the annotation category of each sample text segment in the sample text segment.
本申请实施例中,基于每个样本文本段的特征,确定该样本文本段的预测类别。或者,将样本文本图像的图结构输入第三初始网络,由第三初始网络确定并输出每对样本文本段中每个样本文本段的预测类别。此外,对每个样本文本段进行标注,得到该样本文本段的标注 类别。In the embodiment of the present application, based on the characteristics of each sample text segment, the predicted category of the sample text segment is determined. Alternatively, the graph structure of the sample text image is input into a third initial network, and the third initial network determines and outputs the predicted category of each sample text segment in each pair of sample text segments. In addition, each sample text segment is annotated to obtain the annotation of the sample text segment. category.
在本申请示例性实施例中,按照如下所示的公式(7),利用至少一对样本文本段中各个样本文本段的预测类别和至少一对样本文本段中各个样本文本段的标注类别,确定类别损失值。其中,公式(7)为交叉熵损失(Cross Entropy Loss,CE Loss)函数。
In an exemplary embodiment of the present application, according to the formula (7) shown below, using the predicted category of each sample text segment in at least one pair of sample text segments and the annotation category of each sample text segment in at least one pair of sample text segments, Determine the class loss value. Among them, formula (7) is the cross entropy loss (Cross Entropy Loss, CE Loss) function.
其中,Lnode为类别损失值。N为至少一对样本文本段中样本文本段的数量。CE为交叉熵损失函数的符号。E为线性处理,用于将第L次迭代后图结构上第i个样本文本段关联的节点映射到概率分布维度上,得到第i个样本文本段的预测类别。为第L次迭代后图结构上第i个样本文本段关联的节点。yi为第i个样本文本段的标注类别。Among them, L node is the category loss value. N is the number of sample text segments in at least one pair of sample text segments. CE is the symbol of the cross-entropy loss function. E is linear processing, which is used to map the nodes associated with the i-th sample text segment on the graph structure after the L-th iteration to the probability distribution dimension to obtain the predicted category of the i-th sample text segment. is the node associated with the i-th sample text segment on the graph structure after the L-th iteration. y i is the annotation category of the i-th sample text segment.
需要说明的是,在迭代更新图结构的过程中,对图结构的节点进行更新。其中,更新后的图结构中的节点记为表征第L次更新后的图结构中的第i个节点。根据图结构确定并输出N×b维的概率分布矩阵,其中,N是图结构的节点数量,也就是样本文本图像中样本文本段的数量,b为样本文本段的预测类别的数量。可选地,样本文本图像为菜单图像时,b=5,分别包括菜名、菜价、店名、菜品种类、其他五个预测类别。It should be noted that during the process of iteratively updating the graph structure, the nodes of the graph structure are updated. Among them, the nodes in the updated graph structure are marked as Represents the i-th node in the graph structure after the L-th update. Determine and output an N×b-dimensional probability distribution matrix according to the graph structure, where N is the number of nodes in the graph structure, that is, the number of sample text segments in the sample text image, and b is the number of predicted categories of the sample text segment. Optionally, when the sample text image is a menu image, b=5, including dish name, dish price, store name, dish type, and other five prediction categories respectively.
另外,按照公式(6),基于至少一对样本文本段之间的预测关联信息、至少一对样本文本段之间的标注关联信息,确定关联信息损失值。之后,基于类别损失值和关联信息损失值确定神经网络模型的损失值。In addition, according to formula (6), the correlation information loss value is determined based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments. Afterwards, the loss value of the neural network model is determined based on the category loss value and the associated information loss value.
可选地,基于每对样本文本段之间的预测关联信息和标注关联信息,获取目标模型,还包括:获取每对样本文本段中每个样本文本段的特征,每个样本文本段的特征包括该样本文本段所在图像区域的图像特征或该样本文本段的文本特征中的至少一项;基于每对样本文本段中每个样本文本段的特征、每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型。Optionally, obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments also includes: obtaining the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment. Including at least one of the image features of the image area where the sample text segment is located or the text features of the sample text segment; based on the characteristics of each sample text segment in each pair of sample text segments and the predicted association between each pair of sample text segments information and the annotated association information between each pair of sample text segments to obtain the target model.
本申请实施例中,获取样本文本图像的图像特征。对于每个样本文本段,基于样本文本图像的图像特征和该样本文本段所在图像区域的位置信息,确定该样本文本段所在图像区域的图像特征(记为第一区域特征)。或者,对样本文本图像依次进行图像检测、图像分割,得到样本文本图像中各个样本文本段所在的图像区域,对于每个样本文本段,基于该样本文本段所在的图像区域中各个像素点的像素信息,提取该样本文本段所在图像区域的图像特征(记为第二区域特征)。或者,将第一区域特征和第二区域特征进行拼接或者融合,得到该样本文本段所在图像区域的图像特征。其中,每个样本文本段所在图像区域的图像特征的确定方式见上文有关每个目标文本段所在图像区域的图像特征的描述,二者实现原理相同,不再赘述。In the embodiment of this application, the image features of the sample text image are obtained. For each sample text segment, based on the image features of the sample text image and the position information of the image area where the sample text segment is located, the image features of the image area where the sample text segment is located are determined (recorded as the first area feature). Or, perform image detection and image segmentation on the sample text image in sequence to obtain the image area where each sample text segment is located in the sample text image. For each sample text segment, based on the pixels of each pixel in the image area where the sample text segment is located, Information, extract the image features of the image area where the sample text segment is located (recorded as the second area feature). Alternatively, the first region feature and the second region feature are spliced or fused to obtain image features of the image region where the sample text segment is located. The method for determining the image features of the image area where each sample text segment is located is as described above regarding the image features of the image area where each target text segment is located. The implementation principles of the two are the same and will not be described again.
在得到样本文本图像中各个样本文本段所在的图像区域之后,对每个样本文本段所在的图像区域进行图像识别,得到该样本文本段。接着,利用分词器对该样本文本段进行分词处理,得到该样本文本段中的各个词语。通过向量表查表的方式,确定该样本文本段中各个词语的词向量。之后,基于该样本文本段中各个词语的词向量,确定该样本文本段的文本特征。其中,每个样本文本段的文本特征见上文有关每个目标文本段的文本特征的描述,二者实现原理相同,不再赘述。After obtaining the image area where each sample text segment is located in the sample text image, image recognition is performed on the image area where each sample text segment is located to obtain the sample text segment. Then, use a word segmenter to perform word segmentation processing on the sample text segment, and obtain each word in the sample text segment. By using a vector lookup table, the word vectors of each word in the sample text segment are determined. Afterwards, based on the word vectors of each word in the sample text segment, the text features of the sample text segment are determined. Among them, the text characteristics of each sample text segment are as described above about the text characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.
可选地,将每个样本文本段所在图像区域的图像特征作为该样本文本段的特征。或者,将每个样本文本段的文本特征作为该样本文本段的特征。或者,将每个样本文本段所在图像区域的图像特征和该样本文本段的文本特征进行拼接或者融合,得到该样本文本段的特征。或者,将每个样本文本段所在图像区域的图像特征和该样本文本段的文本特征先进行拼接或者融合后,再进行非线性运算,得到该样本文本段的特征。其中,每个样本文本段的特征见上文有关每个目标文本段的特征的描述,二者实现原理相同,不再赘述。Optionally, the image features of the image area where each sample text segment is located are used as the features of the sample text segment. Or, use the text features of each sample text segment as the features of the sample text segment. Alternatively, the image features of the image area where each sample text segment is located and the text features of the sample text segment are spliced or fused to obtain the features of the sample text segment. Alternatively, the image features of the image area where each sample text segment is located and the text features of the sample text segment are first spliced or fused, and then nonlinear operations are performed to obtain the features of the sample text segment. Among them, the characteristics of each sample text segment are as described above about the characteristics of each target text segment. The implementation principles of the two are the same and will not be described again.
可选地,基于每对样本文本段中每个样本文本段的特征、每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型,包括:获取每对样本文本段 中每个样本文本段的标注类别;对于每个标注类别,基于该标注类别中各个样本文本段的特征,确定该标注类别的特征平均值;基于各个标注类别的特征平均值、每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型。Optionally, obtaining the target model based on the characteristics of each sample text segment in each pair of sample text segments, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments includes: obtaining Each pair of sample text segments The annotation category of each sample text segment in The predicted correlation information between segments and the annotation correlation information between each pair of sample text segments are used to obtain the target model.
本申请实施例中,对每个样本文本段进行标注,得到该样本文本段的标注类别。对于每个标注类别,计算该标注类别中各个样本文本段的特征之和,将和值除以该标注类别中样本文本段的数量,得到该标注类别的特征平均值。In the embodiment of the present application, each sample text segment is annotated to obtain the annotation category of the sample text segment. For each annotation category, calculate the sum of the features of each sample text segment in the annotation category, and divide the sum value by the number of sample text segments in the annotation category to obtain the average feature value of the annotation category.
可选地,按照如下所示的公式(8),基于每个标注类别的特征平均值确定第一损失值。
Optionally, according to formula (8) shown below, the first loss value is determined based on the average feature value of each annotation category.
另外,按照如下所示的公式(9),基于每个标注类别的特征平均值确定第二损失值。
In addition, according to the formula (9) shown below, the second loss value is determined based on the average feature value of each annotation category.
其中,Lpull为第一损失值。Lpush为第二损失值。M为标注关联信息为1的样本文本对的数量,由于每对样本文本段之间的标注关联信息为1时表征这一对样本文本段之间关联,因此,M也是具有关联关系的样本文本段对的数量。m、k均为序列号。表征第m个标注类别的特征平均值。emk表征第m个标注类别中第k个样本文本段的特征。||x||2表征x的二范数,x为自变量。Σ为求和符号。为第j个标注类别的特征平均值。Δ为超参数,用于拉大属于不同标注类别的两个样本文本段的特征之间的距离。本申请实施例不对Δ的值做限定,示例性地,Δ=1。Among them, L pull is the first loss value. L push is the second loss value. M is the number of sample text pairs with annotated association information of 1. Since the annotated association information between each pair of sample text segments is 1, it indicates the association between the pair of sample text segments. Therefore, M is also a sample text with an associated relationship. The number of segment pairs. m and k are both serial numbers. The average value of features characterizing the m-th annotation category. e mk represents the characteristics of the k-th sample text segment in the m-th annotation category. ||x|| 2 represents the second norm of x, and x is the independent variable. Σ is the summation symbol. is the average feature value of the j-th labeled category. Δ is a hyperparameter used to increase the distance between the features of two sample text segments belonging to different annotation categories. The embodiment of the present application does not limit the value of Δ. For example, Δ=1.
需要说明的是,本申请实施例虽然能确定一对样本文本段的预测关联信息,但在确定每个样本文本段的预测类别时,相当于对样本文本段进行分类。分类问题仅能优化类间的边界,容易造成属于同一标注类别的两个样本文本段的特征之间的距离较大,而属于不同标注类别的两个样本文本段的特征之间的距离较小的问题。It should be noted that although the embodiment of the present application can determine the prediction correlation information of a pair of sample text segments, determining the prediction category of each sample text segment is equivalent to classifying the sample text segments. The classification problem can only optimize the boundaries between classes, which easily results in a large distance between the features of two sample text segments belonging to the same annotation category, while a small distance between the features of two sample text segments belonging to different annotation categories. The problem.
请参见图6,图6是本申请实施例提供的一种样本文本段的特征之间的距离示例图。从图6看出,R+大于R-。其中,R+为标注类别A中的两个样本文本段的特征之间的距离,R-为标注类别A中一个样本文本段的特征与标注类别B中一个样本文本段的特征之间的距离。Please refer to FIG. 6 , which is an example diagram of the distance between features of a sample text segment provided by an embodiment of the present application. It can be seen from Figure 6 that R+ is greater than R-. Among them, R+ is the distance between the features of two sample text segments in the labeled category A, and R- is the distance between the features of one sample text segment in the labeled category A and the feature of one sample text segment in the labeled category B.
为了提高确定预测关联信息的性能,采用上述公式(8)根据样本文本段的特征计算第一损失值,由于第一损失值是基于第m个标注类别的特征平均值和第m个标注类别中第k个样本文本段的特征计算得到的,使得每个标注类别中各个样本文本段的特征都趋近于该标注类别的特征平均值。因此,第一损失值用于将每个标注类别中各个样本文本段的特征向该标注类别的特征平均值拉近,减小同一标注类别的两个样本文本段的特征之间的距离。In order to improve the performance of determining the predicted associated information, the above formula (8) is used to calculate the first loss value based on the characteristics of the sample text segment. Since the first loss value is based on the average feature value of the m-th annotation category and the average value of the m-th annotation category. The characteristics of the k-th sample text segment are calculated so that the characteristics of each sample text segment in each annotation category are close to the average feature of the annotation category. Therefore, the first loss value is used to bring the features of each sample text segment in each annotation category closer to the feature average of the annotation category, and reduce the distance between the features of two sample text segments of the same annotation category.
采用上述公式(9)根据样本文本段的特征计算第二损失值,由于第二损失值是基于第m个标注类别的特征平均值、第j个标注类别的特征平均值和超参数Δ确定的,使得任两个标注类别的特征平均值之间的距离都至少大于Δ。因此,第二损失值用于将每个标注类别的特征平均值与另一个标注类别的特征平均值拉远,能够拉大属于不同标注类别的两个样本文本段的特征之间的距离。Use the above formula (9) to calculate the second loss value based on the characteristics of the sample text segment. Since the second loss value is determined based on the feature average of the m-th annotation category, the feature average of the j-th annotation category and the hyperparameter Δ , so that the distance between the feature averages of any two labeled categories is at least greater than Δ. Therefore, the second loss value is used to distance the feature average of each annotation category from the feature average of another annotation category, which can widen the distance between the features of two sample text segments belonging to different annotation categories.
通过公式(8)和公式(9),能提高网络确定预测关联信息的性能。例如针对图6,通过本申请实施例中的第一损失值和第二损失值,能够在缩小R+的同时增大R-,从而提高样本文本图像的特征的准确性。Through formula (8) and formula (9), the performance of the network in determining and predicting associated information can be improved. For example, with regard to Figure 6, through the first loss value and the second loss value in the embodiment of the present application, R- can be increased while reducing R+, thereby improving the accuracy of the features of the sample text image.
其中,每个样本文本段的特征包括该样本文本段所在图像区域的图像特征或该样本文本段的文本特征中的至少一项。可选地,先将该样本文本段所在图像区域的图像特征和该样本文本段的文本特征进行拼接(或融合),再将拼接(或融合)后的特征进行至少一层的非线性运算,得到固定维度的该样本文本段的特征,以利用各个样本文本段的特征计算第一损失值和第二损失值。 The features of each sample text segment include at least one of image features of the image area where the sample text segment is located or text features of the sample text segment. Optionally, first splice (or fuse) the image features of the image area where the sample text segment is located and the text features of the sample text segment, and then perform at least one layer of nonlinear operations on the spliced (or fused) features, Characteristics of the sample text segment of a fixed dimension are obtained, and the first loss value and the second loss value are calculated using the characteristics of each sample text segment.
在计算出第一损失值和第二损失值之后,根据第一损失值、第二损失值和关联信息损失值,确定神经网络模型的损失值。可选地,还能够根据第一损失值、第二损失值、关联信息损失值和类别损失值,确定神经网络模型的损失值。After the first loss value and the second loss value are calculated, the loss value of the neural network model is determined based on the first loss value, the second loss value and the associated information loss value. Optionally, the loss value of the neural network model can also be determined based on the first loss value, the second loss value, the associated information loss value and the category loss value.
可选地,设置第一损失值的权重、第二损失值的权重、关联信息损失值的权重和类别损失值的权重。根据第一损失值、第二损失值、关联信息损失值和类别损失值中的至少一项结合各自的权重,确定神经网络模型的损失值。例如,根据关联信息损失值、类别损失值、关联信息损失值的权重和类别损失值的权重确定神经网络模型的损失值。Optionally, the weight of the first loss value, the weight of the second loss value, the weight of the associated information loss value and the weight of the category loss value are set. The loss value of the neural network model is determined based on at least one of the first loss value, the second loss value, the associated information loss value and the category loss value combined with their respective weights. For example, the loss value of the neural network model is determined based on the associated information loss value, the category loss value, the weight of the associated information loss value, and the weight of the category loss value.
可选地,按照如下所示的公式(10),根据第一损失值、第二损失值、关联信息损失值、类别损失值和各自的权重,确定神经网络模型的损失值。
L=αLnode+βLrel+γLpull+γLpush    公式(10)
Optionally, according to formula (10) shown below, determine the loss value of the neural network model based on the first loss value, the second loss value, the associated information loss value, the category loss value and their respective weights.
L=αL node +βL rel +γL pull +γL push formula (10)
其中,L为神经网络模型的损失值。α为类别损失值的权重。Lnode为类别损失值。β为关联信息损失值的权重。Lrel为关联信息损失值。γ为第一损失值的权重,也为第二损失值的权重。Lpull为第一损失值。Lpush为第二损失值。本申请实施例不对α、β、γ的取值做限定,示例性地,α=1,β=10,γ=0.5。其中,第一损失值的权重和第二损失值的权重相同,或者不同。Among them, L is the loss value of the neural network model. α is the weight of the category loss value. L node is the category loss value. β is the weight of the associated information loss value. L rel is the associated information loss value. γ is the weight of the first loss value and is also the weight of the second loss value. L pull is the first loss value. L push is the second loss value. The embodiments of this application do not limit the values of α, β, and γ. For example, α=1, β=10, and γ=0.5. Among them, the weight of the first loss value and the weight of the second loss value are the same or different.
在确定出神经网络模型的损失值之后,计算神经网络模型的损失值的梯度,将神经网络模型的损失值的梯度逐层反传,以更新神经网络模型的模型参数。即通过神经网络模型的损失值对神经网络模型进行调整,得到目标模型,目标模型用于获取至少一对目标文本段之间的关联信息。After the loss value of the neural network model is determined, the gradient of the loss value of the neural network model is calculated, and the gradient of the loss value of the neural network model is back-transmitted layer by layer to update the model parameters of the neural network model. That is, the neural network model is adjusted through the loss value of the neural network model to obtain a target model. The target model is used to obtain the associated information between at least a pair of target text segments.
在一些实施例中,还能够采用对比学习的损失函数计算对比学习损失值。例如,标注关联信息为关联的每对样本文本段,看做正样本,利用标注关联信息为关联的各对样本文本段的特征,计算正样本的损失值。标注关联信息为不关联的每对样本文本段,看做负样本,利用标注关联信息为不关联的各对样本文本段的特征,计算负样本的损失值。之后,利用正样本的损失值和负样本的损失值确定对比学习损失值。利用第一损失值、第二损失值、关联信息损失值、类别损失值、对比学习损失值中的至少一项,结合各自的权重,确定神经网络模型的损失值。In some embodiments, the loss function of contrastive learning can also be used to calculate the contrastive learning loss value. For example, each pair of sample text segments marked with associated information is regarded as a positive sample, and the loss value of the positive sample is calculated using the characteristics of each pair of sample text segments marked with associated information. Each pair of sample text segments marked with unrelated information is regarded as a negative sample. The loss value of the negative sample is calculated using the characteristics of each pair of sample text segments marked with unrelated information. Afterwards, the loss value of the positive sample and the loss value of the negative sample are used to determine the comparative learning loss value. Using at least one of the first loss value, the second loss value, the associated information loss value, the category loss value, and the comparative learning loss value, combined with their respective weights, the loss value of the neural network model is determined.
上述方法中,基于至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息,获取目标模型,使得目标模型学习到了每对文本段之间的关联信息,有助于降低关联错误的现象,提高文本信息的准确性。In the above method, the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation information between each pair of text segments, It helps reduce association errors and improve the accuracy of text information.
上述从方法步骤的角度阐述了文本信息提取方法和目标模型的获取方法,下面结合图7来进一步描述本申请实施例的目标模型的获取方法。图7为本申请实施例提供的一种神经网络模型的训练示意图。其中,神经网络模型包括第一初始网络、第二初始网络和第三初始网络,第一初始网络包括第一子网络和第二子网络。本申请实施例是利用样本文本图像中每对样本文本段(即每两个样本文本段)之间的预测关联信息来训练神经网络模型的。The above describes the text information extraction method and the target model acquisition method from the perspective of method steps. The target model acquisition method according to the embodiment of the present application will be further described below in conjunction with FIG. 7 . Figure 7 is a schematic diagram of training of a neural network model provided by an embodiment of the present application. Wherein, the neural network model includes a first initial network, a second initial network and a third initial network, and the first initial network includes a first sub-network and a second sub-network. The embodiment of the present application uses the predicted correlation information between each pair of sample text segments (that is, every two sample text segments) in the sample text image to train the neural network model.
本申请实施例中,获取样本文本图像,其中,该样本文本图像是图3中的(2)所示的图像。将样本文本图像输入第一子网络,由第一子网络输出样本文本图像的图像特征。将样本文本图像的图像特征输入第二子网络,由第二子网络输出样本文本图像中各个样本文本段所在图像区域的图像特征。可选地,还对样本文本图像进行图像识别,得到样本文本图像的图像识别结果。样本文本图像的图像识别结果中包括各个样本文本段。利用第二初始网络获取各个样本文本段的文本特征。In the embodiment of the present application, a sample text image is obtained, where the sample text image is the image shown in (2) in Figure 3 . The sample text image is input to the first sub-network, and the first sub-network outputs image features of the sample text image. The image features of the sample text image are input to the second sub-network, and the second sub-network outputs the image features of the image area where each sample text segment in the sample text image is located. Optionally, image recognition is also performed on the sample text image to obtain an image recognition result of the sample text image. The image recognition result of the sample text image includes each sample text segment. The second initial network is used to obtain text features of each sample text segment.
接着,对于每个样本文本段,将该样本文本段所在图像区域的图像特征和该样本文本段的文本特征进行融合,得到该样本文本段的特征。对该样本文本段的特征进行至少一次更新。为便于区分描述,将每个样本文本段的特征称为更新前的样本文本段的特征,将对更新前的样本文本段的特征进行至少一次更新后得到的特征,称为更新后的样本文本段的特征。 Then, for each sample text segment, the image features of the image area where the sample text segment is located and the text features of the sample text segment are fused to obtain the features of the sample text segment. Make at least one update to the characteristics of the sample text segment. In order to facilitate the distinction and description, the characteristics of each sample text segment are called the characteristics of the sample text segment before the update, and the characteristics obtained by updating the characteristics of the sample text segment before the update at least once are called the updated sample text. segment characteristics.
一方面,对每个更新前的样本文本段的特征进行非线性运算,以对每个更新前的样本文本段的特征进行一次更新,得到每个更新后的样本文本段的特征。通过这种方式,能够得到各个样本文本段的特征。基于各个样本文本段的特征,按照上文提及的公式(8)和公式(9),计算特征损失值,其中,特征损失值包括上文提及的第一损失值和第二损失值。On the one hand, a nonlinear operation is performed on the characteristics of each sample text segment before the update to update the characteristics of each sample text segment before the update to obtain the characteristics of each updated sample text segment. In this way, the characteristics of each sample text segment can be obtained. Based on the characteristics of each sample text segment, the characteristic loss value is calculated according to the above-mentioned formula (8) and formula (9), where the characteristic loss value includes the above-mentioned first loss value and the second loss value.
另一方面,将各个更新前的样本文本段的特征输入第三初始网络,第三初始网络能够基于各个更新前的样本文本段的特征构建初始的图结构,并对该图结构进行多次更新,即对各个更新前的样本文本段的特征进行多次更新,直至得到最终的图结构,最终的图结构包括各个更新后的样本文本段的特征。第三初始网络能够基于最终的图结构确定并输出各个样本文本段的预测类别和每两个样本文本段之间的预测关联信息。接着,基于各个样本文本段的预测类别,按照上文提及的公式(7)计算类别损失值。基于每两个样本文本段之间的预测关联信息,按照上文提及的公式(6)计算关联信息损失值。On the other hand, the characteristics of each pre-updated sample text segment are input into the third initial network. The third initial network can construct an initial graph structure based on the characteristics of each pre-updated sample text segment and update the graph structure multiple times. , that is, the characteristics of each sample text segment before updating are updated multiple times until the final graph structure is obtained. The final graph structure includes the characteristics of each updated sample text segment. The third initial network can determine and output the predicted category of each sample text segment and the predicted association information between every two sample text segments based on the final graph structure. Next, based on the predicted category of each sample text segment, the category loss value is calculated according to the formula (7) mentioned above. Based on the predicted correlation information between each two sample text segments, the correlation information loss value is calculated according to the formula (6) mentioned above.
之后,基于特征损失值、类别损失值和关联信息损失值,按照上文提及的公式(10),计算神经网络模型的损失值。基于神经网络模型的损失值,对神经网络模型进行调整,得到目标模型。Afterwards, based on the feature loss value, category loss value and associated information loss value, the loss value of the neural network model is calculated according to the formula (10) mentioned above. Based on the loss value of the neural network model, the neural network model is adjusted to obtain the target model.
在得到目标模型之后,基于目标模型提取目标文本图像中的文本信息。本申请实施例中,目标模型包括图像特征提取网络(由第一初始网络训练得到)、文本特征提取网络(由第二初始网络训练得到)和图卷积网络(由第三初始网络训练得到),且图像特征提取网络包括第一提取网络(由第一子网络训练得到)和第二提取网络(由第二子网络训练得到)。After obtaining the target model, the text information in the target text image is extracted based on the target model. In the embodiment of this application, the target model includes an image feature extraction network (trained by the first initial network), a text feature extraction network (trained by the second initial network), and a graph convolution network (trained by the third initial network). , and the image feature extraction network includes a first extraction network (trained by the first sub-network) and a second extraction network (trained by the second sub-network).
目标文本图像包括菜单图像和执照图像。先对目标文本图像进行图像识别,得到目标文本图像的图像识别结果,再将目标文本图像和目标文本图像的图像识别结果输入目标模型,由目标模型输出目标文本图像中各个目标文本段的类别和每两个目标文本段之间的关联信息。之后,基于目标文本图像中各个目标文本段的类别和每两个目标文本段之间的关联信息,确定目标文本图像中的文本信息。Target text images include menu images and license images. First, perform image recognition on the target text image to obtain the image recognition result of the target text image, and then input the target text image and the image recognition result of the target text image into the target model, and the target model outputs the category and sum of each target text segment in the target text image. Association information between each two target text segments. Afterwards, the text information in the target text image is determined based on the categories of each target text segment in the target text image and the association information between each two target text segments.
请参见图8,图8是本申请实施例提供的一种菜单图像中文本信息的提取示意图。其中,菜单图像中包括“菜A 20元”、“菜B 20元”、“菜C 28元”、“菜D 28元”、“菜E 25元”、“菜F 25元”以及各自关联的图片。通过对菜单图像进行图像识别,得到图像识别结果。其中,图像识别结果中包括菜单图像中的各个文本段(即上文提及的目标文本段)。也就是说,图像识别结果中包括文本段“菜A”、“20元”、“菜B”、“20元”、“菜C”、“28元”、“菜D”、“28元”、“菜E”、“25元”、“菜F”、“25元”。由图8能够看出,图像识别结果仅是识别出菜单图像中的各个文本段,并未对各个文本段进行关联。将菜单图像和菜单图像的图像识别结果输入目标模型,由目标模型输出菜单图像中的各个文本段的类别和菜单图像中每两个文本段之间的关联信息。基于菜单图像中的各个文本段的类别和菜单图像中每两个文本段之间的关联信息,能够得到菜单图像中的文本信息,即得到“菜A:20元”、“菜B:20元”、“菜C:28元”、“菜D:28元”、“菜E:25元”、“菜F:25元”。Please refer to FIG. 8 , which is a schematic diagram of extracting text information from a menu image provided by an embodiment of the present application. Among them, the menu image includes "Dish A 20 yuan", "Dish B 20 yuan", "Dish C 28 yuan", "Dish D 28 yuan", "Dish E 25 yuan", "Dish F 25 yuan" and their respective associations. picture of. By performing image recognition on the menu image, the image recognition result is obtained. The image recognition results include each text segment in the menu image (ie, the target text segment mentioned above). In other words, the image recognition results include the text segments "Dish A", "20 Yuan", "Dish B", "20 Yuan", "Dish C", "28 Yuan", "Dish D", "28 Yuan" , "Cai E", "25 yuan", "Cai F", "25 yuan". It can be seen from Figure 8 that the image recognition result only identifies each text segment in the menu image, and does not associate each text segment. The menu image and the image recognition result of the menu image are input into the target model, and the target model outputs the category of each text segment in the menu image and the association information between each two text segments in the menu image. Based on the categories of each text segment in the menu image and the associated information between each two text segments in the menu image, the text information in the menu image can be obtained, that is, "Dish A: 20 yuan", "Dish B: 20 yuan" ", "Dish C: 28 yuan", "Dish D: 28 yuan", "Dish E: 25 yuan", "Dish F: 25 yuan".
请参见图9,图9是本申请实施例提供的又一种菜单图像中文本信息的提取示意图。基于与图8相同的原理,本申请实施例中利用菜单图像、菜单图像的图像识别结果以及目标模型,能够确定菜单图像中的文本信息,即得到“菜A:20元”、“菜B:20元”、“菜C:20元”、“菜D:6/枚”、“菜E:5/枚”、“菜F:2/碗”以及“菜G:5/碗”。Please refer to FIG. 9 , which is a schematic diagram of extracting text information in yet another menu image provided by an embodiment of the present application. Based on the same principle as Figure 8, in the embodiment of this application, the menu image, the image recognition result of the menu image, and the target model can be used to determine the text information in the menu image, that is, "Dish A: 20 yuan" and "Dish B: "20 yuan", "Dish C: 20 yuan", "Dish D: 6/piece", "Dish E: 5/piece", "Dish F: 2/bowl" and "Dish G: 5/bowl".
需要说明的是,图8和图9所示的菜单图像均为结构化文本图像。本申请实施例中的目标模型也能够对半结构化文本图像中的文本信息进行提取。如对图10、图11所示的执照图像(半结构化文本图像)进行文本信息的提取。It should be noted that the menu images shown in Figures 8 and 9 are both structured text images. The target model in the embodiment of the present application can also extract text information from semi-structured text images. For example, text information is extracted from the license image (semi-structured text image) shown in Figures 10 and 11.
请参见图10,图10是本申请实施例提供的一种执照图像中文本信息的提取示意图。其中,执照图像中包括“执照”、“名称XXX公司”、“公司类型独资元”、“法定代表人XX”以及“日期X年X月X日”。通过对执照图像进行图像识别,得到图像识别结果。其中,图像识别结果中包括“执照”、“名称XXX公司”、“公司类型独资元”、“法定代 表人XX”以及“日期X年X月X日”。将执照图像和执照图像的图像识别结果输入目标模型,由目标模型输出执照图像中的各个文本段的类别和执照图像中每两个文本段之间的关联信息。基于执照图像中的各个文本段的类别和执照图像中每两个文本段之间的关联信息,能够得到执照图像中的文本信息,即得到“执照”、“名称:XXX公司”、“公司类型:独资元”、“法定代表人:XX”以及“日期:X年X月X日”。Please refer to Figure 10, which is a schematic diagram of extracting text information from a license image provided by an embodiment of the present application. Among them, the license image includes "license", "name XXX company", "company type sole proprietorship", "legal representative XX" and "date X year X month X day". By performing image recognition on the license image, the image recognition result is obtained. Among them, the image recognition results include "license", "name XXX company", "company type sole proprietorship", "legal representative" Person XX" and "Date X year Association information between segments. Based on the categories of each text segment in the license image and the association information between each two text segments in the license image, the text information in the license image can be obtained, that is, "License" and "Name" can be obtained: "XXX Company", "Company Type: Sole Proprietor", "Legal Representative: XX" and "Date: X Month X Day".
请参见图11,图11是本申请实施例提供的又一种执照图像中文本信息的提取示意图。基于与图10相似的原理,能够利用执照图像、执照图像的图像识别结果和目标模型,确定执照图像中的文本信息,即得到“执照”、“名称:XXX公司”、“住所:XX镇”、“注册号:1111111”、以及“经营范围:水果蔬菜、日用品、文化体育用品”。Please refer to Figure 11, which is a schematic diagram of extracting text information from a license image according to another embodiment of the present application. Based on the principle similar to Figure 10, the license image, the image recognition result of the license image and the target model can be used to determine the text information in the license image, that is, "License", "Name: XXX Company", "Residence: XX Town" can be obtained , "Registration number: 1111111", and "Business scope: fruits and vegetables, daily necessities, cultural and sporting goods".
本申请实施例采用四种方式对神经网络模型进行训练,得到了四种目标模型。The embodiment of this application uses four methods to train the neural network model and obtains four target models.
其中,第一种目标模型是将样本文本图像和样本文本图像的图像识别结果输入神经网络模型,由神经网络模型进行如下处理:先对样本文本图像进行批归一化后,再基于批归一化后的样本文本图像确定样本文本图像中各个样本文本段所在图像区域的图像特征。基于样本文本图像的图像识别结果确定样本文本图像中各个样本文本段的文本特征。基于各个样本文本段所在图像区域的图像特征和各个样本文本段的文本特征,确定并输出的各个样本文本段的预测类别,按照上文中的公式(7)确定神经网络模型的损失值,并基于神经网络模型的损失值对神经网络模型进行调整得到的。Among them, the first target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: first perform batch normalization on the sample text image, and then based on batch normalization The transformed sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.
第二种目标模型是将样本文本图像和样本文本图像的图像识别结果输入神经网络模型,由神经网络模型进行如下处理:先对样本文本图像进行实例归一化后,再基于实例归一化后的样本文本图像确定样本文本图像中各个样本文本段所在图像区域的图像特征。基于样本文本图像的图像识别结果确定样本文本图像中各个样本文本段的文本特征。基于各个样本文本段所在图像区域的图像特征和各个样本文本段的文本特征,确定并输出的各个样本文本段的预测类别,按照上文中的公式(7)确定神经网络模型的损失值,并基于神经网络模型的损失值对神经网络模型进行调整得到的。The second target model is to input the sample text image and the image recognition result of the sample text image into the neural network model. The neural network model performs the following processing: first perform instance normalization on the sample text image, and then normalize it based on the instance. The sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the prediction category of each sample text segment is determined and output, and the loss value of the neural network model is determined according to the formula (7) above, and based on The loss value of the neural network model is obtained by adjusting the neural network model.
第三种目标模型是将样本文本图像和样本文本图像的图像识别结果输入神经网络模型,由神经网络模型进行如下处理:对样本文本图像进行实例归一化,再基于实例归一化后的样本文本图像确定样本文本图像中各个样本文本段所在图像区域的图像特征。基于样本文本图像的图像识别结果确定样本文本图像中各个样本文本段的文本特征。基于各个样本文本段所在图像区域的图像特征和各个样本文本段的文本特征,确定并输出的各个样本文本段的预测类别和各个样本文本段的特征,按照上文中的公式(7)-(9)确定神经网络模型的损失值,并基于神经网络模型的损失值对神经网络模型进行调整得到的。The third target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: perform instance normalization on the sample text image, and then based on the instance normalized sample The text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the predicted categories of each sample text segment and the characteristics of each sample text segment are determined and output, according to the above formulas (7)-(9) ) is obtained by determining the loss value of the neural network model and adjusting the neural network model based on the loss value of the neural network model.
第四种目标模型是将样本文本图像和样本文本图像的图像识别结果输入神经网络模型,由神经网络模型进行如下处理:对样本文本图像进行实例归一化后,再基于实例归一化后的样本文本图像确定样本文本图像中各个样本文本段所在图像区域的图像特征。基于样本文本图像的图像识别结果确定样本文本图像中各个样本文本段的文本特征。基于各个样本文本段所在图像区域的图像特征和各个样本文本段的文本特征,确定并输出的各个样本文本段的预测类别、各个样本文本段的特征、每两个样本文本对之间的预测关联关系,按照上文中的公式(6)-(9)确定神经网络模型的损失值,并基于神经网络模型的损失值对神经网络模型进行调整得到的。The fourth target model is to input the sample text image and the image recognition result of the sample text image into the neural network model, and the neural network model performs the following processing: after instance normalization of the sample text image, and then based on the instance normalization The sample text image determines the image characteristics of the image area where each sample text segment in the sample text image is located. The text characteristics of each sample text segment in the sample text image are determined based on the image recognition result of the sample text image. Based on the image features of the image area where each sample text segment is located and the text features of each sample text segment, the predicted categories of each sample text segment, the characteristics of each sample text segment, and the predicted association between each two sample text pairs are determined and output The relationship is obtained by determining the loss value of the neural network model according to the above formulas (6)-(9), and adjusting the neural network model based on the loss value of the neural network model.
本申请实施例中,按照如下所示的公式(11)计算每一种目标模型的性能指标。
In the embodiment of this application, the performance index of each target model is calculated according to the formula (11) shown below.
其中,mEF为目标模型的性能指标。i为序列号,Fi为第i种预测类别的分数。Pi为第i种预测类别的精确率,Ri为第i种预测类别的召回率。P为精确率,tp为预测类别和标注类别一致的正样本的数量,fp为预测类别和标注类别不一致的负样本的数量,fn为预测类别和标注类别不一致的正样本的数量。其中,若样本文本图像中某个样本文本段的标注类别为目标类别,则该样本文本段为正样本,若该样本文本段的标注类别不为目标类别,则该样本文本段为负样本。Among them, mEF is the performance index of the target model. i is the sequence number, and F i is the score of the i-th prediction category. P i is the precision rate of the i-th prediction category, and R i is the recall rate of the i-th prediction category. P is the precision rate, tp is the number of positive samples whose predicted category is consistent with the labeled category, fp is the number of negative samples whose predicted category is inconsistent with the labeled category, and fn is the number of positive samples whose predicted category is inconsistent with the labeled category. Among them, if the annotation category of a sample text segment in the sample text image is the target category, then the sample text segment is a positive sample; if the annotation category of the sample text segment is not the target category, the sample text segment is a negative sample.
其中,训练这四种目标模型时采用的样本文本图像为菜单图像,菜单图像中样本文本段的预测类别、标注类别均包括菜名、菜价、店名、菜品种类、其他中的至少一项,目标类别包括菜名和菜价。这四种目标模型的性能指标如下表1所示。Among them, the sample text images used when training these four target models are menu images. The prediction categories and annotation categories of the sample text segments in the menu images include at least one of dish name, dish price, store name, dish type, and others. Target categories include dish names and dish prices. The performance indicators of these four target models are shown in Table 1 below.
表1
Table 1
从表1看出,这四种目标模型的mEF依次增大。由于mEF越大,表明目标模型的性能越好。因此,这四种目标模型的性能是依次增强的,第四种目标模型的性能最好。通过本申请实施例的目标模型,能够有效降低关联错误的现象,提高文本信息的准确性,且能快速提取目标文本图像中的文本信息,避免繁琐复杂的人工输入。As can be seen from Table 1, the mEFs of these four target models increase sequentially. The larger the mEF, the better the performance of the target model. Therefore, the performance of these four target models is enhanced sequentially, and the fourth target model has the best performance. Through the target model of the embodiment of the present application, the phenomenon of association errors can be effectively reduced, the accuracy of text information can be improved, and the text information in the target text image can be quickly extracted, avoiding tedious and complicated manual input.
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的目标文本图像、样本文本图像等都是在充分授权的情况下获取的。It should be noted that the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.) and signals involved in this application, All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions. For example, the target text images, sample text images, etc. involved in this application were all obtained with full authorization.
图12所示为本申请实施例提供的一种文本信息提取装置的结构示意图,如图12所示,该装置包括:Figure 12 shows a schematic structural diagram of a text information extraction device provided by an embodiment of the present application. As shown in Figure 12, the device includes:
获取模块1201,用于获取目标文本图像,目标文本图像中包括多个目标文本段;The acquisition module 1201 is used to acquire a target text image, which includes multiple target text segments;
获取模块1201,还用于获取至少一对目标文本段之间的关联信息,该关联信息用于表征每对目标文本段之间关联的可能性;The acquisition module 1201 is also used to obtain association information between at least one pair of target text segments, where the association information is used to characterize the possibility of association between each pair of target text segments;
确定模块1202,用于基于每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果;The determination module 1202 is configured to determine the association result between each pair of target text segments based on the association information between each pair of target text segments;
提取模块1203,用于基于每对目标文本段之间的关联结果,提取目标文本图像中的文本信息。The extraction module 1203 is used to extract text information in the target text image based on the association results between each pair of target text segments.
在一种可能的实现方式中,获取模块1201,用于获取每对目标文本段的特征或每对目标文本段之间的相对位置特征中的至少一项,每对目标文本段中每个目标文本段的特征包括该 目标文本段所在图像区域的图像特征或该目标文本段的文本特征中的至少一项,每对目标文本段之间的相对位置特征用于表征每对目标文本段所在图像区域之间的相对位置;基于至少一对目标文本段的特征或至少一对目标文本段之间的相对位置特征中的至少一项,确定每对目标文本段之间的关联信息。In a possible implementation, the acquisition module 1201 is configured to acquire at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments. Each target in each pair of target text segments Characteristics of text segments include the At least one of the image features of the image area where the target text segment is located or the text features of the target text segment. The relative position features between each pair of target text segments are used to characterize the relative position between the image areas where each pair of target text segments are located. ; Determine the associated information between each pair of target text segments based on at least one of the characteristics of the at least one pair of target text segments or the relative position characteristics between the at least one pair of target text segments.
在一种可能的实现方式中,每对目标文本段中每个目标文本段的特征包括该目标文本段所在图像区域的图像特征,获取模块1201,用于获取目标文本图像的图像特征;对于至少一对目标文本段中的每个目标文本段,基于目标文本图像的图像特征和该目标文本段所在图像区域的位置信息,确定该目标文本段所在图像区域的图像特征。In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located. The acquisition module 1201 is used to obtain the image characteristics of the target text image; for at least For each target text segment in a pair of target text segments, based on the image features of the target text image and the position information of the image area where the target text segment is located, the image features of the image area where the target text segment is located are determined.
在一种可能的实现方式中,每对目标文本段中每个目标文本段的特征包括该目标文本段的文本特征,获取模块1201,用于对于至少一对目标文本段中的每个目标文本段,获取该目标文本段中每个词语的词向量;对该目标文本段中各个词语的词向量进行融合,得到该目标文本段的文本特征。In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the text characteristics of the target text segment. The acquisition module 1201 is configured to obtain each target text segment in at least one pair of target text segments. segment, obtain the word vector of each word in the target text segment; fuse the word vectors of each word in the target text segment to obtain the text features of the target text segment.
在一种可能的实现方式中,每对目标文本段中每个目标文本段的特征包括该目标文本段所在图像区域的图像特征和该目标文本段的文本特征,获取模块1201,用于对于至少一对目标文本段中的每个目标文本段,将该目标文本段所在图像区域的图像特征切分成目标数量个图像特征块,将该目标文本段的文本特征切分成目标数量个文本特征块;对于每个图像特征块,将该图像特征块和该图像特征块关联的文本特征块进行融合,得到融合特征块;将各个融合特征块进行拼接,得到该目标文本段的特征。In a possible implementation, the characteristics of each target text segment in each pair of target text segments include the image characteristics of the image area where the target text segment is located and the text characteristics of the target text segment. The acquisition module 1201 is configured to at least For each target text segment in a pair of target text segments, the image features of the image area where the target text segment is located are divided into a target number of image feature blocks, and the text features of the target text segment are divided into a target number of text feature blocks; For each image feature block, the image feature block and the text feature block associated with the image feature block are fused to obtain a fused feature block; each fused feature block is spliced to obtain the characteristics of the target text segment.
在一种可能的实现方式中,获取模块1201,用于对于每对目标文本段,获取每对目标文本段所在图像区域的位置信息;基于每对目标文本段所在图像区域的位置信息和目标文本图像的尺寸信息,确定每对目标文本段之间的相对位置特征。In a possible implementation, the acquisition module 1201 is configured to obtain, for each pair of target text segments, the position information of the image area where each pair of target text segments is located; based on the position information of the image area where each pair of target text segments is located and the target text The size information of the image determines the relative position characteristics between each pair of target text segments.
在一种可能的实现方式中,获取模块1201,用于基于至少一对目标文本段的特征和至少一对目标文本段之间的相对位置特征构建图结构,图结构包括至少两个节点和至少一个边,每个节点表征一个目标文本段的特征,每个边表征该边连接的一对节点所指示的一对目标文本段之间的相对位置特征;基于图结构确定每对目标文本段之间的关联信息。In a possible implementation, the acquisition module 1201 is configured to construct a graph structure based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments. The graph structure includes at least two nodes and at least An edge, each node represents the characteristics of a target text segment, and each edge represents the relative position characteristics between a pair of target text segments indicated by a pair of nodes connected by the edge; determine the relationship between each pair of target text segments based on the graph structure associated information.
在一种可能的实现方式中,获取模块1201,还用于获取每个目标文本段的类别和每两个目标文本段之间的关联信息;获取模块1201,用于基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文本段之间的关联信息。In a possible implementation, the acquisition module 1201 is also used to acquire the category of each target text segment and the association information between every two target text segments; the acquisition module 1201 is used to acquire the category of each target text segment based on the category of each target text segment. , determine the associated information between each pair of target text segments from the associated information between each two target text segments.
在一种可能的实现方式中,获取模块1201,用于基于各个目标文本段的类别,从该目标文本图像包括的多个目标文本段中筛选出类别为目标类别的待关联文本段;从每两个目标文本段之间的关联信息中筛选出每两个待关联文本段之间的关联信息,得到每对目标文本段之间的关联信息。In one possible implementation, the acquisition module 1201 is configured to filter out the text segments to be associated with the target category from multiple target text segments included in the target text image based on the category of each target text segment; from each target text segment, The correlation information between each two text segments to be correlated is filtered out from the correlation information between the two target text segments, and the correlation information between each pair of target text segments is obtained.
在一种可能的实现方式中,获取模块1201,还用于获取目标模型;获取模块1201,用于根据目标模型获取每对目标文本段之间的关联信息。In a possible implementation, the acquisition module 1201 is also used to acquire the target model; the acquisition module 1201 is used to acquire the association information between each pair of target text segments according to the target model.
上述技术方案中,每对目标文本段之间的关联信息用于表征每对目标文本段之间关联的可能性,因此,在通过每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果时,能够降低关联错误的现象,提高了关联结果的准确性,使得基于每对目标文本段之间的关联结果,提取目标文本图像中的文本信息时,提高了文本信息的准确性。In the above technical solution, the association information between each pair of target text segments is used to characterize the possibility of association between each pair of target text segments. Therefore, each pair of target text is determined through the association information between each pair of target text segments. It can reduce the phenomenon of association errors and improve the accuracy of association results when extracting text information in target text images based on the association results between each pair of target text segments. accuracy.
应理解的是,上述图12提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be understood that when the device provided in Figure 12 implements its functions, the division of the above functional modules is only used as an example. In actual application, the above functions can be allocated to different functional modules according to needs, that is, the equipment The internal structure is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be described again here.
图13所示为本申请实施例提供的一种目标模型的获取装置的结构示意图,如图13所示,该装置包括:Figure 13 is a schematic structural diagram of a device for obtaining a target model provided by an embodiment of the present application. As shown in Figure 13, the device includes:
第一获取模块1301,用于获取样本文本图像,样本文本图像中包括多个样本文本段; The first acquisition module 1301 is used to acquire a sample text image, which includes multiple sample text segments;
第二获取模块1302,用于获取至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息;The second acquisition module 1302 is used to acquire predicted correlation information between at least one pair of sample text segments and annotation correlation information between at least one pair of sample text segments;
第三获取模块1303,用于基于每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型。The third acquisition module 1303 is used to acquire the target model based on the predicted correlation information between each pair of sample text segments and the annotation correlation information between each pair of sample text segments.
在一种可能的实现方式中,装置还包括:第四获取模块,用于获取每对样本文本段中每个样本文本段的预测类别和每对样本文本段中每个样本文本段的标注类别;第三获取模块1303,用于基于每对样本文本段之间的预测关联信息、每对样本文本段之间的标注关联信息、每对样本文本段中每个样本文本段的预测类别和每对样本文本段中每个样本文本段的标注类别,获取目标模型。In a possible implementation, the device further includes: a fourth acquisition module, configured to acquire the predicted category of each sample text segment in each pair of sample text segments and the annotation category of each sample text segment in each pair of sample text segments. ; The third acquisition module 1303 is used to predict the association information between each pair of sample text segments, the annotation association information between each pair of sample text segments, the predicted category of each sample text segment in each pair of sample text segments, and the predicted category of each sample text segment. Obtain the target model for the annotation category of each sample text segment in the sample text segment.
在一种可能的实现方式中,装置还包括:第五获取模块,用于获取每对样本文本段中每个样本文本段的特征,每个样本文本段的特征包括该样本文本段所在图像区域的图像特征或该样本文本段的文本特征中的至少一项;第三获取模块1303,用于基于每对样本文本段中每个样本文本段的特征、每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型。In a possible implementation, the device further includes: a fifth acquisition module, used to obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include the image area where the sample text segment is located. At least one of the image features or the text features of the sample text segment; the third acquisition module 1303 is used to predict the association between each pair of sample text segments based on the characteristics of each sample text segment in each pair of sample text segments. information and the annotated association information between each pair of sample text segments to obtain the target model.
在一种可能的实现方式中,第三获取模块1303,用于获取每对样本文本段中每个样本文本段的标注类别;对于每个标注类别,基于该标注类别中各个样本文本段的特征,确定该标注类别的特征平均值;基于各个标注类别的特征平均值、每对样本文本段之间的预测关联信息和每对样本文本段之间的标注关联信息,获取目标模型。In a possible implementation, the third acquisition module 1303 is used to obtain the annotation category of each sample text segment in each pair of sample text segments; for each annotation category, based on the characteristics of each sample text segment in the annotation category , determine the feature average of the annotation category; obtain the target model based on the feature average of each annotation category, the predicted correlation information between each pair of sample text segments, and the annotation correlation information between each pair of sample text segments.
上述技术方案中,基于至少一对样本文本段之间的预测关联信息和至少一对样本文本段之间的标注关联信息,获取目标模型,使得目标模型学习到了任一对文本段之间的关联信息,有助于降低关联错误的现象,提高文本信息的准确性。In the above technical solution, the target model is obtained based on the predicted correlation information between at least one pair of sample text segments and the annotated correlation information between at least one pair of sample text segments, so that the target model learns the correlation between any pair of text segments. information, helping to reduce association errors and improve the accuracy of text information.
应理解的是,上述图13提供的装置在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,能够根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be understood that when the device provided in FIG. 13 implements its functions, only the division of the above functional modules is used as an example. In practical applications, the above functions can be allocated to different functional modules according to needs, that is, the equipment The internal structure is divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be described again here.
图14示出了本申请一个示例性实施例提供的终端设备1400的结构框图。该终端设备1400包括有:处理器1401和存储器1402。Figure 14 shows a structural block diagram of a terminal device 1400 provided by an exemplary embodiment of the present application. The terminal device 1400 includes: a processor 1401 and a memory 1402.
处理器1401包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1401采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1401也包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1401集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1401还包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 1401 includes one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 1401 is implemented using at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array). . The processor 1401 also includes a main processor and a co-processor. The main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is used A low-power processor used to process data in standby mode. In some embodiments, the processor 1401 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 1401 also includes an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.
存储器1402包括一个或多个计算机可读存储介质,该计算机可读存储介质是非暂态的。存储器1402还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1402中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器1401所执行以实现本申请中方法实施例提供的文本信息提取方法或者目标模型的获取方法。Memory 1402 includes one or more computer-readable storage media that are non-transitory. Memory 1402 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1402 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 1401 to implement the methods provided by the method embodiments in this application. Text information extraction method or target model acquisition method.
在一些实施例中,终端设备1400还可选包括有:外围设备接口1403和至少一个外围设备。处理器1401、存储器1402和外围设备接口1403之间通过总线或信号线相连。各个外围设备通过总线、信号线或电路板与外围设备接口1403相连。具体地,外围设备包括:显示屏1405。 In some embodiments, the terminal device 1400 optionally further includes: a peripheral device interface 1403 and at least one peripheral device. The processor 1401, the memory 1402 and the peripheral device interface 1403 are connected through a bus or a signal line. Each peripheral device is connected to the peripheral device interface 1403 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: display screen 1405.
外围设备接口1403可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1401和存储器1402。在一些实施例中,处理器1401、存储器1402和外围设备接口1403被集成在同一芯片或电路板上;在一些其他实施例中,处理器1401、存储器1402和外围设备接口1403中的任意一个或两个在单独的芯片或电路板上实现,本实施例对此不加以限定。The peripheral device interface 1403 may be used to connect at least one I/O (Input/Output, input/output) related peripheral device to the processor 1401 and the memory 1402 . In some embodiments, the processor 1401, the memory 1402, and the peripheral device interface 1403 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1401, the memory 1402, and the peripheral device interface 1403 or Both are implemented on separate chips or circuit boards, which is not limited in this embodiment.
显示屏1405用于显示UI(User Interface,用户界面)。该UI包括图形、文本、图标、视频及其它们的任意组合。当显示屏1405是触摸显示屏时,显示屏1405还具有采集在显示屏1405的表面或表面上方的触摸信号的能力。该触摸信号作为控制信号输入至处理器1401进行处理。此时,显示屏1405还用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏1405为一个,设置在终端设备1400的前面板;在另一些实施例中,显示屏1405为至少两个,分别设置在终端设备1400的不同表面或呈折叠设计;在另一些实施例中,显示屏1405是柔性显示屏,设置在终端设备1400的弯曲表面上或折叠面上。甚至,显示屏1405还设置成非矩形的不规则图形,也即异形屏。显示屏1405采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。The display screen 1405 is used to display UI (User Interface, user interface). The UI includes graphics, text, icons, videos, and any combination thereof. When display screen 1405 is a touch display screen, display screen 1405 also has the ability to collect touch signals on or above the surface of display screen 1405 . The touch signal is input to the processor 1401 as a control signal for processing. At this time, the display screen 1405 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there is one display screen 1405, which is provided on the front panel of the terminal device 1400; in other embodiments, there are at least two display screens 1405, which are respectively provided on different surfaces of the terminal device 1400 or have a folding design; In other embodiments, the display screen 1405 is a flexible display screen that is disposed on a curved or folded surface of the terminal device 1400 . Even the display screen 1405 is set in a non-rectangular irregular shape, that is, a special-shaped screen. The display screen 1405 is made of LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light-emitting diode) and other materials.
本领域技术人员能够理解,图14中示出的结构并不构成对终端设备1400的限定,包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 14 does not constitute a limitation on the terminal device 1400, which may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
图15为本申请实施例提供的服务器的结构示意图,该服务器1500可因配置或性能不同而产生比较大的差异,包括一个或多个处理器1501和一个或多个的存储器1502,其中,该一个或多个存储器1502中存储有至少一条计算机程序,该至少一条计算机程序由该一个或多个处理器1501加载并执行以实现上述各个方法实施例提供的文本信息提取方法或者目标模型的获取方法,示例性地,处理器1501为CPU。当然,该服务器1500还具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1500还包括其他用于实现设备功能的部件,在此不做赘述。Figure 15 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 1500 may vary greatly due to different configurations or performance, and includes one or more processors 1501 and one or more memories 1502. The server 1500 includes one or more processors 1501 and one or more memories 1502. At least one computer program is stored in one or more memories 1502, and the at least one computer program is loaded and executed by the one or more processors 1501 to implement the text information extraction method or the target model acquisition method provided by the above method embodiments. , for example, the processor 1501 is a CPU. Of course, the server 1500 also has components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output. The server 1500 also includes other components for realizing device functions, which will not be described again here.
在示例性实施例中,还提供了一种计算机可读存储介质,该存储介质中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行,以使电子设备实现上述任一种文本信息提取方法或者目标模型的获取方法。In an exemplary embodiment, a computer-readable storage medium is also provided. At least one computer program is stored in the storage medium. The at least one computer program is loaded and executed by the processor to enable the electronic device to implement any of the above. Text information extraction method or target model acquisition method.
可选地,上述计算机可读存储介质是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。Optionally, the above computer-readable storage medium is read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM) , tapes, floppy disks and optical data storage devices, etc.
在示例性实施例中,还提供了一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品中存储有至少一条计算机程序,该至少一条计算机程序由处理器加载并执行,以使计算机实现上述任一种文本信息提取方法或者目标模型的获取方法。In an exemplary embodiment, a computer program or computer program product is also provided. At least one computer program is stored in the computer program or computer program product. The at least one computer program is loaded and executed by the processor, so that the computer implements Any of the above text information extraction methods or target model acquisition methods.
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示能够存在三种关系,例如,A和/或B,能够表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。It should be understood that "plurality" mentioned in this article means two or more. "And/or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and/or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The above serial numbers of the embodiments of the present application are only for description and do not represent the advantages and disadvantages of the embodiments.
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。 The above are only exemplary embodiments of the present application and are not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the principles of the present application shall be included in the protection scope of the present application. Inside.

Claims (19)

  1. 一种文本信息提取方法,由终端执行,所述方法包括:A text information extraction method, executed by a terminal, the method includes:
    获取目标文本图像,所述目标文本图像中包括多个目标文本段;Obtaining a target text image, the target text image including a plurality of target text segments;
    获取至少一对目标文本段之间的关联信息,所述关联信息用于表征每对目标文本段之间关联的可能性;Obtaining association information between at least one pair of target text segments, the association information being used to characterize the possibility of association between each pair of target text segments;
    基于每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果;Based on the association information between each pair of target text segments, determine the association results between each pair of target text segments;
    基于每对目标文本段之间的关联结果,提取所述目标文本图像中的文本信息。Based on the association results between each pair of target text segments, text information in the target text image is extracted.
  2. 根据权利要求1所述的方法,获取每对目标文本段之间的关联信息,包括:According to the method of claim 1, obtaining the association information between each pair of target text segments includes:
    获取每对目标文本段的特征或每对目标文本段之间的相对位置特征中的至少一项,每对目标文本段中每个目标文本段的特征包括所述目标文本段所在图像区域的图像特征或所述目标文本段的文本特征中的至少一项,每对目标文本段之间的相对位置特征用于表征每对目标文本段所在图像区域之间的相对位置;Obtain at least one of the characteristics of each pair of target text segments or the relative position characteristics between each pair of target text segments, and the characteristics of each target text segment in each pair of target text segments include the image of the image area where the target text segment is located. Features or at least one of the text features of the target text segment, the relative position feature between each pair of target text segments is used to characterize the relative position between the image areas where each pair of target text segments are located;
    基于至少一对目标文本段的特征或至少一对目标文本段之间的相对位置特征中的至少一项,确定每对目标文本段之间的关联信息。Association information between each pair of target text segments is determined based on at least one of the characteristics of the at least one pair of target text segments or the relative position characteristics between the at least one pair of target text segments.
  3. 根据权利要求2所述的方法,每对目标文本段中每个目标文本段的特征包括所述目标文本段所在图像区域的图像特征,获取每对目标文本段中每个目标文本段的特征,包括:According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include image characteristics of the image area where the target text segment is located, and the characteristics of each target text segment in each pair of target text segments are obtained, include:
    获取所述目标文本图像的图像特征;Obtain image features of the target text image;
    基于所述目标文本图像的图像特征和所述目标文本段所在图像区域的位置信息,确定所述目标文本段所在图像区域的图像特征。Based on the image features of the target text image and the position information of the image area where the target text segment is located, the image features of the image area where the target text segment is located are determined.
  4. 根据权利要求2所述的方法,每对目标文本段中每个目标文本段的特征包括所述目标文本段的文本特征,获取每对目标文本段中每个目标文本段的特征,包括:According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include the text characteristics of the target text segment, and obtaining the characteristics of each target text segment in each pair of target text segments includes:
    获取所述目标文本段中每个词语的词向量;Obtain the word vector of each word in the target text segment;
    对所述目标文本段中各个词语的词向量进行融合,得到所述目标文本段的文本特征。The word vectors of each word in the target text segment are fused to obtain text features of the target text segment.
  5. 根据权利要求2所述的方法,每对目标文本段中每个目标文本段的特征包括所述目标文本段所在图像区域的图像特征和所述目标文本段的文本特征,获取每对目标文本段中每个目标文本段的特征,包括:According to the method of claim 2, the characteristics of each target text segment in each pair of target text segments include image features of the image area where the target text segment is located and text features of the target text segment, and each pair of target text segments is obtained. The characteristics of each target text segment in , including:
    将所述目标文本段所在图像区域的图像特征切分成目标数量个图像特征块,将所述目标文本段的文本特征切分成所述目标数量个文本特征块;segment the image features of the image area where the target text segment is located into a target number of image feature blocks, and segment the text features of the target text segment into the target number of text feature blocks;
    将每个所述图像特征块和所述图像特征块关联的文本特征块进行融合,得到融合特征块;Fuse each of the image feature blocks and the text feature blocks associated with the image feature blocks to obtain a fused feature block;
    将各个融合特征块进行拼接,得到所述目标文本段的特征。Each fused feature block is spliced to obtain the characteristics of the target text segment.
  6. 根据权利要求2所述的方法,获取每对目标文本段之间的相对位置特征,包括:According to the method of claim 2, obtaining the relative position characteristics between each pair of target text segments includes:
    获取每对目标文本段所在图像区域的位置信息;Obtain the position information of the image area where each pair of target text segments is located;
    基于每对目标文本段所在图像区域的位置信息和所述目标文本图像的尺寸信息,确定每对目标文本段之间的相对位置特征。Based on the position information of the image area where each pair of target text segments is located and the size information of the target text image, the relative position characteristics between each pair of target text segments are determined.
  7. 根据权利要求2所述的方法,基于至少一对目标文本段的特征和至少一对目标文本段之间的相对位置特征,确定每对目标文本段之间的关联信息,包括:The method according to claim 2, determining the associated information between each pair of target text segments based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments, including:
    基于至少一对目标文本段的特征和至少一对目标文本段之间的相对位置特征,构建图结构,所述图结构包括至少两个节点和至少一个边,每个节点表征一个目标文本段的特征,每个边表征所述边连接的一对节点所指示的一对目标文本段之间的相对位置特征;Based on the characteristics of at least one pair of target text segments and the relative position characteristics between at least one pair of target text segments, a graph structure is constructed, the graph structure includes at least two nodes and at least one edge, each node represents a target text segment Features, each edge represents the relative position feature between a pair of target text segments indicated by a pair of nodes connected by the edge;
    基于所述图结构,确定每对目标文本段之间的关联信息。Based on the graph structure, the associated information between each pair of target text segments is determined.
  8. 根据权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    获取每个目标文本段的类别和每两个目标文本段之间的关联信息;Obtain the category of each target text segment and the association information between each two target text segments;
    获取每对目标文本段之间的关联信息,包括:Obtain the associated information between each pair of target text segments, including:
    基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文 本段之间的关联信息。Based on the category of each target text segment, each pair of target text segments is determined from the associated information between each two target text segments. Related information between this paragraph.
  9. 根据权利要求8所述的方法,所述基于各个目标文本段的类别,从每两个目标文本段之间的关联信息中,确定每对目标文本段之间的关联信息,包括:The method according to claim 8, determining the associated information between each pair of target text segments from the associated information between every two target text segments based on the category of each target text segment, including:
    基于各个目标文本段的类别,从所述目标文本图像包括的多个目标文本段中,筛选出类别为目标类别的待关联文本段;Based on the category of each target text segment, from the multiple target text segments included in the target text image, select the text segments to be associated with the category of the target category;
    从每两个目标文本段之间的关联信息中,筛选出每两个待关联文本段之间的关联信息,得到每对目标文本段之间的关联信息。From the associated information between each two target text segments, the associated information between each two text segments to be associated is filtered out to obtain the associated information between each pair of target text segments.
  10. 根据权利要求1至9任一项所述的方法,还包括:The method according to any one of claims 1 to 9, further comprising:
    获取目标模型;Get the target model;
    获取每对目标文本段之间的关联信息,包括:Obtain the associated information between each pair of target text segments, including:
    根据所述目标模型获取每对目标文本段之间的关联信息。Obtain association information between each pair of target text segments according to the target model.
  11. 一种目标模型的获取方法,由服务器执行,所述方法包括:A method for obtaining a target model, executed by a server, the method includes:
    获取样本文本图像,所述样本文本图像中包括多个样本文本段;Obtain a sample text image, where the sample text image includes multiple sample text segments;
    获取至少一对样本文本段之间的预测关联信息和标注关联信息;Obtain predicted correlation information and annotation correlation information between at least one pair of sample text segments;
    基于每对样本文本段之间的预测关联信息和标注关联信息,获取目标模型。Based on the predicted correlation information and annotation correlation information between each pair of sample text segments, the target model is obtained.
  12. 根据权利要求11所述的方法,还包括:The method of claim 11, further comprising:
    获取每对样本文本段中每个样本文本段的预测类别和标注类别;Obtain the predicted category and annotation category of each sample text segment in each pair of sample text segments;
    所述基于每对样本文本段之间的预测关联信息和标注关联信息,获取目标模型,包括:The method of obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments includes:
    基于每对样本文本段之间的预测关联信息和标注关联信息,以及每对样本文本段中每个样本文本段的预测类别和标注类别,获取所述目标模型。The target model is obtained based on the predicted correlation information and labeling correlation information between each pair of sample text segments, and the predicted category and labeling category of each sample text segment in each pair of sample text segments.
  13. 根据权利要求11所述的方法,还包括:The method of claim 11, further comprising:
    获取每对样本文本段中每个样本文本段的特征,每个样本文本段的特征包括所述样本文本段所在图像区域的图像特征或所述样本文本段的文本特征中的至少一项;Obtain the characteristics of each sample text segment in each pair of sample text segments, and the characteristics of each sample text segment include at least one of the image characteristics of the image area where the sample text segment is located or the text characteristics of the sample text segment;
    所述基于每对样本文本段之间的预测关联信息和标注关联信息,获取目标模型,包括:The method of obtaining the target model based on the predicted correlation information and annotation correlation information between each pair of sample text segments includes:
    基于每对样本文本段中每个样本文本段的特征,以及每对样本文本段之间的预测关联信息和标注关联信息,获取所述目标模型。The target model is obtained based on the characteristics of each sample text segment in each pair of sample text segments, and the predicted correlation information and annotation correlation information between each pair of sample text segments.
  14. 根据权利要求13所述的方法,所述基于每对样本文本段中每个样本文本段的特征,以及每对样本文本段之间的预测关联信息和标注关联信息,获取所述目标模型,包括:The method according to claim 13, obtaining the target model based on the characteristics of each sample text segment in each pair of sample text segments and the predicted correlation information and annotation correlation information between each pair of sample text segments, including :
    获取每对样本文本段中每个样本文本段的标注类别;Obtain the annotation category of each sample text segment in each pair of sample text segments;
    对于每个标注类别,基于所述标注类别中各个样本文本段的特征,确定所述标注类别的特征平均值;For each annotation category, based on the characteristics of each sample text segment in the annotation category, determine the feature average of the annotation category;
    基于各个标注类别的特征平均值,以及每个样本文本段之间的预测关联信息和标注关联信息,获取所述目标模型。The target model is obtained based on the feature average of each annotation category, as well as the predicted correlation information and annotation correlation information between each sample text segment.
  15. 一种文本信息提取装置,所述装置包括:A text information extraction device, the device includes:
    获取模块,用于获取目标文本图像,所述目标文本图像中包括多个目标文本段;An acquisition module, used to acquire a target text image, where the target text image includes multiple target text segments;
    所述获取模块,还用于获取至少一对目标文本段之间的关联信息,所述关联信息用于表征每对目标文本段之间关联的可能性;The acquisition module is also used to acquire association information between at least one pair of target text segments, where the association information is used to characterize the possibility of association between each pair of target text segments;
    确定模块,用于基于每对目标文本段之间的关联信息,确定每对目标文本段之间的关联结果;a determination module, configured to determine the association result between each pair of target text segments based on the association information between each pair of target text segments;
    提取模块,用于基于每对目标文本段之间的关联结果,提取所述目标文本图像中的文本信息。An extraction module, configured to extract text information in the target text image based on the association results between each pair of target text segments.
  16. 一种目标模型的获取装置,所述装置包括:A device for obtaining a target model, the device includes:
    第一获取模块,用于获取样本文本图像,所述样本文本图像中包括多个样本文本段;The first acquisition module is used to acquire a sample text image, where the sample text image includes a plurality of sample text segments;
    第二获取模块,用于获取至少一对样本文本段之间的预测关联信息和标注关联信息;The second acquisition module is used to acquire the predicted correlation information and annotation correlation information between at least one pair of sample text segments;
    第三获取模块,用于基于每对样本文本段之间的预测关联信息和标注关联信息,获取目 标模型。The third acquisition module is used to obtain the target based on the predicted correlation information and annotation correlation information between each pair of sample text segments. standard model.
  17. 一种电子设备,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条计算机程序,所述至少一条计算机程序由所述处理器加载并执行,以使所述电子设备实现如权利要求1至10任一所述的文本信息提取方法或者实现如权利要求11至14任一所述的目标模型的获取方法。An electronic device. The electronic device includes a processor and a memory. At least one computer program is stored in the memory. The at least one computer program is loaded and executed by the processor, so that the electronic device implements the rights as claimed. The text information extraction method described in any one of claims 1 to 10 or the acquisition method of the target model described in any one of claims 11 to 14.
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计算机实现如权利要求1至10任一所述的文本信息提取方法或者实现如权利要求11至14任一所述的目标模型的获取方法。A computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is loaded and executed by a processor, so that the computer implements any one of claims 1 to 10 The text information extraction method or the acquisition method of the target model as described in any one of claims 11 to 14.
  19. 一种计算机程序产品,所述计算机程序产品中存储有至少一条计算机程序,所述至少一条计算机程序由处理器加载并执行,以使计算机实现如权利要求1至10任一所述的文本信息提取方法或者实现如权利要求11至14任一所述的目标模型的获取方法。 A computer program product, at least one computer program is stored in the computer program product, and the at least one computer program is loaded and executed by a processor to enable the computer to implement text information extraction as claimed in any one of claims 1 to 10 Method or method for obtaining the target model as described in any one of claims 11 to 14.
PCT/CN2023/081379 2022-04-19 2023-03-14 Text information extraction method and apparatus, target model acquisition method and apparatus, and device WO2023202268A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210411039.7A CN114511864B (en) 2022-04-19 2022-04-19 Text information extraction method, target model acquisition method, device and equipment
CN202210411039.7 2022-04-19

Publications (1)

Publication Number Publication Date
WO2023202268A1 true WO2023202268A1 (en) 2023-10-26

Family

ID=81554813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081379 WO2023202268A1 (en) 2022-04-19 2023-03-14 Text information extraction method and apparatus, target model acquisition method and apparatus, and device

Country Status (2)

Country Link
CN (1) CN114511864B (en)
WO (1) WO2023202268A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511864B (en) * 2022-04-19 2023-01-13 腾讯科技(深圳)有限公司 Text information extraction method, target model acquisition method, device and equipment
CN116030466B (en) * 2023-03-23 2023-07-04 深圳思谋信息科技有限公司 Image text information identification and processing method and device and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021128578A1 (en) * 2019-12-27 2021-07-01 深圳市商汤科技有限公司 Image processing method and apparatus, electronic device, and storage medium
CN113591864A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Training method, device and system for text recognition model framework
CN113591657A (en) * 2021-07-23 2021-11-02 京东科技控股股份有限公司 OCR (optical character recognition) layout recognition method and device, electronic equipment and medium
CN114332889A (en) * 2021-08-26 2022-04-12 腾讯科技(深圳)有限公司 Text box ordering method and text box ordering device for text image
CN114511864A (en) * 2022-04-19 2022-05-17 腾讯科技(深圳)有限公司 Text information extraction method, target model acquisition method, device and equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242400A (en) * 2018-11-02 2019-01-18 南京信息工程大学 A kind of logistics express delivery odd numbers recognition methods based on convolution gating cycle neural network
CN111126389A (en) * 2019-12-20 2020-05-08 腾讯科技(深圳)有限公司 Text detection method and device, electronic equipment and storage medium
CN112801099B (en) * 2020-06-02 2024-05-24 腾讯科技(深圳)有限公司 Image processing method, device, terminal equipment and medium
CN112036395B (en) * 2020-09-04 2024-05-28 联想(北京)有限公司 Text classification recognition method and device based on target detection
CN113343982B (en) * 2021-06-16 2023-07-25 北京百度网讯科技有限公司 Entity relation extraction method, device and equipment for multi-modal feature fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021128578A1 (en) * 2019-12-27 2021-07-01 深圳市商汤科技有限公司 Image processing method and apparatus, electronic device, and storage medium
CN113591657A (en) * 2021-07-23 2021-11-02 京东科技控股股份有限公司 OCR (optical character recognition) layout recognition method and device, electronic equipment and medium
CN113591864A (en) * 2021-07-28 2021-11-02 北京百度网讯科技有限公司 Training method, device and system for text recognition model framework
CN114332889A (en) * 2021-08-26 2022-04-12 腾讯科技(深圳)有限公司 Text box ordering method and text box ordering device for text image
CN114511864A (en) * 2022-04-19 2022-05-17 腾讯科技(深圳)有限公司 Text information extraction method, target model acquisition method, device and equipment

Also Published As

Publication number Publication date
CN114511864B (en) 2023-01-13
CN114511864A (en) 2022-05-17

Similar Documents

Publication Publication Date Title
US11868889B2 (en) Object detection in images
US11544550B2 (en) Analyzing spatially-sparse data based on submanifold sparse convolutional neural networks
US10402703B2 (en) Training image-recognition systems using a joint embedding model on online social networks
US10726208B2 (en) Consumer insights analysis using word embeddings
WO2023202268A1 (en) Text information extraction method and apparatus, target model acquisition method and apparatus, and device
CN106776673B (en) Multimedia document summarization
US10685183B1 (en) Consumer insights analysis using word embeddings
US11182806B1 (en) Consumer insights analysis by identifying a similarity in public sentiments for a pair of entities
US10083379B2 (en) Training image-recognition systems based on search queries on online social networks
CN103268317B (en) Image is carried out the system and method for semantic annotations
US20140250120A1 (en) Interactive Multi-Modal Image Search
CN111897964A (en) Text classification model training method, device, equipment and storage medium
WO2023065211A1 (en) Information acquisition method and apparatus
US20140337005A1 (en) Cross-lingual automatic query annotation
US10558759B1 (en) Consumer insights analysis using word embeddings
US10509863B1 (en) Consumer insights analysis using word embeddings
US10803248B1 (en) Consumer insights analysis using word embeddings
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
US20210303864A1 (en) Method and apparatus for processing video, electronic device, medium and product
WO2022156525A1 (en) Object matching method and apparatus, and device
US11030539B1 (en) Consumer insights analysis using word embeddings
US10685184B1 (en) Consumer insights analysis using entity and attribute word embeddings
CN112396091B (en) Social media image popularity prediction method, system, storage medium and application
CN111814481B (en) Shopping intention recognition method, device, terminal equipment and storage medium
CN114328798A (en) Processing method, device, equipment, storage medium and program product for searching text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23790929

Country of ref document: EP

Kind code of ref document: A1