CN117253239A - End-to-end document image translation method and device integrating layout information - Google Patents

End-to-end document image translation method and device integrating layout information Download PDF

Info

Publication number
CN117253239A
CN117253239A CN202311189129.7A CN202311189129A CN117253239A CN 117253239 A CN117253239 A CN 117253239A CN 202311189129 A CN202311189129 A CN 202311189129A CN 117253239 A CN117253239 A CN 117253239A
Authority
CN
China
Prior art keywords
word
document image
translated
translation
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311189129.7A
Other languages
Chinese (zh)
Inventor
张志扬
张亚萍
向露
周玉
宗成庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202311189129.7A priority Critical patent/CN117253239A/en
Publication of CN117253239A publication Critical patent/CN117253239A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • G06V30/2445Alphabet recognition, e.g. Latin, Kanji or Katakana
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/246Division of the character sequences into groups prior to recognition; Selection of dictionaries using linguistic properties, e.g. specific for English or German language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an end-to-end document image translation method and device fusing layout information, wherein the method comprises the following steps: acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated; obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated; and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated. The end-to-end document image translation method fused with layout information provided by the invention effectively improves the document translation effect.

Description

End-to-end document image translation method and device integrating layout information
Technical Field
The invention relates to the technical field of natural language processing, in particular to an end-to-end document image translation method and device integrating layout information.
Background
Document images refer to images produced by scanning, photographing, etc. of text on paper or physical surfaces. Document image translation aims at automatically translating text embedded in a document image from a source language to a target language, which is one of key technologies for realizing automation of document information processing.
Aiming at the characteristic that the layout of the document image is flexible and changeable, most methods firstly utilize a layout analysis model and a logic sequence detection model based on deep learning to automatically analyze the layout and the logic sequence of the document image, thereby extracting a source language text with the logic sequence and taking the source language text as the input of a translation model to obtain a target language translation. However, in all the methods, a plurality of modules such as layout analysis, logic sequence detection, sentence segmentation and translation are spliced together in a cascading manner, each module is independently trained on specific data, and the lack of information interaction between the modules is unfavorable for adaptation between the modules. In addition, due to the cascade structure, errors of the layout analysis module are accumulated and amplified continuously in the forward process, and negative influence is brought to the translation result.
Disclosure of Invention
The invention provides an end-to-end document image translation method and device integrating layout information, which are used for solving the problem of poor translation effect of a document image analysis model with a cascade structure in the prior art.
The invention provides an end-to-end document image translation method fusing layout information, which comprises the following steps:
acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated.
In some embodiments, the obtaining the first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word, and the one-dimensional position information of each word includes:
coding the text corresponding to each word to obtain a text feature vector;
encoding the two-dimensional coordinate information of each word to obtain a two-dimensional coordinate feature vector;
encoding the one-dimensional position information of each word to obtain a one-dimensional position feature vector;
and carrying out feature fusion on the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector to obtain the first feature vector.
In some embodiments, the decoding the first feature vector to obtain a translated text corresponding to the document image to be translated includes:
decoding the first feature representation vector to obtain a first hidden layer vector corresponding to each word, wherein the first hidden layer vector is used for indicating the reading sequence of each word;
determining a sentence boundary class label corresponding to each word based on the first hidden layer vector, wherein the sentence boundary class label is used for indicating whether each word is a sentence starting word or not;
determining semantic features of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label;
and determining the translation text corresponding to the document image to be translated based on the semantic features of each source language sentence.
In some embodiments, the determining, based on the first hidden layer vector, a sentence boundary category label corresponding to each word includes:
encoding the first hidden layer vector to obtain a second feature vector corresponding to each word, wherein the second feature vector is used for indicating the sentence boundary of the source language sentence corresponding to each word;
and determining a sentence boundary category label corresponding to each word based on the second feature vector.
In some embodiments, the determining the semantic feature of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label includes:
dividing the first hidden layer vector based on the sentence boundary category label corresponding to each word to obtain a second hidden layer vector corresponding to each source language sentence;
and taking the second hidden layer vector as the semantic feature of each source language sentence.
In some embodiments, the determining the translated text corresponding to the document image to be translated based on the semantic features of each source language sentence includes:
decoding semantic features of each source language sentence to obtain a target language sentence corresponding to each source language sentence;
and splicing the target language sentences corresponding to the source language sentences in sequence to obtain the translated text corresponding to the document image to be translated.
The invention also provides an end-to-end document image translation device fusing layout information, which comprises:
the device comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring a character recognition result of a document image to be translated, the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
the coding module is used for obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and the decoding module is used for decoding the first feature vector to obtain a translation text corresponding to the document image to be translated.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the end-to-end document image translation method fusing layout information according to any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an end-to-end document image translation method of fusing layout information as any one of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements the end-to-end document image translation method of fusing layout information as described in any one of the above.
The end-to-end document image translation method and device integrating layout information provided by the invention realize joint coding of version information and text information of a document image to be translated by combining the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, so as to obtain a first feature vector; and then decoding the first feature vector to obtain a translation text corresponding to the document image to be translated, so that interaction and adaptation between encoding and decoding processes are enhanced, and document image translation effects of different formats and layout structures are effectively improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an end-to-end document image translation method for fusing layout information;
FIG. 2 is one of schematic diagrams of optical character recognition of the end-to-end document image translation method of the present invention for fusing layout information;
FIG. 3 is a schematic diagram of two optical character recognition of the end-to-end document image translation method of the present invention for fusing layout information;
FIG. 4 is a schematic diagram showing a cascade model structure and an end-to-end model structure of the end-to-end document image translation method for fusing layout information;
FIG. 5 is a schematic diagram of a model framework of an end-to-end document image translation method for fusing layout information provided by the invention;
FIG. 6 is a schematic diagram of an end-to-end document image translation device for fusing layout information provided by the invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the related art, a document refers to paper, image or electronic document containing text, which is present in a large amount in communication channels such as daily life and the internet.
Unlike plain text, the text in a document image is organized in a layout and logical order. Although the text content can be extracted by manually designing rules and analyzing the layout and the logic sequence, and document image translation is realized by plain text machine translation, in practical application, documents in different fields and different formats often show inconsistent layout structures and logic sequences, so that the method cannot process different types of document images at the same time, and has very limited generalization.
The invention provides an end-to-end document image translation method integrating layout information based on Transformer Encoder-Decoder (encoder-Decoder) architecture, which effectively improves the document image translation effect of different formats and layout structures.
The end-to-end document image translation method and apparatus for fusing layout information according to the present invention will be described with reference to fig. 1 to 7.
FIG. 1 is a schematic flow chart of an end-to-end document image translation method for fusing layout information. Referring to fig. 1, the end-to-end document image translation method for fusing layout information provided by the present invention includes: step 110, step 120 and step 130.
Step 110, acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
step 120, obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from a document image to be translated;
and 130, decoding the first feature vector to obtain a translation text corresponding to the document image to be translated.
The execution subject of the end-to-end document image translation method fusing layout information provided by the invention can be electronic equipment, a component, an integrated circuit or a chip in the electronic equipment. The electronic device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., without limitation of the present invention.
The technical scheme of the invention is described in detail below by taking a computer to execute the end-to-end document image translation method of the fusion layout information provided by the invention as an example.
In step 110, an optical character recognition process is performed on the document image to be translated, and at least one word and two-dimensional layout coordinates of each word in the document image to be translated are extracted.
In actual implementation, a pixel value threshold is set to be 1000, and the height and width of the document image to be translated are normalized to be 0, 1000 pixel value intervals, wherein the pixel value threshold can be set according to actual requirements and is not particularly limited herein.
As shown in fig. 2, the normalized document image 210 to be translated is processed by an optical character recognition engine to obtain text and two-dimensional layout coordinates corresponding to each word. The two-dimensional layout coordinates are used for indicating two-dimensional coordinate information of each word in the document image to be translated.
As shown in fig. 3, a rectangular bounding box is displayed around each word in the processed document image 310 to be translated. The rectangular bounding box identifies the words identified by the optical character recognition engine and their two-dimensional coordinate locations, and the two-dimensional coordinate information corresponding to the location of each word is uniquely determined by the upper left and lower right corner coordinates of the rectangular bounding box surrounding it.
In step 120, the layout perceptual encoder is used to jointly encode the text, the two-dimensional coordinate information and the one-dimensional position information corresponding to the word, so as to obtain a layout perceptual feature representation, which is the first feature vector.
It will be appreciated that the word text identified from the document image to be translated is a one-dimensional sequence, each word in which is given a position index starting at 1. The position mark is one-dimensional sequence position information, namely one-dimensional position information.
In step 130, based on the first feature vector (the layout perceptual feature representation), reading order decoding, sentence boundary decoding and translation decoding are sequentially and orderly performed by using a multi-step conduction decoder, so as to obtain a final document image translation chapter, and thus a translation text corresponding to the document image to be translated is obtained.
The end-to-end document image translation method based on the layout information is characterized in that the version information and the text information of a document image to be translated are jointly encoded by combining the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, so that a first feature vector is obtained; and then decoding the first feature vector to obtain a translation text corresponding to the document image to be translated, so that interaction and adaptation between encoding and decoding processes are enhanced, and document image translation effects of different formats and layout structures are effectively improved.
The above steps are described in detail below with reference to fig. 4 and 5.
As shown in fig. 4, in the conventional cascading model structure, a layout analysis module, a logic sequence detection module, a sentence segmentation module, a translation module and other independent training modules are spliced together, so that joint optimization for a translation target cannot be performed. The invention provides an end-to-end model structure based on a layout perception encoder-text decoder.
In some embodiments, step 120 may include:
encoding the text corresponding to each word to obtain a text feature vector;
encoding the two-dimensional coordinate information of each word to obtain a two-dimensional coordinate feature vector;
encoding the one-dimensional position information of each word to obtain a one-dimensional position feature vector;
and carrying out feature fusion on the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector to obtain a first feature vector.
In actual execution, the text, the two-dimensional coordinate information and the one-dimensional position information corresponding to each word are encoded by using respective embedded layers to obtain a text feature vector, a two-dimensional coordinate feature vector and a one-dimensional position feature vector.
After the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector are added, context coding and feature fusion are carried out through a Transformer Encoder structure, so that a layout perception feature representation is obtained, and the layout perception feature representation is the first feature vector.
As shown in fig. 5, the text corresponding to at least one word is expressed as: "The" … "Strategy", "Vision" … "yes";
the one-dimensional position information is expressed as: i=1 … i=5, i= … i=l; wherein L is the total number of the recognized words;
the two-dimensional coordinate information is expressed as: (x) 0 ,y 0 ,x 1 ,y 1 ,w,h) 1 …(x 0 ,y 0 ,x 1 ,y 1 ,w,h) 5 、(x 0 ,y 0 ,x 1 ,y 1 ,w,h) 6 …(x 0 ,y 0 ,x 1 ,y 1 ,w,h) L The method comprises the steps of carrying out a first treatment on the surface of the Wherein the upper left corner coordinates (x 0 ,y 0 ) And lower right angular position (x 1 ,y 1 ) W is the width of the rectangular bounding box and h is the height of the rectangular bounding box.
The layout perceptual features representation (first feature vector) is expressed as: x is X 1 * …X 5 * 、X 6 * …X L *
According to the end-to-end document image translation method fusing layout information, the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector are subjected to feature fusion, so that the joint coding of the text feature and the layout feature is realized, the joint understanding of the text-layout corresponding to the document image to be translated can be realized, and the translation capability of the model to the document image to be translated with different formats and layout structures is remarkably improved.
In some embodiments, step 130 may include:
decoding the first feature representation vector to obtain a first hidden layer vector corresponding to each word, wherein the first hidden layer vector is used for indicating the reading sequence of each word;
determining a sentence boundary class label corresponding to each word based on the first hidden layer vector, wherein the sentence boundary class label is used for indicating whether each word is a sentence starting word or not;
determining semantic features of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label;
and determining a translation text corresponding to the document image to be translated based on the semantic features of each source language sentence.
Based on the first feature vector (layout perception feature representation), a reading sequence decoder in the multi-step conduction decoder is utilized to autoregressively calculate a reading sequence hidden layer vector and a reading sequence index value corresponding to each word, and a greedy decoding strategy is adopted in prediction.
The loss function for reading order decoding is designed as follows:
wherein Idx is i A reading order index value tag representing the i-th word,representing the probability distribution of the reading order of the ith word predicted by the reading order decoder, L is the number of document words. And carrying out parameter optimization on the reading sequence decoder by using the loss function, wherein the reading sequence hidden layer vector is the first hidden layer vector.
As shown in fig. 5, the first hidden layer vector (decoder hidden layer vector) may be represented as H 1 rsd …H 5 rsd 、H 6 rsd …H L rsd 。H 1 rsd The reading sequence index value label is Idx 1 …H 5 rsd The reading sequence index value label is Idx 5 、H 6 rsd The reading sequence index value label is Idx 6 …H L rsd The reading sequence index value label is Idx L
It should be noted that, the reading sequence index value is used to provide a supervisory signal, and train the reading sequence decoder so that the feature vector outputted by the reading sequence decoder is a feature vector corresponding to a sequence with a correct reading sequence, so as to facilitate subsequent sentence boundary decoding and translation decoding.
Based on the first hidden layer vector (reading sequence hidden layer vector), a sentence boundary vector corresponding to each word is obtained, and a sentence boundary class label of each word is predicted.
Wherein the sentence boundary category tag includes { BOS, IOS }. Where BOS represents a sentence start word and IOS represents a non-sentence start word.
The first hidden layer vector (reading order hidden layer vector) is divided into a plurality of sub-vector sequences corresponding to the semantic features of each source language sentence based on the predicted sentence boundary class labels.
And generating target language translations sentence by using a translation decoder in the multi-step conduction decoder based on semantic features of each source language sentence, and finally obtaining translation text corresponding to the document image to be translated.
In some embodiments, determining a sentence boundary category label for each word based on the first hidden layer vector includes:
encoding the first hidden layer vector to obtain a second feature vector corresponding to each word, wherein the second feature vector is used for indicating the sentence boundary of the source language sentence corresponding to each word;
based on the second feature vector, a sentence boundary class label corresponding to each word is determined.
In actual implementation, based on the first hidden layer vector (reading sequential hidden layer vector), a sentence boundary decoder in the multi-step conduction decoder is utilized to further perform context encoding on the first hidden layer vector so as to obtain a sentence boundary vector corresponding to each word, and a sentence boundary class label of each word is predicted. The sentence boundary vector is the second feature vector.
The loss function for the sentence boundary decoder can be designed as:
wherein B is i Sentence boundary category labels representing the i-th word in the ordered sequence of words,representing the sentence boundary probability distribution of the i-th word predicted by the sentence boundary decoder, L is the number of document words. The loss function is used to perform parameter optimization on the sentence boundary decoder.
As shown in fig. 5, the sentence boundary vector (second feature vector) is expressed as: h 1 ssd …H 5 ssd 、H 6 ssd …H L ssd . The semantic features of the source language sentences are H respectively 1 sent 、H 2 sent …H M sent
In some embodiments, determining semantic features of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label comprises:
dividing the first hidden layer vector based on the sentence boundary class label corresponding to each word to obtain a second hidden layer vector corresponding to each source language sentence;
the second hidden layer vector is used as the semantic feature of each source language sentence.
In actual execution, based on sentence boundary labels corresponding to each word, dividing a first hidden layer vector (reading order hidden layer vector) by utilizing a sentence segmentation rule to obtain a plurality of sub-vector sequences, namely obtaining a second hidden layer vector of each source language sentence, and taking the second hidden layer vector as the semantic feature of the source language sentence.
As shown in FIG. 5, H 1 rsd To H 5 rsd For a sequence of sub-vectors H 6 rsd … H as a sequence of sub-vectors L rsd Is a sequence of sub-vectors.
In some embodiments, determining a translation text corresponding to the document image to be translated based on the semantic features of each source language sentence includes:
decoding semantic features of each source language sentence to obtain a target language sentence corresponding to each source language sentence;
and splicing the target language sentences corresponding to each source language sentence in sequence to obtain a translation text corresponding to the document image to be translated.
In actual implementation, based on the semantic features of each source language sentence, a translation decoder in the multi-step conduction decoder is utilized to generate a translation corresponding to the target language sentence by sentence, and a column search decoding strategy can be adopted during generation. And sequentially splicing all target language sentences to obtain final document image translation chapters, namely translation texts corresponding to the document images to be translated. Designing a loss function for a translation decoder:
wherein Y is k,j A j-th word representing a kth translation sentence,vocabulary probability distribution representing jth word of kth translation sentence predicted by translation decoder, M is sentence number, |Y k The i is the number of words of the kth sentence. The loss function may be used to parameter optimize the translation decoder.
As shown in FIG. 5, the semantic features of the source language sentences are H respectively 1 sent 、H 2 sent …H M sent The corresponding target language sentence (translation sentence) is corresponding to the translation sentence 1 Sentence of translation 2 … translated sentence M
In some embodiments, two document image translation datasets ReadingBank and DITrans are used to verify the effect of document image translation.
The document image of the ReadingBank belongs to the general field, the document image of the DITrans comprises three special fields of government report, news newspaper and advertisement, and the two fields can be combined to verify the document image translation effects of various different layout types in the general and specific fields.
In addition, in order to verify the advantages of the method in the cross-domain scene, the invention also carries out a zero sample cross-domain translation experiment on DITrans.
The invention also compares the existing cascading method, and builds an end-to-end document image translation method for fusing the layout information based on the layout perception encoder-text decoder, and the results are shown in the attached table 1 and the attached table 2.
Table 1 results of different document image translation methods on ReadingBank datasets.
Table 2 DITrans dataset results of different document image translation methods under two experimental settings
As can be seen from tables 1 and 2, the methods of document image translation in cascade are four existing methods of document image translation in the related art, which are DocHandler-1, docHandler-2, MGTrans-DETR and MGTrans-Conv.
LayoutLM-Dec and LiLT-Dec are two end-to-end document image translation methods based on the fusion layout information of the layout-text joint encoder and the text decoder constructed by the invention.
The Layoutdit is the end-to-end document image translation method fusing layout information provided by the invention.
In contrast, it can be found that:
(1) The method provided by the invention has better results on two data sets, namely the general field of the Readingbank data set or the three special fields of the DITrans data set, and has obvious effect improvement compared with the existing cascading method and the end-to-end method.
(2) The method provided by the invention has obvious advantages in the quantity of model parameters, and is a parameter efficient method compared with most existing methods.
(3) The method provided by the invention has particularly remarkable performance improvement under the experimental setting of zero sample cross-domain transfer learning on the DITrans data set, and has more remarkable advantages under the cross-domain setting.
The existing document image translation method is difficult to realize the joint understanding of text and layout, and the cascade combination among modules cannot be optimized in a joint way. The end-to-end document image translation method fused with the layout information can carry out joint coding on the layout and text characteristics, and the end-to-end modeling can carry out joint optimization on all sub-modules by utilizing the translation targets, so that the translation capability of the model on document images in different formats and layout structures is obviously improved.
The end-to-end document image translation device for fusing layout information provided by the invention is described below, and the end-to-end document image translation device for fusing layout information described below and the end-to-end document image translation method for fusing layout information described above can be referred to correspondingly.
Fig. 6 is a schematic diagram of the structure of the end-to-end document image translation device for fusing layout information provided by the invention. Referring to fig. 6, the end-to-end document image translation apparatus for fusing layout information provided by the present invention includes: acquisition module 610, encoding module 620, and decoding module 630.
An obtaining module 610, configured to obtain a character recognition result of a document image to be translated, where the character recognition result includes a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on a pixel value of the document image to be translated;
an encoding module 620, configured to obtain a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word, and the one-dimensional position information of each word, where the one-dimensional position information is used to indicate a position of the word in a word sequence, and the word sequence is used to indicate a one-dimensional sequence composed of all words identified from the document image to be translated;
and the decoding module 630 is configured to decode the first feature vector to obtain a translation text corresponding to the document image to be translated.
The end-to-end document image translation device fused with layout information provided by the invention realizes joint coding of version information and text information of a document image to be translated by combining the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, so as to obtain a first feature vector; and then decoding the first feature vector to obtain a translation text corresponding to the document image to be translated, so that interaction and adaptation between encoding and decoding processes are enhanced, and document image translation effects of different formats and layout structures are effectively improved.
In some embodiments, the encoding module 620 is specifically configured to:
coding the text corresponding to each word to obtain a text feature vector;
encoding the two-dimensional coordinate information of each word to obtain a two-dimensional coordinate feature vector;
encoding the one-dimensional position information of each word to obtain a one-dimensional position feature vector;
and carrying out feature fusion on the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector to obtain the first feature vector.
In some embodiments, the decoding module 630 is specifically configured to:
decoding the first feature representation vector to obtain a first hidden layer vector corresponding to each word, wherein the first hidden layer vector is used for indicating the reading sequence of each word;
determining a sentence boundary class label corresponding to each word based on the first hidden layer vector, wherein the sentence boundary class label is used for indicating whether each word is a sentence starting word or not;
determining semantic features of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label;
and determining the translation text corresponding to the document image to be translated based on the semantic features of each source language sentence.
In some embodiments, the decoding module 630 is specifically configured to:
encoding the first hidden layer vector to obtain a second feature vector corresponding to each word, wherein the second feature vector is used for indicating the sentence boundary of the source language sentence corresponding to each word;
and determining a sentence boundary category label corresponding to each word based on the second feature vector.
In some embodiments, the decoding module 630 is specifically configured to:
dividing the first hidden layer vector based on the sentence boundary category label corresponding to each word to obtain a second hidden layer vector corresponding to each source language sentence;
and taking the second hidden layer vector as the semantic feature of each source language sentence.
In some embodiments, the decoding module 630 is specifically configured to:
decoding semantic features of each source language sentence to obtain a target language sentence corresponding to each source language sentence;
and splicing the target language sentences corresponding to the source language sentences in sequence to obtain the translated text corresponding to the document image to be translated.
Fig. 7 illustrates a physical schematic diagram of an electronic device, as shown in fig. 7, which may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform an end-to-end document image translation method that fuses layout information, the method comprising:
acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated.
Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can perform an end-to-end document image translation method for fusing layout information provided by the above methods, where the method includes:
acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated.
In still another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the end-to-end document image translation method of fusing layout information provided by the above methods, the method comprising:
acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The end-to-end document image translation method integrating layout information is characterized by comprising the following steps of:
acquiring a character recognition result of a document image to be translated, wherein the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and decoding the first eigenvector to obtain a translation text corresponding to the document image to be translated.
2. The method for end-to-end document image translation based on layout information fusion according to claim 1, wherein the obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word, and the one-dimensional position information of each word comprises:
coding the text corresponding to each word to obtain a text feature vector;
encoding the two-dimensional coordinate information of each word to obtain a two-dimensional coordinate feature vector;
encoding the one-dimensional position information of each word to obtain a one-dimensional position feature vector;
and carrying out feature fusion on the text feature vector, the two-dimensional coordinate feature vector and the one-dimensional position feature vector to obtain the first feature vector.
3. The method for translating an end-to-end document image with layout information fused according to claim 1, wherein decoding the first feature vector to obtain a translated text corresponding to the document image to be translated comprises:
decoding the first feature representation vector to obtain a first hidden layer vector corresponding to each word, wherein the first hidden layer vector is used for indicating the reading sequence of each word;
determining a sentence boundary class label corresponding to each word based on the first hidden layer vector, wherein the sentence boundary class label is used for indicating whether each word is a sentence starting word or not;
determining semantic features of each source language sentence corresponding to all words based on the first hidden layer vector and the sentence boundary class label;
and determining the translation text corresponding to the document image to be translated based on the semantic features of each source language sentence.
4. The method for end-to-end document image translation with layout information fused according to claim 3, wherein determining the sentence boundary class label corresponding to each word based on the first hidden layer vector comprises:
encoding the first hidden layer vector to obtain a second feature vector corresponding to each word, wherein the second feature vector is used for indicating the sentence boundary of the source language sentence corresponding to each word;
and determining a sentence boundary category label corresponding to each word based on the second feature vector.
5. The method for end-to-end document image translation with layout information fusion according to claim 3, wherein determining the semantic feature of each source language sentence corresponding to all the words based on the first hidden layer vector and the sentence boundary class label comprises:
dividing the first hidden layer vector based on the sentence boundary category label corresponding to each word to obtain a second hidden layer vector corresponding to each source language sentence;
and taking the second hidden layer vector as the semantic feature of each source language sentence.
6. The method for end-to-end document image translation based on layout information fusion according to claim 2, wherein determining the translation text corresponding to the document image to be translated based on the semantic feature of each source language sentence comprises:
decoding semantic features of each source language sentence to obtain a target language sentence corresponding to each source language sentence;
and splicing the target language sentences corresponding to the source language sentences in sequence to obtain the translated text corresponding to the document image to be translated.
7. An end-to-end document image translation device fusing layout information, comprising:
the device comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring a character recognition result of a document image to be translated, the character recognition result comprises a plurality of words in the document image to be translated and two-dimensional coordinate information of each word, and the two-dimensional coordinate information is determined based on pixel values of the document image to be translated;
the coding module is used for obtaining a first feature vector based on the text corresponding to each word, the two-dimensional coordinate information of each word and the one-dimensional position information of each word, wherein the one-dimensional position information is used for indicating the position of the word in a word sequence, and the word sequence is used for indicating a one-dimensional sequence formed by all words identified from the document image to be translated;
and the decoding module is used for decoding the first feature vector to obtain a translation text corresponding to the document image to be translated.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the end-to-end document image translation method of fusing layout information as claimed in any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the end-to-end document image translation method of fusing layout information as claimed in any one of claims 1 to 6.
10. A computer program product comprising a computer program which when executed by a processor implements the end-to-end document image translation method of fusing layout information as claimed in any one of claims 1 to 6.
CN202311189129.7A 2023-09-14 2023-09-14 End-to-end document image translation method and device integrating layout information Pending CN117253239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311189129.7A CN117253239A (en) 2023-09-14 2023-09-14 End-to-end document image translation method and device integrating layout information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311189129.7A CN117253239A (en) 2023-09-14 2023-09-14 End-to-end document image translation method and device integrating layout information

Publications (1)

Publication Number Publication Date
CN117253239A true CN117253239A (en) 2023-12-19

Family

ID=89134270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311189129.7A Pending CN117253239A (en) 2023-09-14 2023-09-14 End-to-end document image translation method and device integrating layout information

Country Status (1)

Country Link
CN (1) CN117253239A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953109A (en) * 2024-03-27 2024-04-30 杭州果粉智能科技有限公司 Method, system, electronic device and storage medium for translating generated pictures

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953109A (en) * 2024-03-27 2024-04-30 杭州果粉智能科技有限公司 Method, system, electronic device and storage medium for translating generated pictures

Similar Documents

Publication Publication Date Title
CN110750959B (en) Text information processing method, model training method and related device
CN114821622B (en) Text extraction method, text extraction model training method, device and equipment
CN111160031A (en) Social media named entity identification method based on affix perception
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN111160343A (en) Off-line mathematical formula symbol identification method based on Self-Attention
CN114596566B (en) Text recognition method and related device
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN117253239A (en) End-to-end document image translation method and device integrating layout information
CN117010500A (en) Visual knowledge reasoning question-answering method based on multi-source heterogeneous knowledge joint enhancement
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN115862040A (en) Text error correction method and device, computer equipment and readable storage medium
CN113987274A (en) Video semantic representation method and device, electronic equipment and storage medium
CN112182151A (en) Reading understanding task identification method and device based on multiple languages
CN114821613A (en) Extraction method and system of table information in PDF
CN111881900B (en) Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN116909435A (en) Data processing method and device, electronic equipment and storage medium
CN114579796B (en) Machine reading understanding method and device
CN116311322A (en) Document layout element detection method, device, storage medium and equipment
CN116186312A (en) Multi-mode data enhancement method for data sensitive information discovery model
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN116976341A (en) Entity identification method, entity identification device, electronic equipment, storage medium and program product
CN117009577A (en) Video data processing method, device, equipment and readable storage medium
CN111428005A (en) Standard question and answer pair determining method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination