WO2023134447A1 - 数据处理的方法和相关设备 - Google Patents

数据处理的方法和相关设备 Download PDF

Info

Publication number
WO2023134447A1
WO2023134447A1 PCT/CN2022/142667 CN2022142667W WO2023134447A1 WO 2023134447 A1 WO2023134447 A1 WO 2023134447A1 CN 2022142667 W CN2022142667 W CN 2022142667W WO 2023134447 A1 WO2023134447 A1 WO 2023134447A1
Authority
WO
WIPO (PCT)
Prior art keywords
bounding box
image
iteration
recognition result
processing
Prior art date
Application number
PCT/CN2022/142667
Other languages
English (en)
French (fr)
Other versions
WO2023134447A9 (zh
Inventor
黄永帅
卢宁
都林
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22920077.9A priority Critical patent/EP4350646A1/en
Publication of WO2023134447A1 publication Critical patent/WO2023134447A1/zh
Publication of WO2023134447A9 publication Critical patent/WO2023134447A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Definitions

  • the present application relates to the field of artificial intelligence, in particular to a data processing method, device, system and data processing chip.
  • Image table recognition (referred to as table recognition for short) is an artificial intelligence (AI) technology that converts tables in images into editable tables (for example, hypertext markup language (hypertext markup language, HTML) and other formats). Image table recognition plays an important role in the automated processing of document formats.
  • AI artificial intelligence
  • the row and column lines of the table in the image are firstly detected, and then the intersection points between all the row and column lines included in the table are calculated to restore the coordinates of each cell included in the table (i.e. cell position).
  • the intersection points between all the row and column lines included in the table are calculated to restore the coordinates of each cell included in the table (i.e. cell position).
  • arrange all the cells according to the cell positions and obtain the row and column information of the cells (for example, starting row, starting column, spanning rows or spanning columns) through a heuristic algorithm to obtain a table recognition result.
  • the row and column lines are not obvious or the row and column lines are inclined, there may be missed detection of row and column lines or incorrect calculation of intersection points, and the accuracy of the table recognition results obtained based on this method is poor.
  • the present application provides a data processing method, device, system and data processing chip, which can improve the accuracy of form recognition results.
  • a data processing method including: acquiring a form image to be processed; determining a form recognition result according to the form image according to a generative form recognition strategy, wherein the generative form recognition strategy is used to indicate the use of mark
  • the non-overlapping attribute of language and bounding box determines the table recognition result of the table image, the bounding box is used to indicate the location of the text contained in the cell in the table associated with the table image, and the table recognition result is used to indicate the table contains The global structure and content of the table; output the recognition result of the table.
  • a markup language may be used to indicate a table local structure that is a partial structure in the table global structure.
  • the table structure may include: rows of the table, columns of the table, cells included in the table, each cell in the table, and a bounding box corresponding to text included in each cell in the table.
  • the bounding box corresponding to the text may refer to an arbitrary polygonal bounding box surrounding the text included in the cell.
  • the position of the text included in the cell in the table can be understood as the position of the bounding box corresponding to the text included in the cell in the table.
  • the table can be identified according to the markup language used to identify the structure of the table and the position of the text contained in the cell in the table in the table, so as to obtain the result of table recognition, which avoids the traditional technology only based on the table
  • the row-column structure of the table (the row-column structure of the table does not include the enclosing frame) has the problem of poor accuracy of the recognition result when recognizing the table.
  • the method provided by the application can improve the accuracy of the table recognition result.
  • the bounding box non-overlapping attribute is used to indicate that the areas corresponding to the cells included in the table do not overlap.
  • a bounding box may refer to an arbitrary polygonal box surrounding text contained in a cell.
  • the bounding box can also be called the bounding box corresponding to the text or the cell text block.
  • the cells included in the table are arranged in row order.
  • the determining the form recognition result according to the form image according to the generative form recognition strategy includes: obtaining the form recognition result through iterative processing according to the form image features and the markup language.
  • the table image feature can be used to indicate one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the feature of crossing rows of the table, the feature of crossing columns of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and each cell in the table or a bounding box corresponding to the text contained in each cell in the table.
  • the iterative processing includes multiple rounds of iterations
  • the method further includes: determining the first bounding box and the local structure obtained in the first iteration according to the table image features and the markup language, the first iteration is In any round of iterative processing of the multiple rounds of iterations, the first bounding box is used to indicate the bounding box of the local structure obtained by the first iteration, and the local structure is a partial structure of the global structure; when the second iteration obtains When the global structure is determined, the processing result obtained by the second iteration is determined to be the table recognition result, the second iteration is an iterative processing performed after the first iterative processing in the iterative processing, and the processing result includes the global structure and the content.
  • the bounding box of the local structure obtained in the first iteration is the position of the text included in the cell in the local structure of the table obtained in the first iteration. It can be understood that when the partial structure does not include any cell, or any cell included in the partial structure is an empty cell (that is, the cell does not include any text), the bounding box of the partial structure is empty.
  • the processing result obtained by the second iteration will be determined according to the first bounding box and the local structure obtained by the first iteration Identify the results for this table.
  • the result of the current round of iteration is determined based on the bounding box and the local structure obtained in the previous round of iteration (for example, the first iteration).
  • the subject of this method is an AI model, that is, in each round of iteration, this method will not only use the generated local structure (the local structure can be marked with a markup language) as a priori, but also surround the generated The box is used as a priori and is input into the AI model together to guide the next generation of the AI model.
  • This method is equivalent to not only telling the AI model how many cells have been generated before the current round of iteration, but also telling the AI model the specific position of the cells that have been generated before the current round of iteration in the table, so that the AI model's Attention will focus on the ungenerated cells.
  • This method can effectively reduce the divergence of the attention of the AI model and help improve the accuracy of the table recognition results.
  • the flow of the above multiple rounds of iterative processing can be executed by the transformer decoder in the data processing model provided by this application.
  • the transformer decoder may include two decoders, which are respectively denoted as a first decoder and a second decoder.
  • the transformer decoder determines the first bounding box and local structure obtained in the first iteration according to the table image feature and the markup language as an example.
  • the transformer decoder determines the first bounding box and the local structure obtained in the first iteration according to the table image feature and the markup language, and may include the following steps: using the table image feature and the markup language through the first decoder performing processing to obtain a first output result, the first output result indicating a non-empty cell or not a non-empty cell; the data processing model performs a first operation on the first output result to obtain the local structure.
  • the first operation may include normalized exponential function softmax processing.
  • the aforementioned processing of the table image feature and the markup language by the first decoder to obtain the first output result includes: processing the table image feature and the markup language by the first decoder to obtain the first decoded The output result of the decoder; the data processing model performs linearization processing on the output result of the first decoder to obtain the first output result.
  • the first decoder includes a first residual branch, a second residual branch and a third residual branch
  • the first residual branch includes a first attention head
  • the first The second residual branch includes a second attention head
  • the third residual branch includes a first feed-forward neural network FFN layer
  • the first decoder processes the table image features and the markup language to obtain the
  • the output result of the first decoder includes: the first residual branch processes the target vector to obtain the output result of the first residual branch, the target vector is a vector obtained according to the markup language; the second The residual branch processes the table image feature and the output result of the first residual branch to obtain the output result of the second residual branch; the output result of the third residual branch to the first FFN
  • the target operation is performed to obtain an output result of the first decoder, and the output result of the first FFN is obtained by performing a second operation according to the output result of the second residual branch.
  • the second operation may be a linear operation
  • the linear operation may specifically be: a linear transformation and a linear rect
  • the first residual branch further includes a first residual unit, and the first residual branch processes the target vector to obtain an output result of the first residual branch, including: the The first residual unit performs the target operation on the output of the first attention head to obtain the output result of the first residual branch.
  • the output of the first attention head is based on the first vector, the second vector and the third
  • the first vector is the query vector obtained according to the target vector
  • the second vector is the key vector obtained according to the target vector
  • the third vector is the value vector obtained according to the target vector.
  • the multiplication operation may include dot product and cross product.
  • the second residual branch further includes a second residual unit, and the second residual branch processes the table image features and the output result of the first residual branch to obtain the
  • the output result of the second residual branch includes: the second residual unit performs the target operation on the output of the second attention head to obtain the output result of the second residual branch, and the second attention head
  • the output is obtained according to the multiplication operation of the fourth vector, the fifth vector and the sixth vector, the fourth vector is the key vector obtained according to the table image feature, and the five vectors are the value vector obtained according to the table image feature,
  • the sixth vector is a query vector obtained according to the output result of the first residual branch.
  • the target vector is a vector obtained by performing a third operation on the second bounding box and the markup language according to the position encoding information, where the position encoding information indicates the position of the local structure indicated by the markup language in the table,
  • the second bounding box is used to indicate the bounding box of the local structure.
  • the third operation may include an addition operation.
  • the bounding box of the partial structure is used to indicate the location of the text contained in the cells of the partial structure in the table. It can be understood that when the partial structure does not include a cell, or any cell included in the partial structure does not include text, the enclosing frame of the partial structure is empty.
  • the target vector is obtained according to the position coding information, the second bounding box and the markup language, and the position coding information indicates the location of the local structure indicated by the markup language in the table, which is beneficial to improve the robustness of the table recognition results sex and accuracy.
  • the data processing method when the first output result indicates the non-empty cell, further includes: processing the table image feature and the first output result by the second decoder to obtain A second output result, where the second output result is used to indicate the first bounding box; the data processing model performs target operation on the second output result to obtain the first bounding box.
  • the second decoder can obtain the second output result through multiple iterations. It can be understood that the working principle of each iteration performed by the second decoder is the same as that of each iteration performed by the first decoder, except that the input and output data of the two decoders are different.
  • the transformer decoder can include decoder #1 and decoder #2.
  • Decoder #1 can perform table recognition on the table included in the table image through multiple rounds of iterations according to the markup language used to identify the table structure and the position of the text included in the cell in the table, avoiding the traditional technique of only Recognition of tables according to their row-column structure (the row-column structure of the table does not include bounding boxes) has the problem of poor accuracy of recognition results.
  • the output of the decoder #1 can be used as the input of the decoder #2, so that the decoder #2 is based on the output result of the decoder #1 and
  • the table image feature determines the specific position of the text contained in the non-empty cell indicated by the output of the decoder #1 in the table. In summary, this method can improve the accuracy and recognition efficiency of table recognition results.
  • the method further includes: correcting the first bounding box obtained by the first iteration.
  • table recognition may be performed based on the corrected bounding box.
  • real-time correction can be performed on the first bounding box acquired in the first iteration, which can further improve the accuracy of the first bounding box.
  • the robustness and accuracy of the output result of the next iteration can be further improved, and this method is conducive to further improving table recognition the accuracy of the results.
  • correcting the first bounding box obtained by the first iteration includes: correcting the first bounding box according to input parameters and the table image.
  • the above input parameters may be one or more parameters acquired by the user according to the form image, and the one or more parameters are used to correct the first bounding box.
  • the user can determine the input parameters for correcting the first bounding box according to actual needs, and manually input the input parameters to correct the first bounding box in real time.
  • This method further improves the accuracy of the table recognition results. On the premise of accuracy, it can also improve user satisfaction.
  • correcting the first bounding box obtained by the first iteration includes: when the matching degree between the second bounding box and the first bounding box is greater than or equal to a preset threshold , the first bounding frame is corrected according to the second bounding frame, the second bounding frame is obtained by processing the local structure with an error correction detection model, and the error correction detection model is a trained artificial intelligence AI model.
  • the data processing model provided in this application may also include an error correction and detection model.
  • the deviation correction detection model in the data processing model can automatically correct the first bounding box predicted by the model in real time, which is beneficial to further improve the accuracy and recognition efficiency of the form recognition result.
  • the method further includes: correcting the form recognition result according to the form image, and outputting the corrected form recognition result.
  • the method further includes: performing feature extraction on the form image to obtain features of the form image.
  • the table image feature can be used to indicate one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the feature of crossing rows of the table, the feature of crossing columns of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and each cell in the table or a bounding box corresponding to the text contained in each cell in the table.
  • the above-mentioned process of obtaining table image features can be executed by the feature extraction model in the data processing model provided by this application.
  • the feature extraction model is a neural network model with a feature extraction function, and the structure of the feature extraction model is not specifically limited.
  • any one of the following markup languages is used to mark the table recognition result: Hypertext Markup Language HTML, Extensible Markup Language XML, or LaTeX.
  • the markup language can be used to mark the form recognition result, which is beneficial to the subsequent further processing of the form recognition result.
  • a data processing device in a second aspect, includes various modules for executing the data processing method in the first aspect or any possible implementation manner of the first aspect.
  • the third aspect provides a data processing device, the data processing device has the first aspect or any possible implementation of the first aspect, and the second aspect or any possible implementation of the second aspect Implementation of the described data processing means of the functions.
  • This function may be implemented based on hardware, or may be implemented by corresponding software based on hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the structure of the data processing device includes a processor, and the processor is configured to support the data processing device to perform corresponding functions in the foregoing method.
  • the data processing device may also include a memory, which is used to be coupled with the processor, and stores necessary program instructions and data of the data processing device.
  • the data processing device includes: a processor, a transmitter, a receiver, a random access memory, a read only memory, and a bus. Wherein, the processor is respectively coupled to the transmitter, the receiver, the random access memory and the read-only memory through the bus.
  • the basic input/output system solidified in the read-only memory or the bootloader boot system in the embedded system is started, and the data processing device is guided into a normal operation state. After the data processing device enters the normal running state, run the application program and the operating system in the random access memory, so that the processor executes the method in the first aspect or any possible implementation manner of the first aspect.
  • a computer program product comprising: computer program code, when the computer program code is run on a computer, the computer is made to perform any possible implementation of the above first aspect or the first aspect Methods.
  • a computer-readable medium stores program codes, and when the computer program codes run on a computer, the computer executes any one of the above-mentioned first aspect or the first aspect. method of execution.
  • These computer-readable storages include, but are not limited to, one or more of the following: read-only memory (read-only memory, ROM), programmable ROM (programmable ROM, PROM), erasable PROM (erasable PROM, EPROM), Flash memory, electrical EPROM (electrically EPROM, EEPROM) and hard drive (hard drive).
  • a chip system the chip system includes a processor and a data interface, wherein the processor reads the instructions stored in the memory through the data interface, so as to execute any one of the above-mentioned first aspect or the first aspect method in a possible implementation.
  • the chip system can be based on a central processing unit (central processing unit, CPU), a microcontroller (micro controller unit, MCU), a microprocessor (micro processing unit, MPU), a digital signal processor (digital signal processor) signal processing, DSP), system on chip (system on chip, SoC), application-specific integrated circuit (application-specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or programmable logic device (programmable logic device, PLD) in the form of realization.
  • CPU central processing unit
  • MCU microcontroller
  • MPU microprocessor
  • DSP digital signal processor
  • SoC system on chip
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • PLD programmable logic device
  • a data processing system configured to execute the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the eighth aspect provides a data processing cluster, the cluster includes the second aspect or any possible implementation of the second aspect, and the third aspect or any of the possible implementations of the third aspect
  • the multiple data processing devices described may be used to execute the method in the first aspect or any possible implementation manner of the first aspect.
  • Figure 1 is a structural schematic diagram of the main framework of artificial intelligence.
  • Fig. 2 is a schematic structural diagram of a standard transformer module.
  • Fig. 3 is a schematic diagram of a convolutional neural network structure.
  • Fig. 4 is a schematic diagram of another convolutional neural network structure.
  • FIG. 5 is a schematic structural diagram of a system architecture 500 provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a data processing model provided by an embodiment of the present application.
  • Fig. 7 is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a data processing method 800 provided by an embodiment of the present application.
  • Fig. 9a is a schematic flowchart of a data processing model training method 900 provided by an embodiment of the present application.
  • Fig. 9b is a schematic diagram of a bounding box included in the table provided by the embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a data processing method 1000 provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of an execution process of a data processing method 1000 provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of an execution process of a data processing method 1000 provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a data processing device 1300 provided by an embodiment of the present application.
  • Fig. 14 is a schematic structural diagram of a training device 1400 provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a computing device 1500 provided by an embodiment of the present application.
  • Figure 1 is a schematic structural diagram of the main framework of artificial intelligence.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis)
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom”.
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (provision and processing technology realization) to the industrial ecological process of the system.
  • the infrastructure 110 provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • computing power is composed of smart chips (central processing unit (CPU), embedded neural network processor (neural-network processing unit, NPU), graphics processing unit (graphics processing unit, GPU), Application-specific integrated circuit (application-specific integrated circuit, ASIC), field-programmable gate array (field-programmable gate array, FPGA) and other hardware acceleration chips) are provided;
  • the basic platform includes distributed computing framework and network and other related platform guarantees and supports , which can include cloud storage and computing, interconnected networks, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • the data 120 on the upper layer of the infrastructure 110 is used to represent data sources in the field of artificial intelligence.
  • Data 120 involves graphics, images, voice, and text, as well as IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing 130 generally includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making. Among them, machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on the data 120 .
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general-purpose capabilities can be formed based on the results of the data processing 130, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
  • Intelligent products and industry applications 150 refer to the products and applications of artificial intelligence systems in various fields, which are the packaging of the overall solution of artificial intelligence, and the productization of intelligent information decision-making and the realization of landing applications. Its application fields mainly include: intelligent terminals, intelligent transportation , smart healthcare, autonomous driving, smart cities, etc.
  • the embodiments of the present application can be applied to many fields of artificial intelligence, for example, fields such as smart manufacturing, smart transportation, smart home, smart medical care, smart security, automatic driving, smart city, or smart terminal.
  • the neural network can be composed of neural units, and the neural unit can refer to an operation unit that takes xs and intercept 1 as input, and the output of the operation unit can be:
  • Ws is the weight of xs
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • the transformer model can also be called a transformer module, or a transformer structure, etc.
  • the transformer model is a multi-layer neural network based on self-attention modules. At present, it is mainly used to process natural language tasks.
  • the transformer model is mainly composed of a stacked multi-head self-attention module and a feed forward neural network (FFN).
  • FNN feed forward neural network
  • the transformer model can be further divided into an encoder (also known as an encoding module) and a decoder (also known as a decoding module), and its composition is roughly similar and different.
  • Fig. 2 is a schematic structural diagram of a standard transformer module. As shown in FIG. 2 , the encoder 210 is on the left, and the decoder 220 is on the right.
  • the encoder 210 may include any number of encoding sub-modules, and each encoding sub-module includes a multi-head self-attention module and a feedforward neural network.
  • Decoder 220 may include any number of decoding sub-modules, each of which includes two multi-head self-attention modules and a feed-forward neural network. The number of encoding sub-modules and the number of decoding sub-modules may or may not be equal.
  • the attention mechanism imitates the internal process of biological observation behavior, that is, a mechanism that aligns internal experience and external sensation to increase the observation precision of some areas, and can quickly filter out high-value information from a large amount of information with limited attention resources .
  • Attention mechanism can quickly extract important features of sparse data, so it is widely used in natural language processing tasks, especially machine translation.
  • the self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features.
  • the essential idea of the attention mechanism can be rewritten as the following formula:
  • Lx
  • the meaning of the formula is to imagine that the constituent elements in Source are composed of a series of data pairs.
  • a certain element Query (abbreviated as Q )
  • K the weight coefficient corresponding to Value (abbreviated as V) for each Key is obtained, that is, the final Attention value is obtained.
  • the Attention mechanism is to weight and sum the Value values of the elements in the Source, and Query and Key are used to calculate the weight coefficient corresponding to the Value.
  • Attention can be understood as selectively screening out a small amount of important information from a large amount of information and focusing on these important information, ignoring most of the unimportant information.
  • the process of focusing is reflected in the calculation of the weight coefficient.
  • the self-attention mechanism can be understood as internal Attention (intra attention).
  • the Attention mechanism occurs between the elements Query of the Target and all elements in the Source.
  • the self-attention mechanism refers to between the internal elements of the Source or between the internal elements of the Target.
  • the specific calculation process is the same, but the calculation object has changed.
  • CNN Convolutional neural network
  • a convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features independent of position.
  • the convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • CNN is a very common neural network.
  • the convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture is It refers to multiple levels of learning at different levels of abstraction through machine learning algorithms.
  • CNN is a feed-forward artificial neural network in which individual neurons respond to inputs.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 .
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix.
  • the convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. Take an image as an example (other data types are similar). During the convolution operation on the image, the weight matrix is usually followed by one pixel in the horizontal direction on the input image One pixel (or two pixels followed by two pixels...depending on the value of the stride) is processed to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • the dimension here can be understood as determined by the "multiple" above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • pooling layer can be followed by one layer of convolutional layers.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because the above convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. Pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 3 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models.
  • the convolutional neural network used in the embodiment of the present application may only include an input layer 210 , a convolutional layer/pooling layer 220 and an output layer 240 .
  • the convolutional neural network 200 shown in FIG. 3 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as Multiple convolutional layers/pooling layers shown in FIG. 4 are parallelized, and the extracted features are input to the fully connected layer 230 for processing.
  • the present application provides a data processing method, including: acquiring a form image to be processed; determining the form recognition result according to the form image according to a generative form recognition strategy, wherein the generative form recognition strategy is used to indicate the use of markup language and bounding boxes
  • the non-overlapping attribute determines the table recognition result of the table image
  • the bounding box is used to indicate the location of the text contained in the cell in the table image associated with the table image
  • the table recognition result is used to indicate the global structure and content included in the table; output table recognition result.
  • the method recognizes the table according to the markup language used to identify the table structure and the position in the table of the text included in the cell in the table to obtain the result of the table recognition, avoiding the traditional technology only based on the row and column structure of the table (The row-column structure of the table does not include the enclosing frame) There is a problem that the accuracy of the recognition result is poor in the recognition of the table, and the method can improve the accuracy of the recognition result of the table.
  • FIG. 5 introduces in detail the system architecture provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a system architecture 500 provided by an embodiment of the present application.
  • the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 and a data acquisition system 560 .
  • the execution device 510 includes a calculation module 511 , a data processing system 512 , a preprocessing module 513 and a preprocessing module 514 .
  • the calculation module 511 may include the target model/rule 501, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • the data collection device 560 is used to collect training samples.
  • the training sample may include the training image and the classification result corresponding to the training image, wherein the classification result of the training image may be the result of manual pre-labeling .
  • the data collection device 560 stores these training samples in the database 530 .
  • the data processing model provided by this application may also be maintained in the database 530 .
  • FIG. 6 below shows a schematic structural diagram of the data processing model provided by the embodiment of the present application. For details, refer to the relevant description of FIG. 6 below, and details are not repeated here.
  • the training device 520 can train the target model/rule 501 based on the training samples maintained in the database 530 .
  • the target model/rule 501 may be a data processing model provided by this application.
  • all training samples maintained in the database 530 may be training samples collected by the data collection device 560 .
  • some of the training samples maintained in the database 530 may also be training samples collected by devices other than the data collection device 560 .
  • the training device 520 may also acquire training samples from the cloud or other places to train the target model/rule 501, and the above description should not be used as a limitation to the embodiment of the present application.
  • the target model/rule 501 trained according to the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG. Laptops, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) equipment, vehicle terminals, etc., can also be servers or clouds, etc.
  • the training device 520 may transfer the data processing model provided by this application to the execution device 510 .
  • the execution device 510 includes a data processing system 512.
  • the data processing system 512 is used for data interaction with external devices.
  • the user can input data to the data processing system 512 through the client device 540 (for example, the data to be processed in the embodiment of the present application) table image).
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data received by the data processing system 512 . It should be understood that there may be no preprocessing module 513 and preprocessing module 514, or there may be only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the calculation module 511 may be used directly to process the input data.
  • the execution device 510 When the execution device 510 preprocesses the input data, or in the calculation module 511 of the execution device 510 performs calculation and other related processing, the execution device 510 can call the data, codes, etc. in the data storage system 550 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 550 .
  • the data processing system 512 presents the processing result (for example, the form recognition result in the embodiment of the present application) to the client device 540, thereby providing it to the user.
  • the processing result for example, the form recognition result in the embodiment of the present application
  • the user can manually specify input data, and the "manually specify input data" can be operated through a user interface (user interface, UI) provided by the data processing system 512.
  • the client device 540 can automatically send the input data to the data processing system 512 . If the client device 540 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 540 .
  • the user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 540 can also be used as a data collection terminal, collecting the input data of the input data processing system 512 as shown in the figure and the output results of the output data processing system 512 as new sample data, and storing them in the database 530 .
  • the data processing system 512 can also directly store the input data of the input data processing system 512 and the output results of the output data processing system 512 as new sample data without going through the client device 540.
  • Database 530 can also be used as a data collection terminal, collecting the input data of the input data processing system 512 as shown in the figure and the output results of the output data processing system 512 as new sample data, and storing them in the database 530 .
  • the data processing system 512 can also directly store the input data of the input data processing system 512 and the output results of the output data processing system 512 as new sample data without going through the client device 540.
  • Database 530 can also directly store the input data of the input data processing system 512 and the output results of the output data processing system 512 as new sample data
  • FIG. 5 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in FIG. 5 does not constitute any limitation.
  • the data storage system 550 is an external memory relative to the execution device 510 , and in other cases, the data storage system 550 may also be placed in the execution device 510 . It should be understood that the above execution device 510 may be deployed in the client device 540 .
  • the system architecture 500 shown in FIG. 5 above can be applied to the application phase (also called inference phase) of the data processing model provided by this application, and the training phase of the data processing model provided by this application.
  • application phase also called inference phase
  • training phase of the data processing model provided by this application.
  • the specific functions of the modules included in the system architecture 500 will be described below in detail when the system architecture 500 is applied to the application phase of the data processing model and the training phase of the data processing model respectively.
  • the above-mentioned system architecture shown in FIG. 5 can be applied to the application phase of the data processing model provided in this application.
  • the calculation module 511 of the execution device 510 may obtain the code stored in the data storage system 550 to implement the data processing method in the embodiment of the present application.
  • the calculation module 511 of the execution device 510 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a graphics processing unit (GPU), general-purpose processor, digital signal processor (digital signal processing, DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 can be a hardware system with the function of executing instructions, such as a CPU , DSP, etc., or a hardware system that does not have the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems that do not have the function of executing instructions and hardware systems that have the function of executing instructions.
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • GPU graphics processing unit
  • DSP digital signal processor
  • microprocessor or microcontroller etc.
  • the training device 520 can be a hardware system with the function of executing instructions, such as a CPU
  • the computing module 511 of the execution device 510 may be a hardware system capable of executing instructions
  • the data processing method provided in the embodiment of the present application may be a software code stored in a memory
  • the computing module 511 of the execution device 510 may read from the memory
  • the software code is obtained in the computer, and the obtained software code is executed to implement the data processing method provided in the embodiment of the present application.
  • the computing module 511 of the execution device 510 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions.
  • the computing module 511 in the calculation module 511 is implemented by a hardware system that does not have the function of executing instructions, which is not limited here.
  • the system architecture shown in FIG. 5 above may be applied to the training phase of the data processing model provided in this application.
  • the above-mentioned training device 520 can obtain the code stored in the memory (not shown in FIG. 5, which can be integrated into the training device 520 or deployed separately from the training device 520) to realize the data processing in the embodiment of the present application Methods.
  • the training device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a graphics processing unit (GPU), a general-purpose processor , digital signal processor (digital signal processing, DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 can be a hardware system with the function of executing instructions, such as CPU, DSP etc., or a hardware system that does not have the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems that do not have the function of executing instructions and hardware systems that have the function of executing instructions.
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • GPU graphics processing unit
  • DSP digital signal processor
  • microprocessor or microcontroller etc.
  • the training device 520 can be a hardware system with the function of executing instructions, such as CPU, DSP etc., or a
  • the training device 520 may be a hardware system capable of executing instructions
  • the data processing method provided in the embodiment of the present application may be a software code stored in a memory
  • the training device 520 may acquire the software code from the memory and execute the The obtained software code is used to realize the data processing method provided by the embodiment of the present application.
  • the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions.
  • the instruction function is implemented by a hardware system, which is not limited here.
  • FIG. 6 shows the structure of the data processing model.
  • FIG. 6 shows the structure of the data processing model. For details related to the model, refer to the description in FIG. 6 below, and details will not be repeated here.
  • FIG. 6 is a schematic structural diagram of a data processing model provided by an embodiment of the present application.
  • the data processing model may be a neural network model.
  • the data processing model includes a feature extraction model and a transformer decoder.
  • the transformer decoder can include embedding layer #1, embedding layer #2, decoder #1 composed of N-layer decoder stacks, and decoder #2 composed of M-layer decoder stacks, N and M are positive integers .
  • the structure of any layer of decoders in the N-layer decoders may be the same as the structure of any layer of decoders in the M-layer decoders.
  • FIG. 7 below shows the structure of the decoder. For details, refer to the relevant description in FIG. 7 , which will not be described in detail here. Specific values of N and M can be set according to actual needs. The values of N and M may be equal or unequal, which is not specifically limited.
  • the feature extraction model is a neural network model, and the feature extraction model is used to perform feature extraction on the form image to obtain form feature vectors (also called form image features) included in the form image.
  • the table feature vector is used to indicate one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the row-spanning feature of the table, the column-spanning feature of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and each cell in the table or a bounding box corresponding to the text contained in each cell in the table.
  • the bounding box corresponding to the text may refer to any polygon surrounding the text.
  • the output of the feature extraction model may include a value vector V2 corresponding to the feature vector of the table, and a key vector K2 corresponding to the feature vector of the table.
  • the feature extraction model is not specifically limited.
  • the feature extraction model may be a CNN model.
  • the feature extraction model may be a combination model composed of a CNN model and a feature pyramid network (feature pyramid network image, FPN) model.
  • the embedding layer can embed the current input to obtain multiple feature vectors.
  • the core feature of the data processing model lies in its unique attention mechanism.
  • the embedding layer encodes the value, position and corresponding bounding box of each node in the current sequence, and adds these codes element by element to obtain the embedding vector.
  • the embedding layer processes the embedding vector to obtain the query vector Q1, key vector K1 and value vector V1 corresponding to the embedding vector.
  • Fig. 7 is a schematic structural diagram of a decoder provided by an embodiment of the present application.
  • the decoder includes sequential masked multi-head attention (masked multi-head attention) layer, summation and normalization (add&norm), multi-head attention (multi-head attention, MHA) layer, summation and Normalize, feed forward layers, sum and normalize.
  • the masked multi-head attention (masked multi-head attention) layer and summation and normalization can be referred to as residual branch 1 in the following.
  • Multi-head attention layer with summation and normalization called residual branch 2.
  • the feed-forward layer is summed and normalized, called residual branch 3.
  • the masked multi-head attention layer obtains input vectors from its upper layer, adopts the self-attention mechanism, and transforms each vector based on the degree of correlation between vectors to obtain the output vector of the masked multi-head attention layer.
  • the output vectors of the masked multi-head attention layer are summed and normalized to obtain the output vector Q2 of residual branch 1. It can be understood that when the masked multi-head attention layer is a layer directly connected to the embedding layer, such as the decoder directly connected to the embedding layer #1 in Figure 6, the input vector obtained by the masked multi-head attention layer is the embedding Vector of output from layer #1.
  • the mask multi-head attention layer is a layer directly connected to the embedding layer, and the input of the mask multi-head attention layer includes the output vector of the embedding layer (ie, Q1, K1 and V1) .
  • the layer directly connected to decoder #1 of the upper layer N is equal to 1) (N is equal to 2) decoder #1 includes a masked multi-head attention layer, and the input vector obtained by the masked multi-head attention layer is the output vector of decoder #1 of the upper layer.
  • the input of the multi-head attention layer includes the vector Q2 output by the residual branch 1, and the output vectors (ie V2 and K2) of the feature extraction model.
  • the self-attention mechanism is used to transform each vector based on the degree of correlation between the vectors. , to get the output vector.
  • the MHA layer based on multi-head attention (MHA) includes multiple attention heads.
  • the feed-forward neural network FFN layer is used to perform the following operations on the vector output by the residual branch 2: linear transformation and linear rectification function (linear rectification function, ReLU) activation operation. Afterwards, sum and normalize the vectors output by the FFN layer of the feedforward neural network to obtain the output vector of the residual branch 3 .
  • FIG. 8 is a schematic flowchart of a data processing method 800 provided by an embodiment of the present application. It can be understood that the method 800 can be executed by, but not limited to, the data processing model shown in FIG. 6 above.
  • the data processing model includes a feature extraction model and a transformer model, and the transformer model includes decoder #1 and decoder # 2.
  • the method 800 includes step 810 to step 830. Next, step 810 to step 830 will be described in detail.
  • Step 810 acquire the form image to be processed.
  • Acquiring the table image to be processed may include: the data processing model obtains the table image to be processed.
  • the data processing model may provide a user interface (user interface, UI) for the user, and the user inputs the form image through the UI.
  • UI user interface
  • the table image may include one or more tables.
  • the form image may also be replaced by: portable document format (PDF).
  • PDF portable document format
  • Step 820 determine the form recognition result according to the form image according to the generative form recognition strategy, wherein the generative form recognition strategy is used to indicate the form recognition result of the form image determined by using the markup language and the non-overlapping property of the bounding box, and the bounding box is used to indicate the form The position of the text included in the cell in the table associated with the image, and the table recognition result is used to indicate the global structure and content included in the table.
  • the generative form recognition strategy is used to indicate the form recognition result of the form image determined by using the markup language and the non-overlapping property of the bounding box
  • the bounding box is used to indicate the form
  • the table recognition result is used to indicate the global structure and content included in the table.
  • a markup language may be used to indicate a table local structure that is a partial structure in the table global structure.
  • the table structure may include: rows of the table, columns of the table, cells included in the table, each cell in the table, and a bounding box corresponding to text included in each cell in the table.
  • the bounding box corresponding to the text may refer to an arbitrary polygonal bounding box surrounding the text included in the cell.
  • the position of the text included in the cell in the table can be understood as the position of the bounding box corresponding to the text included in the cell in the table.
  • the markup language may be, but not limited to, any one of the following markup languages: hypertext markup language HTML, extensible markup language (extensible markup language, XML), or LaTex.
  • FIG. 9b shows a schematic diagram of the bounding box included in the table provided by the present application. Referring to FIG. 9b, the rectangular bounding box corresponding to the text included in each non-empty cell (that is, the cell includes text) is not There are overlapping areas.
  • the bounding box is used to indicate the position of the text included in the cell in the table associated with the table image.
  • the bounding box may refer to any polygon that surrounds the text contained in the cell.
  • the shape of the bounding box is not specifically limited.
  • the polygon may be, but not limited to, one of the following polygons: rectangle, square, parallelogram, or other polygons (eg, hexagon, etc.).
  • the specific position of the bounding box in the table can be determined through the coordinates of the bounding box.
  • the specific position of the rectangular bounding box in the table can be determined by the coordinates of the upper left corner of the rectangular bounding box and the coordinates of the lower right corner of the rectangular bounding box.
  • the specific position of the rectangular bounding box in the table can be determined by the coordinates of the lower left corner of the rectangular bounding box and the coordinates of the upper right corner of the rectangular bounding box.
  • the execution subject of the above step 820 may be a transformer decoder in the data processing model.
  • the form recognition result is determined according to the form image according to the generative form recognition strategy, including: the transformer decoder obtains the form recognition result through iterative processing according to the form image features and markup language. This iterative process may include multiple rounds of iterations, and the transformer decoder also performs the following steps: the transformer decoder determines the first bounding box and local structure obtained in the first iteration according to the table image features and markup language, and the first iteration is any of the multiple rounds of iterations.
  • the first bounding box is used to indicate the bounding box of the local structure obtained in the first iteration, and the local structure is a partial structure of the global structure; when the global structure is obtained in the second iteration, the transformer decoder determines the second iteration
  • the obtained processing result is a table recognition result
  • the second iteration is an iterative processing performed after the first iterative processing in the iterative processing, and the processing result includes the global structure and content.
  • the bounding box of the local structure obtained by the first iteration is the position of the text included in the cell in the local structure of the table obtained by the first iteration. It can be understood that when the partial structure does not include any cell, or any cell included in the partial structure is an empty cell (that is, the cell does not include any text), the bounding box of the partial structure is empty.
  • the transformer decoder determines the first bounding box and local structure obtained in the first iteration according to the table image features and markup language, including: decoder #1 determines the local structure according to the table image features and markup language, and the local structure includes the table The local structure is a part of the global structure of the table; Decoder #2 determines that the bounding box corresponding to the text included in the non-empty cell is located in the table according to the image characteristics of the table and the non-empty cells of the table s position. The text included in the non-empty cell corresponds to the position in the table of the bounding box, that is, the text included in the non-empty cell is located in the table.
  • the above markup language may include a sequence indicating the beginning of the markup language of the markup table.
  • the markup language may include a sequence indicating the local structure of the markup table, and the local structure output by decoder #1 contains the markup language The local structure of the indicated table. It can be understood that when the local structure output by decoder #1 does not include non-empty cells (that is, the local structure does not include a bounding box), decoder #2 may not perform iterative processing on the table image features and the local structure.
  • decoder #1 and decoder #2 can respectively obtain the global structure of the table by performing multiple rounds of iterations.
  • FIG. 10 to FIG. 12 below show a schematic diagram of the execution process of decoder #1 and decoder #2 respectively determining the table recognition result through multiple rounds of iterations.
  • FIG. 10 to FIG. 12 below show a schematic diagram of the execution process of decoder #1 and decoder #2 respectively determining the table recognition result through multiple rounds of iterations.
  • the following method may also be performed: correcting the first bounding box obtained in the first iteration.
  • This correction method can be understood as a real-time correction method, that is, to correct the bounding box acquired at a certain time during the form recognition process.
  • the execution subject that executes this correction process may be the above-mentioned decoder #1. It can be understood that, in the next round of iterative processing of the first iteration, processing can be performed according to the corrected bounding box and table image features of the first bounding box, so as to obtain the next round of iteration of the first iteration. Bounding boxes and local structures.
  • the local structure obtained in the next iteration of the first iteration is a local structure in the global structure of the table, and the local structure obtained in the next iteration of the first iteration includes the local structure obtained in the first iteration.
  • correcting the first bounding box obtained in the first iteration includes: correcting the first bounding box according to the input parameters and the table image.
  • the input parameters include parameters for correcting the first bounding box, and the input parameters may be parameters obtained by the user according to the form image.
  • the data processing model may provide the user with a user interface (user interface, UI) having the input parameter, and the user inputs the input parameter through the UI.
  • correcting the first bounding box obtained in the first iteration includes: when the matching degree between the second bounding box and the first bounding box is greater than or equal to a preset threshold, according to the second The bounding box corrects the first bounding box, and the second bounding box is obtained by processing the local structure with the error correction detection model, and the error correction detection model is a trained artificial intelligence (AI) model.
  • the data processing model may also include the error correction and detection model. Based on this, the specific implementation manner of this correction can be understood as a manner in which the data processing model automatically corrects the first bounding box.
  • the matching degree may be determined by any of the following methods: matching by intersection-over-union (IoU), or the distance between central points. For example, when the matching degree is determined by IoU, the greater the IoU, the greater the matching degree. As another example, when the matching degree is determined by the distance from the center point, the smaller the distance, the greater the matching degree.
  • the size of the preset threshold is not specifically limited, and the size of the preset threshold can be set according to actual needs.
  • the following steps may also be performed: according to the global structure of the table, the text content included in the global structure is obtained.
  • the manner of obtaining the text content included in the global structure is not specifically limited.
  • the cell image corresponding to the text bounding box can be intercepted, and the cell image can be recognized by an optical character recognition (optical character recognition, OCR) system, so that the cell image includes of the text content.
  • OCR optical character recognition
  • the following method may also be performed before the above step 820: performing feature extraction on the form image to obtain form image features.
  • the feature extraction model in the data processing model may perform feature extraction on the form image to obtain the feature of the form image.
  • the table image features include one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the feature of crossing rows of the table, the feature of crossing columns of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and each cell included in the table or a bounding box of cell text included in the table.
  • any one of the following markup languages may be used to mark the table recognition result: Hypertext Markup Language HTML, Extensible Markup Language XML, or LaTeX.
  • HTML Hypertext Markup Language
  • XML Extensible Markup Language
  • LaTeX LaTeX
  • Step 830 outputting the form recognition result.
  • the following method may also be executed after the above step 830: correcting the form recognition result according to the form image, and outputting the corrected form recognition result.
  • the subject that executes the method may be a model of data processing.
  • This method of correction can be understood as a method of post-event correction, that is, after the form recognition result is obtained, the form recognition result can be corrected according to the form image.
  • the data processing model can identify the table included in the table image through multiple rounds of iterations according to the markup language used to identify the table structure and the text included in the cell in the table.
  • the decoder #1 included in the transformer decoder in the data processing model can determine the local structure according to the features of the table image and the markup language obtained in the previous round of iteration.
  • the decoder #2 included in the transformer decoder can determine the decoder #1 according to the output of the decoder #1 and the table image features
  • the bounding boxes included in the non-empty cells indicated by the output of are located at specific positions in the table, so that the redundancy of predicted bounding boxes can be reduced and the efficiency of table recognition results can be improved.
  • the decoder #2 predicts the bounding boxes included in all non-empty cells in the table through multiple rounds of iterations, so that the predicted bounding boxes are more accurate, which is conducive to improving the accuracy of table recognition results.
  • the data processing model can also correct the first bounding box acquired in the first iteration in real time, which can further improve the accuracy of the first bounding box.
  • the robustness and accuracy of the output result of the next iteration can be further improved, and this method is conducive to further improving table recognition the accuracy of the results.
  • model training method for data processing provided by the embodiment of the present application will be described by taking the model training stage as an example.
  • Fig. 9a is a schematic flowchart of a data processing model training method 900 provided by an embodiment of the present application. As shown in FIG. 9 a , the method 900 includes steps 910 to 940 . Step 910 to step 940 will be described in detail below.
  • Step 910 acquiring multiple training data sets and label information corresponding to each training data set.
  • Step 920 input the training data set into the target model, and the target model processes the training data set to obtain training output information corresponding to the training data set.
  • Step 930 adjust the parameters of the target model according to the label information and the training output information, so as to minimize the difference between the training output information and the label information.
  • Step 940 use the adjusted parameter value to return to and continue to execute step 920 and step 930 until the obtained loss value gradually converges, that is, the trained target model is obtained.
  • both the above target model and the trained target model include: a feature extraction model and a transformer decoder.
  • the feature extraction model and the transformer decoder perform model training together.
  • each training data set may include a table image
  • the label information corresponding to each training data set may be used to indicate the table features included in the table image.
  • the table features include one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the row-spanning feature of the table, the column-spanning feature of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and a bounding box of each cell or cell text.
  • FIG. 6 shows a schematic structural diagram of a model including a feature extraction model and a transformer decoder.
  • the above target model and the trained target model may both include: a feature extraction model, a transformer decoder, and an error correction detection model.
  • the error correction detection model may be a neural network model, which is used to correct the bounding boxes corresponding to the text contained in the cells in the table.
  • the feature extraction model, transformer decoder, and error correction detection model are performed together for model training.
  • each training data set may include a table image, and the label information corresponding to each training data set may be used to indicate the table features included in the table image.
  • the table features include one or more of the following features: the number of rows in the table, the number of columns in the table, the size of the table, the row-spanning feature of the table, the column-spanning feature of the table, or the layout of the table.
  • the layout of the table includes a markup language used to indicate the structure of the table, and a bounding box of each cell or cell text.
  • FIG. 10 to FIG. 12 are only intended to help those skilled in the art understand the embodiments of the present application, and are not intended to limit the embodiments of the present application to the illustrated specific values or specific scenarios. Those skilled in the art can obviously make various equivalent modifications or changes according to the examples shown in FIGS. 10 to 12 below, and such modifications and changes also fall within the scope of the embodiments of the present application.
  • the table image includes a table 1, and the table 1 can be represented by HTML language as an example for introduction.
  • the table 1 can also be expressed in any of the following languages, but not limited to: Extensible Markup Language XML, or LaTeX.
  • the table 1 may be as shown in Table 1, the table 1 includes multiple cells, the multiple cells include text, and the cells included in the table 1 may be strictly arranged in a row-first order.
  • FIG. 9b shows the rectangular bounding box corresponding to the text included in each non-empty cell (that is, the cell includes text) included in the table 1, and each rectangular bounding box shown in FIG. 9b corresponds to The numbers are used to indicate the order in which the cells included in the table 1 are arranged in rows.
  • Table 1 shown in Table 1 above can be represented by the following simplified HTML language:
  • ⁇ td>[] ⁇ /td>” and " ⁇ td” represent the HTML sequence of non-empty cells, and the encoding of the bounding box corresponding to the text included in the HTML sequence of non-empty cells is ([x1,y1,x2, y2] ⁇ (0 ⁇ N)), and the values of x1, y1, x2 and y2 are not 0 at the same time.
  • the specific position of the bounding box may be determined by the coordinates of the upper left corner of the bounding box and the coordinates of the lower right corner of the bounding box.
  • "x1, y1" may represent the coordinates of the upper left corner of the one bounding box
  • "x2, y2" may represent the coordinates of the lower right corner of the one bounding box.
  • the specific position of the bounding box may be determined by the coordinates of the lower left corner of the bounding box and the coordinates of the upper right corner of the bounding box.
  • "x1, y1” may represent the coordinates of the lower left corner of the one bounding box
  • "x2, y2" may represent the coordinates of the upper right corner of the one bounding box.
  • the bounding box corresponding to the text may refer to any polygon surrounding the text.
  • the polygon is not specifically limited.
  • the polygon may be, but not limited to, one of the following polygons: rectangle, square, parallelogram, or other polygons (eg, hexagon, etc.).
  • a bounding box corresponding to a text is a rectangular bounding box, and the coordinates of the upper left corner of the rectangular bounding box and the lower right corner of the rectangular bounding box are used to determine the The specific position of a rectangular bounding box" is used as an example to introduce.
  • the data processing method shown in FIG. 10 may be a specific example of the data processing method shown in FIG. 8 above.
  • the method 1000 shown in FIG. 10 is to include a table (that is, the table shown in Table 1 above) in the table image in the method 800 shown in FIG. 8 above, and use HTML to mark the table recognition result and
  • the local structure of the table obtained by each round of iteration, and the correction of the first bounding box by using the error correction detection model are introduced as examples.
  • FIG. 10 is a schematic flowchart of a data processing method 1000 provided by an embodiment of the present application. It can be understood that the method 1000 shown in FIG. 10 can be executed by the data processing model shown in FIG. 6 above. Specifically, the structural decoder 1 shown in FIG. 10 may be the decoder #1 shown in FIG. 6 , and the bounding box decoder 1 shown in FIG. 10 may be the decoder #2 shown in FIG. 6 . As shown in FIG. 10 , the method 1000 includes step 1010 to step 1080 . Next, step 1010 to step 1080 will be introduced in detail.
  • Step 1010 the feature extraction model performs feature extraction on the form image to obtain image feature 1.
  • the number of tables included in the table image is not specifically limited.
  • a form image may include 1, 2 or 5 forms, etc.
  • the table image in the above step 1010 includes a table 1 shown in the above table 1 as an example for introduction.
  • the image feature 1 is used to indicate one or more of the following features: the number of rows in Table 1, the number of columns in Table 1, the size of Table 1, the row-spanning feature of Table 1, or the column-spanning feature of Table 1.
  • the feature extraction model is a submodel in the data processing model.
  • the feature extraction model is a neural network model with a feature extraction function, and the structure of the feature extraction model is not specifically limited.
  • the feature extraction model can be a CNN model.
  • the feature extraction model may also be a combined model composed of a CNN model and a feature pyramid network (feature pyramid network image, FPN) model.
  • the feature extraction model may also be used to obtain form images.
  • a user inputs a form image into a data processing model, so that the feature extraction model can obtain the form image.
  • the structural decoder 1 can perform iterative processing for i times based on the image feature 1 and the initial input sequence to obtain the structural features of Table 1.
  • the structural features of Table 1 are part of the global structure of Table 1. j is a positive integer.
  • the structural features of Table 1 may include row and column information of Table 1.
  • the structural decoder 1 performs iterative processing based on the image feature 1 and the initial input sequence, including: during the first iterative processing, the structural decoder 1 processes the image feature 1 and the initial input sequence; the i+1th iteration During processing, the structural decoder 1 obtains the output result after the iterative processing of the i+1th iteration for the image feature 1, the initial input sequence, and the output result after the iterative processing.
  • step 1020, step 1030 and step 1040 the process of executing the first to third iterations by the structure decoder 1 will be described in detail. Exemplarily, (1) in FIG.
  • FIG. 11 shows the flow of the structure decoder 1 performing the first iterative process based on the image feature 1, the sequence position code and the initial input sequence, and the predicted structure sequence 1 is the first iterative process. Output the result.
  • (2) in FIG. 11 shows the flow of the second iterative process performed by the structure decoder 2, and the predicted structure sequence 2 is the output result of the second iterative process.
  • (3) in FIG. 11 shows the flow of the structure decoder 2 executing the third iterative process, and the predicted structure sequence 3 is the output result of the third iterative process.
  • Step 1020 structure decoder 1 processes image feature 1, initial input sequence and initial bounding box to obtain predicted structure sequence 1, which is used to indicate HTML sequence 1 that is not a non-empty cell.
  • the initial input sequence may include a sequence indicating the start of the HTML sequence marking Form 1, where the initial bounding box is empty.
  • the predicted structure sequence 1 includes an HTML sequence 1 used to indicate that the cell is not a non-empty cell.
  • the predicted structure sequence 1 does not include bounding box information, that is, the bounding box information included in the predicted structure sequence 1 is empty.
  • (1) in Figure 11 shows the initial input sequence, the initial bounding box and the predicted structure sequence 1, the predicted structure sequence 1 is " ⁇ table>", " ⁇ table>" is used to indicate that it is not a non-empty unit grid of HTML sequences1.
  • the structural decoder 1 processes the image feature 1, the initial input sequence and the initial bounding box to obtain the recognition result 1, including: the structural decoder 1 processes the image feature 1, the initial input sequence and the initial bounding box to obtain the structural decoding
  • the output result of decoder 1 the structure decoder 1 linearizes the output result to obtain sequence information 1, which is used to indicate the predicted structure sequence 1; use the normalized exponential function softmax to process the sequence information 1 to obtain the prediction Structural Sequence 1.
  • the structure decoder 1 processes the image feature 1, the initial input sequence and the initial bounding box to obtain the output result of the structure decoder 1, including: the structure decoder 1 processes the image feature 1, the initial The input sequence, the initial sequence position encoding and the initial bounding box are processed to obtain the output of the structure decoder 1.
  • the initial sequence position code is used to indicate the position of the initial input sequence in Table 1. Exemplarily, (1) in FIG. 11 shows the initial sequence position encoding.
  • the structure decoder 1 in the embodiment of the present application as the structure shown in Figure 7 as an example, it is specifically introduced that "the structure decoder 1 is used to perform image feature 1, initial input sequence, initial sequence position encoding and initial bounding box processing to obtain the output result of structure decoder 1".
  • the input of the masked multi-head attention layer of residual branch 1 includes V1, K1 and Q1.
  • V1 is the value (value) vector obtained according to the target vector 1
  • K1 is the key (key) vector obtained according to the target vector 1
  • Q1 is the query (query) vector obtained according to the target vector 1
  • the output of the masked multi-head attention layer is the result of the dot multiplication of V1, K1 and Q1, which can be expressed by the following formula 1:
  • Attention(Q1, K1, V1) represents the output result of the masked multi-head attention layer
  • d k1 represents the dimension of the K1 vector.
  • the output of the masked multi-head attention layer is then summed and normalized to obtain the output of residual branch 1.
  • the input of the multi-head attention layer of residual branch 2 includes V2, K2 and Q2.
  • V2 is the value (value) vector obtained by processing according to image feature 1
  • K2 is the key (key) vector obtained by processing according to image feature 1
  • Q2 is the query (query) vector obtained by processing according to target vector 2
  • target vector 2 is the query vector obtained by processing the output result of residual branch 1.
  • the output of the multi-head attention layer is the result of the dot multiplication of V2, K2 and Q2, which can be expressed by the following formula 2:
  • Attention(Q2, K2, V2) represents the output result of the masked multi-head attention layer
  • d k2 represents the dimension of K2.
  • the input of the feedforward neural network FFN layer of the residual branch 3 includes the output of the residual branch 2, and the feedforward neural network FFN layer performs the following operations on the output of the residual branch 2: Linear transformation and linear rectification function (ReLU) processing.
  • ReLU linear rectification function
  • Step 1030 structure decoder 1 processes image feature 1, input sequence 1 and bounding box 1 to obtain predicted structure sequence 2, which is used to indicate HTML sequence 2 that is not a non-empty cell.
  • the input sequence 1 includes an HTML sequence for indicating the local structure 2 of the table 1 , which is a partial structure in the global structure of the table 1 , and the local structure 2 includes the local structure 1 and the predicted structure sequence 1 .
  • (2) in FIG. 11 shows input sequence 1.
  • the bounding box 1 is used to indicate the location of the text included in the cell corresponding to the local structure 2 .
  • (2) in FIG. 11 shows that the bounding box 1 includes: Box ⁇ , Box ⁇ .
  • the sequence position code 1 shown in (2) in FIG. 11 is used to indicate the position of the input sequence 1 in the table 1 .
  • the prediction structure sequence 2 is used to indicate the HTML sequence 2 that is not a non-empty cell, and at this time the prediction structure sequence 2 does not include bounding box information.
  • (2) in FIG. 11 shows the prediction structure sequence 2
  • " ⁇ tr>" is used to indicate the HTML sequence 2 that is not a non-empty cell.
  • step 1030 the execution principle of the above-mentioned step 1030 is the same as that of the above-mentioned step 1020, except that there is a difference between the input data and the output data of the structural decoder 1, which will not be described in detail here.
  • step 1030 shows the execution flow of the above step 1030 .
  • Step 1040 structure decoder 1 processes image feature 1, input sequence 2 and bounding box 2 to obtain predicted structure sequence 3, which includes HTML sequence 3 for indicating non-empty cell 1.
  • the input sequence 2 includes an HTML sequence for indicating the local structure 3 of the table 1 , which is a partial structure in the global structure of the table 1 , and the local structure 3 includes the local structure 1 and the predicted structure sequence 2 .
  • (3) in FIG. 11 shows the input sequence 2 .
  • the bounding box 2 is used to indicate the location of the text included in the corresponding cell of the local structure 3 .
  • (3) in FIG. 11 shows that the bounding box 2 includes: Box ⁇ , Box ⁇ , Box ⁇ .
  • the sequence position code 2 shown in (3) in FIG. 11 is used to indicate the position of the input sequence 2 in the table 1 .
  • the prediction structure sequence 3 includes an HTML sequence 3 for indicating a non-empty cell 1.
  • the prediction structure sequence 3 may include a bounding box #1, which is a bounding box corresponding to the text in the non-empty cell 1.
  • (3) in FIG. 11 shows a prediction structure sequence 3
  • “ ⁇ td” is an HTML sequence 3 used to indicate a non-empty cell 1 . It can be understood that when the shape of the text in the non-empty cell 1 is a rectangle, the bounding box #1 can be the rectangle.
  • step 1040 the execution principle of the above-mentioned step 1040 is the same as that of the above-mentioned step 1020, except that there is a difference between the input data and the output data of the structure decoder 1, which will not be described in detail here. describe.
  • (3) in FIG. 11 shows the execution flow of the above step 1040 .
  • Step 1050 the bounding box decoder 1 processes the image feature 1 and the HTML sequence information of the non-empty cell 1 to obtain the position of the bounding box #1 in the table 1, and the HTML sequence information of the non-empty cell 1 is used to indicate the non-empty cell 1.
  • the HTML sequence 3 of the empty cell 1 the sequence information 3 includes the HTML sequence information of the non-empty cell 1, and the sequence information 3 is used to indicate the prediction structure sequence 3.
  • Bounding box #1 is at the location in sheet 1, that is, the text in non-empty cell 1 is at the location in sheet 1.
  • the position of the bounding box #1 in Table 1 can be described by the coordinates of the upper left corner and the lower right corner of the rectangle.
  • (3) in FIG. 11 shows the sequence information 3 , the predicted structure sequence 3 , the HTML sequence information of the non-empty cell 1 , and the HTML sequence 3 of the non-empty cell 1 .
  • the bounding box decoder 1 processes the HTML sequence information of the image feature 1 and the non-empty cell 1 to obtain the position of the bounding box #1 in the table 1, including: decoding the bounding box based on the image feature 1 and the non-empty cell
  • the HTML sequence information of the empty cell 1 is processed j times to obtain the position of the bounding box #1 in the table 1, and j is a positive integer.
  • the bounding box decoder 1 performs j iteration processing based on the HTML sequence information of the image feature 1 and the non-empty cell 1, including: during the first iteration processing, the bounding box decoder 1 performs the image feature 1 and the non-empty cell 1's HTML sequence information to get the output result of the first iterative processing; in the j+1th iterative processing, the bounding box decoder 1 is for the image feature 1, the HTML sequence information of the non-empty cell 1, and the jth The output result of the iterative processing is processed to obtain the output result of the j+1th iterative processing.
  • FIG. 1 Exemplarily, FIG.
  • the execution flow of each iteration of the bounding box decoder 1 is the same as the execution flow of the structural decoder 1 described in the above step 1020, except that the input data and output data of the decoder are different, and will not be described in detail here. For details, please refer to the above Relevant descriptions in step 1020.
  • the above step 1050 is executed, it can be understood that, in the case where the predicted structure sequence (for example, the predicted structure sequence 3) output by the structure decoder 1 is used to indicate a non-empty cell, the bounding box decoder will also be triggered 1 Based on the predicted structure sequence, predict the position of the bounding box included in the predicted structure sequence in Table 1.
  • the predicted structure sequence for example, the predicted structure sequence 3
  • the bounding box decoder will also be triggered 1 Based on the predicted structure sequence, predict the position of the bounding box included in the predicted structure sequence in Table 1.
  • Step 1060 judging whether the matching degree between the bounding box #1 and the bounding box #2 is greater than or equal to a preset threshold.
  • the execution subject of the above step 1060 may be a data processing model.
  • the bounding box #2 may include a bounding box obtained by correcting the local structure 3 by an error correction detection model, and the error correction detection model is a trained artificial intelligence AI model.
  • the data processing model may also include the error correction and detection model.
  • Judging whether the matching degree of the bounding box #1 and the bounding box #2 is greater than or equal to a preset threshold includes: when it is determined that the matching degree of the bounding box #1 and the bounding box #2 is greater than or equal to the preset threshold, in step 1060 Then step 1070 is executed; if it is determined that the matching degree between the bounding box #1 and the bounding box #2 is less than the preset threshold, step 1080 is executed after step 1060 .
  • the method for determining the degree of matching between the bounding box #1 and the bounding box #2 is not specifically limited. Exemplarily, the matching degree may be determined by any of the following methods: matching by intersection-over-union (IoU), or the distance between the center points.
  • the size of the preset threshold is not specifically limited, and the size of the preset threshold can be set according to actual needs.
  • Step 1070 structure decoder 1 corrects bounding box 2 in step 1040 to bounding box #2, and processes image feature 1, input sequence 2 and bounding box #2 to obtain predicted structure sequence 4, predicted structure sequence 4 Include the HTML sequence 4 to indicate a non-empty cell 1.
  • the prediction structure sequence 4 includes an HTML sequence 4 for indicating the non-empty cell 1, at this time, the prediction structure sequence 4 includes a bounding box #2, and the bounding box #2 is a bounding box corresponding to the text in the non-empty cell 1, and Bounding box #2 is different from bounding box #1.
  • step 1070 is the same as that of the above-mentioned step 1020, except that there is a difference between the input data and the output data of the structural decoder 1, which will not be described in detail here.
  • the related describe please refer to the related describe.
  • Step 1080 structure decoder 1 processes image feature 1, input sequence 3 and bounding box #1 to obtain predicted structure sequence 5, which includes HTML sequence 5 for indicating non-empty cell 2.
  • the input sequence 3 includes an HTML sequence for indicating the local structure 4 of the table 1 , which is a partial structure in the global structure of the table 1 , and the local structure 4 includes the local structure 1 and the predicted structure sequence 3 .
  • the above step 1080 is executed after the above step 1060 .
  • it is determined that the matching degree between the bounding box #1 and the bounding box #2 is less than the preset threshold it can be understood that the bounding box #1 obtained in the above step 1050 is accurate.
  • the structure of Table 1 can be further determined based on the predicted structure sequence 3 obtained in the above step 1050.
  • the execution principle of the above step 1080 is the same as the execution principle of the above step 1020, except that the input data and output data of the structural decoder 1 are different, which will not be described in detail here. For details, please refer to the relevant description in the above step 1020.
  • the execution order of the above step 1070 and the above step 1080 is not specifically limited. For example, after the above step 1060, the above step 1070 may be performed first, and then the above step 1080 may be performed. For another example, after the above step 1060, the above step 1080 may be performed first, and then the above step 1070 may be performed.
  • the structure decoder 1 and the bounding box decoder 1 respectively need to undergo multiple iterations to obtain the global structure of Table 1.
  • the global structure can include the row and column information of Table 1 and each non-empty The bounding box corresponding to the text in the cell.
  • the data processing model may also perform the following steps: according to the global structure of Table 1, obtain the text content included in the global structure.
  • the manner of obtaining the text content included in the global structure is not specifically limited.
  • the cell image corresponding to the text bounding box can be intercepted, and the cell image can be recognized by an optical character recognition (optical character recognition, OCR) system, so that the cell image includes of the text content.
  • OCR optical character recognition
  • the HTML sequence referred to in the embodiment of the present application can be equivalently converted into an extensible markup language (XML) sequence or a LaTex sequence.
  • XML extensible markup language
  • FIG. 13 is a schematic structural diagram of a data processing device 1300 provided by an embodiment of the present application.
  • the data processing apparatus 1300 shown in FIG. 13 may execute corresponding steps of the above-mentioned data processing method.
  • the data processing device 1300 includes: an acquisition unit 1310, a processing unit 1320 and an output unit 1330,
  • the acquisition unit 1310 is used to acquire the form image to be processed; the processing unit 1320 is used to determine the form recognition result according to the form image according to the generative form recognition strategy, wherein the generative form recognition strategy is used to indicate the use of markup language
  • the non-overlapping property of the bounding box determines the table recognition result of the table image.
  • the bounding box is used to indicate the location of the text contained in the cell in the table associated with the table image.
  • the table recognition result is used to indicate the text included in the table.
  • Global structure and content; the output unit 1330 is used to output the form recognition result.
  • the non-overlapping attribute of the bounding boxes is used to indicate that the areas corresponding to the cells included in the table do not overlap.
  • the processing unit 1320 is further configured to: obtain the form recognition result through iterative processing according to the form image features and the markup language.
  • the iterative processing includes multiple rounds of iterations
  • the processing unit 1320 is further configured to: determine the first bounding box and local structure obtained in the first iteration according to the table image features and the markup language , the first iteration is an iterative process of any round of the multiple iterations, the first bounding box is used to indicate the bounding box of the local structure obtained by the first iteration, and the local structure is a partial structure of the global structure ;
  • the second iteration obtains the global structure, it is determined that the processing result obtained by the second iteration is the table recognition result, the second iteration is an iterative processing performed after the first iterative processing in the iterative processing, and the processing The result includes the global structure and the content.
  • the processing unit 1320 is further configured to: correct the first bounding box obtained by the first iteration.
  • the processing unit 1320 is further configured to: correct the first bounding box according to the input parameters and the table image.
  • the processing unit 1320 is further configured to: if the matching degree between the second bounding box and the first bounding box is greater than or equal to a preset threshold, according to the second bounding box The first bounding box is corrected, and the second bounding box is obtained by processing the local structure with an error correction detection model, which is a trained artificial intelligence AI model.
  • the processing unit 1320 is further configured to: correct the form recognition result according to the form image, and output the corrected form recognition result.
  • the processing unit 1320 is further configured to: perform feature extraction on the form image to obtain features of the form image.
  • any one of the following markup languages is used to mark the table recognition result: Hypertext Markup Language HTML, Extensible Markup Language XML, or LaTeX.
  • the apparatus 1300 in the embodiment of the present application may be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD), the above-mentioned PLD can be complex program logic device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), general array logic (generic array logic, GAL) or any combination thereof.
  • the device 1300 and its unit modules may also be software modules.
  • Fig. 14 is a schematic structural diagram of a training device 1400 provided by an embodiment of the present application.
  • the training device 1400 shown in FIG. 14 can execute corresponding steps of the above-mentioned model training method for data processing.
  • the training device 1400 includes: an acquisition unit 1410 and a processing unit 1420 .
  • the obtaining unit 1410 is configured to execute the above step 910 .
  • the processing unit 1420 is configured to execute the above step 920 , the above step 930 , and the above step 940 .
  • the training device 1400 may further include an output unit, the output unit 1430 is configured to output the trained target model obtained in the above step 940 .
  • the device 1400 in the embodiment of the present application can be implemented by a central processing unit (central processing unit, CPU), or by an application-specific integrated circuit (application-specific integrated circuit, ASIC), an artificial intelligence chip, a system on a chip (system on chip, SoC), accelerator card or programmable logic device (programmable logic device, PLD), the above-mentioned PLD can be complex program logic device (complex programmable logical device, CPLD), field programmable gate array (field-programmable gate array) , FPGA), generic array logic (generic array logic, GAL) or any combination thereof.
  • the device 1400 and its unit modules may also be software modules.
  • apparatus 1300 and apparatus 1400 are both embodied in the form of functional units.
  • unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
  • the hardware circuitry may include application specific integrated circuits (ASICs), electronic circuits, processors (such as shared processors, dedicated processors, or group processors) for executing one or more software or firmware programs. etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • ASICs application specific integrated circuits
  • processors such as shared processors, dedicated processors, or group processors for executing one or more software or firmware programs. etc.
  • memory incorporating logic, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • FIG. 15 is a schematic structural diagram of a computing device 1500 provided by an embodiment of the present application.
  • Computing device 1500 may be a server, an edge server, a computer workstation, or other devices with computing capabilities.
  • the computing device 1500 may include: at least one processor 1501, a memory unit 1502, a communication interface 1503, and a storage medium 1504, as shown by a solid line in FIG. 15 .
  • the computing device 1500 may further include an output device 1506 and an input device 1507, as shown by dashed lines in FIG. 15 .
  • the processor 1501, the memory unit 1502, the communication interface 1503, the storage medium 1504, the output device 1506, and the input device 1507 communicate through the bus 1505, and may also communicate through other means such as wireless transmission.
  • the computing device 1500 may be used to implement the same or similar functions of the above-mentioned data processing apparatus 1300 .
  • the memory unit 1502 is used to store computer instructions 15022
  • the processor 1501 can invoke the computer instructions 15022 stored in the memory unit 1502 to execute the steps of the methods performed by the data processing model in the above method embodiments.
  • the functionality of the computing device 1500 is the same or similar to that of the training device 1400 described above.
  • the memory unit 1502 is used to store computer instructions 15022, and the processor 1501 can invoke the computer instructions 15022 stored in the memory unit 1502 to execute the steps of the above training method.
  • the processor 1501 may include at least one CPU 15011.
  • the processor 1501 can also be other general-purpose processors, digital signal processors (digital signal processing, DSP), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), artificial intelligence AI chips, and system-on-chips. SoC or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or any conventional processor or the like.
  • processor 1501 includes two or more processors of different types.
  • the processor 1501 includes a CPU 15011 and a general-purpose processor, a digital signal processor DSP, an application-specific integrated circuit ASIC, a field programmable gate array FPGA, an artificial intelligence AI chip, a system-on-chip SoC or other programmable logic devices, discrete gates or transistors At least one of logic devices, discrete hardware components.
  • the memory unit 1502 may include read-only memory and random-access memory, and provides instructions and data to the processor 1501 .
  • the memory unit 1502 may also include non-volatile random access memory.
  • memory unit 1502 may also include storage device type information.
  • Memory unit 1502 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced SDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous connection dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the communication interface 1503 implements communication between the computing device 1500 and other devices or communication networks using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the processor 1501 can call the computer instruction 15022 stored in the memory unit 1502 to execute the steps of the method performed by the data processing model in the above method embodiment
  • the form image or the data processing model can be obtained through the communication interface 1503. Model.
  • the training data set can be obtained through the communication interface 1603 .
  • the storage medium 1504 has a storage function.
  • the storage medium 1504 may be used to temporarily store computing data in the processor 1501 and data exchanged with an external memory.
  • the storage medium 1504 may be, but not limited to, a hard disk drive (HDD).
  • the bus 1505 may include not only a data bus, but also a power bus, a control bus, and a status signal bus. However, the various buses are labeled as bus 1505 in FIG. 15 for clarity of illustration.
  • the bus 1505 can be a peripheral component interconnection standard (Peripheral Component Interconnect Express, PCIe) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, a unified bus (unified bus, Ubus or UB), a computer fast link ( compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnection standard
  • EISA extended industry standard architecture
  • unified bus unified bus, Ubus or UB
  • CXL compute express link
  • CIX cache coherent interconnect for accelerators
  • the output device 1506 may be any of the following: keyboard, tablet, microphone, stereo, or display.
  • the input device 1507 may be any one of the following: keyboard, mouse, camera, scanner, handwriting input board, or voice input device.
  • the steps of the method executed by the data processing model in the above method embodiments may also be implemented based on multiple computing devices 1500 , or the steps of the above training method may be implemented based on multiple computing devices 1500 .
  • the plurality of computing devices 1500 may be devices included in a computer cluster.
  • an introduction is made by taking a computer cluster including two computing devices 1500 as an example, and the two computing devices 1500 are used to execute the method performed by the data processing model in the above method embodiment.
  • the two computing devices 1500 are respectively referred to as device #1 and device #2, and the structures of device #1 and device #2 can be referred to in FIG. 15 .
  • Device #1 and device #2 can, but are not limited to, communicate via Ethernet to realize data transmission.
  • the memory unit included in device #1 can provide instructions and data to the processor included in device #2 through Ethernet, so that the processor included in device #2 can call the memory unit included in device #1 through Ethernet.
  • the computer instructions stored in the unit are used to execute the steps of the methods executed by the data processing model in the above method embodiments.
  • the structure of the computing device 1500 listed above is only an example, and the present application is not limited thereto.
  • the computing device 1500 in the embodiment of the present application includes various hardware in a computer system in the prior art. Those skilled in the art should understand that the computing device 1500 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the above computing device 1500 may also include hardware devices for implementing other additional functions.
  • the present application also provides a data processing system, the system includes a plurality of computing devices 1500 as shown in Figure 15, and the plurality of computing devices 1500 form a data processing system in the form of a cluster, and the system is used to implement the operations in the above data processing method The function of the step.
  • An embodiment of the present application provides a computer program product.
  • a computer program product is provided, and the computer program product includes: computer program code, when the computer program code is run on a computer, it causes the computer to execute the above data processing method.
  • An embodiment of the present application provides a computer program product.
  • a computer program product is provided, and the computer program product includes: computer program code, when the computer program code is run on a computer, it causes the computer to perform the above-mentioned data processing model training method.
  • An embodiment of the present application provides a computer-readable storage medium for storing a computer program, where the computer program is used to execute the method in the foregoing method embodiment.
  • An embodiment of the present application provides a chip system, including at least one processor and an interface; the at least one processor is used to call and run a computer program, so that the chip system executes the method in the above method embodiment .
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs.
  • the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Character Input (AREA)

Abstract

一种数据处理的方法和装置,该方法包括:获取待处理的表格图像;根据该表格图像按照生成式表格识别策略确定表格识别结果,其中,该生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定该表格图像的表格识别结果,该包围框用于指示该表格图像所关联的表格中的单元格包括的文本所在位置,该表格识别结果用于指示该表格所包括的全局结构和内容;输出该表格识别结果。由此对表格图像进行识别,可以提高表格识别结果的准确性。

Description

数据处理的方法和相关设备 技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理的方法、装置、系统和数据处理芯片。
背景技术
图像表格识别(简称为表格识别)是将图像中的表格转换为可编辑的表格(例如,超文本标记语言(hypertext markup language,HTML)等格式)的人工智能(artificial intelligence,AI)技术。图像表格识别在文档格式的自动化处理中扮演着重要角色。
相关技术中提供的表格识别方法,首先对图像中的表格进行行列线检测,然后计算该表格包括的所有行列线之间的交叉点,即可还原出该表格包括的每个单元格的坐标(即单元格位置)。在获得所有单元格位置后,按照单元格位置对所有单元格进行排列,并通过启发式算法获取单元格的行列信息(例如,起始行、起始列、跨行或者跨列),以得到表格识别结果。这种实现方式中,当行列线不明显或者行列线倾斜时,会存在行列线漏检或者交叉点计算错误,基于这种方式得到的表格识别结果的准确性较差。
因此,亟需一种数据处理的方法,该方法可以提高表格识别结果的准确性。
发明内容
本申请提供一种数据处理的方法、装置、系统和数据处理芯片,可以提高表格识别结果的准确性。
第一方面,提供了一种数据处理的方法,包括:获取待处理的表格图像;根据该表格图像按照生成式表格识别策略确定表格识别结果,其中,该生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定该表格图像的表格识别结果,该包围框用于指示该表格图像所关联的表格中的单元格包括的文本所在位置,该表格识别结果用于指示该表格所包括的全局结构和内容;输出该表格识别结果。
标记语言可以用于指示表格局部结构,该表格局部结构为表格全局结构中的部分结构。其中,表格结构可以包括:表格的行、表格的列、表格包括的单元格、表格中的每个单元格、以及表格中的每个单元格包括的文本对应的包围框。文本对应的包围框,可以是指包围该单元格包括的文本的任意多边形的包围框。表格中的单元格包括的文本所在位置,可以理解为,表格中的单元格包括的文本对应的包围框的位置。
上述技术方案中,能够根据用于标识表格结构的标记语言和该表格中的单元格包括的文本位于表格中的位置对表格进行识别,以得到表格识别的结果,避免了传统技术中仅根据表格的行列结构(该表格的行列结构不包括包围框)对表格进行识别存在识别结果的准确性较差的问题,本申请提供的方法可以提高表格识别结果的准确性。
在一种可能的设计中,该包围框不重叠属性用于指示该表格所包括的各个单元格所对应的区域无重叠。
其中,该表格所包括的各个单元格所对应的区域无重叠,即该表格包括的各个单元格不存在重叠,且该各个单元格包括的文本对应的包围框也不存在重叠。包围框可以是指包围一个单元格包括的文本的任意多边形的框。包围框,又可称为文本对应的包围框或单元格文本块。
可选的,在一些实现方式中,表格包括的单元格是按照行的顺序排列的。
上述技术方案中,对表格图像进行表格识别时,不仅利用了用于标记表格结构的标记语言,同时还利用了表格中的包围框不重叠属性。也就是说,该方法充分利用了表格的特征,有利于提高表格识别结果的鲁棒性和准确性。
在另一种可能的设计中,该根据该表格图像按照生成式表格识别策略确定表格识别结果,包括:根据该表格图像特征和该标记语言通过迭代处理获得该表格识别结果。
其中,该表格图像特征可以用于指示以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及表格中的每个单元格或者表格中的每个单元格包括的文本对应的包围框。上述技术方案中,根据表格图像特征和标记语言通过迭代的方式预测表格识别结果,使得预测的表格识别结果更准确,可以提高表格识别结果的准确性。
在另一种可能的设计中,该迭代处理包括多轮迭代,该方法还包括:根据该表格图像特征和该标记语言确定第一迭代获得的第一包围框和局部结构,该第一迭代为该多轮迭代的任意一轮迭代处理过程,该第一包围框用于指示该第一迭代所获得的该局部结构的包围框,该局部结构为该全局结构的部分结构;当第二迭代获得该全局结构时,确定该第二迭代获得的处理结果为该表格识别结果,该第二迭代是该迭代处理中在该第一迭代处理后执行的一次迭代处理,该处理结果包括该全局结构和该内容。
其中,第一迭代所获得的局部结构的包围框,即第一迭代所获得的表格的局部结构中的单元格包括的文本所在位置。可以理解的是,当局部结构不包括任何单元格,或该局部结构包括的任意一个单元格为空单元格(即,单元格不包括任何文本)时,该局部结构的包围框为空。
可选的,当第二迭代为第一迭代之后的最近一次迭代时,在第二迭代过程中,会根据第一迭代获得的第一包围框和局部结构,确定该第二迭代获得的处理结果为该表格识别结果。
上述技术方案中,在本轮迭代(例如,第二迭代)时,通过根据上一轮迭代(例如,第一迭代)获得的包围框和局部结构,确定本轮迭代的结果。当执行该方法的主体为AI模型时,即在每轮迭代时,该方法不仅会使用已生成的局部结构(该局部结构可以利用标记语言进行标记)作为先验,并且会将已生成的包围框作为先验,一同输入到该AI模型中,指导该AI模型下一步的生成。这种方法相当于不仅告诉该AI模型在本轮迭代前已经生成了多少单元格,而且还告诉该AI模型在本轮迭代前已经生成的单元格位于表格中的具体位置,这样该AI模型的注意力就会关注未生成的单元格,该方法能够有效减轻AI模型注意力发散现象,有利于提高表格识别结果的准确性。
上述多轮迭代处理的流程,可以由本申请提供的数据处理的模型中的transformer解码器执行。该transformer解码器可以包括2个解码器,分别记为第一解码器和第二解码器。下面,以transformer解码器根据该表格图像特征和该标记语言确定第一迭代获得的第一包围框和局部结构为例进行介绍。示例性的,transformer解码器根据该表格图像特征和该标记语言确定第一迭代获得的第一包围框和局部结构,可以包括以下步骤:通过该第一解码器对该表格图像特征和该标记语言进行处理,得到第一输出结果,该第一输出结果指示非空单元格或不是非空单元格;该数据处理的模型对该第一输出结果进行第一运算,得到该局部结构。其中,第一运算可以包括归一化指数函数softmax处理。上述通过该第一解码器对该表格图像特征和该标记语言进行处理,得到第一输出结果,包括:通过该第一解码器对该表格图像特征和该标记语言进行处理,得到该第一解码器的输出结果;该数据处理的模型对该第 一解码器的输出结果进行线性化处理,得到该第一输出结果。
在一些可能的设计中,该第一解码器包括第一残差支路、第二残差支路和第三残差支路,该第一残差支路包括第一注意力头,该第二残差支路包括第二注意力头,该第三残差支路包括第一前馈神经网络FFN层,该通过该第一解码器对该表格图像特征和该标记语言进行处理,得到该第一解码器的输出结果,包括:该第一残差支路对目标向量进行处理,得到该第一残差支路的输出结果,该目标向量为根据该标记语言得到的向量;该第二残差支路对该表格图像特征和该第一残差支路的输出结果进行处理,得到该第二残差支路的输出结果;该第三残差支路对该第一FFN的输出结果进行该目标运算,得到该第一解码器的输出结果,该第一FFN的输出结果为根据该第二残差支路的输出结果进行第二运算得到的。其中,第二运算可以是线性运算,该线性运算具体可以是:线性变换和线性整流函数ReLU激活运算。
在一些可能的设计中,该第一残差支路还包括第一残差单元,该第一残差支路对目标向量进行处理,得到该第一残差支路的输出结果,包括:该第一残差单元对该第一注意力头的输出进行目标运算,得到该第一残差支路的输出结果,该第一注意力头的输出为根据第一向量,第二向量和第三向量进行乘法运算得到的,该第一向量为根据该目标向量得到的查询向量,该第二向量为根据该目标向量得到的键向量,该第三向量为根据该目标向量得到的值向量。其中,该乘法运算可以包括点乘和叉乘。
在一些可能的设计中,该第二残差支路还包括第二残差单元,该第二残差支路对该表格图像特征和该第一残差支路的输出结果进行处理,得到该第二残差支路的输出结果,包括:该第二残差单元对该第二注意力头的输出进行该目标运算,得到该第二残差支路的输出结果,该第二注意力头的输出为根据第四向量,第五向量和第六向量进行乘法运算得到的,该第四向量为根据该表格图像特征得到的键向量,该五向量为根据该表格图像特征得到的值向量,该第六向量为根据该第一残差支路的输出结果得到的查询向量。
在一些可能的设计中,该目标向量为根据位置编码信息,第二包围框和该标记语言进行第三运算得到的向量,该位置编码信息指示该标记语言指示的局部结构位于表格中的位置,该第二包围框用于指示该局部结构的包围框。其中,第三运算可以包括加法运算。该局部结构的包围框,用于指示该表格中该局部结构中的单元格包括的文本所在位置。可以理解的是,当该局部结构不包括单元格,或该局部结构包括的任意一个单元格不包括文本时,该局部结构的包围框为空。
上述技术方案中,目标向量是根据位置编码信息,第二包围框和标记语言得到的,通过该位置编码信息指示该标记语言指示的局部结构位于表格中的位置,利于提高表格识别结果的鲁棒性和准确性。
在一些可能的设计中,当该第一输出结果指示该非空单元格时,该数据处理的方法还包括:通过该第二解码器对该表格图像特征和该第一输出结果进行处理,得到第二输出结果,该第二输出结果用于指示该第一包围框;该数据处理的模型对该第二输出结果进行目标运算,得到该第一包围框。第二解码器可以通过多轮迭代获取第二输出结果。可以理解的是,第二解码器执行每次迭代的工作原理与第一解码器执行每次迭代的工作原理相同,仅是这个两个解码器的输入和输出数据存在差别。
可以理解的是,上述技术方案中,可以仅在第一解码器的输出用于指示非空单元格时触发第二解码器根据该第一解码器的输出确定该第一解码器的输出对应的包围框,该方法能够减少预测包围框的冗余和提高表格识别结果的效率。此外,该方法通过迭代的方式预测表格包括的所有包围框,使得预测的包围框更准确,还有利于提高表格识别结果的准确性。
上述技术方案可以应用于本申请提供的数据处理模型中的transformer解码器,该transformer解码器可以包括解码器#1和解码器#2。解码器#1可以根据用于标识表格结构的标记语言和该表格中的单元格包括的文本位于表格中的位置,通过多轮迭代对表格图像包括的表格进行表格识别,避免了传统技术中仅根据表格的行列结构(该表格的行列结构不包括包围框)对表格进行识别存在识别结果的准确性较差的问题。当该解码器#1的输出结果用于指示非空单元格时,该解码器#1的输出可以作为解码器#2的输入,以使解码器#2基于该解码器#1的输出结果和表格图像特征,确定该解码器#1的输出所指示的非空单元格包括的文本位于表格中的具体位置。综上,该方法可以提高表格识别结果的准确性和识别效率。
在另一种可能的设计中,该方法还包括:对该第一迭代获得的该第一包围框进行纠正。
可选的,在第一迭代之后的下一轮迭代时,可以基于该纠正后的包围框进行表格识别。
上述技术方案中,可以对第一迭代获取的第一包围框进行实时纠正,能够进一步提高第一包围框的精度。在第一迭代后的下一迭代时,基于该纠正后的第一包围框进行处理时,能够进一步提高该下一迭代的输出结果的鲁棒性和准确性,该方法有利于进一步提高表格识别结果的准确性。
在另一种可能的设计中,该对该第一迭代获得的该第一包围框进行纠正,包括:根据输入参数和该表格图像对该第一包围框进行纠正。
可选的,上述输入参数可以是用户根据表格图像获取的一个或多个参数,该一个或多个参数用于对第一包围框进行纠正。
上述技术方案中,用户可以根据实际需求确定对第一包围框进行纠正的输入参数,并通过用户手动输入该输入参数以对该第一包围框进行实时纠正,该方法在进一步提高表格识别结果的准确性的前提下,还可以提高用户使用的满意度。
在另一种可能的设计中,该对该第一迭代获得的该第一包围框进行纠正,包括:在第二包围框与该第一包围框的匹配度大于或等于预设阈值的情况下,根据该第二包围框对该第一包围框进行纠正,该第二包围框为误差纠偏检测模型对该局部结构进行处理得到的,该误差纠偏检测模型为经过训练的人工智能AI模型。
可选的,本申请提供的数据处理的模型还可以包括误差纠偏检测模型。
上述技术方案中,可以通过数据处理的模型中纠偏检测模型自动地对模型预测得到的第一包围框进行实时纠正,有利于进一步提高表格识别结果的准确性和识别效率。
在另一种可能的设计中,该方法还包括:根据该表格图像对该表格识别结果进行纠正,并输出纠正后的表格识别结果。
上述技术方案中,通过对获取的表格识别结果进行纠正,有利于进一步提高表格识别结果的准确性。
在另一种可能的设计中,该方法还包括:对该表格图像进行特征提取,获得该表格图像特征。
其中,该表格图像特征可以用于指示以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及表格中的每个单元格或者表格中的每个单元格包括的文本对应的包围框。
上述获得表格图像特征的流程,可以由本申请提供的数据处理的模型中的特征提取模型执行。该特征提取模型是一种具有特征提取功能的神经网络模型,对特征提取模型的结构不作具体限定。
在另一种可能的设计中,采用以下任意一种标记语言标识该表格识别结果:超文本标记语言HTML,可扩展标记语言XML,或者拉泰赫LaTex。
上述技术方案中,可以利用标记语言标识表格识别结果,有利于后续对表格识别结果的进一步处理。
第二方面,提供了一种数据处理的装置,该装置包括用于执行第一方面或第一方面任一种可能实现方式中的数据处理方法的各个模块。
第三方面,提供了一种数据处理的装置,该数据处理的装置具有实现上述第一方面或第一方面的任意一种可能的实现方式,以及第二方面或第二方面中任意一种可能的实现方式所描述的数据处理的装置的功能。该功能可以基于硬件实现,也可以基于硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
在一种可能的实现方式中,数据处理的装置的结构中包括处理器,该处理器被配置为支持数据处理的装置执行上述方法中相应的功能。
该数据处理的装置还可以包括存储器,该存储器用于与处理器耦合,其保存数据处理的装置必要的程序指令和数据。
在另一种可能的实现方式中,该数据处理的装置包括:处理器、发送器、接收器、随机存取存储器、只读存储器以及总线。其中,处理器通过总线分别耦接发送器、接收器、随机存取存储器以及只读存储器。其中,当需要运行数据处理的装置时,通过固化在只读存储器中的基本输入/输出系统或者嵌入式系统中的bootloader引导系统进行启动,引导数据处理的装置进入正常运行状态。在数据处理的装置进入正常运行状态后,在随机存取存储器中运行应用程序和操作系统,使得该处理器执行第一方面或第一方面的任意可能的实现方式中的方法。
第四方面,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种可能执行的方法。
第五方面,提供了一种计算机可读介质,该计算机可读介质存储有程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述第一方面或第一方面的任意一种可能执行的方法。这些计算机可读存储包括但不限于如下的一个或者多个:只读存储器(read-only memory,ROM)、可编程ROM(programmable ROM,PROM)、可擦除的PROM(erasable PROM,EPROM)、Flash存储器、电EPROM(electrically EPROM,EEPROM)以及硬盘驱动器(hard drive)。
第六方面,提供一种芯片系统,该芯片系统包括处理器与数据接口,其中,处理器通过该数据接口读取存储器上存储的指令,以执行上述第一方面或第一方面的任意一种可能的实现方式中的方法。在具体实现过程中,该芯片系统可以以中央处理器(central processing unit,CPU)、微控制器(micro controller unit,MCU)、微处理器(micro processing unit,MPU)、数字信号处理器(digital signal processing,DSP)、片上系统(system on chip,SoC)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或可编辑逻辑器件(programmable logic device,PLD)的形式实现。
第七方面,提供了一种数据处理的系统,该系统包括处理器,该处理器用于执行上述第一方面或第一方面的任意一种可能的实现方式中的方法。
第八方面,提供了一种数据处理的集群,该集群包括上述第二方面或第二方面的任意一种可能的实现方式,以及第三方面或第三方面中任意一种可能的实现方式所描述的多个数据 处理的装置,该多个数据处理的装置可以用于执行上述第一方面或第一方面的任意一种可能的实现方式中的方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
图1是人工智能主体框架的一种结构示意图。
图2是一种标准的transformer模块的结构示意图。
图3是一种卷积神经网络结构的示意图。
图4是另一种卷积神经网络结构的示意图。
图5是本申请实施例提供的系统架构500的结构示意图。
图6是本申请实施例提供的数据处理的模型的结构示意图。
图7是本申请实施例提供的一种解码器的结构示意图。
图8是本申请实施例提供的一种数据处理的方法800的示意性流程图。
图9a是本申请实施例提供的一种数据处理的模型训练方法900的示意性流程图。
图9b是本申请实施例提供的表格包括的包围框的示意图。
图10是本申请实施例提供的一种数据处理的方法1000的示意性流程图。
图11是本申请实施例提供的数据处理的方法1000的执行过程示意图。
图12是本申请实施例提供的数据处理的方法1000的执行过程示意图。
图13是本申请实施例提供的一种数据处理的装置1300的示意性结构图。
图14是本申请实施例提供的一种训练装置1400的示意性结构图。
图15是本申请实施例提供的一种计算设备1500的结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1是人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施110、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施110
基础设施110为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(中央处理单元(central processing unit,CPU)、嵌入式神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据120
基础设施110的上一层的数据120用于表示人工智能领域的数据来源。数据120涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理130
数据处理130通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。其中,机器学习和深度学习可以对数据120进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力140
对数据120经过上面提到的数据处理130后,进一步基于数据处理130的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用150
智能产品及行业应用150指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶、智慧城市或智能终端等领域。
为了便于理解,下面对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)transformer模型
transformer模型也可以称为transformer模块、或transformer结构等。transformer模型是一种基于自注意力模块的多层神经网络。目前主要是用于处理自然语言任务,transformer模型主要由层叠的多头自注意力模块与前馈神经网络(feed forward neural networks,FFN)组成。transformer模型可进一步分成编码器(也可称为编码模块)和解码器(也可称为解码模块),其构成大致相似,也有所不同。
图2是一种标准的transformer模块的结构示意图。如图2所示,左边为编码器210,右边为解码器220,编码器210可包括任意数量的编码子模块,每个编码子模块包括一个多头自注意力模块和一个前馈神经网络。解码器220可包括任意数量的解码子模块,每个解码 子模块包括两个多头自注意力模块和一个前馈神经网络。编码子模块的数量与解码子模块的数量可以相等或不相等。
(3)注意力机制(attention mechanism)
注意力机制模仿了生物观察行为的内部过程,即一种将内部经验和外部感觉对齐从而增加部分区域的观察精细度的机制,能够利用有限的注意力资源从大量信息中快速筛选出高价值信息。注意力机制可以快速提取稀疏数据的重要特征,因而被广泛用于自然语言处理任务,特别是机器翻译。而自注意力机制(self-attention mechanism)是注意力机制的改进,其减少了对外部信息的依赖,更擅长捕捉数据或特征的内部相关性。注意力机制的本质思想可以改写为如下公式:
其中,Lx=||Source||代表Source的长度,公式含义即将Source中的构成元素想象成是由一系列的数据对构成,此时给定目标Target中的某个元素Query(简记为Q),通过计算Query和各个Key(简记为K)的相似性或者相关性,得到每个Key对应Value(简记为V)的权重系数,即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和,而Query和Key用来计算对应Value的权重系数。从概念上理解,把Attention可以理解为从大量信息中有选择地筛选出少量重要信息并聚焦到这些重要信息上,忽略大多不重要的信息。聚焦的过程体现在权重系数的计算上,权重越大越聚焦于其对应的Value值上,即权重代表了信息的重要性,而Value是其对应的信息。自注意力机制可以理解为内部Attention(intra attention),Attention机制发生在Target的元素Query和Source中的所有元素之间,自注意力机制指的是在Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制,其具体计算过程是一样的,只是计算对象发生了变化而已。
(4)卷积神经网络(convolutional neuron network,CNN)
卷积神经网络是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
CNN是一种非常常见的神经网络,如前文的基础概念介绍该,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入做出响应。
下面结合图3重点对CNN的结构进行详细的介绍。如图3所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。
卷积层/池化层220:
卷积层:
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,以图像为例(其他数据类型类似),在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面该的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图3中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为上述卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图3所示的231、232至23n),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,仅包括图3中所示的网络结构的一部分,比如,本申请实施例中所采用的卷积神经网络可以仅包括输入层210、卷积层/池化层220和输出层240。
需要说明的是,如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图4所示的多个卷积层/池化层并行,将分别提取的特征均输入给全连接层230进行处理。
本申请提供了一种数据处理的方法,包括:获取待处理的表格图像;根据表格图像按照生成式表格识别策略确定表格识别结果,其中,生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定表格图像的表格识别结果,包围框用于指示表格图像所关联的表格中的单元格包括的文本所在位置,表格识别结果用于指示表格所包括的全局结构和内容;输出表格识别结果。该方法通过根据用于标识表格结构的标记语言和该表格中的单元格包括的文本位于表格中的位置对表格进行识别,以得到表格识别的结果,避免了传统技术中仅根据表格的行列结构(该表格的行列结构不包括包围框)对表格进行识别存在识别结果的准确性较差的问题,该方法可以提高表格识别结果的准确性。
接下来,结合附图5介绍本申请实施例中模型训练阶段和模型应用阶段的系统架构。
图5对本申请实施例提供的系统架构进行详细的介绍。图5是本申请实施例提供的系统架构500的结构示意图。如图5所示,系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。
执行设备510包括计算模块511、数据处理系统512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。
数据采集设备560用于采集训练样本。例如,针对本申请实施例的数据处理的方法来说,若样本为图像数据,则训练样本可以包括训练图像以及训练图像对应的分类结果,其中,训练图像的分类结果可以是人工预先标注的结果。在采集到训练样本之后,数据采集设备560将这些训练样本存入数据库530。应理解,数据库530中还可以维护有本申请提供的数据处理的模型。示例性的,下文中的图6示出了本申请实施例提供的数据处理的模型的结构示意图,具体参见下文中对图6的相关描述,此处不再详细赘述。
训练设备520可以基于数据库530中维护的训练样本训练得到目标模型/规则501。其中,该目标模型/规则501可以是本申请提供的数据处理的模型。
需要说明的是,在实际应用中,数据库530中维护的所有训练样本可以都是数据采集设备560采集的训练样本。可选的,数据库530中维护的部分训练样本还可以是除数据采集设备560之外的其它设备采集的训练样本。另外需要说明的是,训练设备520也有可能从云端或其他地方获取训练样本进行目标模型/规则501的训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应 用于图5所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备,车载终端等,还可以是服务器或者云端等。
具体的,训练设备520可以将本申请提供的数据处理的模型传递至执行设备510。
在图5中,执行设备510包括数据处理系统512,数据处理系统512用于与外部设备进行数据交互,用户可以通过客户设备540向数据处理系统512输入数据(例如本申请实施例中的待处理的表格图像)。
预处理模块513和预处理模块514用于根据数据处理系统512接收到的输入数据进行预处理。应理解,可以没有预处理模块513和预处理模块514,或者只有一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统550中。
最后,数据处理系统512将处理结果(例如本申请实施例中的表格识别结果)呈现给客户设备540,从而提供给用户。
在图5所示情况下,用户可以手动给定输入数据,该“手动给定输入数据”可以通过数据处理系统512提供的用户界面(user interface,UI)进行操作。另一种情况下,客户设备540可以自动地向数据处理系统512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入数据处理系统512的输入数据及输出数据处理系统512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由数据处理系统512直接将如图所示输入数据处理系统512的输入数据及输出数据处理系统512的输出结果,作为新的样本数据存入数据库530。
值得注意的是,图5仅是本申请实施例提供的一种系统架构的示意图,图5中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。应理解,上述执行设备510可以部署于客户设备540中。
上述图5示出的系统架构500可以应用于本申请提供的数据处理的模型的应用阶段(又称为推理阶段),以及本申请提供的数据处理的模型的训练阶段。下面,具体描述当该系统架构500分别应用于数据处理的模型的应用阶段和数据处理的模型的训练阶段时,该系统架构500包括的模块的具体功能。
在一些实现方式中,上述图5示出的系统架构可以应用于本申请提供的数据处理的模型的应用阶段。这种实现方式中,上述执行设备510的计算模块511可以获取到数据存储系统550中存储的代码来实现本申请实施例中的数据处理的方法。执行设备510的计算模块511可以包括硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)、图形处理器(graphics processing unit,GPU)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等)、或这些硬件电路的组合,例如,训练设备520可以为具有执行指令功能的硬件系统,如CPU、DSP等,或者为不具有执行指令功能的硬件系统,如ASIC、FPGA 等,或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。具体的,执行设备510的计算模块511可以为具有执行指令功能的硬件系统,本申请实施例提供的数据处理的方法可以为存储在存储器中的软件代码,执行设备510的计算模块511可以从存储器中获取到软件代码,并执行获取到的软件代码来实现本申请实施例提供的数据处理的方法。应理解,执行设备510的计算模块511可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合,本申请实施例提供的数据处理的方法的部分步骤还可以通过执行设备510的计算模块511中不具有执行指令功能的硬件系统来实现,这里并不限定。
可选的,在一些实现方式中,上述图5示出的系统架构可以应用于本申请提供的数据处理的模型的训练阶段。这种实现方式中,上述训练设备520可以获取到存储器(图5中未示出,可以集成于训练设备520或者与训练设备520分离部署)中存储的代码来实现本申请实施例中的数据处理的方法。训练设备520可以包括硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、图形处理器(graphics processing unit,GPU)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等等)、或这些硬件电路的组合,例如,训练设备520可以为具有执行指令功能的硬件系统,如CPU、DSP等,或者为不具有执行指令功能的硬件系统,如ASIC、FPGA等,或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。具体的,训练设备520可以为具有执行指令功能的硬件系统,本申请实施例提供的数据处理的方法可以为存储在存储器中的软件代码,训练设备520可以从存储器中获取到软件代码,并执行获取到的软件代码来实现本申请实施例提供的数据处理的方法。应理解,训练设备520可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合,本申请实施例提供的数据处理的方法的部分步骤还可以通过训练设备520中不具有执行指令功能的硬件系统来实现,这里并不限定。
上文结合图5详细介绍了本申请提供的数据处理的模型适用的系统架构,下面结合图6介绍本申请提供的数据处理的模型的结构。可以理解的是,本申请提供的数据处理的模型能够实现本申请提供的数据处理的方法。为便于描述,下文中将能够实现本申请提供的数据处理的方法的模型,简称为数据处理的模型。图6示出了该数据处理的模型的结构,与该模型相关的内容具体参见下文图6中的描述,此处不再详细赘述。
图6是本申请实施例提供的一种数据处理的模型的结构示意图。可以理解的是,数据处理的模型可以是一种神经网络模型。如图6所示,数据处理的模型包括特征提取模型和transformer解码器。其中,transformer解码器可以包括嵌入层#1,嵌入层#2,由N层解码器堆叠构成的解码器#1,以及由M层解码器堆叠构成的解码器#2,N和M为正整数。N层解码器中的任意一层解码器的结构,可以与M层解码器中的任意一层解码器的结构相同。示例性的,下文中的图7示出了该解码器的结构,具体参见图7中的相关描述,此处不再详细赘述。N和M的具体取值可以根据实际需要而设置。N和M的取值可以相等或不相等,对此不作具体限定。
下面描述本申请提供的数据处理的模型中各部分的具体工作过程:
1、特征提取模型
特征提取模型是一种神经网络模型,该特征提取模型用于对表格图像进行特征提取,以得到该表格图像包括的表格特征向量(又称为表格图像特征)。该表格特征向量用于指示以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨 列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及表格中的每个单元格或者表格中的每个单元格包括的文本对应的包围框。文本对应的包围框可以是指包围该文本的任意多边形。特征提取模型的输出可以包括表格特征向量对应的值向量V2,以及该表格特征向量对应的键向量K2。
对特征提取模型不作具体限定。在一些可能的设计中,该特征提取模型可以是CNN模型。在另一些可能的设计中,特征提取模型可以是由CNN模型和特征金字塔网络(feature pyramid network image,FPN)模型构成的组合模型。
2、嵌入层
嵌入层可以对当前输入进行嵌入处理,得到多个特征向量。数据处理的模型的核心特点在于其采用的独特的注意力机制。嵌入层对当前序列中各个节点的值、位置及其对应的包围框进行编码,并将这些编码进行逐元素相加,得到嵌入向量。嵌入层对嵌入向量进行处理,得到该嵌入向量对应的查询向量Q1,键向量K1和值向量V1。
3、解码器
图7是本申请实施例提供的一种解码器的结构示意图。如图7所示,解码器包括依次掩码多头注意力(masked multi-head attention)层、求和并归一化(add&norm)、多头注意力(multi-head attention,MHA)层、求和并归一化、前馈(feed forward)层、求和并归一化。为便于描述,下文中可以将掩码多头注意力(masked multi-head attention)层与求和并归一化,称为残差支路1。将多头注意力层与求和并归一化,称为残差支路2。将前馈层与求和并归一化,称为残差支路3。
其中,掩码多头注意力层从其上一层获取输入向量,采用自注意力机制,基于向量间的关联度对各个向量进行变换,得到该掩码多头注意力层的输出向量。对该掩码多头注意力层的输出向量执行求和并归一化处理,得到残差支路1的输出向量Q2。可以理解的是,当该掩码多头注意力层是与嵌入层直接相连的层,例如图6中与嵌入层#1直连的解码器,掩码多头注意力层获取的输入向量即为嵌入层#1输出的向量。示例性的,图7中示出了该掩码多头注意力层是与嵌入层直接相连的层,该掩码多头注意力层的输入包括嵌入层的输出向量(即,Q1,K1和V1)。还可以理解的是,当该掩码多头注意力层是后续的解码器包括的掩码多头注意力层,例如图6中,与上层(N等于1)的解码器#1直连的本层(N等于2)的解码器#1包括的掩码多头注意力层,该掩码多头注意力层获取的输入向量即为该上层的解码器#1的输出向量。
其中,多头注意力层的输入包括残差支路1输出的向量Q2,以及特征提取模型的输出向量(即V2和K2),采用自注意力机制,基于向量间的关联度对各个向量进行变换,得到输出向量。在多头注意力层,基于多头注意力(multi-head attention,MHA)的MHA层包括多个注意力头head。
其中,前馈神经网络FFN层,用于对残差支路2输出的向量执行以下操作:线性变换和线性整流函数(linear rectification function,ReLU)激活运算。此后,对前馈神经网络FFN层输出的向量进行求和与归一化处理得到残差支路3的输出向量。
上文介绍了本申请实施例适用的系统架构,以及执行本申请实施例的方法的数据处理的模型。下面,以模型推理阶段为例对本申请实施例提供的数据处理的方法进行说明。
图8是本申请实施例提供的一种数据处理的方法800的示意性流程图。可以理解的是,该方法800可以但不限于由上述图6所示的数据处理的模型执行,该数据处理的模型包括特征提取模型和transformer模型,该transformer模型包括解码器#1和解码器#2。如图8所 示,该方法800包括步骤810至步骤830。下面,详细介绍步骤810至步骤830。
步骤810,获取待处理的表格图像。
获取待处理的表格图像,可以包括:数据处理的模型获取待处理的表格图像。示例性的,数据处理的模型可以为用户提供用户界面(user interface,UI),用户通过该UI输入该表格图像。
对表格图像包括的表格数目不作具体限定,即表格图像可以包括一个或多个表格。可选的,该表格图像还可以替换为:可携带文档格式(portable document format,PDF)。
步骤820,根据表格图像按照生成式表格识别策略确定表格识别结果,其中,生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定表格图像的表格识别结果,包围框用于指示表格图像所关联的表格中的单元格包括的文本所在位置,表格识别结果用于指示表格所包括的全局结构和内容。
标记语言可以用于指示表格局部结构,该表格局部结构为表格全局结构中的部分结构。其中,表格结构可以包括:表格的行、表格的列、表格包括的单元格、表格中的每个单元格、以及表格中的每个单元格包括的文本对应的包围框。文本对应的包围框,可以是指包围该单元格包括的文本的任意多边形的包围框。表格中的单元格包括的文本所在位置,可以理解为,表格中的单元格包括的文本对应的包围框的位置。可选的,标记语言可以但不限于为以下任意一种标记语言:超文本标记语言HTML,可扩展标记语言(extensible markup language,XML),或者拉泰赫LaTex。
包围框不重叠属性用于指示表格所包括的各个单元格所对应的区域无重叠。表格所包括的各个单元格所对应的区域无重叠,即该各个单元格包括的文本对应的区域也无重叠。示例性的,图9b示出了本申请提供的表格包括的包围框的示意图,参见图9b所示,每个非空单元格(即,单元格包括文本)包括的文本对应的矩形包围框不存在重叠区域。
包围框用于指示表格图像所关联的表格中的单元格包括的文本所在位置。换句话说,包围框可以是指包围该单元格包括的文本的任意多边形。对包围框的形状不作具体限定。例如,该多边形可以但不限于为以下一种多边形:矩形、正方形、平行四边形、或其它多边形(例如,六边形等)。在本申请实施例中,可以通过该包围框的坐标来确定该包围框位于表格的具体位置。例如,当该包围框为矩形时,可以通过该一个矩形包围框的左上角坐标和该一个矩形包围框的右下角坐标,确定该一个矩形包围框位于表格中的具体位置。又如,当该包围框为矩形时,可以通过该一个矩形包围框的左下角坐标和该一个矩形包围框的右上角坐标,确定该一个矩形包围框位于表格中的具体位置。还应理解的是,当表格中的一个单元格中不包括任何文本时,该包围框用于指示的位置为空。一个单元格中不包括任何文本,该一个单元格又称为空单元格。
上述步骤820的执行主体可以是数据处理的模型中的transformer解码器。根据表格图像按照生成式表格识别策略确定表格识别结果,包括:transformer解码器根据表格图像特征和标记语言通过迭代处理获得表格识别结果。该迭代处理可以包括多轮迭代,transformer解码器还执行以下步骤:transformer解码器根据表格图像特征和标记语言确定第一迭代获得的第一包围框和局部结构,第一迭代为多轮迭代的任意一轮迭代处理过程,第一包围框用于指示第一迭代所获得的局部结构的包围框,局部结构为全局结构的部分结构;当第二迭代获得全局结构时,transformer解码器确定第二迭代获得的处理结果为表格识别结果,第二迭代是迭代处理中在第一迭代处理后执行的一次迭代处理,处理结果包括全局结构和内容。其中,第一迭代所获得的局部结构的包围框,即第一迭代所获得的表格的局部结构中的单元 格包括的文本所在位置。可以理解的是,当局部结构不包括任何单元格,或该局部结构包括的任意一个单元格为空单元格(即,单元格不包括任何文本)时,该局部结构的包围框为空。
上述步骤中,transformer解码器根据表格图像特征和标记语言确定第一迭代获得的第一包围框和局部结构,包括:解码器#1根据表格图像特征和标记语言确定局部结构,该局部结构包括表格的非空单元格,局部结构为表格全局结构中的部分结构;解码器#2根据表格图像特征和该表格的非空单元格,确定该非空单元格包括的文本对应的包围框位于表格中的位置。该非空单元格包括的文本对应的包围框位于表格中的位置,即该非空单元格包括的文本位于表格中的位置。当第一迭代为解码器#1的第一次迭代时,上述标记语言可以包括用于指示标记表格的标记语言开始的序列。当第一迭代为解码器#1的第一次迭代后的一次迭代时,上述标记语言可以包括用于指示标记表格的局部结构的序列,此时解码器#1输出的局部结构包含该标记语言所指示的表格的局部结构。可以理解的是,当解码器#1输出的局部结构不包括非空单元格时(即该局部结构不包括包围框),解码器#2可以不用对表格图像特征和该局部结构进行迭代处理。上述解码器#1和解码器#2分别通过执行多轮迭代可以获得表格的全局结构。示例性的,下文中的图10至图12示出了解码器#1和解码器#2分别通过多轮迭代确定表格识别结果的执行过程示意图,具体可以参见下文图10至图12中的相关描述,此处不再详细赘述。
可选的,在上述步骤820之后,还可以执行以下方法:对第一迭代获得的第一包围框进行纠正。这种纠正的方式,可以理解为是实时纠正的方式,即在进行表格识别的过程中对某一次获取的包围框进行纠正。执行该纠正处理的执行主体可以是上述解码器#1。可以理解的是,在第一迭代的下一轮迭代处理时,可以根据对第一包围框进行纠正后的包围框和表格图像特征进行处理,以得到该第一迭代的下一轮迭代获得的包围框和局部结构。该第一迭代的下一轮迭代获得的局部结构为表格的全局结构中的局部结构,且该第一迭代的下一轮迭代获得的局部结构包括第一迭代获取的局部结构。在一些可能的实现方式中,对第一迭代获得的第一包围框进行纠正,包括:根据输入参数和表格图像对第一包围框进行纠正。其中,输入参数包括用于对第一包围框进行纠正的参数,输入参数可以是用户根据表格图像得到的参数。具体实现时,数据处理的模型可以为用户提供具有该输入参数的用户界面(user interface,UI),用户通过该UI输入该输入参数。这种纠正的具体实现方式,可以理解为是用户手动对第一包围框进行纠正的方式。在另一些可能的实现方式中,对第一迭代获得的第一包围框进行纠正,包括:在第二包围框与第一包围框的匹配度大于或等于预设阈值的情况下,根据第二包围框对第一包围框进行纠正,第二包围框为误差纠偏检测模型对局部结构进行处理得到的,误差纠偏检测模型为经过训练的人工智能(artificial intelligence,AI)模型。可选的,数据处理的模型还可以包括该误差纠偏检测模型。基于此,这种纠正的具体实现方式,可以理解为是数据处理的模型自动对第一包围框进行纠正的方式。示例性的,下文中的步骤1060和步骤1070示出了这种自动纠正方式的流程,具体可以参见下文步骤1060和步骤1070的相关描述,此处不再详细赘述。在本申请实施例中,可以通过以下任意一种方式确定匹配度:交并比(intersection-over-union,IoU)进行匹配,或中心点的距离。例如,匹配度的确定方式为IoU时,IoU越大,匹配度越大。又如,匹配度的确定方式为中心点的距离时,距离越小,匹配度越大。对预设阈值的大小不作具体限定,可以根据实际需求设置预设阈值的大小。
可选的,在获取表格的全局结构后,还可以执行以下步骤:根据表格的全局结构,获取该全局结构包括的文本内容。对根据表格的全局结构,获取该全局结构包括的文本内容的方 式不作具体限定。例如,可以根据全局结构中的文本包围框,截取该文本包围框对应的单元格图像,并通过光学字符识别(optical character recognition,OCR)系统,识别该单元格图像,以得到该单元格图像包括的文本内容。
可选的,在上述步骤820之前还可以执行以下方法:对表格图像进行特征提取,获得表格图像特征。具体实现时,可以是数据处理的模型中的特征提取模型对该表格图像进行特征提取,以获得该表格图像特征。表格图像特征包括以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及该表格包括的每个单元格或者该表格包括的单元格文本的包围框。
可选的,可以采用以下任意一种标记语言标识表格识别结果:超文本标记语言HTML,可扩展标记语言XML,或者拉泰赫LaTex。示例性的,当上述步骤820中的标记语言为HTML时,表格识别结果可以采用HTML标识;当上述步骤820中的标记语言为XML时,表格识别结果可以采用XML标识。
步骤830,输出表格识别结果。
可选的,在上述步骤830之后还可以执行以下方法:根据表格图像对表格识别结果进行纠正,并输出纠正后的表格识别结果。执行该方法的主体可以是数据处理的模型。这种纠正的方法,可以理解为事后纠正的方式,即在得到表格识别结果后可以根据表格图像对该表格识别结果进行纠正。
应理解的是,上述方法800仅为示意,并不对本申请实施例提供的数据处理的方法构成任何限定。
在本申请实施例中,数据处理的模型能够根据用于标识表格结构的标记语言和该表格中的单元格包括的文本位于表格中的位置,通过多轮迭代对表格图像包括的表格进行识别,以得到该表格对应的表格识别结果,避免了传统技术中仅根据表格的行列结构(该表格的行列结构不包括包围框)对表格进行识别存在识别结果的准确性较差的问题,该方法可以提高表格识别结果的准确性。具体实现时,本轮迭代时,该数据处理的模型中的transformer解码器包括的解码器#1能够根据表格图像特征和上一轮迭代获取的标记语言确定局部结构。当本轮迭代的该解码器#1的输出用于指示非空单元格时,transformer解码器包括的解码器#2可以根据该解码器#1的输出和表格图像特征,确定该解码器#1的输出指示的非空单元格包括的包围框位于表格中的具体位置,这样,能够减少预测包围框的冗余和提高表格识别结果的效率。该解码器#2通过多轮迭代的方式预测表格中的所有非空单元格包括的包围框,使得预测的包围框更准确,有利于提高表格识别结果的准确性。此外,数据处理的模型还可以对第一迭代获取的第一包围框进行实时纠正,能够进一步提高第一包围框的精度。在第一迭代后的下一迭代时,基于该纠正后的第一包围框进行处理时,能够进一步提高该下一迭代的输出结果的鲁棒性和准确性,该方法有利于进一步提高表格识别结果的准确性。
下面,以模型训练阶段为例对本申请实施例提供的数据处理的模型训练方法进行说明。
图9a是本申请实施例提供的一种数据处理的模型训练方法900的示意性流程图。如图9a所示,该方法900包括步骤910至步骤940。下面具体介绍步骤910至步骤940。
步骤910,获取多个训练数据集和每个训练数据集对应的标注信息。
步骤920,将训练数据集输入目标模型,目标模型对训练数据集进行处理,以得到训练数据集对应的训练输出信息。
步骤930,根据标注信息和训练输出信息,调整目标模型的参数,以最小化训练输出信 息和标注信息的差异。
步骤940,使用调整后的参数值返回继续执行步骤920和步骤930直到得到的损失值逐渐收敛,即得到训练完成的目标模型。
在一些实现方式中,上述目标模型和训练完成的目标模型都包括:特征提取模型和transformer解码器。其中,特征提取模型和transformer解码器是一起执行模型训练的。这种实现方式中,每个训练数据集可以包括表格图像,该每个训练数据集对应的标注信息可以用于指示该表格图像包括的表格特征。表格特征包括以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及每个单元格或者单元格文本的包围框。示例性的,图6示出了包括特征提取模型和transformer解码器的模型的结构示意图。
可选的,在另一些实现方式中,上述目标模型和训练完成的目标模型可以都包括:特征提取模型,transformer解码器,以及误差纠偏检测模型。其中,误差纠偏检测模型可以为神经网络模型,用于对表格中的单元格包括的文本对应的包围框进行纠正。其中,特征提取模型,transformer解码器,以及误差纠偏检测模型是一起执行模型训练的。这种实现方式中,每个训练数据集可以包括表格图像,该每个训练数据集对应的标注信息可以用于指示该表格图像包括的表格特征。表格特征包括以下一种或多种特征:表格的行数目,表格的列数目,表格的大小,表格的跨行特征,表格的跨列特征,或表格的布局。其中,表格的布局包含用于指示该表格结构的标记语言,以及每个单元格或者单元格文本的包围框。
下面,结合图10至图12介绍本申请实施例提供的一种数据处理的方法。应理解,图10至图12的例子仅仅是为了帮助本领域技术人员理解本申请实施例,而非要将申请实施例限制于所示例的具体数值或具体场景。本领域技术人员根据下面所给出的图10至图12的例子,显然可以进行各种等价的修改或变化,这样的修改和变化也落入本申请实施例的范围内。
在结合图10至图12介绍本申请实施例提供的一种数据处理的方法的之前,先对该方法实施例中涉及的表格相关内容进行介绍。
在本申请实施例中,可以以表格图像包括一个表格1,且以该一个表格1可以通过HTML语言表示为例进行介绍。可选的,该一个表格1还可以但不限于通过以下任意一种语言表示:可扩展标记语言XML,或者拉泰赫LaTex。示例性的,该一个表格1可以如表1所示,该表格1包括多个单元格,该多个单元格包括文本,该表格1包括的单元格可以是按照行优先的顺序严格排列的。示例性的,图9b示出了该表格1包括的每个非空单元格(即,单元格包括文本)包括的文本对应的矩形包围框,图9b中示出的每个矩形包围框对应的编号用于指示该表格1包括的单元格按行排列的顺序。
表1
Figure PCTCN2022142667-appb-000001
为了便于描述,可以将上述表1示出的表格1包括的每个单元格的内容进行简化表示,即若单元格包含文字,则将其用“[]”表示,否则用“”表示。基于此,可以通过以下简化后的HTML语言表示上述表1示出的表格1:
<html>
<table>
<tr>
<td rowspan="2">[]</td>
<td colspan="2">[]</td>
</tr>
<tr>
<td>[]</td>
<td>[]</td>
</tr>
<tr>
<td>[]</td>
<td>[]</td>
<td>[]</td>
</tr>
<tr>
<td>[]</td>
<td>[]</td>
<td></td>
</tr>
<tr>
<td>[]</td>
<td>[]</td>
<td>[]</td>
</tr>
</table>
</html>
其中,上述“<table>”、“<tr>”、“rowspan=*>”“colspan=*>”、“</td>”等不代表具体单元格的HTML序列。“<td></td>”代表空白单元格的HTML序列,空白单元格的HTML序列对应的单元格的包围框编码(又称为包围框坐标)为空(即0,0,0,0)。“<td>[]</td>”和“<td”代表非空单元格的HTML序列,非空单元格的HTML序列所包括的文本对应的包围框编码为([x1,y1,x2,y2]∈(0~N)),且x1,y1,x2和y2的取值不同时为0。在一些可能的实现方式中,可以通过一个包围框的左上角坐标和该一个包围框的右下角坐标,确定该包围框的具体位置。基于此,“x1,y1”可以表示该一个包围框的左上角坐标,“x2,y2”可以表示该一个包围框的右下角坐标。可选的,在另一些可能的实现方式中,可以通过一个包围框的左下角坐标和该一个包围框的右上角坐标,确定该包围框的具体位置。基于此,“x1,y1”可以表示该一个包围框的左下角坐标,“x2,y2”可以表示该一个包围框的右上角坐标。
需说明的是,文本对应的包围框可以是指包围该文本的任意多边形。对该多边形不作具体限定。例如,该多边形可以但不限于为以下一种多边形:矩形、正方形、平行四边形、或其它多边形(例如,六边形等)。为便于描述,下文描述的本申请实施例中以“一个文本对应的包围框为一个矩形包围框,且通过该一个矩形包围框的左上角坐标和该一个矩形包围框的右下角坐标,确定该一个矩形包围框的具体位置”为例进行介绍。
下面,结合图10至图12介绍本申请实施例提供的又一种数据处理的方法。图10示出的数据处理的方法可以为上述图8示出的数据处理的方法的一个具体示例。具体的,图10示出的方法1000是以上述图8示出的方法800中的表格图像包括一个表格(即,上述表1示出的表格),采用HTML标记该一个表格的表格识别结果和每轮迭代获取的表格局部结构,以及利用误差纠偏检测模型对第一包围框进行纠正为例进行介绍。
图10是本申请实施例提供的一种数据处理的方法1000的示意性流程图。可以理解的是,图10所示的方法1000可以由上述图6所示的数据处理的模型执行。具体的,图10所示的结构解码器1可以为图6所示的解码器#1,图10所示的包围框解码器1可以为图6所示的解码器#2。如图10所示,该方法1000包括步骤1010至步骤1080。下面,对步骤1010至步骤1080进行详细介绍。
步骤1010,特征提取模型对表格图像进行特征提取,得到图像特征1。
对表格图像包括的表格的数目不作具体限定。例如,表格图像可以包括1个、2个或5个表格等。为便于描述,本申请实施例中,均以上述步骤1010中的表格图像包括上述表1所示的一个表格1为例进行介绍。
图像特征1用于指示以下一种或多种特征:表格1的行数目,表格1的列数目,表格1的大小,表格1的跨行特征,或者表格1的跨列特征。
特征提取模型是数据处理的模型中的子模型。特征提取模型是一种具有特征提取功能的神经网络模型,对特征提取模型的结构不作具体限定。例如,该特征提取模型可以是CNN模型。又如,该特征提取模型还可以是由CNN模型和特征金字塔网络(feature pyramid network image,FPN)模型构成的组合模型。
可选的,在上述步骤1010之前,特征提取模型还可以用于获取表格图像。例如,用户将表格图像输入至数据处理的模型,以使该特征提取模型获取该表格图像。
在本申请实施例中,结构解码器1可以基于图像特征1和初始输入序列进行i次迭代处理,得表格1的结构特征,表格1的结构特征是表格1的全局结构中的部分特征,j为正整数。表格1的结构特征可以包括表格1的行列信息。其中,结构解码器1基于图像特征1和初始输入序列进行i次迭代处理,包括:第1次迭代处理时,结构解码器1对图像特征1和初始输入序列进行处理;第i+1次迭代处理时,结构解码器1对图像特征1,初始输入序列以及第i次迭代处理后的输出结果,得到第i+1次迭代处理后的输出结果。下面,结合步骤1020,步骤1030和步骤1040详细描述结构解码器1执行第1次至第3次迭代的处理过程。示例性的,图11中的(1)示出了结构解码器1基于图像特征1,序列位置编码和初始输入序列进行第1次迭代处理的流程,预测结构序列1为第1次迭代处理的输出结果。图11中的(2)示出了结构解码器2执行第2次迭代处理的流程,预测结构序列2为第2次迭代处理的输出结果。图11中的(3)示出了结构解码器2执行第3次迭代处理的流程,预测结构序列3为第3次迭代处理的输出结果。
步骤1020,结构解码器1对图像特征1,初始输入序列和初始包围框进行处理,得到预测结构序列1,预测结构序列1用于指示不是非空单元格的HTML序列1。
初始输入序列可以包括用于指示标记表格1的HTML序列开始的序列,此时初始包围框为空。可以理解的是,预测结构序列1包括用于指示不是非空单元格的HTML序列1,此时预测结构序列1不包括包围框信息,即该预测结构序列1包括的包围框信息为空。示例性的,图11中的(1)示出了初始输入序列,初始包围框和预测结构序列1,预测结构序列1即“<table>”,“<table>”用于指示不是非空单元格的HTML序列1。
其中,结构解码器1对图像特征1,初始输入序列和初始包围框进行处理,得到识别结果1,包括:结构解码器1对图像特征1,初始输入序列和初始包围框进行处理,得到结构解码器1的输出结果;结构解码器1对该输出结果进行线性化,得到序列信息1,序列信息1用于指示预测结构序列1;利用归一化指数函数softmax对序列信息1并进行处理得到预测结构序列1。可选的,在一些实现方式中,结构解码器1对图像特征1,初始输入序列和初始包围框进行处理,得到结构解码器1的输出结果,包括:结构解码器1对图像特征1,初始输入序列,初始序列位置编码和初始包围框进行处理,得到结构解码器1的输出结果。初始序列位置编码用于指示初始输入序列在表格1中的位置。示例性的,图11中的(1)示出了初始序列位置编码。
下面,以本申请实施例中的结构解码器1为图7所示的结构为例,具体介绍“结构解码器1用于对图像特征1,初始输入序列,初始序列位置编码和初始包围框进行处理,得到结构解码器1的输出结果”的流程。
在本申请实施例中,残差支路1的掩码多头注意力层的输入包括V1,K1和Q1。其中,V1是根据目标向量1处理得到的值(value)向量,K1是根据目标向量1处理得到的键(key)向量,Q1是根据目标向量1处理得到的查询(query)向量,目标向量1是对初始输入序列,初始包围框和位置编码1进行加和处理得到的。掩码多头注意力层的输出为对V1,K1和Q1进行点乘运算得到的结果,该点乘运算可以通过以下公式1表示:
Figure PCTCN2022142667-appb-000002
其中,Attention(Q1,K1,V1)表示掩码多头注意力层的输出结果,d k1表示K1向量的维度。
接着对掩码多头注意力层的输出进行求和并归一化处理,得到残差支路1的输出结果。
在本申请实施例中,残差支路2的多头注意力层的输入包括V2,K2和Q2。其中,V2是根据图像特征1处理得到的值(value)向量,K2是根据图像特征1处理得到的键(key)向量,Q2是根据目标向量2处理得到的查询(query)向量,目标向量2是对残差支路1的输出结果进行处理得到的查询向量。多头注意力层的输出为对V2,K2和Q2进行点乘运算得到的结果,该点乘运算可以通过以下公式2表示:
Figure PCTCN2022142667-appb-000003
其中,Attention(Q2,K2,V2)表示掩码多头注意力层的输出结果,d k2表示K2的维度。
接着对多头注意力层的输出进行求和并归一化处理,得到残差支路2的输出结果。
在本申请实施例中,残差支路3的前馈神经网络FFN层的输入包括残差支路2的输出结果,前馈神经网络FFN层对残差支路2的输出结果执行以下操作:线性变换和线性整流函数(linear rectification function,ReLU)处理。
步骤1030,结构解码器1对图像特征1,输入序列1和包围框1进行处理,得到预测结构序列2,预测结构序列2用于指示不是非空单元格的HTML序列2。
输入序列1包括用于指示表格1的局部结构2的HTML序列,局部结构2是表格1的全局结构中的部分结构,且局部结构2包含局部结构1和预测结构序列1。示例性的,图11中的(2)示出了输入序列1。
包围框1用于指示局部结构2对应的单元格包括的文本所在位置。示例性的,图11中的(2)示出了包围框1包括:Box φ,Box φ。图11中的(2)示出的序列位置编码1用于指示输入序列1在表格1中的位置。
预测结构序列2用于指示不是非空单元格的HTML序列2,此时预测结构序列2不包括包围框信息。示例性的,图11中的(2)示出了预测结构序列2,“<tr>”用于指示不是非空单元格的HTML序列2。
可以理解的是,上述步骤1030的执行原理与上述步骤1020的执行原理相同,只是结构解码器1的输入数据和输出数据存在差别,此处不再详细赘述,具体可以参见上述步骤1020中的相关描述。示例性的,图11中的(2)示出了上述步骤1030的执行流程。
步骤1040,结构解码器1对图像特征1,输入序列2和包围框2进行处理,得到预测结构序列3,预测结构序列3包括用于指示非空单元格1的HTML序列3。
输入序列2包括用于指示表格1的局部结构3的HTML序列,局部结构3是表格1的全局结构中的部分结构,且局部结构3包含局部结构1和预测结构序列2。示例性的,图11中的(3)示出了输入序列2。
包围框2用于指示局部结构3的对应的单元格包括的文本所在位置。示例性的,图11中的(3)示出了包围框2包括:Box φ,Box φ,Box φ。图11中的(3)示出的序列位置编码2用于指示输入序列2在表格1中的位置。
预测结构序列3包括用于指示非空单元格1的HTML序列3,此时,预测结构序列3可以包括包围框#1,包围框#1为非空单元格1中的文本对应的包围框。示例性的,图11中的(3)示出了预测结构序列3,“<td”为用于指示非空单元格1的HTML序列3。可以理解的是,当非空单元格1中的文本形状为矩形时,包围框#1可以为该矩形。
可以理解的是,上述步骤1040的执行原理与上述步骤1020的执行原理相同,只是结构解码器1的输入数据和输出数据存在差别,此处不再详细赘述,具体可以参见上述步骤1020中的相关描述。示例性的,图11中的(3)示出了上述步骤1040的执行流程。
步骤1050,包围框解码器1对图像特征1和非空单元格1的HTML序列信息进行处理,得到包围框#1位于表格1中的位置,非空单元格1的HTML序列信息用于指示非空单元格1的HTML序列3,序列信息3包括非空单元格1的HTML序列信息,序列信息3用于指示预测结构序列3。
包围框#1位于表格1中的位置,即非空单元格1中的文本位于表格1中的位置。例如,当该包围框#1为矩形时,可以通过该矩形的左上角坐标和右下角坐标来描述该包围框#1位于表格1中的位置。示例性的,图11中的(3)示出了序列信息3,预测结构序列3,非空单元格1的HTML序列信息,以及非空单元格1的HTML序列3。
上述步骤1050中,包围框解码器1对图像特征1和非空单元格1的HTML序列信息进行处理,得到包围框#1位于表格1中的位置,包括:包围框解码基于图像特征1和非空单元格1的HTML序列信息进行j次迭代处理,得到包围框#1位于表格1中的位置,j为正整数。其中,包围框解码器1基于图像特征1和非空单元格1的HTML序列信息进行j次迭代处理,包 括:第1次迭代处理时,包围框解码器1对图像特征1和非空单元格1的HTML序列信息进行处理,得到第1次迭代处理的输出结果;第j+1次迭代处理时,包围框解码器1对图像特征1,非空单元格1的HTML序列信息,以及第j次迭代处理的输出结果进行处理,得到第j+1次迭代处理的输出结果。示例性的,图12示出了包围框解码器1基于图像特征1和非空单元格1的HTML序列信息进行4(即,j=4)次迭代处理的流程。可以理解的是,当包围框解码器1基于j次迭代后的处理结果可以确定该包围框#1位于表格1中的位置时,该包围框解码器1可以停止迭代。包围框解码器1执行每次迭代的流程与上述步骤1020中描述的结构解码器1的执行流程相同,只是解码器的输入数据和输出数据存在差别,此处不再详细赘述,具体可以参见上述步骤1020中的相关描述。
在上述步骤1040之后执行上述步骤1050,可以理解为,在结构解码器1输出的预测结构序列(例如,预测结构序列3)用于指示非空单元格的情况下,还会触发包围框解码器1基于该预测结构序列,预测该预测结构序列包括的包围框位于表格1中的位置。
步骤1060,判断包围框#1与包围框#2的匹配度是否大于或等于预设阈值。
上述步骤1060的执行主体可以是数据处理的模型。
包围框#2可以包括误差纠偏检测模型对局部结构3进行纠正得到的包围框,该误差纠偏检测模型为经过训练的人工智能AI模型。可选的,数据处理的模型还可以包括该误差纠偏检测模型。
判断包围框#1与包围框#2的匹配度是否大于或等于预设阈值,包括:在确定包围框#1与包围框#2的匹配度大于或等于预设阈值的情况下,在步骤1060之后执行步骤1070;在确定包围框#1与包围框#2的匹配度小于预设阈值的情况下,在步骤1060之后执行步骤1080。对确定包围框#1与包围框#2的匹配度的方法不作具体限定。示例性的,可以通过以下任意一种方式确定匹配度:交并比(intersection-over-union,IoU)进行匹配,或中心点的距离。例如,匹配度的确定方式为IoU时,IoU越大,匹配度越大。又如,匹配度的确定方式为中心点的距离时,距离越小,匹配度越大。对预设阈值的大小不作具体限定,可以根据实际需求设置预设阈值的大小。
步骤1070,结构解码器1将上述步骤1040中的包围框2纠正为包围框#2,并对图像特征1,输入序列2和包围框#2进行处理,得到预测结构序列4,预测结构序列4包括用于指示非空单元格1的HTML序列4。
预测结构序列4包括用于指示非空单元格1的HTML序列4,此时,预测结构序列4包括包围框#2,包围框#2为非空单元格1中的文本对应的包围框,且包围框#2与包围框#1不同。
可以理解的是,上述步骤1070的执行原理与上述步骤1020的执行原理相同,只是结构解码器1的输入数据和输出数据存在差别,此处不再详细赘述,具体可以参见上述步骤1020中的相关描述。
步骤1080,结构解码器1对图像特征1,输入序列3和包围框#1进行处理,得到预测结构序列5,预测结构序列5包括用于指示非空单元格2的HTML序列5。
输入序列3包括用于指示表格1的局部结构4的HTML序列,局部结构4是表格1的全局结构中的部分结构,且局部结构4包含局部结构1和预测结构序列3。
可以理解的是,在确定包围框#1与包围框#2的匹配度小于预设阈值的情况下,在上述步骤1060之后执行上述步骤1080。其中,确定包围框#1与包围框#2的匹配度小于预设阈值,可以理解为,上述步骤1050中得到的包围框#1是准确的。基于此,上述步骤1080中可以基于上述步骤1050得到的预测结构序列3进一步确定表格1的结构。
上述步骤1080的执行原理与上述步骤1020的执行原理相同,只是结构解码器1的输入数据和输出数据存在差别,此处不再详细赘述,具体可以参见上述步骤1020中的相关描述。
对上述步骤1070和上述步骤1080的执行顺序不作具体限定。例如,可以在上述步骤1060之后,先执行上述步骤1070,再执行上述步骤1080。又如,可以在上述步骤1060之后,先执行上述步骤1080,再执行上述步骤1070。
上述实现方式中,结构解码器1和上述包围框解码器1分别需要经过多次迭代处理,以得到表格1的全局结构,全局结构可以包括表格1的行列信息和表格1中的每个非空单元格中的文本对应的包围框。
可选的,在数据处理的模型基于上述方法获取表格1的全局结构之后,该数据处理的模型还可以执行以下步骤:根据表格1的全局结构,获取该全局结构包括的文本内容。对根据表格的全局结构,获取该全局结构包括的文本内容的方式不作具体限定。例如,可以根据全局结构中的文本包围框,截取该文本包围框对应的单元格图像,并通过光学字符识别(optical character recognition,OCR)系统,识别该单元格图像,以得到该单元格图像包括的文本内容。
可选的,本申请实施例中所指的HTML序列,可等价转换成可扩展标记语言(extensible markup language,XML)序列、拉泰赫LaTex序列。
应理解的是,上述图10示出的方法仅为示意,并不对本申请实施例提供的数据处理的方法构成任何限定。
上文结合图5至图12,详细描述了本申请实施例适用的系统架构、数据处理的模型、本申请实施例提供的数据处理的方法、以及训练数据处理的模型的方法。下面将结合图13至图15,详细描述本申请的装置的实施例。方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
图13是本申请实施例提供的一种数据处理的装置1300的示意性结构图。图13所示的数据处理的装置1300可以执行上述数据处理的方法的相应步骤。如图13所示,该数据处理的装置1300包括:获取单元1310,处理单元1320和输出单元1330,
该获取单元1310,用于获取待处理的表格图像;该处理单元1320,用于根据该表格图像按照生成式表格识别策略确定表格识别结果,其中,该生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定该表格图像的表格识别结果,该包围框用于指示该表格图像所关联的表格中的单元格包括的文本所在位置,该表格识别结果用于指示该表格所包括的全局结构和内容;该输出单元1330,用于输出该表格识别结果。
可选的,在一些可能的设计中,该包围框不重叠属性用于指示该表格所包括的各个单元格所对应的区域无重叠。
可选的,在另一些可能的设计中,该处理单元1320还用于:根据该表格图像特征和该标记语言通过迭代处理获得该表格识别结果。
可选的,在另一些可能的设计中,该迭代处理包括多轮迭代,该处理单元1320还用于:根据该表格图像特征和该标记语言确定第一迭代获得的第一包围框和局部结构,该第一迭代为该多轮迭代的任意一轮迭代处理过程,该第一包围框用于指示该第一迭代所获得的该局部结构的包围框,该局部结构为该全局结构的部分结构;当第二迭代获得该全局结构时,确定该第二迭代获得的处理结果为该表格识别结果,该第二迭代是该迭代处理中在该第一迭代处理后执行的一次迭代处理,该处理结果包括该全局结构和该内容。
可选的,在另一些可能的设计中,该处理单元1320还用于:对该第一迭代获得的该第一 包围框进行纠正。
可选的,在另一些可能的设计中,该处理单元1320还用于:根据输入参数和该表格图像对该第一包围框进行纠正。
可选的,在另一些可能的设计中,该处理单元1320还用于:在第二包围框与该第一包围框的匹配度大于或等于预设阈值的情况下,根据该第二包围框对该第一包围框进行纠正,该第二包围框为误差纠偏检测模型对该局部结构进行处理得到的,该误差纠偏检测模型为经过训练的人工智能AI模型。
可选的,在另一些可能的设计中,该处理单元1320还用于:根据该表格图像对该表格识别结果进行纠正,并输出纠正后的表格识别结果。
可选的,在另一些可能的设计中,该处理单元1320还用于:对该表格图像进行特征提取,获得该表格图像特征。
可选的,在另一些可能的设计中,采用以下任意一种标记语言标识该表格识别结果:超文本标记语言HTML,可扩展标记语言XML,或者拉泰赫LaTex。
应理解的是,本申请实施例的装置1300可以通过中央处理单元(central processing unit,CPU)实现,也可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。当通过软件实现上文中的数据处理的方法时,装置1300及其各个单元模块也可以为软件模块。
图14是本申请实施例提供的一种训练装置1400的示意性结构图。图14所示的训练装置1400可以执行上述数据处理的模型训练方法的相应步骤。如图14所示,该训练装置1400包括:获取单元1410和处理单元1420。
其中,获取单元1410用于执行上述步骤910。处理单元1420用于执行上述步骤920,上述步骤930,和上述步骤940。
可选的,该训练装置1400还可以包括输出单元,该输出单元1430用于输出上述步骤940得到的训练完成的目标模型。
应理解的是,本申请实施例的装置1400可以通过中央处理单元(central processing unit,CPU)实现,也可以通过专用集成电路(application-specific integrated circuit,ASIC)、人工智能芯片、片上系统(system on chip,SoC)、加速卡或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。当通过软件实现上述数据处理的模型训练方法时,装置1400及其各个单元模块也可以为软件模块。
需要说明的是,上述装置1300和装置1400均以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和 电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图15是本申请实施例提供的一种计算设备1500的结构示意图。计算设备1500可以是服务器、边缘服务器、计算机工作站、或者其他具有计算能力的设备。该计算设备1500可以包括:至少一个处理器1501、内存单元1502、通信接口1503和存储介质1504,参见图15中的实线所示。可选的,该计算设备1500还可以包括输出设备1506和输入设备1507,参见图15中的虚线所示。其中,处理器1501、内存单元1502、通信接口1503、存储介质1504、输出设备1506、输入设备1507通过总线1505进行通信,也可以通过无线传输等其他手段实现通信。
在一些实现方式中,计算设备1500可以用于实现上述数据处理的装置1300的相同或相似的功能。此时,内存单元1502用于存储计算机指令15022,处理器1501可以调用该内存单元1502中存储的计算机指令15022以执行上述方法实施例中由数据处理的模型执行的方法的步骤。
在另一些实现方式中,计算设备1500的功能与上述训练装置1400的功能相同或相似。此时,内存单元1502用于存储计算机指令15022,处理器1501可以调用该内存单元1502中存储的计算机指令15022以执行上述训练方法的步骤。
应理解,在本申请实施例中,处理器1501可以包括至少一个CPU 15011。可选的,处理器1501还可以是其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、人工智能AI芯片、片上系统SoC或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。可选地,处理器1501中包括两种或多种不同类型的处理器。例如,处理器1501包括CPU 15011和通用处理器、数字信号处理器DSP、专用集成电路ASIC、现场可编程门阵列FPGA、人工智能AI芯片、片上系统SoC或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件中至少一种。
内存单元1502可以包括只读存储器和随机存取存储器,并向处理器1501提供指令和数据。内存单元1502还可以包括非易失性随机存取存储器。例如,内存单元1502还可以包括存储设备类型的信息。
内存单元1502可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
通信接口1503使用例如但不限于收发器一类的收发装置,来实现计算设备1500与其他设备或通信网络之间的通信。例如,当该处理器1501可以调用该内存单元1502中存储的计算机指令15022以执行上述方法实施例中由数据处理的模型执行的方法的步骤时,可以通过 通信接口1503获取表格图像或者数据处理的模型。又如,当该处理器1501可以调用该内存单元1502中存储的计算机指令15022以执行上述训练方法的步骤时,可以通过通信接口1603获取训练数据集。
存储介质1504具有存储的功能。存储介质1504可以用于暂时存放处理器1501中的运算数据,以及与外部存储器交换的数据。存储介质1504可以但不限于是硬盘(hard disk drive,HDD)。
总线1505除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图15中将各种总线都标为总线1505。
总线1505可以是快捷外围部件互连标准(Peripheral Component Interconnect Express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线1505可以分为地址总线、数据总线、控制总线等。
输出设备1506可以是以下任意一种:键盘、写字板、麦克风、音响、或者显示器。
输入设备1507可以是以下任意一种:键盘、鼠标、摄像头、扫描仪、手写输入板、或者语音输入装置。
可选的,在另一些实现方式中,还可以基于多个计算设备1500实现上述方法实施例中由数据处理的模型执行的方法的步骤,或者基于多个计算设备1500实现上述训练方法的步骤。其中,该多个计算设备1500可以是计算机集群中包括的装置。示例性的,以计算机集群包括两个计算设备1500,该两个计算设备1500用于执行上述方法实施例中由数据处理的模型执行的方法为例进行介绍。为便于描述,将该两个计算设备1500分别记为装置#1和装置#2,装置#1和装置#2的结构可以参见图15所示。装置#1和装置#2可以但不限于通过以太网进行通信,以实现数据的传输。具体实现时,装置#1包括的内存单元通过以太网可以向该装置#2包括的处理器提供指令和数据,使得该装置#2包括的处理器通过以太网可以调用该装置#1包括的内存单元中存储的计算机指令,以执行上述方法实施例中由数据处理的模型执行的方法的步骤。
以上列举的计算设备1500的结构仅为示例性说明,本申请并未限定于此,本申请实施例的计算设备1500包括现有技术中计算机系统中的各种硬件。本领域的技术人员应当理解,计算设备1500还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,上述计算设备1500还可包括实现其他附加功能的硬件器件。
本申请还提供一种数据处理的系统,该系统包括多个如图15所示的计算设备1500,多个计算设备1500以集群形式构成数据处理系统,该系统用于实现上述数据处理方法中操作步骤的功能。
本申请实施例提供了一种计算机程序产品,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述数据处理的方法。
本申请实施例提供了一种计算机程序产品,提供了一种计算机程序产品,该计算机程序产品包括:计算机程序代码,当该计算机程序代码在计算机上运行时,使得计算机执行上述数据处理的模型训练方法。
本申请实施例提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序包括用于执行上述方法实施例中的方法。
本申请实施例提供了一种芯片系统,包括至少一个处理器和接口;所述至少一个所述处理器,用于调用并运行计算机程序,以使所述芯片系统执行上述方法实施例中的方法。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (23)

  1. 一种数据处理的方法,其特征在于,所述方法包括:
    获取待处理的表格图像;
    根据所述表格图像按照生成式表格识别策略确定表格识别结果,其中,所述生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定所述表格图像的表格识别结果,所述包围框用于指示所述表格图像所关联的表格中的单元格包括的文本所在位置,所述表格识别结果用于指示所述表格所包括的全局结构和内容;
    输出所述表格识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述包围框不重叠属性用于指示所述表格所包括的各个单元格所对应的区域无重叠。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述表格图像按照生成式表格识别策略确定表格识别结果,包括:
    根据所述表格图像特征和所述标记语言通过迭代处理获得所述表格识别结果。
  4. 根据权利要求3所述的方法,其特征在于,所述迭代处理包括多轮迭代,所述方法还包括:
    根据所述表格图像特征和所述标记语言确定第一迭代获得的第一包围框和局部结构,所述第一迭代为所述多轮迭代的任意一轮迭代处理过程,所述第一包围框用于指示所述第一迭代所获得的所述局部结构的包围框,所述局部结构为所述全局结构的部分结构;
    当第二迭代获得所述全局结构时,确定所述第二迭代获得的处理结果为所述表格识别结果,所述第二迭代是所述迭代处理中在所述第一迭代处理后执行的一次迭代处理,所述处理结果包括所述全局结构和所述内容。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    对所述第一迭代获得的所述第一包围框进行纠正。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述第一迭代获得的所述第一包围框进行纠正,包括:
    根据输入参数和所述表格图像对所述第一包围框进行纠正。
  7. 根据权利要求5所述的方法,其特征在于,所述对所述第一迭代获得的所述第一包围框进行纠正,包括:
    在第二包围框与所述第一包围框的匹配度大于或等于预设阈值的情况下,根据所述第二包围框对所述第一包围框进行纠正,所述第二包围框为误差纠偏检测模型对所述局部结构进行处理得到的,所述误差纠偏检测模型为经过训练的人工智能AI模型。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:
    根据所述表格图像对所述表格识别结果进行纠正,并输出纠正后的表格识别结果。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述方法还包括:
    对所述表格图像进行特征提取,获得所述表格图像特征。
  10. 根据权利要求1至9任一项所述的方法,其特征在于,
    采用以下任意一种标记语言标识所述表格识别结果:超文本标记语言HTML,可扩展标记语言XML,或者拉泰赫LaTex。
  11. 一种数据处理的装置,其特征在于,所述装置包括获取单元,处理单元和输出单元,
    所述获取单元,用于获取待处理的表格图像;
    所述处理单元,用于根据所述表格图像按照生成式表格识别策略确定表格识别结果,其 中,所述生成式表格识别策略用于指示利用标记语言和包围框不重叠属性确定所述表格图像的表格识别结果,所述包围框用于指示所述表格图像所关联的表格中的单元格包括的文本所在位置,所述表格识别结果用于指示所述表格所包括的全局结构和内容;
    所述输出单元,用于输出所述表格识别结果。
  12. 根据权利要求11所述的装置,其特征在于,
    所述包围框不重叠属性用于指示所述表格所包括的各个单元格所对应的区域无重叠。
  13. 根据权利要求11所述的装置,其特征在于,所述处理单元还用于:
    根据所述表格图像特征和所述标记语言通过迭代处理获得所述表格识别结果。
  14. 根据权利要求13所述的装置,其特征在于,所述迭代处理包括多轮迭代,所述处理单元还用于:
    根据所述表格图像特征和所述标记语言确定第一迭代获得的第一包围框和局部结构,所述第一迭代为所述多轮迭代的任意一轮迭代处理过程,所述第一包围框用于指示所述第一迭代所获得的所述局部结构的包围框,所述局部结构为所述全局结构的部分结构;
    当第二迭代获得所述全局结构时,确定所述第二迭代获得的处理结果为所述表格识别结果,所述第二迭代是所述迭代处理中在所述第一迭代处理后执行的一次迭代处理,所述处理结果包括所述全局结构和所述内容。
  15. 根据权利要求14所述的装置,其特征在于,所述处理单元还用于:
    对所述第一迭代获得的所述第一包围框进行纠正。
  16. 根据权利要求15所述的装置,其特征在于,所述处理单元还用于:
    根据输入参数和所述表格图像对所述第一包围框进行纠正。
  17. 根据权利要求15所述的装置,其特征在于,所述处理单元还用于:
    在第二包围框与所述第一包围框的匹配度大于或等于预设阈值的情况下,根据所述第二包围框对所述第一包围框进行纠正,所述第二包围框为误差纠偏检测模型对所述局部结构进行处理得到的,所述误差纠偏检测模型为经过训练的人工智能AI模型。
  18. 根据权利要求11至17任一项所述的装置,其特征在于,所述处理单元还用于:
    根据所述表格图像对所述表格识别结果进行纠正,并输出纠正后的表格识别结果。
  19. 根据权利要求11至18任一项所述的装置,其特征在于,所述处理单元还用于:
    对所述表格图像进行特征提取,获得所述表格图像特征。
  20. 根据权利要求11至19任一项所述的装置,其特征在于,
    采用以下任意一种标记语言标识所述表格识别结果:超文本标记语言HTML,可扩展标记语言XML,或者拉泰赫LaTex。
  21. 一种数据处理芯片,其特征在于,所述芯片包括逻辑电路,所述逻辑电路执行如权利要求1至10任意一项所述的方法。
  22. 一种数据处理的系统,其特征在于,所述系统包括处理器,所述处理器用于执行如权利要求1至10任意一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令在计算机设备上运行时,使得所述计算机设备执行权利要求1至10任意一项所述的方法。
PCT/CN2022/142667 2022-01-12 2022-12-28 数据处理的方法和相关设备 WO2023134447A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP22920077.9A EP4350646A1 (en) 2022-01-12 2022-12-28 Data processing method and related device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210029776 2022-01-12
CN202210029776.0 2022-01-12
CN202210168027.6A CN116486422A (zh) 2022-01-12 2022-02-23 数据处理的方法和相关设备
CN202210168027.6 2022-02-23

Publications (2)

Publication Number Publication Date
WO2023134447A1 true WO2023134447A1 (zh) 2023-07-20
WO2023134447A9 WO2023134447A9 (zh) 2023-11-30

Family

ID=87221956

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142667 WO2023134447A1 (zh) 2022-01-12 2022-12-28 数据处理的方法和相关设备

Country Status (3)

Country Link
EP (1) EP4350646A1 (zh)
CN (1) CN116486422A (zh)
WO (1) WO2023134447A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665063A (zh) * 2023-07-27 2023-08-29 南京信息工程大学 基于自注意力和深度卷积并行的高光谱重建方法
CN116976862A (zh) * 2023-09-20 2023-10-31 山东国研自动化有限公司 一种工厂设备信息化管理系统及方法
CN117973337A (zh) * 2024-01-24 2024-05-03 中国科学院自动化研究所 表格重建方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080028291A1 (en) * 2006-07-26 2008-01-31 Xerox Corporation Graphical syntax analysis of tables through tree rewriting
CN112528863A (zh) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 表格结构的识别方法、装置、电子设备及存储介质
CN113033165A (zh) * 2019-12-24 2021-06-25 腾讯科技(深圳)有限公司 电子表格文件解析方法、装置和计算机可读存储介质
CN113297975A (zh) * 2021-05-25 2021-08-24 新东方教育科技集团有限公司 表格结构识别的方法、装置、存储介质及电子设备
CN113536856A (zh) * 2020-04-20 2021-10-22 阿里巴巴集团控股有限公司 图像识别方法和系统、数据处理方法
CN113807326A (zh) * 2021-11-17 2021-12-17 航天宏康智能科技(北京)有限公司 制式表格文字识别方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080028291A1 (en) * 2006-07-26 2008-01-31 Xerox Corporation Graphical syntax analysis of tables through tree rewriting
CN113033165A (zh) * 2019-12-24 2021-06-25 腾讯科技(深圳)有限公司 电子表格文件解析方法、装置和计算机可读存储介质
CN113536856A (zh) * 2020-04-20 2021-10-22 阿里巴巴集团控股有限公司 图像识别方法和系统、数据处理方法
CN112528863A (zh) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 表格结构的识别方法、装置、电子设备及存储介质
CN113297975A (zh) * 2021-05-25 2021-08-24 新东方教育科技集团有限公司 表格结构识别的方法、装置、存储介质及电子设备
CN113807326A (zh) * 2021-11-17 2021-12-17 航天宏康智能科技(北京)有限公司 制式表格文字识别方法和装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665063A (zh) * 2023-07-27 2023-08-29 南京信息工程大学 基于自注意力和深度卷积并行的高光谱重建方法
CN116665063B (zh) * 2023-07-27 2023-11-03 南京信息工程大学 基于自注意力和深度卷积并行的高光谱重建方法
CN116976862A (zh) * 2023-09-20 2023-10-31 山东国研自动化有限公司 一种工厂设备信息化管理系统及方法
CN116976862B (zh) * 2023-09-20 2024-01-02 山东国研自动化有限公司 一种工厂设备信息化管理系统及方法
CN117973337A (zh) * 2024-01-24 2024-05-03 中国科学院自动化研究所 表格重建方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN116486422A (zh) 2023-07-25
EP4350646A1 (en) 2024-04-10
WO2023134447A9 (zh) 2023-11-30

Similar Documents

Publication Publication Date Title
WO2023134447A1 (zh) 数据处理的方法和相关设备
WO2021190451A1 (zh) 训练图像处理模型的方法和装置
WO2020228376A1 (zh) 文本处理方法、模型训练方法和装置
WO2021043112A1 (zh) 图像分类方法以及装置
EP4099220A1 (en) Processing apparatus, method and storage medium
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2023160472A1 (zh) 一种模型训练方法及相关设备
CN113705769A (zh) 一种神经网络训练方法以及装置
WO2023236977A1 (zh) 一种数据处理方法及相关设备
WO2021218470A1 (zh) 一种神经网络优化方法以及装置
CN114418030B (zh) 图像分类方法、图像分类模型的训练方法及装置
WO2024041479A1 (zh) 一种数据处理方法及其装置
WO2023231794A1 (zh) 一种神经网络参数量化方法和装置
WO2021190433A1 (zh) 更新物体识别模型的方法和装置
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
US20240185086A1 (en) Model distillation method and related device
WO2023274052A1 (zh) 一种图像分类方法及其相关设备
Wang et al. Deep fusion feature based object detection method for high resolution optical remote sensing images
US20240046067A1 (en) Data processing method and related device
CN114266897A (zh) 痘痘类别的预测方法、装置、电子设备及存储介质
CN111738074B (zh) 基于弱监督学习的行人属性识别方法、系统及装置
US20220327835A1 (en) Video processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22920077

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022920077

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022920077

Country of ref document: EP

Effective date: 20240102