WO2023024614A1 - Procédé et appareil de classification de document, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de classification de document, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2023024614A1
WO2023024614A1 PCT/CN2022/094788 CN2022094788W WO2023024614A1 WO 2023024614 A1 WO2023024614 A1 WO 2023024614A1 CN 2022094788 W CN2022094788 W CN 2022094788W WO 2023024614 A1 WO2023024614 A1 WO 2023024614A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
document
line
fusion
Prior art date
Application number
PCT/CN2022/094788
Other languages
English (en)
Chinese (zh)
Inventor
李煜林
庾悦晨
钦夏孟
章成全
姚锟
韩钧宇
刘经拓
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023024614A1 publication Critical patent/WO2023024614A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision and deep learning.
  • Documents are an important information carrier and are widely used in various business and office scenarios. In an automated office or input system, classifying different documents is one of the most critical processes.
  • the present disclosure provides a method, device, electronic equipment and storage medium for document classification.
  • a method for document classification including:
  • the category of the document to be processed is determined.
  • the text includes at least one line of text content; acquiring text information of the text included in the document to be processed includes: acquiring at least one line of text content and location information of the at least one line of text content.
  • performing fusion based on text information and image information to obtain fusion features includes: adding image information to position information of at least one line of text content, and then concatenating with at least one line of text content to obtain fusion features.
  • obtaining the feature sequence of the text according to the fusion feature includes: performing arithmetic averaging on the first character feature in at least one line of text content in the fusion feature, and combining the arithmetic mean result with the first word feature in at least one line of text content A position feature is multiplied to obtain the feature sequence of the text.
  • obtaining the feature sequence of the text according to the fusion feature includes: inputting the fusion feature into a stacked self-attention network to obtain an enhanced fusion feature; the initial weight of the self-attention network is the fusion feature.
  • the self-attention network is represented as follows:
  • W l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, * is a positive integer; d represents the feature dimension; H l represents the output of the l-layer self-attention network; V represents the fusion feature; ⁇ represents the normalization function.
  • obtaining the feature sequence of the text includes: arithmetically averaging the second single-character feature in at least one line of text content composed of the feature H1 output by the self-attention network, and combining the result of the arithmetic mean with a line of text The second position feature in the content is multiplied to obtain the feature sequence of the text.
  • the fused features are represented as follows:
  • T is the vector of the encoded single word in at least one line of text content
  • F is the vector of using the region of interest pooling algorithm to extract the image information of at least one line of text content on the entire image
  • S is the position code of at least one line of text content The vector after ; the vector dimensions of T, F, and S are the same.
  • acquiring the text information and image information of the text contained in the document to be processed includes: using a neural network to extract the image information of the document to be processed.
  • determining the category of the document to be processed includes:
  • Predefined document categories use the classifier function to obtain the probability of the feature sequence of the text on the predefined document categories;
  • the predefined document category with the highest probability value in the probability is taken as the category of the document.
  • a document classification device including:
  • An acquisition module configured to: acquire text information and image information of the text included in the document to be processed
  • the fusion feature module is used for: performing fusion based on text information and image information to obtain fusion features;
  • the feature sequence acquisition module is used to: acquire the feature sequence of the text according to the fusion feature;
  • the classification module is configured to: determine the category of the document to be processed based on the predefined document category and feature sequence.
  • a method for training a document classification model including:
  • the parameters of the document classification model are adjusted based on the correct probability value and the predicted probability distribution value, and a target document classification model is obtained in response to preset conditions.
  • an electronic device including:
  • the memory stores instructions that can be executed by at least one processor, and the instructions are executed by at least one processor, so that at least one processor can execute the method in any one of the above method technical solutions.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to make a computer execute the method in any one of the above-mentioned method technical solutions.
  • a computer program product including a computer program, and when the computer program is executed by a processor, the method in any one of the above method technical solutions is implemented.
  • the technical solution provided in the present disclosure proposes a document classification method for multimodal feature fusion. This method takes text content, text image blocks and text coordinates as input information to enhance the semantic expression of document features.
  • FIG. 1 shows a schematic flowchart of a method for classifying documents provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic flowchart of an optical character recognition method provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic flowchart of determining a document category according to a text feature sequence and a predefined document category provided by an embodiment of the present disclosure
  • Fig. 4 shows a schematic diagram of a document classification device provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic flowchart of a method for training a document classification model provided by an embodiment of the present disclosure
  • Fig. 6 is a block diagram of an electronic device used to implement the document classification method of the embodiment of the present disclosure.
  • Region of interest pooling algorithm ROI Pooling (RegionOFInterest); the pooling layer is sandwiched between consecutive convolutional layers to compress the amount of data and parameters and reduce overfitting; if the input is an image, then the pooling layer The main function of is to compress the image.
  • OCR Optical Character Recognition, Optical Character Recognition.
  • the methods for processing document classification include: manual approach: this approach is to fill in the report by the uploader or manual classification by the auditor, which is time-consuming, laborious and inefficient; image classification: classify the image through the visual information of the document; text classification: Based on the acquired document text content, use text to classify; classification based on image and text content: obtain classification results based on images and text respectively, and give the final result based on voting or predefined rules.
  • Fig. 1 shows a schematic flowchart of a method for document classification provided by an embodiment of the present disclosure. As shown in Fig. 1, the method may mainly include the following steps:
  • S101 Obtain the text information and image information of the text included in the document to be processed; use a camera to obtain the image of the document to be classified, which can be obtained by a mobile phone, camera, tablet computer, scanner, etc.
  • the image must contain text information, otherwise, it does not belong to the document classified in this disclosure.
  • the text and text information in the image is obtained through the text recognition algorithm, which is one of the basis for document classification; the image information includes the color feature, texture feature, shape feature and spatial relationship feature of the image.
  • S104 Determine the category of the document to be processed based on the predefined document category and feature sequence.
  • text information and image information are processed, and what is obtained through fusion may be a line text, a feature sequence of column text, or text in other arrangements.
  • the feature sequence of the text is processed, and the category of the document is determined in combination with the predefined document category.
  • Pre-defined document categories for example: VAT invoices in invoices, taxi tickets, tolls, train tickets, itinerary and other documents.
  • VAT invoices in invoices for example: VAT invoices in invoices, taxi tickets, tolls, train tickets, itinerary and other documents.
  • To classify invoices of the same type the technical solution of the present disclosure can be adopted. It can also be other types of documents, such as case sheets, prescription lists, medical record pages, inspection reports and other documents in hospital scenarios.
  • the text includes at least one line of text content; acquiring text information of the text included in the document to be processed includes: acquiring at least one line of text content and position information of at least one line of text content.
  • Acquiring at least one line of text content and location information of at least one line of text content includes: acquiring at least one line of text by using an optical character recognition method.
  • optical character recognition OCR
  • OCR can be used to obtain at least one line of text.
  • Fig. 2 shows a schematic flowchart of an optical character recognition method provided by an embodiment of the present disclosure. As shown in Fig. 2, the method may mainly include the following steps:
  • S201 Text detection algorithm: used to obtain position information of at least one line of text content in the document to be processed.
  • S202 Text recognition algorithm: used to obtain at least one line of text content.
  • the so-called text includes the position of the text content, the content of the text, and the horizontal and vertical arrangement, oblique arrangement or other arrangements of the text.
  • the technical solution can recognize row texts, column texts, or texts arranged in other ways, so that the application scope of the present disclosure is wider.
  • the text detection algorithm includes: EAST algorithm, which is a prior art, and will not be described in detail here.
  • the character recognition algorithm includes: CTC algorithm, which is also a prior art, and will not be described in detail here.
  • the OCR recognizes the position and content of a line of text, it means that the text information of the text in the document to be processed has been obtained.
  • Fusion based on text information and image information to obtain fusion features including: adding image information and position information of at least one line of text content, and then concatenating with at least one line of text content to obtain fusion features.
  • Acquiring the feature sequence of the text according to the fusion feature including: performing arithmetic mean of the first single character feature in at least one line of text content in the fusion feature, and multiplying the result of the arithmetic mean with the first position feature in at least one line of text to obtain A sequence of features for the text.
  • the first word feature in at least one line of text content is to encode each word t i in at least one line of text content into a 768-dimensional vector
  • the dimension can also be other numbers.
  • the first position feature in the position information of at least one line of text content is the 768-dimensional image information of a line of text in the document extracted by using a pooling algorithm (pooling) in the entire document This dimension can also be other numbers, but it must be consistent with the vector dimension encoded by the single word t i ; the first position feature also includes: for the 4-dimensional space coordinates of at least one line of text, it is also encoded as a 768-dimensional vector This dimension can also be other dimensions, but it must be consistent with the dimension of the vector encoded by the word t i .
  • the 4-dimensional space coordinates of at least one line of text are the upper left, upper right, lower left, and lower right coordinates of each text.
  • the document can be classified, and the category of the document can be determined according to the feature sequence of the text and the predefined document category.
  • Acquiring the feature sequence of the text according to the fusion feature including: inputting the fusion feature into the stacked self-attention network to obtain the enhanced fusion feature; the initial weight of the self-attention network is the fusion feature.
  • the representation of the self-attention network is as follows:
  • W l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, and * is a positive integer
  • d represents the feature dimension, which is 768 dimensions in the above technical solution
  • H l represents the self-attention of the l layer
  • the output of the network V represents the fusion feature
  • represents the normalization function, and in this embodiment, the normalization function adopts the sigmoid function. Taking the fused features as initial weights, stack H l step by step.
  • the self-attention network first uses two fully connected layers (W l1 and W l2 ) to calculate the input feature H l-1 , uses matrix multiplication for the calculated features, and normalizes it by the sigmoid function ⁇ to obtain the weight Matrix, the weight matrix is then multiplied by H l-1 to get a new feature H l and output as the lth layer.
  • Obtaining the feature sequence of the text including: performing arithmetic averaging on the second character feature in at least one line of text content composed of the feature H l output by the self-attention network, and comparing the result of the arithmetic mean with the second position in the line of text content
  • the features are multiplied to obtain the feature sequence of the text.
  • the second word feature is to encode each word t i in at least one line of text content into a 768-dimensional vector
  • the encoded features output by the above deep self-attention network are denoted by x.
  • the second position feature is the 768-dimensional image information of at least one line of text in the document extracted by pooling algorithm (pooling)
  • the encoded features output by the above-mentioned deep self-attention network are denoted by y, and y corresponds to the encoded features of the image information F and the position s of a line of text content.
  • the 768 dimension is an implementation in the embodiment, and may also be other dimensions. However, dimensions must be consistent before and after encoding.
  • the encoded H is expressed as:
  • H (x 1,1 ,x 1,2 ,...,x 1,k1 ,x 2,1 ,x 2,2 ,...,x 2,k2 ,...,x n,1 ,...,x n,kn ,y 1 ,...,y n )
  • the specific implementation method is: for all the second character features x r of a line of text content (such as the rth row, the rth column or the rth sorting of other arrangements), these second character features are arithmetically averaged and The Hadamard product of the result and the second position feature y r is obtained to obtain the feature sequence of a line of text content:
  • M ⁇ m r ; r ⁇ [1,N] ⁇ ;
  • T is the vector of the encoded single word in at least one line of text content
  • F is the vector of using the region of interest pooling algorithm to extract the image information of at least one line of text content on the entire image
  • S is the position code of at least one line of text content
  • Acquiring the text information and image information of the text included in the document to be processed includes: using a neural network to extract the image information of the document to be processed.
  • Neural networks include: convolutional neural networks.
  • Fig. 3 shows a schematic flowchart of determining a document category according to a text feature sequence and a predefined document category provided by an embodiment of the present disclosure. As shown in Fig. 3 , the method may mainly include the following steps:
  • S302 Use a classifier function to obtain the probability of the feature sequence of the text on the predefined document category; the classifier function includes: a softmax function.
  • M' mean(M), average all elements m in the text feature sequence M, then use a fully connected layer to a vector of predefined category size, and then use the softmax function to map to a probability distribution, expressed as follows:
  • scores is the mapped probability distribution value; fc is the fully connected layer.
  • S303 Take the predefined document category with the highest probability value among the probabilities as the category of the document.
  • cls is the classification category of the document
  • argmax is the function of taking the maximum value.
  • FIG. 4 shows a schematic diagram of a document classification device provided by the embodiment of the present disclosure.
  • the document classification apparatus 400 includes an acquisition module 401 , a fusion feature module 402 , a feature sequence acquisition module 403 and a classification module 404 .
  • the text includes at least one line of text content; when acquiring the text information of the text included in the document to be processed, the acquisition module 401 is further configured to: acquire at least one line of text content and location information of at least one line of text content.
  • the fusion feature module 402 when used for fusion based on text information and image information to obtain fusion features, it is also used for:
  • the image information is added to the position information of at least one line of text content, and then concatenated with at least one line of text content to obtain fusion features.
  • the feature sequence acquisition module 403 when used to acquire the feature sequence of the text according to the fusion feature, it is also used to: arithmetically average the first word feature in at least one line of text content in the fusion feature, and The result of the arithmetic mean is multiplied by the first position feature in at least one line of text content to obtain a feature sequence of the text.
  • the feature sequence acquisition module 403 when used to acquire the feature sequence of the text according to the fusion feature, it is also used to: input the fusion feature into the stacked self-attention network to obtain an enhanced fusion feature; the initial self-attention network The weights are fused features.
  • the representation of the self-attention network is as follows:
  • W l* represents the learnable parameter matrix of the fully connected layer that multiple learning parameters do not share, * is a positive integer; d represents the feature dimension; H l represents the output of the l-layer self-attention network; V represents the fusion feature; ⁇ represents the normalization function.
  • the feature sequence acquisition module 403 when used to acquire the feature sequence of the text, it is also used to: perform the second word feature in at least one row of text content composed of the feature H1 output by the self-attention network Arithmetic mean, and multiply the result of the arithmetic mean with the second position feature in a line of text content to obtain the feature sequence of the text.
  • the fusion feature is expressed as follows:
  • T is the vector of the encoded single word in at least one line of text content
  • F is the vector of using the region of interest pooling algorithm to extract the image information of at least one line of text content on the entire image
  • S is the position code of at least one line of text content The vector after ; the vector dimensions of T, F, and S are the same.
  • the obtaining module 401 is further configured to: use a neural network to extract image information of the document to be processed.
  • the classification module 404 is also used for:
  • the predefined document category with the highest probability value in the probability is taken as the category of the document.
  • FIG. 5 shows a schematic flowchart of a training method for a document classification model provided by an embodiment of the present disclosure. As shown in FIG. 5 , the method may mainly include the following steps:
  • S501 Predefine categories of test documents, and predefine correct probability values for documents of each category.
  • the categories of predefined test documents are as follows: such as forms, contracts, bills, certificates, etc.
  • the correct probability value for example, is as follows: the probability corresponding to the labeled category is 1, and the rest are 0.
  • S503 Process based on the text information and the image information to obtain a text feature sequence.
  • S504 Determine the predicted category of the test document and the predicted probability distribution value of the test document belonging to each category according to the feature sequence of the text and the category of the predefined test document.
  • S505 Adjust document classification model parameters based on the correct probability value and the predicted probability distribution value, and obtain a target document classification model in response to preset conditions.
  • the preset conditions include: the number of training rounds, the training time, and whether the training samples have been trained; the preset conditions can also include: whether the model converges in the later stage of training, for example: we will correct the probability (corresponding to the labeled category probability is 1 , the rest are 0) and the predicted probability distribution use the minimum cross-entropy function algorithm to calculate and optimize model parameters, and save model snapshots at fixed intervals, wait for the model to converge, that is, the cross-entropy will no longer decrease in the later stage of training, and obtain the snapshot version with the minimum cross-entropy Used as the optimal model for actual forecasting.
  • the technical solution provided by the disclosure integrates multiple modes, that is, text content, text position and image information, and avoids only processing information of a single mode to obtain document classification results.
  • Using the method of multi-modal fusion it effectively solves the visual attributes based on the document, which is limited to the format of the document and cannot handle similar documents; it solves the problem of using plain text for classification, ignoring the visual layout of the content in the document and the content in the document.
  • the image information that will exist can easily lead to semantic confusion; it solves the problem that the use of images and text is independent of each other, and the correlation between the two modal information is not considered, and there is a possibility of conflict.
  • the technical solution provided by the disclosure can effectively solve document confusion and improve classification accuracy.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 6 shows a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 600 includes a computing unit 601 that can execute according to a computer program stored in a read-only memory (ROM) 602 or loaded from a storage unit 608 into a random-access memory (RAM) 603. Various appropriate actions and treatments. In the RAM 803, various programs and data necessary for the operation of the device 600 can also be stored.
  • the computing unit 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the I/O interface 605 includes: an input unit 606, such as a keyboard, a mouse, etc.; an output unit 607, such as various types of displays, speakers, etc.; a storage unit 608, such as a magnetic disk, an optical disk, etc. ; and a communication unit 609, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 609 allows the device 600 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 601 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 601 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 601 executes the various methods and processes described above, such as the document classification method.
  • the document classification method can be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a memory Unit 608.
  • part or all of the computer program may be loaded and/or installed on the device 600 via the ROM 602 and/or the communication unit 609.
  • the computer program When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the document classification method described above may be performed.
  • the computing unit 601 may be configured in any other appropriate way (for example, by means of firmware) to execute the document classification method.
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente divulgation concerne un procédé et un appareil de classification de document, ainsi qu'un dispositif électronique et un support de stockage qui relèvent du domaine technique de l'intelligence artificielle, en particulier des domaines techniques de la vision artificielle et de l'apprentissage profond, et peuvent être appliqués dans des scénarios de ville intelligente et de financement intelligent. Une solution de mise en œuvre spécifique est un procédé de classification de document. Ledit procédé comprend les étapes consistant à : acquérir des informations de texte et des informations d'image de texte intégrées dans un document devant être traité; effectuer une fusion sur la base des informations de texte et des informations d'image de façon à obtenir une caractéristique de fusion; acquérir une séquence de caractéristique du texte d'après la caractéristique de fusion; et déterminer la catégorie du document sur la base d'une catégorie de document prédéfinie et de la séquence de caractéristique. La solution technique d'après la présente divulgation règle le problème technique du brouillage de document dans une classification de document et accroît la précision de classification.
PCT/CN2022/094788 2021-08-27 2022-05-24 Procédé et appareil de classification de document, dispositif électronique et support de stockage WO2023024614A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110994014.X 2021-08-27
CN202110994014.XA CN113742483A (zh) 2021-08-27 2021-08-27 文档分类的方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023024614A1 true WO2023024614A1 (fr) 2023-03-02

Family

ID=78733361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094788 WO2023024614A1 (fr) 2021-08-27 2022-05-24 Procédé et appareil de classification de document, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113742483A (fr)
WO (1) WO2023024614A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189193A (zh) * 2023-04-25 2023-05-30 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116912871A (zh) * 2023-09-08 2023-10-20 上海蜜度信息技术有限公司 身份证信息抽取方法、系统、存储介质及电子设备
CN117112734A (zh) * 2023-10-18 2023-11-24 中山大学深圳研究院 基于语义的知识产权文本表示与分类方法及终端设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742483A (zh) * 2021-08-27 2021-12-03 北京百度网讯科技有限公司 文档分类的方法、装置、电子设备和存储介质
CN114429637B (zh) * 2022-01-14 2023-04-07 北京百度网讯科技有限公司 一种文档分类方法、装置、设备及存储介质
CN114399775A (zh) * 2022-01-21 2022-04-26 平安科技(深圳)有限公司 文档标题生成方法、装置、设备及存储介质
CN114445833B (zh) * 2022-01-28 2024-05-14 北京百度网讯科技有限公司 文本识别方法、装置、电子设备和存储介质
CN114626455A (zh) * 2022-03-11 2022-06-14 北京百度网讯科技有限公司 金融信息处理方法、装置、设备、存储介质及产品
CN114898388B (zh) * 2022-03-28 2024-05-24 支付宝(杭州)信息技术有限公司 文档图片分类方法、装置、存储介质及电子设备
CN116152817B (zh) * 2022-12-30 2024-01-02 北京百度网讯科技有限公司 信息处理方法、装置、设备、介质和程序产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019769A1 (en) * 2018-07-15 2020-01-16 Netapp, Inc. Multi-modal electronic document classification
US20200302016A1 (en) * 2019-03-20 2020-09-24 Adobe Inc. Classifying Structural Features of a Digital Document by Feature Type using Machine Learning
CN111782808A (zh) * 2020-06-29 2020-10-16 北京市商汤科技开发有限公司 文档处理方法、装置、设备及计算机可读存储介质
CN112685565A (zh) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 基于多模态信息融合的文本分类方法、及其相关设备
CN112966522A (zh) * 2021-03-03 2021-06-15 北京百度网讯科技有限公司 一种图像分类方法、装置、电子设备及存储介质
CN113033534A (zh) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 建立票据类型识别模型与识别票据类型的方法、装置
CN113742483A (zh) * 2021-08-27 2021-12-03 北京百度网讯科技有限公司 文档分类的方法、装置、电子设备和存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344815B (zh) * 2018-12-13 2021-08-13 深源恒际科技有限公司 一种文档图像分类方法
CN110298338B (zh) * 2019-06-20 2021-08-24 北京易道博识科技有限公司 一种文档图像分类方法及装置
CN111680490B (zh) * 2020-06-10 2022-10-28 东南大学 一种跨模态的文档处理方法、装置及电子设备
CN113204615B (zh) * 2021-04-29 2023-11-24 北京百度网讯科技有限公司 实体抽取方法、装置、设备和存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200019769A1 (en) * 2018-07-15 2020-01-16 Netapp, Inc. Multi-modal electronic document classification
US20200302016A1 (en) * 2019-03-20 2020-09-24 Adobe Inc. Classifying Structural Features of a Digital Document by Feature Type using Machine Learning
CN111782808A (zh) * 2020-06-29 2020-10-16 北京市商汤科技开发有限公司 文档处理方法、装置、设备及计算机可读存储介质
CN112685565A (zh) * 2020-12-29 2021-04-20 平安科技(深圳)有限公司 基于多模态信息融合的文本分类方法、及其相关设备
CN112966522A (zh) * 2021-03-03 2021-06-15 北京百度网讯科技有限公司 一种图像分类方法、装置、电子设备及存储介质
CN113033534A (zh) * 2021-03-10 2021-06-25 北京百度网讯科技有限公司 建立票据类型识别模型与识别票据类型的方法、装置
CN113742483A (zh) * 2021-08-27 2021-12-03 北京百度网讯科技有限公司 文档分类的方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN QIANG, FU YANWEI: "Stacked Self-Attention Networks for Visual Question Answering", PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 13 June 2019 (2019-06-13), pages 207 - 211, XP093038913, DOI: 10.1145/3323873.3325044 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116189193A (zh) * 2023-04-25 2023-05-30 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116189193B (zh) * 2023-04-25 2023-11-10 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置
CN116912871A (zh) * 2023-09-08 2023-10-20 上海蜜度信息技术有限公司 身份证信息抽取方法、系统、存储介质及电子设备
CN116912871B (zh) * 2023-09-08 2024-02-23 上海蜜度信息技术有限公司 身份证信息抽取方法、系统、存储介质及电子设备
CN117112734A (zh) * 2023-10-18 2023-11-24 中山大学深圳研究院 基于语义的知识产权文本表示与分类方法及终端设备
CN117112734B (zh) * 2023-10-18 2024-02-02 中山大学深圳研究院 基于语义的知识产权文本表示与分类方法及终端设备

Also Published As

Publication number Publication date
CN113742483A (zh) 2021-12-03

Similar Documents

Publication Publication Date Title
WO2023024614A1 (fr) Procédé et appareil de classification de document, dispositif électronique et support de stockage
CN112966522B (zh) 一种图像分类方法、装置、电子设备及存储介质
US11816165B2 (en) Identification of fields in documents with neural networks without templates
US20220253631A1 (en) Image processing method, electronic device and storage medium
US11481605B2 (en) 2D document extractor
US11816710B2 (en) Identifying key-value pairs in documents
CN111709339A (zh) 一种票据图像识别方法、装置、设备及存储介质
CN112001368A (zh) 文字结构化提取方法、装置、设备以及存储介质
CN113094509B (zh) 文本信息提取方法、系统、设备及介质
US20240273134A1 (en) Image encoder training method and apparatus, device, and medium
US11972625B2 (en) Character-based representation learning for table data extraction using artificial intelligence techniques
US20230196805A1 (en) Character detection method and apparatus , model training method and apparatus, device and storage medium
CN114724156B (zh) 表单识别方法、装置及电子设备
CN114817612A (zh) 多模态数据匹配度计算和计算模型训练的方法、相关装置
CN114429633A (zh) 文本识别方法、模型的训练方法、装置、电子设备及介质
CN112906368B (zh) 行业文本增量方法、相关装置及计算机程序产品
CN116311298A (zh) 信息生成方法、信息处理方法、装置、电子设备以及介质
US20220392243A1 (en) Method for training text classification model, electronic device and storage medium
CA3060293A1 (fr) Extracteur de documents 2d
CN115880702A (zh) 数据处理方法、装置、设备、程序产品及存储介质
CN112541055B (zh) 一种确定文本标签的方法及装置
CN114579876A (zh) 虚假信息检测方法、装置、设备及介质
CN114445833A (zh) 文本识别方法、装置、电子设备和存储介质
CN113806541A (zh) 情感分类的方法和情感分类模型的训练方法、装置
CN115497112B (zh) 表单识别方法、装置、设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22859962

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22859962

Country of ref document: EP

Kind code of ref document: A1