WO2023016925A1

WO2023016925A1 - System for extracting data from a document

Info

Publication number: WO2023016925A1
Application number: PCT/EP2022/071982
Authority: WO
Inventors: François BLAYO
Original assignee: Neoinstinct Sa
Priority date: 2021-08-13
Filing date: 2022-08-04
Publication date: 2023-02-16
Also published as: CH718888A2

Abstract

The invention makes it possible to automatically recognize any document based on a small number of extractions that are performed manually, these serving as a learning base for an automated system. Based on the automatic construction of models in relation to each metadatum through supervised machine learning, the method according to the invention is effective in processing any type of document and thus dealing with the variability of the content of the documents. The principle of the invention is based on each issuer of an accounting document presenting the content thereof in a structured and coherent manner. The invention consists in statistically identifying the structure of the information contained in the accounting documents produced by an issuer and in generalizing it so as then to be able to recognize the information contained in all accounting documents issued by one and the same issuer.

Description

System for extracting data from a document

Technical field of the invention

The present invention relates to a partly computer-implemented method for the recognition and automatic extraction of data from a document, in particular the automatic extraction of data from an accounting document.

State of the art

A 2015 study ("Driscoll") evaluating 832 financial divisions of companies using data from the "Open Standards Benchmarking" database of the APQC ("American Productivity & Quality Center"), highlighted what the financial divisions actually do all day. The leaders of these divisions were asked about the time spent by their organizations on transaction processing, control, decision support and management activities. The results showed that regardless of company size, about half of finance departments' time is spent processing transactions.

This means that in an average work week, employees in highly paid finance departments spend the equivalent of Monday morning to Wednesday noon making sure bills are paid, customers receive correct bills, general accounting work is done and fixed assets are accounted for, among other tasks that keep money flowing through the business.

In contrast, management requires finance teams to provide timely, reliable, and concise information on the economic impact of specific strategic and tactical actions. At the end of the first half, when there is still time to improve performance, management wants to know the impact on the cost structure of choosing strategy A rather than option B or C. They want the revenue and operating margin calculations are determined by specific decisions about the allocation of the company's limited resources. In order for a finance team to find the time to support strategic business decisions while performing its operational tasks requires a reduction in manual work.

For example, it is common for companies to receive 60% of supplier invoices on paper or in PDF (“Portable Document Format”). Someone has to manually enter this data into the company's financial systems. Some use optical character recognition. But it also takes time, and small and medium-sized businesses may not have the means to do it at scale. Thus, paper continues to clutter the system, requiring employees to be trained and managed, which costs time and money.

If the goal is to process transactions quickly, inexpensively, and without errors, businesses need to free their finance staff from moving piles of paper and focus them on understanding cost and demand drivers, resource requirements and operational constraints.

Thus, millions of documents produced every day are examined, processed, stored, verified and transformed into machine-readable data. For example, accounts payable and receivable, financial statements, administrative documents, human resources records, legal documents, payroll records, shipping documents and tax forms.

These documents generally require the data to be extracted before it can be processed.

Various techniques, such as Electronic Data Interchange (EDI), attempt to eliminate human processing efforts by encoding and transmitting documentary information in strictly formatted messages. Electronic data interchange is notorious for its custom computer systems, unfriendly software, and complex standards that have prevented its rapid spread throughout the supply chain. The vast majority of companies have avoided implementing EDI perceived as too costly. Similarly, applications of XML, XBRL and other machine-readable document files are very limited compared to the use of paper documents and digital images such as PDF and JPEG (Joint Photography Expert Group). To date, these documents are read and interpreted by human beings to be processed by computers. Specifically, there are three general methods of extracting data from documents:

• conventional extraction

• outsourcing

• automation.

Conventional data mining requires workers with specific training, domain expertise, special training, software knowledge, and/or cultural understanding. Data Extractors must recognize documents, identify and extract relevant information from documents, and enter data appropriately and accurately into specific software. Such manual data extraction is complex, time-consuming and error-prone. As a result, the cost of data mining is often very high and especially when data mining is performed by accountants, lawyers and other highly paid professionals as part of their job.

Conventional data extraction also exposes the entire document to data extraction workers. These documents may contain information of a confidential nature relating to the employment, family status, finances, legal, tax and other matters of associates and companies.

While conventional data mining is done entirely on paper, outsourcing and automation starts with converting paper into digital image files. This step is simple, easy and quick thanks to high quality, fast and affordable scanners that are offered by many vendors. Once paper documents are converted into digital image files, document processing can be made more productive through the use of workflow software that routes documents to the least expensive workforce, internal or external. Primary processing can be done by junior staff; exceptions can be handled by better trained staff. Despite the potential productivity gains offered by workflow software through better use of human resources, the manual processing of documents remains a fundamentally expensive process.

Outsourcing requires the same education, expertise, training, software knowledge, and/or cultural understanding. As with conventional data mining, data extractors need to recognize documents, find relevant information on documents, extract and enter data appropriately and accurately into particular software. Since outsourcing is manual, just like conventional data mining, it is also complex, time-consuming, and error-prone. Some companies often cut costs by outsourcing data mining work to locations with lower labor costs. For example, extracting data from US tax and financial documents is a function that has been implemented using thousands of well-trained English-speaking workers in India and other low-wage countries.

The first step in outsourcing is to scan financial, tax or other documents and save the resulting image files. These image files can be accessed by data extractors through several methods. One method stores image files on the source organization's computer systems and data extraction workers view the image files on networks (such as the Internet or private networks). Another method stores image files on third-party computer systems and data extraction workers then view the image files on third-party servers across networks. Another method is to transmit the source organizations' image files over networks and store the image files for data extraction workers to view on the data extraction organizations' computer system. .

For example, an accountant can scan various tax forms containing client financial data and transmit the scanned image files to an external company. An employee of the external company extracts the client's financial data and enters it into income tax software. Income. The resulting tax software data file is then transmitted back to the accountant.

In this situation, many clients have seen quality issues in outsourced data mining jobs. External service providers address these issues by hiring better trained or more experienced workers, providing them with more extensive training, extracting and entering data twice or more, or comprehensively checking the quality of their work. These measures consequently reduce the cost savings expected from outsourcing.

With outsourcing comes concerns about associated security risks such as fraud and identity theft. These security concerns apply to employees and temporary workers as well as to external and overseas workers who have access to documents containing sensitive information.

Although the transmission of scanned image files to the data extraction agency may be secured by cryptographic techniques, sensitive data and personally identifiable information is "in the clear", i.e. unencrypted when read by workers responsible for extracting the data before entering the appropriate computer systems. Data mining organizations publicly recognize the need for information security. Some data mining organizations claim to investigate and perform background checks on employees. Many data mining organizations advertise strictly limiting physical access to premises where employees enter data. Paper, writing materials, cameras or other recording technology may be prohibited on the premises. Additionally, employees can be inspected to ensure nothing is copied or deleted. Since such seemingly comprehensive security measures are primarily physical in nature, they are imperfect and potentially unverifiable.

Due to these imperfections, breaches in physical security have occurred. For example, owners, managers, staff of data mining organizations may misuse a part or all of the unencrypted confidential information entrusted to them. Additionally, breaches of physical security and information systems by third parties may occur. As data-mining organizations are increasingly located abroad, Swiss citizens victimized in this way often have little or no recourse.

The third general method of data extraction involves partial automation, often combining optical character recognition, human inspection, and workflow management software.

Software tools that facilitate the automated extraction and transformation of document information are available from several vendors. The relative savings in operating costs facilitated by these tools is proportional to the degree of automation which depends in particular on the application, the quality of the customization of the software, the variety and the quality of the documents.

The first step in a partially automated data extraction operation is to scan financial, tax or other documents and save the resulting image files. The scanned images are compared against a database of known documents. Images that are not identified are routed to data extraction workers for conventional processing. The images that are identified have data extracted using templates, either location-based or tag-based, as well as optical character recognition (OCR) technology.

Optical character recognition is imperfect, with more than one percent of characters being incorrectly recognized. However, paper documents are neither clean nor of high quality, suffering from being folded or damaged before scanning, distorted during scanning and degraded during post-scanning binarization. Therefore, some of the information needed to identify the data is often not recognizable and as a result some of the data cannot be extracted automatically.

Using conventional software tools, publishers claim to be able to extract up to 80-90% of data on a limited number of standard forms. When there is a wide range of forms, automated data extraction is very limited. Despite years of effort, many tax document automation vendors achieve identification rates of 50% or less in data extraction quality and admit many errors compared to conventional data extraction methods. .

This rate decreases further when the documents are not taken from standardized forms. This is the case, for example, for supermarket receipts, urban transport tickets and all supporting documents that are not issued by a standard administrative supplier.

In an attempt to remedy these drawbacks at least in part, the document US20210117665A describes a method to be used in an expense management platform that can be used to perform a content analysis of an imaged invoice document comprising at least one invoice, in an improved way. The expense management platform includes an automatic invoice analyzer (AIA) including an optical character recognition (OCR) engine, said automatic invoice analyzer (AIA) operable to perform automated analysis of at least one invoice . The expense management platform also includes a machine learning engine including a knowledge repository and a trained mechanism for performing visuo-linguistic analysis, wherein said mechanism includes a neural network. The method includes the steps of receiving, via a communication interface, the imaged invoice document; pre-processing, by the automatic invoice analyzer (AIA), of said at least one invoice, extraction, by the optical character recognition (OCR) engine, of a set of OCR results associated with said at least one invoice, of generation, by the automatic invoice analyzer (AIA), of an image improved by OCR of the at least one invoice, of application, by the automatic invoice analyzer (AIA), of an analysis visuo-linguistic to determine semantic information of at least one element of the at least one invoice, and production, by the automatic invoice analyzer (AIA), of one or more analysis results.

This solution is based on the detection of probabilities of presence of the data to be identified in the document. These probabilities are visualized under form of heatmaps (“heatmaps” in English) as illustrated in Figure 12 of the document US20210117665A. This involves combining a spatial representation with a semantic representation obtained after optical character recognition. The use of probabilities and heat maps is particularly complex and can limit the effectiveness of the method. Moreover, the solution proposed in US20210117665A can only process data structured in form form.

In addition, the automatic extraction of metadata presents a technical difficulty which stems from the variability of the metadata format of the accounting documents. In general, for a given country, the presentation of the date, the VAT number, the amounts remains consistent. For example, a date in Switzerland will be formatted as follows: dd.mm.yyyy. In France, the format will be: dd/mm/yyyy. In the United States, the format will be: mm/dd/yyyy. There is also variability within a single country. For Switzerland, the date can also be formatted according to dd.mm.yy or even dd mmmmmm yy. The VAT identifier that appears on the accounting document is also subject to variations. For example in Switzerland it can be formatted CHE-nnn.nnn.nnn TVA, CHE-nnn.nnn.nnn MWST, CHE-nnn.nnn.nnn IVA, CHE-nnn.nnn.nnn VAT. The total amount of the voucher can also vary depending on the format of the numbers. The separator can be a “. " Or " , ". For example: 150.00 or 150.00. The items that make up the description of each line of the accounting document can also vary in number. There can be one article as well as ten articles in the same accounting document. VAT rates can also be different within the same accounting document. The variability of the content of accounting documents creates a combinatorial complexity that makes an automatic extraction task difficult to achieve by a succession of conditions.

An object of the invention is to propose a system and a method for automatically extracting data from a document, for example an accounting document, structured or unstructured. Such an accounting document can for example be a supporting document, a bank statement, a supplier invoice or any document required for the establishment of an accounting ledger. Another object of the invention is to propose a simple, reliable, rapid and efficient solution for automatically extracting data from a large number of documents, in particular of a varied nature. Summary of the invention

To this end, the invention firstly relates to a process for extracting data from a textually digitized target document, each data being characterized by its type in the form of metadata, said process comprising the steps of:

- design of a generic model from a plurality of documents each comprising at least one marked metadata, said generic model listing all the marked metadata,

- automatic production of a learning database from the generated generic model, said learning database comprising a plurality of learning documents each comprising all or part of the metadata types of the generated generic model, each type metadata being associated with a value,

- generation of a plurality of specific models by training a plurality of neural networks in the same number as the number of specific models from the learning base, the training of each neural network resulting in the generation of a specific model representing a type of metadata listed in the generic model,

- textual and sequential reading of the target document using the sliding window and calculation, for each character of said sliding window and for each specific model, of the probability that said character belongs to the metadata corresponding to said specific model,

- identification, in the sliding window, of the specific model for which the average of the probabilities calculated for each character of at least one series of characters is greater than a predetermined threshold,

- determination of the metadata associated with the specific model identified,

- extraction of the value associated with said determined metadata.

The invention makes it possible to automatically recognize any document from a small number of extractions carried out manually which serve as a basis for learning an automated system.

Based on the automatic construction of models relating to each metadata by supervised machine learning, the method according to the invention is effective for processing any type of document and thus countering the variability of the content of documents, in particular accounting documents.

The constitution of a sufficiently exhaustive learning base makes it possible to ensure that the models offer a performance in accordance with the needs of the users. In particular, the user must find a benefit in simplicity and efficiency that will compensate for situations in which the automatic extraction of metadata from the accounting document will produce a partial result.

The principle of the invention is based on the fact that each issuer of an accounting document presents its content in a structured and coherent manner. The invention consists in statistically identifying the structure of the information contained in the accounting documents produced by an issuer and in generalizing it in order to then be able to recognize the information contained in all the accounting documents issued by the same issuer. For example, a retailer issues a receipt that contains the name of his company, the VAT number, the items purchased, the VAT rates applied for each item, the transaction total, the seller, the method of payment, the identification of the mode of payment, the address of the company. The invention described in this patent must make it possible to identify on a small number of tickets all the information previously described and then to be able to identify them for any ticket issued by this retailer. This operation must in particular make it possible to identify all the items sold, whatever their number, all VAT rates, all payment methods such as cash, bank card, payment card.

According to one characteristic of the invention, the method comprises a step of selecting a predetermined number of reference documents from an initial set of documents.

According to one aspect of the invention, the selection of the predetermined number of documents in an initial set of documents is carried out manually by an operator.

Advantageously, the method comprises a preliminary step of digitizing the target document by optical character recognition to allow textual reading. According to one aspect of the invention, the method comprises a step of marking each piece of identifiable metadata in each selected document, preferably manually by an operator.

Advantageously, the method includes a step of recording in a memory zone the specific models generated.

Preferably, the automatic production (generation) of the training database from the generated generic model comprises the generation of at least one hundred training documents, preferably at least one thousand, more preferably at least ten thousand .

According to one aspect of the invention, each type of metadata is randomly associated with a value.

Advantageously, each document of the plurality of training documents includes all the types of metadata of the generated generic model.

Advantageously, the method includes a filtering step to ensure that each document includes at most one metadata of each type.

Preferably, a neural network is dedicated to generate each specific model.

In one embodiment, the sliding window slides one character on each iteration.

Preferably, the size of the sliding window is at least twenty characters, preferably at least fifty characters, for example one hundred characters.

The invention also relates to a computer program product characterized in that it comprises a set of program code instructions which, when executed by one or more processors, configure the processor or processors to implement a process as presented above.

The invention also relates to a module for extracting data from a textually digitized target document, said extraction module being configured to implement certain steps of the method as presented above.

The invention also relates to a system comprising an image capture module, a management module, a character recognition module, an extraction module as presented previously, a memory zone and a screen.

Brief description of the drawings

Figure 1 schematically illustrates an embodiment of the system according to the invention.

Figure 2 shows an example image of a reference document.

Figure 3 illustrates an embodiment of the learning phase.

Figure 4 illustrates an example of neural network learning from a sliding window.

Figure 5 illustrates the training substeps of the example in Figure 4.

Figure 6 illustrates an example of encoding by position of the characters framed by the <total> tag.

Window 7 illustrates two examples of profiles of average probability of prediction of a metadata from detected characters.

Figure 8 illustrates an embodiment of the exploitation phase.

Figure 9 illustrates an example of metadata identification from prediction probability profiles of detected characters.

Figure 10 illustrates an example of an image of a document allowing the manual addition of metadata in a dedicated field.

detailed description

There is shown in Figure 1 a functional schematic example of an embodiment of the system 1 according to the invention. I. System 1

The system 1 makes it possible to automatically extract one or more data from a so-called “target” document such as, for example, an accounting document, in particular of the invoice or receipt type, or any other document. In a target document, each data is characterized by its type and possibly its name. The type of data can be represented by metadata to identify said type algorithmically or by computer in a manner known per se. The role and processing of these metadata will be better understood in the light of the description which will be given below.

System 1 comprises an image capture module 10, a management module 20, a character recognition module 30, an extraction module 40, a memory zone 50 and a screen 60.

The image capture module 10, the management module 20, the character recognition module 30, the extraction module 40, the memory zone 50 and the screen 60 can be implemented by the same physical entity or well by separate physical entities.

Preferably, as in the example illustrated in FIG. 1, the image capture module 10 is implemented by a first physical entity, the management module 20, the character recognition module 30, the extraction 40 and the memory zone 50 are implemented by a second physical entity, for example a server 2 or a computer, and the screen 60 constitutes a third physical entity, the three entities being connected together by wire, wireless or via one or more communication networks.

Image capture module 10

The image capture module 10 is configured to generate document images. These documents can be reference documents or target documents, as will be explained below.

The image capture module 10 can for example be a manual scanner, an automatic scanner, the camera of a smartphone, a camera and in general any device capable of generating an image of the accounting document and of produce a digital file, for example in JPEG format (Joint Photographic Expert Group), TIFF, BMP (BitMaP) or PDF (Packet Data Format) or any other suitable format. Scanning devices include scanners connected directly to a computer, shared scanners connected to a computer via a network, and smart scanners with built-in computing functionality. Capture from smartphones with direct sending to storage systems such as Dropbox®, Trésorit®, OneDrive® can be used.

At the end of the image capture, the image(s) can be transmitted to the management module 20, which stores them in the memory zone 50 or transfers them to the character recognition module 30, or directly to the character recognition module. 30 characters.

Management module 20

The management module 20 is configured to control the various interactions with the image capture module 10, with the character recognition module 30, with the extraction module 40, with the memory zone and with the screen 60. The management module 20 can include the memory area 50 or be linked (directly or remotely) to the memory area 50.

In order to allow interactions with the user via the screen 60, the management module 20 comprises a user interface (UI or User Interface). Preferably, this user interface operates within a web browser such as for example Google Chrome®, Firefox®, Microsoft Edge®, Safari® or any standard browser available on the market or at state level. 'art.

The management module 20 is configured to allow the selection, preferably manually by an operator via the user interface, of a predetermined number of documents, called "reference", in an initial set of documents.

The management module 20 is configured to allow the marking, by an operator via the user interface, of each identifiable piece of metadata in each selected reference document. 30 Character Recognizer

The character recognition module 30 is configured to encode a target document in textual form by optical character recognition (OCR), called Optical Character Recognition (OCR) in English.

Optical character recognition is a computer-based process for translating images of printed or typed text into text files. This process is implemented by software making it possible to recover the text in the image of a printed text and to save it in a file which can be used in a word processor for enrichment, and stored in a database or on another medium usable by a computer system.

The characters extracted from the image by the character recognition module 30 are transferred to the extraction module 40 via the management module (or alternatively directly).

Extraction module 40

The extraction module 40 is configured to analyze the characters provided by the character recognition module 30. To this end, the extraction module 40 can be implemented by a computer, by a server, by a platform or any suitable device comprising a processor or several processors allowing the processing of the steps as will be described below.

The extraction module 40 is configured to automatically generate a generic model from a plurality of so-called “marked” documents. The marked documents correspond to reference documents each comprising at least one piece of metadata which has been marked by an operator via the management module 20.

The generic model generated by the extraction module 40 lists all the metadata marked in various reference documents by the operator. The extraction module 40 is configured to automatically generate a so-called “learning” database from the generated generic model. This is achieved by a generator which produces multiple examples from the "generic model". The generator is therefore a kind of simulator which will produce a whole set of different documents based on the generic model.

The learning database therefore comprises a large number of so-called "learning" documents, preferably at least several hundred or several thousand, each comprising all or part of the types of metadata of the generic model, each type of metadata being associated to a randomly generated data value.

The extraction module 40 is configured to generate a plurality of specific models by training a plurality of neural networks in the same number from the learning base, the training of each neural network resulting in the generation of a specific model representative of a type of metadata listed in the generic model. By “same number”, we mean that a neural network is dedicated to one and only one specific model.

The extraction module 40 is configured to save the specific models generated in the memory area 50.

The extraction module 40 is configured to carry out a textual and sequential reading of a target document using a sliding window and to calculate, for each character of said sliding window and for each specific model, the probability of belonging of said character to the metadata corresponding to the specific model.

The extraction module 40 is configured to identify, in the sliding window, the specific model for which the mean of the probabilities calculated for each character of at least one series of characters is greater than a predetermined threshold.

The extraction module 40 is configured to determine the metadata associated with the specific model identified. Memory area 50

The memory zone 50 is included in the management module 20 or is connected (directly or remotely) to the management module 20 in order to store the various digital documents used during the implementation of the invention, in particular the reference documents , marked documents, generic model, learning documents, specific models and target documents.

The memory zone 50 can for example be a hard disk on a local computer, on a file server, on a cloud service such as Dropbox®, S3®, Box®, OneDrive® and in general any storage system which offers an interface management type Application Programming Interface (API).

Screen 60

The screen 60 is connected to the management module 20 in order to display the documents, commands and results necessary for the implementation of the invention.

The screen 60 can be of any type: a simple display screen, a touch screen or any suitable screen.

II. Example of implementation

In the non-limiting example which will be described below, the target documents processed are slips of the cash receipt type comprising several types of data associated with different metadata. By way of example, one of the data may be a total sum of numerical values of the "price" type, noted "SOMME CHF" (sum in Swiss francs) in the slip, and which will be identified by metadata including a tag start tag, denoted <total>, and an end tag, denoted </total>. In this case, the value of the data will be of numeric type and will be indicated between the start tag and the end tag. For example, for a total sum of 2914.85 CHF on the slip, the extraction module 40 will note said sum <total>2914.85</total>.

The implementation of the method according to the invention assumes two distinct phases: a learning phase and an operating phase. Learning phase

The learning phase is a preparatory phase which consists of the selection of a set of so-called "reference" documents, the manual marking by an operator of said reference documents, the production of a generic model from the marked documents, the generation of a set of so-called “learning” documents from the generic document designed and the training of neural networks from the generated learning documents in order to create specific models each specific to a type of metadata.

This learning phase can for example be carried out for each group of similar target documents, in particular associated with the same publisher of said documents.

Marking

First of all, at step E0, the user of system 1 has a set of reference documents, for example selected by the user or available to the user. These documents are preferably chosen so as to present different forms and/or different types of data from the same publisher.

Reference documents can be in paper or electronic form. In their electronic form, documents are files, for example JPEG-type images or PDF-type files. In their paper form, the reference documents are submitted to the image capture module 10 which transforms them into computer files, for example of the JPEG or PDF type.

Once digitized, if they were not, the reference documents are sent to the character recognition module 30 which encodes them in textual form during a step E1 of optical character recognition.

Then, in a step E2, each reference document coded in textual form is presented to the user via the screen 60 to allow him to simply designate the metadata present in said reference document. The principle of the analysis consists in automatically extracting identifiable metadata in the reference document in textual form. In the example described below, the present invention offers the extraction of nine identifiable metadata which are:

• Date: the date of issue of the reference document;

• Identify: the VAT number of the issuer of the reference document;

• Total: the total amount of the reference document;

• Item-name: the description of a line of the reference document;

• ltem_value: the value corresponding to the item-name of the reference document;

• Item-taxcode: the VAT code corresponding to the item-name;

• Taxitem-code: the code corresponding to a VAT percentage;

• Taxitem-value: the amount of VAT corresponding to a Taxitem-code;

• Taxitem-percentage: the VAT percentage corresponding to a Taxitem-code;

In this example, identifiable metadata is of two types:

• Individual: Date, Identifier, Total, Item-name, Item-value, Item-taxcode, taxitem-code, taxitem-value, taxitem-percentage

• Composites: Items and Taxitems which are respectively composed of several Items and taxitems.

Referring to Figure 2, the image of the reference document displayed on screen 60 via graphical user interface 400 is available in field 435 to allow the user to simply recognize it. Data extracted after the text extraction step is presented in field 440. Metadata to be identified is presented in fields 410 (Date), 415 (Identify), 423 (Total) for individual type metadata, 420 (Item-name), 421 (Item-value), 422 (Item -tax-code) for Items type metadata and 425 (taxitem-code), 426 (taxitem-value), 427 (taxitem- percentage). The page can be saved via the “save” function 430. Field 450 is intended for the display of messages which result from consistency processing on the metadata of the accounting document. This treatment will not be described within the scope of the present invention.

In a step E3, the user will use the interface 400 to associate the data identified in the field 440 with the metadata to be identified (so-called “marking” step). To this end, the user selects a portion of text from field 440 and drags and drops this text into one of the metadata to be identified.

For example in Figure 6, the data 20.12.2018 present in the field 440 of the interface 400 has been dragged and dropped into the Date metadata 410. The marked documents are saved in the memory zone 50 in a step E4.

An extract from the database of an example document with the metadata identified is shown in the following Table 1:

Chart 1

Metadata are identified by the tags that correspond to them. Each metadata surrounds the data with an opening tag that ends with a closing tag. For example the "date" metadata is stored as <date>20.12.2018</date>.

This identification process is repeated for all available reference documents to obtain a set of marked reference documents. Generic model

Once the reference documents have been marked, a step E5 makes it possible to generate a summary of all the reference documents marked in the form of a document called “generic model” corresponding in this example to a generic accounting document model. The extraction module 40 builds, that is to say generates, automatically the generic model from the plurality of marked documents. The generic model lists all the metadata marked by the user in the reference documents. This generic model is made from a fusion of different content to produce a generative model of accounting documents. It consists of exhaustively listing and counting the occurrences of each line of the reference document.

For example, in Table 2 below which illustrates an example of a generic pattern, {{counts:22:34}}coop^| means that the line “coop^j” appeared 22 times out of the 34 examples manually identified by the user, the line {{counts: 1:34}}coo^| appeared 1 time out of 34. The line {{counts: 3: 34}}cood^| appeared 3 times out of the 34 examples:

<itemxname>{{text+}}</name> {{decimal}}

<value>{{decimal}}</value><taxcode>{{int:l}}</taxcodex/item>!]

<itemxname>{{text+}}</name> i {{decimal}}

<value>{{decimal}}</valuextaxcode>{{int:l}}</taxcodex/item>!]

<itemxname>{{text+}}</name> {{decimal}} <value>{{decimal}}

<taxcode>{{int:l}}</taxcodex/item>!]

<itemxname>{{text+}}</name> {{decimal}} <value>{{decimal}}</value>

<taxcode>{{int:l}}</taxcode>.</item>!]

<itemxname>{{text+}}</name>{{int:l}} {{int: l}}{{decimal}} {{decimal}}

<value>{{decimal}}</value> <taxcode>i</taxcode> a</item>ü

<itemxname>{{text+}}</name> {{int:l}}{{decimal}} <value>{{decimal}}</value>

<taxcode>{{int:l}}</taxcode>#</item>ü

<itemxname>{{text+}}</name>: {{decimal}} {{decimal}} {{decimal}}

<value>{{decimal}}</value> <taxcode>d</taxcode> a</itemx/items>!]

{{counts:9:34}}sum chf <total>{{decimal}}</total>ü

{{counts:25:34}}sum of <total>{{decimal}}</total>ü

{{counts:7:34}}species {{decimal}}!]

{{counts:7:34}}return -{{decimal}}!]

{{counts:22:34}}visa {{decimal}}!]

{{counts:2:34}}tuint {{decimal}}!]

{{counts:l:34}}tvint {{decimal}}!]

{{counts: l:34}}amexco {{decimal}}!]

{{counts: l:34}}anexco {{decimal}}!]

{{counts:22:34}}debit visa credit!]

{{counts: 1:34}}VVVV uuuuuu uwvü

{{counts: l:34}}manü

{{counts: l:34}}{{int:2}}ü

{{counts:3:34}}xxxxxxxxxxxx{{int:4}} {{int:2}}:{{int:2}}ü

{{counts:2:34}}debit american express!]

{{counts: 18:34}}xxxxxxxxxxxx{{int:4}}ü

{{counts:2:34}}xxxxxxxxxxx{{int:4}}ü

{{counts:21:34}}{{int:2}}.{{int:2}}.{{int:4}} {{int:2}}:{{int:2}}ü

{{counts:3:34}}{{int:2}}.{{int:2}}.{{int:4}}!]

{{counts: 17:34}}#{{int:8}}*{{int:8}}/{{int:6}}/{{int:ll}}#!]

{{counts: l:34}}#{{int:8}}*{{int:8}}/{{int:2}}/{{int:ll}}#!]

{{counts: l:34}}#{{int:8}}-{{int:8}}/{{int:6}}/{{int:ll}}#!]

{{counts:2:34}}#{{int:8}}*{{int:8}}/{{int:6}}/{{int:12}}!]

{{counts:2:34}}#{{int:8}}*{{int:15}}/{{int:ll}}#!]

{{counts:l:34}}#{{int:8}}*{{int:8}}/{{int:2}}/{{int:ll}}*!)

{{counts:22:34}}total-eft chf: {{decimal}}!]

{{counts: l:34}}total-efi chf: {{decimal}}!]

{{counts: l:34}}{{int:l}}ü

{{counts:l:34}}.. - Ü

{{counts: l:34}}total-eft che: {{decimal}}!]

{{counts:l:34}} - Ü

{{counts:30:34}}coop cooperative society,<ident>che-{{int:3}}.{{int:3}}.{{int:3}}</ident> tvaü

{{counts:l:34}}coop cooperative society,<ident>che-{{int:3}}.{{int:3}}.{{int:3}}</ident> ivaü

{{counts:l:34}}cd vat% jotal vat

{{counts:l:34}}cd vat% total ivaü

{{counts:l:34}}cd ta% total vat

{{counts:l:34}}cd tax totals

{{counts:l:34}}cd vat totals

<taxitemsxtaxitemxcode>{{int:l}}</code> <percentage>{{decimal}}</percentage>

{{decimal}} <taxvalue>{{decimal}}</taxvaluex/taxitem>

<taxitemxcode>{{int:l}}</code> <percentage>{{decimal}}</percentage> {{decimal}}

<taxvalue>{{decimal}}</taxvaluex/taxitem>

<taxitemxcode>{{int:l}}</code> <percentage>{{decimal}}</percentage>{{decimal}}

<taxvalue>{{decimal}}</taxvaluex/taxitem>

<taxitemxcode>{{int:l}}</code> <percentage>{{decimal}}</percentage>.

{{decimal}} <taxvalue>{{decimal}}</taxvaluex/taxitemx/taxitems>ü

{{counts:3:34}}you save {{decimal}}!]

{{counts:2:34}}you save {{decimal}}!]

{{counts:18:34}}number of items purchased!]

{{counts:l:34}}number of items purchased >ü

{{counts:l:34}}item count

{{counts:22:34}}no supercard: }ü

{{counts:l:34}}no supercard:{{i sale!]

{{counts:l:34}}no supercard:{{i balance {{int:3}}ü

{{counts:l:34}}no supercard: } balance {{int:4}}ü

{{counts:l:34}}no supercard:{{i balance {{int:5}}ü

{{counts:l:34}}akciens points

{{counts:l:34}}no supercard:{{i {{int:4}}ü

{{counts: l:34}}no supercard:{{i

{{counts:2:34}}no supercaro:{{i

{{counts:l:34}}ang tens points balance {{int:4}}ü

{{counts:10:34}}old balance points {{int:4}}ü

{{counts:l:34}}old points balance balance {{int:4}} {{int:4}}ü

{{counts:2:34}}old balance points!]

{{counts:l:34}}superpoints sub-/total on excl purchase* {{decimal}} balance gg

{{int:4}}ü

{{counts:l:34}}balance {{int:4}}ü

{{counts:2:34}}old points!]

{{counts:l:34}}old balance points {{int:3}}ü

{{counts:ll:34}}old balance points {{int:5}}ü

{{counts: 14:34}}sub/total excl* {{decimal}} {{int:2}}ü

{{counts: l:34}}sub/total.excl* {{decimal}} {{text+}}ü

{{counts: l:34}}sub/total excl * {{int:2}}ü

{{counts: l:34}}sub/total excl * {{int: 1}}Ü

{{counts:l:34}}sub/total excl*{{decimal}} {{int:2}} {{int:2}}ü

{{counts:l:34}}old sub/total points excl* {{decimal}} {{int:5}}ü {{counts:5:34}}sub/total excl*{{decimal}} {{int:l}}ü

{{counts:5:34}}sub/total excl* {{decimal}}!]

{{counts: !:34}}{{text+}}juü

{{counts:13:34}}superpoints on purchaseü

{{counts:l:34}}superpoints on purchase sol deü

{{counts:7:34}}superpoints on purchase {{int:2}}ü

{{counts:2:34}}superpoints on purchase {{int:5}}ü

{{counts:2:34}}superpoints on purchase {{int:4}}ü

{{counts:l:34}}superpoints on balance purchase {{int:4}}ü

{{counts:l:34}}superpoints on balance purchase {{int:2}} 1

{{counts:l:34}}superpoints on purchase {{int:2}}{{int:3}}51

{{counts:l:34}}superpoints on purchase {{int:2}}{{int:4}}51

{{counts:4:34}}new points saleü

{{counts: l:34}}new sale points!]

{{counts:2:34}}new points!]

{{counts: l:34}}new sale points.!]

{{counts:l:34}}new balance points balance!]

{{counts: l:34}}new solid points!]

{{counts:6:34}}new balancepoints {{int:4}}ü

{{counts:l:34}}served by Mrs. Nzitaü

{{counts: l:34}}new points {{int:3}}ü

{{counts: l:34}}new points {{int:4}}ü

{{counts:l:34}}new pointsbalance balance {{int:4}}ü

{{counts:10:34}}new balance points {{int:5}}ü

{{counts:l:34}}new balancepoints. {{int:5}}ü

{{counts:17:34}}trophy points received!]

{{counts:l:34}}served by Mrs Pershantu

{{counts:2:34}}#with superpoints but no discounts!]

{{counts:l:34}}*with superpoints but no discounts!]

{{counts: l:34}}servi park.ciravegnaü

{{counts:24:34}}served by coop self-checkoutü

{{counts:5:34}}serui by coop self-checkoutü

{{counts:l:34}}serul by coop self-checkoutü

{{counts:l:34}}serut by coop self-checkoutü

{{counts:25:34}}thank you for visitingü

{{counts: l:34}}{{int:l}} {{int: 1}} {{int: 1}} {{int: 1}} {{int: 1}} {{int:2} } {{int: 1}} {{int:4}} {{int: 1}}

{{int: 1}} {{int: 1}} {{int:8}} {{int: 1}} {{int: 1}} {{int:l}}*{{int:2}} "u

{{counts:l:34}}thank you for visiting

{{counts: l:34}}nchini

{{counts:l:34}}looooo{{int:9}} mittiin {{int:4}} 1

{{counts:l:34}}"{{int:6}} {{int:4}} {{int:2}} {{int: 1}} {{int: 1}} {{int: 1 }} {{int: 13}}"ü

{{counts:l:34}}"{{int:8}} {{int:2}} {{int: 1}} {{int: 1}} {{int: 1}} {{int: 15 }}"ü

{{counts: l:34}}"{{int:20}} {{int:3}}*ot {{int: l}}'{{int: l}}"{{int:l}}" ü

{{counts: l:34}}{{int:5}} {{int:3}} {{int:4}} {{int:8}} {{int:4}}*{{int:2 }}*{{int:l}}'{{int:l}}"ü

{{counts: l:34}}"{{int:5}} {{int:2}} {{int:l}} {{int: 13}}*{{int:l}}*{{int :2}}*{{int:2}}*{{int:2}}"5]

{{counts:8:34}}we thank you for your visitü

{{counts: !:34}}{{int:10}} {{int:18}} 1

Chart 2

For compound metadata of the Items or Taxitems type, the synthesis will produce a syntactic structure. For example, for an Items type metadata:

means that an items type metadata has been read, followed by an Item-name tag. Inside the tag, the data is made up of several text type characters. Following Item-name, several spaces were read, a numeric value was read, then a decimal value, then several spaces, followed by an Item-value composed of a decimal value, then an item- taxcode composed of an integer value. The item then ends. The analysis also identified an item formed in a different way:

Syntactic analysis is performed in a similar way for Tax-items type metadata.

Learning database

The extraction module 40 automatically produces in a step E6 a learning database from the generated generic model, said learning database comprising a plurality of learning documents each comprising all or part of the types metadata of the generated generic model, each type of metadata being associated with a value.

The constitution of an exhaustive learning base cannot be done manually because it would require the manual identification of metadata in hundreds of accounting documents. A second technical difficulty arises from this observation: the user must be called upon to manually identify the metadata of the accounting documents, but this can only be done in a very limited way. In practice, a user will agree to manually analyze and identify the metadata of some accounting documents. Beyond that, the risk of causing a rejection will be very high and the system will therefore become unusable. In the context of this invention, the identification and analysis of a limited number of metadata will be done manually by a user in order to constitute a base of examples. This will then be automatically analyzed to generate a much larger learning base. This will finally be used to automatically generate identifiable metadata models. The detailed process is described with reference to Figure 3. At the end of the complete analysis, the generic model is used to generate the learning database, which is stored in the memory area 50. This is achieved by producing so-called “learning” documents whose content is produced randomly while respecting the constraints of the generic model. This method is made possible because the aim is to produce a large quantity of accounting documents which respect the general structure observed in some documents and not the values which they contain.

The method for generating the learning documents consists of randomly choosing lines from the generic model such as: “serui par coop self-checkout”. In the embodiment described, if the line only contains characters, it is kept as it is.

Still in the embodiment described, if the chosen line contains type codes, then values are randomly generated which correspond to these types in order to form a character sequence.

For example, the line <taxcode>{{int:1 }}</taxcode></item>^| produces the character sequence <taxcode>6</taxcode></item>^|. The type {{int: 1 }} has been replaced by the randomly chosen value 6.

After the random generation of lines and line contents of a learning document, a filtering step (not shown) can be carried out to guarantee that said generated learning document will be compliant. More precisely, the filtering consists in keeping only one occurrence of each Date, Identifier, Total, Items, Taxitems metadata, the Item-name, ltem_value, Item-taxcode, Taxitem-code, Taxitem-value, Taxitempercentage metadata being able to to appear in any number.

An example of a learning document generated by this method is shown in Table 3:

Chart 3

At the end of this step E6, the memory zone 50 contains all the randomly generated documents, which will be used for learning the models. Specific models

At the end of the generation of the learning base, the system triggers a step E7 of automatic modeling by learning of neural networks. To this end, the extraction module 40 generates a plurality of specific models by training a plurality of neural networks in identical numbers from the learning base, the training of each neural network resulting in the generation of 'a specific model representative of a type of metadata listed in the generic model.

Step E7 consists in creating a set of neural networks which construct by supervised learning a model for recognizing the data present in a target document. In the context of the present invention, the choice of networks depends on the type of metadata. For individual metadata (Total, Identifier, Date), each model will be built by a neural network. So three networks will be needed for these three individual metadata. For metadata of the Items or tax-items type composed of several individual metadata, the models will be built at two levels: models for the recognition of the beginning and the end of the compound metadata, and models for the individual recognition of the metadata. An example of the neural network architecture for individual metadata is shown in Figure 4.

The general architecture used in this example for neural networks, which model individual metadata, consists of an input layer whose length is fixed by a parameter. In the case of the present invention, the length for the networks responsible for modeling the individual type metadata is fixed at one hundred characters.

This input layer is a sliding reading window that traverses the accounting document by shifting one character to the right for each learning step. The sliding reading window 510, limited for the illustration to twenty-five characters, contains the characters being read in the learning document.

The input characters are digitally encoded according to the method using an integration layer (Keras). This method being known to those skilled in the art, it will not be detailed in the context of this invention. Each character is encoded with five numeric values. The network consists of three layers of neurons whose number is also fixed by a parameter. In the context of the present invention, it is fixed at five times the size of the sliding window, ie five hundred neurons per layer.

The output layer 520 contains as many units as the size of the input window. The output values are “0” or “1” and thus performing position coding. The value “0” means that the value in the input layer at the corresponding position is not valid data for the metadata being modeled. The value “1” means that the value in the input layer at the corresponding position is valid data for the metadata being modeled. In the example of Figure 4, the modeled metadata is “Total”. The value 18.75 corresponding to the Total value in the sliding window being read.

The parameters of the neural network whose architecture is presented in Figure 4 are fixed by a learning method whose process is presented in Figure 5 and with reference to Figure 6.

The learning process E7 begins with step E71 during which the system reads an accounting document in the memory area 50. In step E72, this document is scanned through a sliding window 376 whose size in number of characters is fixed by a parameter. A typical value for this parameter is one hundred characters. This sliding window is encoded as a vector Xi of text 377 where each position is a character and all tags are removed. The data in the sliding window is analyzed to detect the presence of a tag corresponding to the model being trained. For example, if the model being trained concerns the "Total" metadata, the system will look for the presence of the <total> and </total> tags in the text of the sliding window 376.

In parallel, a vector of binary values zi 378 and of the same length as Xi is constructed to encode the position of the characters which correspond to the model being learned. This vector will be filled with values 1 at the positions which correspond to the numerical values of the characters surrounded by the tags <total> and </total> as illustrated in FIG. 6 (step E73). In general, the vector zi will position-code the presence or the absence of characters corresponding to the text of the model being trained.

So that the sliding window is always of equal length, the content of the sliding window is filled with empty characters for the beginning and for the end of the reading of the accounting document. The two vectors Xi and zi are used to adapt the model parameters. Each individual metadata will be modeled by a neural network.

In the example described here, there are nine individual pieces of metadata. There will therefore be nine neural networks whose parameters will be calculated in order to model each of them.

In the learning phase, each vector Xi associated with a vector Zj is presented as input to the neural network. The values of Xi flow through the network to produce an output yr This is compared to the expected output Zj. The comparison method is an angle calculation of the "cosine similarity" type known in itself.

For the two vectors yi and Zj, the angle 0 is obtained by the scalar product and the norm of the vectors: yi - zt

llydl- I ïll

As the value cos 0 is included in the interval [-1 ,1 ], the value "-1" will indicate opposite vectors, "0" orthogonal vectors and "1" collinear vectors with a positive coefficient. The intermediate values make it possible to evaluate the degree of similarity.

The parameters of the model consist of all the connections Wij 531, 532, 533 in Figure 5 between neurons of the different layers. They are adapted according to an adaptation algorithm in a step E74 (FIG. 6). In the context of the present invention, the adaptation of the coefficients is based on a gradient descent, the calculation of which is optimized by an ADAM (Adaptive Moment Estimation) optimizer. This process is repeated for all the examples available in the learning database.

When all the characters have been read, the sliding window 377 is shifted by one unit to the right and the steps E71 to E75 are repeated until the sliding window 377 reaches the last character of the document. of learning, for each learning document of the learning database.

At the end of this process E7, the specific models are saved in the memory zone 50 in order to make them available for the exploitation phase.

Model improvement

The interpretation of the model built by this process is that for each character in the sliding window 377, the output from the network predicts the probability that the character in the sliding window does or does not belong to the section modeled by the network. Since a large number of predictions are obtained, all of the predictions obtained must be aggregated. All the output vectors yi produced from the learning base are thus added and an average is calculated.

y N where N is the number of examples available in the knowledge base.

The analysis of this average makes it possible to improve the processing intended to produce the prediction from the sliding window. Indeed, by analyzing the shape of the vectors with average probabilities over many samples, it appears that the probability that given characters belong to the section depends on the shape of the prediction. The prediction with a very marked profile will have greater credibility than a prediction with a smoother profile. An example is shown in Figure 7.

To differentiate the peaks, the method involves identifying the peaks within a signal based on the properties of the peaks. This method takes a one-dimensional array of values and finds all local maxima by simple comparison of neighboring values. This method allows one to account for the prominence of each peak which, in general, can be interpreted as how the peak stands out from the surrounding region. Additionally, additional filtering is used to filter out small spikes that may appear on the ticket characters. The prominence value is then used as a confidence measure. A very high prominence value usually means that the region is very likely to represent the section, so the region with the highest prominence is returned as the region of interest.

Specific preprocessing for compound metadata

Compound metadata requires specific pre-processing which is used to extract part of the document to build the model. Indeed, Item-name, ltem_value, Item-taxcode, Taxitem-code, Taxitemvalue, Taxitem-percentage metadata can only appear in the context of Items or Taxitems metadata.

The learning base must therefore reflect this particular context and only contain sequences of characters that are found in this context. It is therefore necessary to identify the start and end characters of the parts of the documents containing the Items and Taxitems.

Once this processing has been carried out, the learning of the models is carried out according to the same architecture and the same process as for the individual items.

At the end of all the learnings, the parameters of the models are recorded in the memory zone 50 in a step E8. These parameters will be used during the exploitation of the models when it comes to automatically recognizing the metadata of a new document called "target document" that has not been used for learning.

Operation phase

With reference to FIG. 8, the exploitation phase comprises in a step S1 the textual and sequential reading of a target document by the extraction module 40 using the sliding window and the calculation in a step S2, for each character of said sliding window and for each specific model, the probability that said character belongs to the metadata corresponding to said specific model. Inspired by the way the brain proceeds to read texts, the invention will make it possible to read a target document through a sliding window centered on each character read and which thus offers it a reading context.

At step S1, the accounting document is read in a textual form. This assumes that the image recognition phase has been carried out beforehand.

At step S2, the prediction of the metadata is triggered and the extraction module 40 calculates the probabilities of belonging to a piece of metadata for each character read in the sliding window from the models available in the memory area 50.

In a step S3, the extraction module 40 identifies, in the sliding window, the specific model for which the mean of the probabilities calculated for each character of at least one series of characters is greater than a predetermined threshold,

In a step S4, the extraction module 40 determines the metadata associated with the specific model identified.

In a step S5, the extraction module 40 extracts the value associated with said determined metadata.

Referring to Figure 9, for each position of the sliding window 377, the extraction module 40 calculates all the predictions of each model. It identifies the metadata that delivers the most credible prediction and, knowing the position of each character, the system copies them into the corresponding metadata field. In Figure 9, for the characters read in the sliding window, the “total” metadata model delivers the most credible prediction. The data in the location with the highest probabilities will be copied to the "Total" metadata field.

Add metadata manually

The “description” metadata can be enriched in a window displayed by the extraction module 40. With reference to FIG. 10, the user can fill in the field 720 of this interface to manually add a description of the accounting document. This step can be used to facilitate grouping of target documents automatically. The invention therefore allows efficient and reliable automatic extraction.

Claims

38 Claims

1. Process for extracting data from a textually digitized target document, each data being characterized by its type in the form of metadata, said process comprising the steps of:

- design (E5) of a generic model from a plurality of documents each comprising at least one marked metadata, said generic model listing all the marked metadata,

- automatic production (E6) of a learning database from the generated generic model, said learning database comprising a plurality of learning documents each comprising all or part of the metadata types of the generated generic model , each type of metadata being associated with a value,

- generation (E7) of a plurality of specific models by training a plurality of neural networks in the same number as the number of specific models from the learning base, the training of each neural network resulting in the generation of a specific model representing a type of metadata listed in the generic model,

- textual and sequential reading (S1) of the target document using the sliding window and calculation (S2), for each character of said sliding window and for each specific model, of the probability of belonging of said character to the corresponding metadata to the specific model,

- identification (S3), in the sliding window, of the specific model for which the mean of the probabilities calculated for each character of at least one series of characters is greater than a predetermined threshold,

- determination (S4) of the metadata associated with the specific model identified,

- extraction (S5) of the value associated with said determined metadata.

2. Method according to claim 1, comprising a step (E0) of selecting a predetermined number of reference documents from an initial set of documents. 39

3. Method according to the preceding claim, in which the selection of the predetermined number of documents from an initial set of documents is carried out manually by an operator.

4. Method according to any one of the preceding claims, comprising a step of marking (E3) each identifiable metadata item in each selected document, preferably manually by an operator.

5. Method according to any one of the preceding claims, comprising a step of recording (E4) in a memory zone (50) specific models generated.

6. Method according to any one of the preceding claims, in which the automatic production (E6) of the learning database from the generated generic model comprises the generation of at least one hundred learning documents, preferably at less than a thousand, more preferably at least ten thousand.

7. Method according to any one of the preceding claims, in which each document of the plurality of training documents comprises all the types of metadata of the generated generic model.

8. Method according to any one of the preceding claims, comprising a filtering step to ensure that each document comprises at most one piece of metadata of each type.

9. Method according to any one of the preceding claims, in which the sliding window slides by one character at each iteration.

10. Extraction module (40) of data in a textually digitized target document, said extraction module (40) being configured to implement the method according to claim 1.