WO2021096381A1

WO2021096381A1 - Method and system for depersonalizing documents containing personal data

Info

Publication number: WO2021096381A1
Application number: PCT/RU2019/000819
Authority: WO
Inventors: Вадим Валерьевич ПОЛУЛЯХ; Алексей Владимирович СКУГАРЕВ; Владимир Михайлович СИДОРОВ
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2021-05-20

Abstract

The present technical solution relates to the field of computer technologies, specifically to a method and system for depersonalizing an image of documents containing personal data. The claimed solution is implemented by means of a computer-aided method of removing personal data from an image of a document, which method is carried out using a processor and comprises steps in which: a primary document image is received; said image is pre-processed, during which an image of a specified size and resolution is formed; the pre-processed image is reduced with the help of 4x transformation using Gaussian pyramid conversion; at least one morphological operation is used with a 33-pixel window for the document image; the image obtained is magnified with the help of 4x transformation using Gaussian pyramid conversion; a final document image is formed with personal data removed.

Description

METHOD AND SYSTEM FOR LOSSING DOCUMENTS CONTAINING

PERSONAL INFORMATION

FIELD OF TECHNOLOGY

[0001] The present technical solution relates to the field of computer technology, in particular to a method and system for depersonalizing images of documents that contain personal data.

LEVEL OF TECHNOLOGY

[0002] Today, the processing of personal data contained in various types of documents takes an increasingly important place in the conduct of economic activities of modern business. With the constant tightening of policies for the processing of this type of information and control over the observance of its integrity, storage, as well as protection against possible leaks, modern technologies offer more and more advanced methods for working with personal data in terms of their depersonification and anonymization. [0003] From US patent US 7,158,979 B2 "System and method of de-identifying data"

(INGENIX Inc., 02.01.2007) a method based on processing textual information is known. To apply documents to images, you must first use OCR (Optical Character Recognition) technology. One way or another, these technologies use Named Entity Recognition (NER). This approach cannot guarantee the completeness of the identification of the text with personal data contained in the image, which makes this method unsuitable for automatic operation.

[0004] US patent application US20190279011 A1 "Data anonymization using neural networks" (Microsoft Technology Licensing LLC, 09/12/2019) discloses a solution to the problem of training a classifier on a computing network node, the operator of which should not have access to the data for training. The technology is based on the property of poor interpretability of a neural network. The main idea of the invention in the said application is the idea of separating in the pretrained network an encoder operating in the computer network of the owner of the data and a classifier operating on computational nodes that have no right to read the initial data.

[0005] The retraining of the classifier can be accomplished by transmitting the correct response and encoded data. The advantage of this technology is simplicity and good scalability for various machine learning tasks. This technology makes it possible to use multiple data sources to train one model. The disadvantage of this method is the need to involve people with access to personal data for training.

[0006] Other known approaches are to split the image with personal data into information segments that are not personally identifiable. These approaches depend on the image splitting method and generally cannot guarantee correct segmentation. The main difficulty when working with personal data is the need to maintain the required level of confidentiality.

[0007] In different countries and areas of activity, the requirements may differ, but one way or another, these requirements do not allow the use of crowdsourcing sites like Yandex Toloka or Amazon Mechanical Turk. It is advisable to use these platforms to create large datasets suitable for building a document classifier using machine learning.

[0008] There are three approaches to solving the problem of creating large datasets for classifying documents with personal data:

• selection of markers who are legally entitled to work with personal data;

• manual removal of personal information from the document;

• automatic deletion of personal data.

The first two methods are relatively expensive and difficult to implement. The implementation of the third method suffers from problems with the completeness of anonymization when implemented in an automated mode (some data may be missed by the algorithm).

DISCLOSURE OF THE INVENTION

[0009] The proposed solution allows you to solve a technical problem associated with the completeness of processing of input data of images containing personal data, as well as to improve the quality of anonymization of documents without the possibility of their recovery with automatic implementation of processing in real time.

[0010] The technical result is to improve the quality of removing personal data from the image of documents in automatic mode, without losing the ability to classify the type of document. [OOP] The claimed solution is carried out due to a computer-implemented method for removing personal data from the image of a document, which is performed using a processor and contains the stages at which: get the primary image of the document; performing preprocessing of the said image, in which the formation of an image of a given size and resolution is performed; reducing the preprocessed image by means of its fourfold transformation using a transition along the Gaussian pyramid; applying at least one morphological operation with a 3x3 pixel window to the document image; the obtained image is enlarged by means of a four-fold transformation using a transition along the Hauss pyramid; form the final image of the document with the deleted personal data. [0012] In one of the particular embodiments of the method, at the preprocessing stage, the aspect ratio of the image is reduced to obtain the 5: 7 ratio.

[0013] In another particular embodiment of the method, the image resolution is set to 840x600 pixels at the preprocessing stage.

[0014] In another particular embodiment of the method, the preprocessing step transforms the image to obtain a 24-bit color depth index.

[0015] In another particular embodiment of the method, at the preprocessing stage, the image is additionally converted into a JPEG format.

[0016] In another particular embodiment of the method, the type of morphological operation is selected from the group: opening, closing, erosion, dilution, median filter, morphological gradient, or combinations thereof.

[0017] In another particular embodiment of the method, the document contains fields with personal text data.

[0018] In another particular embodiment of the method, a feature vector is stored for the final image, on the basis of which the document type is determined.

[0019] The claimed solution is also implemented using a system for removing personal data from a document image, which includes at least one processor and at least one data storage means containing machine-readable instructions that, when executed by the processor, implement the above-described method. BRIEF DESCRIPTION OF DRAWINGS

[0020] FIG. 1A - FIG. 1B illustrate a block diagram of the claimed method. [0021] FIG. 2 illustrates an example of document processing.

[0022] FIG. 3 illustrates an example of a computing device for implementing the claimed solution.

CARRYING OUT THE INVENTION

[0023] FIG. 1A - FIG. 1B shows a process for performing the claimed method (100) for transforming an image of documents containing personal data. The transformation of the document is carried out in an automated mode upon obtaining the primary image (101) to be transformed. The image can be obtained from various sources in one of the supported graphic formats, for example, PDF, TIFF, JPG, etc.

[0024] The document conversion method (100) may be performed using a standardized computing device such as a computer. Information for processing document images can be transmitted to said device via a data network such as the Internet, Intranet, LAN, and the like.

[0025] Standardized document types are typically processed as document images, for example:

• TIN;

• SNILS;

• Passport of the Russian Federation, CIS countries, international passport;

• Driving license of the Russian Federation / USA / European Union;

• national ID-cards of the European Union, etc.

[0026] The above list of documents is not exhaustive and can be supplemented with new types of documents. Recognition of a document type for the content of personal data in it can be performed using a classifier based on a machine learning algorithm, for example, a neural network trained to recognize from an image the corresponding type of document to be further processed.

[0027] Typically, each of the above document types contains one or more text fields. Each such field is a local block of logically related text in the document. So, for example, a text field in a passport is First Name, Last Name, Date of Birth or Passport Number. The field is present in document, if it is possible to read at least one letter from the text field. Otherwise, the field is missing.

[0028] The task of anonymizing a document for the purpose of transforming it to delete personal data can be set as follows. Let a set of images of documents I, for example, photographic images, be given. Each such image I k e I corresponds to a vector of features P_k, which uniquely corresponds to one of the types, for example, from the above list of document types. [0029] The coordinates of the vector P_k are unknown, but it was experimentally found that they are related to the shape of the document and the location of color regions on the document.

A - the set of all fields with personal information. In the image I_k there are text fields with personal information A_ke A. The task is to construct a transformation f: I -> G. Г - a set of images, without text fields with personal information. VI_k 'e G: R_k == R_k \

[0030] The desired transformation f can be constructed as follows to provide the desired appearance of the final image of the document. It is preferable that the primary image of the document was obtained by photographing with a camera, for example, a camera, and the photograph should contain the entire document. [0031] The height of the field with text in documents is no more than 1/26 of the height of the entire photo (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation).

[0032] When taking photographic images from a camera, text fields have a nearly rectangular shape, with the height being the height of the tallest letter in the field. Preferably, the aspect ratio of the photograph is approximately 14: 5 and the photographic image of the document is in color (24-bit color depth). Compliance with these requirements allows you to completely remove the printed text from the photo, but leave the possibility of classifying the type of document at a sufficiently high level.

[0033] After obtaining the image of the document in step (101), its preprocessing (102) is performed. At this stage, the image is converted to the specified size and resolution. Specifically, as shown in FIG. 1B, the image resolution is set to 840 * 600 pixels (compression, provided that the cropped image is larger than 840x600, otherwise proportional stretching), as well as bringing the sides of the image to 5: 7. It is also preferable to convert the image format to 24-bit JPEG and set landscape orientation of its location. The transformation of the primary image can be carried out using well-known graphics processing algorithms that provide the required operations.

[0034] In case the input image is a PDF file with multiple pages, then such a document is converted into a set of JPEG images, each of which is subjected to the above processing in step (102). Next, a document reduction (103) transformation is applied to the preprocessed image at step (102), which is performed using a fourfold Gaussian pyramid transition. As shown in FIG. 1B, the transformation using the Gaussian pyramid is performed with the subsequent reduction of the image resolution from that set in step (102).

[0035] The pyramid of images is a sequence of N images, and each subsequent image is obtained from the previous one by filtering and decimating two times according to the scheme:

Original image fN-l (x, y)

!

High-frequency filtering with kernel h (u, v) g (x, y) = JJ AST— 1 (x-u, y-v) h (u, v) dudv i

Reducing the size by 2 times fN (x, y) = g (2x, 2y).

[0036] Image filtering is necessary to suppress high frequency noise. The Gauss function is used as the kernel h (u, v). For this reason, the pyramid is called Gaussian. According to Kotelnikov's theorem, compression in a Gaussian pyramid occurs with minimal loss of information. Image f _N (x, y) is a miniature copy of the original image fi (x, y). The pixel size of the image of level N is equal to ry = 2 ^s_1 . For the coordinates of the pixels of the images of two arbitrary levels of the pyramid with numbers n and m, the following ratios are valid: 2 ^n-

¹ y _m .

[0037] As a result of steps (1031) - (1034), the document image is reduced to 53x38 pixels. The height of a line of text in documents is no more than 1/26 of the height of the image (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation). Thus, after applying the transformation according to the Gauss pyramid, the height of a line of text will not exceed 2 pixels, as a result of which the letters of the text will be almost indistinguishable. [0038] At step (104), one or more morphological operations are applied to ensure the removal of possible remaining information in the lines of text, as well as to suppress the possible recovery of part of the information. For example, the morphological operation open with a 3x3 pixel window can be applied, which is defined by the formula: dst = open (source, element) = dilate (erode (source, element)), where dst is the final image, source is the initial image, and element is the window for applying the operation (in this case 3x3). Dilitate operation: dst (x, y) = max (x ', y'): element (x ', y') ^ 0 src (x + x ', y + y') (noHCK of the maximum in the window and filling the window with this maximum), erosion operation: dst (x, y) = min (x ', y'): element (x ', y') ^ 0 src (x + x ', y + y') (noHCK of the minimum in the window and filling the window with this minimum). Also, as morphological operations can be used: closure, erosion, median filter, morphological gradient, or their combination.

[0039] After applying morphological operations on the reduced image at step (105), the Gaussian pyramid is sequentially applied to enlarge the image (move 4 levels higher along the Gaussian pyramid) to return to the original size (possibly with minor deviations in the final resolution).

[0040] The resulting image with the deleted personal data is stored (106), for example, in a database, or transmitted for further processing. An important feature of the presented approach is that information is deleted without the possibility of its recovery. For the final image of the document, a vector of features is calculated and saved, according to which the type of document can be set. This procedure is performed on the basis that the vector of features characterizing the document type is known for the input document.

[0041] FIG. 2 shows an example of processing a primary image of a document (210) containing personal data for the purpose of removing them from the final image of a document (220). The final document is converted into a form that excludes the subsequent restoration of the original information in it, but according to the vector of signs, it can be correlated with one or another type of document of the established sample.

[0042] FIG. 3 shows an architectural execution of a system (300) suitable for implementing the claimed method, which can be performed on the basis of a standardized arrangement of computing devices (personal computer, server, server cluster, mainframe, etc.) and includes such components as: one or multiple processors (301), random access memory (302), storage data (303), input / output interfaces (304), input / output means (305) and networking means (306).

[0043] The processor (301) is designed to execute the program logic and the required computational operations necessary for the operation of the system (300). The processor (301) executes the necessary computer readable instructions and instructions contained in the main memory (302). The processor (301) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices that are currently widely used, for example, such manufacturers as: Intel ™, AMD ™, Apple ™, Samsung Exynos ™, MediaTEK ™, Qualcomm Snapdragon ™, etc. Under the processor or one of the processors used in the system architecture (300), it is also necessary to take into account a graphics processor, for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial implementation of the methods for implementing the declared solution, and can also be used for training and applying models. machine learning.

[0044] Random access memory (302), as a rule, is made in the form of RAM and contains the necessary program logic that provides the required functionality. The data storage medium (303) can be performed in the form of HDD, SSD disks, array raid, flash memory, optical storage devices (CD, DVD, MD, Blue-Ray disks), etc. Means (303) allow performing long-term storage of various types of information. [0045] Interfaces (304) are standard means for connecting and operating multiple devices, such as USB, RS232, RJ45, LPT, COM, HDMI, PS / 2, Lightning, FireWire, and the like. The choice of interfaces (304) depends on the specific implementation of the system (300). As means of data input / output (305) can be used: keyboard, joystick, display (touchscreen display), projector, touchpad, mouse, trackball, light pen, speakers, microphone, etc.

[0046] Networking means (306) are selected from devices that provide network reception and transmission of data, for example, Ethernet card, WLAN / Wi-Fi module, Bluetooth module, BLE module, NFC module, IrDa, RFID module, GSM modem, etc. .P. The means (305) provide the organization of data exchange via a wired and / or wireless data transmission channel, for example, WAN, PAN, LAN, Intranet, Internet, WLAN, WMAN or GSM.

[0047] The presented description of the claimed solution discloses only preferred examples of its implementation and should not be interpreted as limiting other, particular examples of its implementation, not going beyond the scope of the legal protection that are obvious to a person skilled in the relevant field of technology.

Claims

FORMULA

1. A computer-implemented method for removing personal data from a document image, performed using a processor and containing the stages at which:

- get the primary image of the document;

- perform preprocessing of the above-mentioned image, in which the formation of an image of a given size and resolution is carried out;

- carry out the reduction of the preprocessed image by means of its fourfold transformation using the transition along the Gaussian pyramid;

- carry out the application of at least one morphological operation with a window of 3x3 pixels to the document image;

- carry out an increase in the obtained image using a four-fold transformation using a transition along the Gaussian pyramid;

- form the final image of the document with the deleted personal data.

2. The method according to claim 1, characterized in that at the stage of preprocessing, the aspect ratio of the image is reduced to obtain an indicator of 5: 7.

3. The method according to claim 1, characterized in that at the stage of preprocessing, the image resolution is set to 840x600 pixels.

4. The method according to claim 1, characterized in that at the stage of preprocessing the image is transformed to obtain a 24-bit color depth index.

5. The method according to claim 1, characterized in that at the stage of preprocessing the image is additionally converted into JPEG format.

6. The method according to claim 1, characterized in that the type of morphological operation is selected from the group: opening, closing, erosion, dilution, median filter, morphological gradient, or combinations thereof.

7. The method according to claim 1, characterized in that the document contains fields with personal text data.

8. The method according to claim 1, characterized in that a vector of features is stored for the final image, on the basis of which the type of document is determined.

9. A system for removing personal data from a document image, and the system includes at least one processor and at least one data storage device containing machine-readable instructions, which, when executed by the processor, implement the method according to any one of claims. 1-8.