WO2021096381A1 - Procédé et système d'anonymisation de documents contenant des données personnelles - Google Patents
Procédé et système d'anonymisation de documents contenant des données personnelles Download PDFInfo
- Publication number
- WO2021096381A1 WO2021096381A1 PCT/RU2019/000819 RU2019000819W WO2021096381A1 WO 2021096381 A1 WO2021096381 A1 WO 2021096381A1 RU 2019000819 W RU2019000819 W RU 2019000819W WO 2021096381 A1 WO2021096381 A1 WO 2021096381A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- document
- personal data
- preprocessing
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
Definitions
- the present technical solution relates to the field of computer technology, in particular to a method and system for depersonalizing images of documents that contain personal data.
- US patent application US20190279011 A1 "Data anonymization using neural networks” (Microsoft Technology Licensing LLC, 09/12/2019) discloses a solution to the problem of training a classifier on a computing network node, the operator of which should not have access to the data for training.
- the technology is based on the property of poor interpretability of a neural network.
- the main idea of the invention in the said application is the idea of separating in the pretrained network an encoder operating in the computer network of the owner of the data and a classifier operating on computational nodes that have no right to read the initial data.
- the retraining of the classifier can be accomplished by transmitting the correct response and encoded data.
- the advantage of this technology is simplicity and good scalability for various machine learning tasks.
- This technology makes it possible to use multiple data sources to train one model.
- the disadvantage of this method is the need to involve people with access to personal data for training.
- the first two methods are relatively expensive and difficult to implement.
- the implementation of the third method suffers from problems with the completeness of anonymization when implemented in an automated mode (some data may be missed by the algorithm).
- the proposed solution allows you to solve a technical problem associated with the completeness of processing of input data of images containing personal data, as well as to improve the quality of anonymization of documents without the possibility of their recovery with automatic implementation of processing in real time.
- the technical result is to improve the quality of removing personal data from the image of documents in automatic mode, without losing the ability to classify the type of document.
- the claimed solution is carried out due to a computer-implemented method for removing personal data from the image of a document, which is performed using a processor and contains the stages at which: get the primary image of the document; performing preprocessing of the said image, in which the formation of an image of a given size and resolution is performed; reducing the preprocessed image by means of its fourfold transformation using a transition along the Gaussian pyramid; applying at least one morphological operation with a 3x3 pixel window to the document image; the obtained image is enlarged by means of a four-fold transformation using a transition along the Hauss pyramid; form the final image of the document with the deleted personal data.
- the aspect ratio of the image is reduced to obtain the 5: 7 ratio.
- the image resolution is set to 840x600 pixels at the preprocessing stage.
- the preprocessing step transforms the image to obtain a 24-bit color depth index.
- the image is additionally converted into a JPEG format.
- the type of morphological operation is selected from the group: opening, closing, erosion, dilution, median filter, morphological gradient, or combinations thereof.
- the document contains fields with personal text data.
- a feature vector is stored for the final image, on the basis of which the document type is determined.
- the claimed solution is also implemented using a system for removing personal data from a document image, which includes at least one processor and at least one data storage means containing machine-readable instructions that, when executed by the processor, implement the above-described method.
- FIG. 1A - FIG. 1B illustrate a block diagram of the claimed method.
- FIG. 2 illustrates an example of document processing.
- FIG. 3 illustrates an example of a computing device for implementing the claimed solution.
- FIG. 1A - FIG. 1B shows a process for performing the claimed method (100) for transforming an image of documents containing personal data.
- the transformation of the document is carried out in an automated mode upon obtaining the primary image (101) to be transformed.
- the image can be obtained from various sources in one of the supported graphic formats, for example, PDF, TIFF, JPG, etc.
- the document conversion method (100) may be performed using a standardized computing device such as a computer.
- Information for processing document images can be transmitted to said device via a data network such as the Internet, Intranet, LAN, and the like.
- Standardized document types are typically processed as document images, for example:
- Recognition of a document type for the content of personal data in it can be performed using a classifier based on a machine learning algorithm, for example, a neural network trained to recognize from an image the corresponding type of document to be further processed.
- a machine learning algorithm for example, a neural network trained to recognize from an image the corresponding type of document to be further processed.
- each of the above document types contains one or more text fields.
- Each such field is a local block of logically related text in the document. So, for example, a text field in a passport is First Name, Last Name, Date of birth or Passport Number. The field is present in document, if it is possible to read at least one letter from the text field. Otherwise, the field is missing.
- the task of anonymizing a document for the purpose of transforming it to delete personal data can be set as follows. Let a set of images of documents I, for example, photographic images, be given. Each such image I k e I corresponds to a vector of features P_k, which uniquely corresponds to one of the types, for example, from the above list of document types. [0029] The coordinates of the vector P_k are unknown, but it was experimentally found that they are related to the shape of the document and the location of color regions on the document.
- the desired transformation f can be constructed as follows to provide the desired appearance of the final image of the document. It is preferable that the primary image of the document was obtained by photographing with a camera, for example, a camera, and the photograph should contain the entire document. [0031]
- the height of the field with text in documents is no more than 1/26 of the height of the entire photo (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation).
- the image is converted to the specified size and resolution. Specifically, as shown in FIG. 1B, the image resolution is set to 840 * 600 pixels (compression, provided that the cropped image is larger than 840x600, otherwise proportional stretching), as well as bringing the sides of the image to 5: 7. It is also preferable to convert the image format to 24-bit JPEG and set landscape orientation of its location.
- the transformation of the primary image can be carried out using well-known graphics processing algorithms that provide the required operations.
- the input image is a PDF file with multiple pages
- a document is converted into a set of JPEG images, each of which is subjected to the above processing in step (102).
- a document reduction (103) transformation is applied to the preprocessed image at step (102), which is performed using a fourfold Gaussian pyramid transition. As shown in FIG. 1B, the transformation using the Gaussian pyramid is performed with the subsequent reduction of the image resolution from that set in step (102).
- the pyramid of images is a sequence of N images, and each subsequent image is obtained from the previous one by filtering and decimating two times according to the scheme:
- Image filtering is necessary to suppress high frequency noise.
- the Gauss function is used as the kernel h (u, v). For this reason, the pyramid is called Gaussian. According to Kotelnikov's theorem, compression in a Gaussian pyramid occurs with minimal loss of information.
- Image f N (x, y) is a miniature copy of the original image fi (x, y).
- the following ratios are valid: 2 n- 1 y m .
- the document image is reduced to 53x38 pixels.
- the height of a line of text in documents is no more than 1/26 of the height of the image (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation).
- the height of a line of text will not exceed 2 pixels, as a result of which the letters of the text will be almost indistinguishable.
- one or more morphological operations are applied to ensure the removal of possible remaining information in the lines of text, as well as to suppress the possible recovery of part of the information.
- the Gaussian pyramid is sequentially applied to enlarge the image (move 4 levels higher along the Gaussian pyramid) to return to the original size (possibly with minor deviations in the final resolution).
- the resulting image with the deleted personal data is stored (106), for example, in a database, or transmitted for further processing.
- An important feature of the presented approach is that information is deleted without the possibility of its recovery.
- a vector of features is calculated and saved, according to which the type of document can be set. This procedure is performed on the basis that the vector of features characterizing the document type is known for the input document.
- FIG. 2 shows an example of processing a primary image of a document (210) containing personal data for the purpose of removing them from the final image of a document (220).
- the final document is converted into a form that excludes the subsequent restoration of the original information in it, but according to the vector of signs, it can be correlated with one or another type of document of the established sample.
- FIG. 3 shows an architectural execution of a system (300) suitable for implementing the claimed method, which can be performed on the basis of a standardized arrangement of computing devices (personal computer, server, server cluster, mainframe, etc.) and includes such components as: one or multiple processors (301), random access memory (302), storage data (303), input / output interfaces (304), input / output means (305) and networking means (306).
- processors 301
- random access memory 302
- storage data 303
- input / output interfaces 304
- input / output means (305)
- networking means (306).
- the processor (301) is designed to execute the program logic and the required computational operations necessary for the operation of the system (300).
- the processor (301) executes the necessary computer readable instructions and instructions contained in the main memory (302).
- the processor (301) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices that are currently widely used, for example, such manufacturers as: Intel TM, AMD TM, Apple TM, Samsung Exynos TM, MediaTEK TM, Qualcomm Snapdragon TM, etc.
- a graphics processor for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial implementation of the methods for implementing the declared solution, and can also be used for training and applying models. machine learning.
- Random access memory (302) is made in the form of RAM and contains the necessary program logic that provides the required functionality.
- the data storage medium (303) can be performed in the form of HDD, SSD disks, array raid, flash memory, optical storage devices (CD, DVD, MD, Blue-Ray disks), etc. Means (303) allow performing long-term storage of various types of information.
- Interfaces (304) are standard means for connecting and operating multiple devices, such as USB, RS232, RJ45, LPT, COM, HDMI, PS / 2, Lightning, FireWire, and the like. The choice of interfaces (304) depends on the specific implementation of the system (300). As means of data input / output (305) can be used: keyboard, joystick, display (touchscreen display), projector, touchpad, mouse, trackball, light pen, speakers, microphone, etc.
- Networking means (306) are selected from devices that provide network reception and transmission of data, for example, Ethernet card, WLAN / Wi-Fi module, Bluetooth module, BLE module, NFC module, IrDa, RFID module, GSM modem, etc. .P.
- the means (305) provide the organization of data exchange via a wired and / or wireless data transmission channel, for example, WAN, PAN, LAN, Intranet, Internet, WLAN, WMAN or GSM.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Image Processing (AREA)
Abstract
La présente invention se rapporte au domaine des techniques informatiques, et concerne notamment un procédé et un système d'anonymisation d'image de documents qui comprennent des données personnelles. Cette invention est réalisée grâce à un procédé mis en oeuvre par ordinateur d'élimination des données personnelles dans une image de document, lequel est réalisé à l'aide d'un processeur et comprend les étapes suivantes: obtenir une image primaire du document; effectuer un traitement de ladite image au cours duquel on effectue une génération d'une image d'une taille et d'une résolution données; effectuer une réduction de l'image pré-traitée à l'aide de sa transformée quadratique en utilisant une transition selon une pyramide de Gauss; appliquer au moins une opération morphologique avec une fenêtre à 33 pixels sur l'image du document; agrandir l'image obtenue à l'aide de la transformée quadratique en utilisant la transition selon une pyramide de Gauss; et générer une image finale du document dont les données personnelles ont été retirées.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2019/000819 WO2021096381A1 (fr) | 2019-11-15 | 2019-11-15 | Procédé et système d'anonymisation de documents contenant des données personnelles |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2019/000819 WO2021096381A1 (fr) | 2019-11-15 | 2019-11-15 | Procédé et système d'anonymisation de documents contenant des données personnelles |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021096381A1 true WO2021096381A1 (fr) | 2021-05-20 |
Family
ID=75912108
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2019/000819 WO2021096381A1 (fr) | 2019-11-15 | 2019-11-15 | Procédé et système d'anonymisation de documents contenant des données personnelles |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021096381A1 (fr) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160330438A1 (en) * | 2015-05-07 | 2016-11-10 | Government Of The United States, As Represented By The Secretary Of The Air Force | Morphological Automatic Landolt C Orientation Detection |
US20170302661A1 (en) * | 2016-04-17 | 2017-10-19 | International Business Machines Corporation | Anonymizing biometric data for use in a security system |
CN110110769A (zh) * | 2019-04-24 | 2019-08-09 | 长安大学 | 一种基于宽度径向基函数网络的图像分类方法 |
US20190279011A1 (en) * | 2018-03-12 | 2019-09-12 | Microsoft Technology Licensing, Llc | Data anonymization using neural networks |
-
2019
- 2019-11-15 WO PCT/RU2019/000819 patent/WO2021096381A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160330438A1 (en) * | 2015-05-07 | 2016-11-10 | Government Of The United States, As Represented By The Secretary Of The Air Force | Morphological Automatic Landolt C Orientation Detection |
US20170302661A1 (en) * | 2016-04-17 | 2017-10-19 | International Business Machines Corporation | Anonymizing biometric data for use in a security system |
US20190279011A1 (en) * | 2018-03-12 | 2019-09-12 | Microsoft Technology Licensing, Llc | Data anonymization using neural networks |
CN110110769A (zh) * | 2019-04-24 | 2019-08-09 | 长安大学 | 一种基于宽度径向基函数网络的图像分类方法 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bayar et al. | Design principles of convolutional neural networks for multimedia forensics | |
US11023708B2 (en) | Within document face verification | |
CN105678292A (zh) | 基于卷积及递归神经网络的复杂光学文字序列识别系统 | |
CN110599387A (zh) | 一种自动去除图片水印的方法及装置 | |
CN109446345A (zh) | 核电文件校验处理方法以及系统 | |
JP6882362B2 (ja) | 身元確認書類を含む画像を識別するシステムおよび方法 | |
CN112686258A (zh) | 体检报告信息结构化方法、装置、可读存储介质和终端 | |
Revathi et al. | Comparative analysis of text extraction from color images using tesseract and opencv | |
LU93381B1 (en) | Systems, methods and devices for tamper proofing documents and embedding data in a biometric identifier | |
Dergachov et al. | Data pre-processing to increase the quality of optical text recognition systems | |
US20190188511A1 (en) | Method and system for optical character recognition of series of images | |
Tymoshenko et al. | Real-Time Ukrainian Text Recognition and Voicing. | |
RU2672395C1 (ru) | Способ обучения классификатора, предназначенного для определения категории документа | |
AU2021312111A1 (en) | Classifying pharmacovigilance documents using image analysis | |
CN111178398B (zh) | 检测身份证图像信息篡改的方法、系统、存储介质及装置 | |
WO2021096381A1 (fr) | Procédé et système d'anonymisation de documents contenant des données personnelles | |
RU2793607C1 (ru) | Способ и система обезличивания документов, содержащих персональные данные | |
US20220398399A1 (en) | Optical character recognition systems and methods for personal data extraction | |
Panchal et al. | An investigation on feature and text extraction from images using image recognition in Android | |
Alaei et al. | Document Image Quality Assessment: A Survey | |
CN111414889B (zh) | 基于文字识别的财务报表识别方法及装置 | |
Lu et al. | A novel assessment framework for learning-based deepfake detectors in realistic conditions | |
Fang | Semantic segmentation of PHT based on improved DeeplabV3+ | |
Shetty et al. | Automated Identity Document Recognition and Classification (AIDRAC)-A Review | |
Bhatt et al. | Text Extraction & Recognition from Visiting Cards |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19952369 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.09.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19952369 Country of ref document: EP Kind code of ref document: A1 |