WO2021096381A1 - Procédé et système d'anonymisation de documents contenant des données personnelles - Google Patents

Procédé et système d'anonymisation de documents contenant des données personnelles Download PDF

Info

Publication number
WO2021096381A1
WO2021096381A1 PCT/RU2019/000819 RU2019000819W WO2021096381A1 WO 2021096381 A1 WO2021096381 A1 WO 2021096381A1 RU 2019000819 W RU2019000819 W RU 2019000819W WO 2021096381 A1 WO2021096381 A1 WO 2021096381A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
document
personal data
preprocessing
data
Prior art date
Application number
PCT/RU2019/000819
Other languages
English (en)
Russian (ru)
Inventor
Вадим Валерьевич ПОЛУЛЯХ
Алексей Владимирович СКУГАРЕВ
Владимир Михайлович СИДОРОВ
Original Assignee
Федеральное Государственное Автономное Образовательное Учреждение Высшего Образования "Московский Физико-Технический Институт" (Национальный Исследовательский Университет) (Мфти)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Федеральное Государственное Автономное Образовательное Учреждение Высшего Образования "Московский Физико-Технический Институт" (Национальный Исследовательский Университет) (Мфти) filed Critical Федеральное Государственное Автономное Образовательное Учреждение Высшего Образования "Московский Физико-Технический Институт" (Национальный Исследовательский Университет) (Мфти)
Priority to PCT/RU2019/000819 priority Critical patent/WO2021096381A1/fr
Publication of WO2021096381A1 publication Critical patent/WO2021096381A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication

Definitions

  • the present technical solution relates to the field of computer technology, in particular to a method and system for depersonalizing images of documents that contain personal data.
  • US patent application US20190279011 A1 "Data anonymization using neural networks” (Microsoft Technology Licensing LLC, 09/12/2019) discloses a solution to the problem of training a classifier on a computing network node, the operator of which should not have access to the data for training.
  • the technology is based on the property of poor interpretability of a neural network.
  • the main idea of the invention in the said application is the idea of separating in the pretrained network an encoder operating in the computer network of the owner of the data and a classifier operating on computational nodes that have no right to read the initial data.
  • the retraining of the classifier can be accomplished by transmitting the correct response and encoded data.
  • the advantage of this technology is simplicity and good scalability for various machine learning tasks.
  • This technology makes it possible to use multiple data sources to train one model.
  • the disadvantage of this method is the need to involve people with access to personal data for training.
  • the first two methods are relatively expensive and difficult to implement.
  • the implementation of the third method suffers from problems with the completeness of anonymization when implemented in an automated mode (some data may be missed by the algorithm).
  • the proposed solution allows you to solve a technical problem associated with the completeness of processing of input data of images containing personal data, as well as to improve the quality of anonymization of documents without the possibility of their recovery with automatic implementation of processing in real time.
  • the technical result is to improve the quality of removing personal data from the image of documents in automatic mode, without losing the ability to classify the type of document.
  • the claimed solution is carried out due to a computer-implemented method for removing personal data from the image of a document, which is performed using a processor and contains the stages at which: get the primary image of the document; performing preprocessing of the said image, in which the formation of an image of a given size and resolution is performed; reducing the preprocessed image by means of its fourfold transformation using a transition along the Gaussian pyramid; applying at least one morphological operation with a 3x3 pixel window to the document image; the obtained image is enlarged by means of a four-fold transformation using a transition along the Hauss pyramid; form the final image of the document with the deleted personal data.
  • the aspect ratio of the image is reduced to obtain the 5: 7 ratio.
  • the image resolution is set to 840x600 pixels at the preprocessing stage.
  • the preprocessing step transforms the image to obtain a 24-bit color depth index.
  • the image is additionally converted into a JPEG format.
  • the type of morphological operation is selected from the group: opening, closing, erosion, dilution, median filter, morphological gradient, or combinations thereof.
  • the document contains fields with personal text data.
  • a feature vector is stored for the final image, on the basis of which the document type is determined.
  • the claimed solution is also implemented using a system for removing personal data from a document image, which includes at least one processor and at least one data storage means containing machine-readable instructions that, when executed by the processor, implement the above-described method.
  • FIG. 1A - FIG. 1B illustrate a block diagram of the claimed method.
  • FIG. 2 illustrates an example of document processing.
  • FIG. 3 illustrates an example of a computing device for implementing the claimed solution.
  • FIG. 1A - FIG. 1B shows a process for performing the claimed method (100) for transforming an image of documents containing personal data.
  • the transformation of the document is carried out in an automated mode upon obtaining the primary image (101) to be transformed.
  • the image can be obtained from various sources in one of the supported graphic formats, for example, PDF, TIFF, JPG, etc.
  • the document conversion method (100) may be performed using a standardized computing device such as a computer.
  • Information for processing document images can be transmitted to said device via a data network such as the Internet, Intranet, LAN, and the like.
  • Standardized document types are typically processed as document images, for example:
  • Recognition of a document type for the content of personal data in it can be performed using a classifier based on a machine learning algorithm, for example, a neural network trained to recognize from an image the corresponding type of document to be further processed.
  • a machine learning algorithm for example, a neural network trained to recognize from an image the corresponding type of document to be further processed.
  • each of the above document types contains one or more text fields.
  • Each such field is a local block of logically related text in the document. So, for example, a text field in a passport is First Name, Last Name, Date of birth or Passport Number. The field is present in document, if it is possible to read at least one letter from the text field. Otherwise, the field is missing.
  • the task of anonymizing a document for the purpose of transforming it to delete personal data can be set as follows. Let a set of images of documents I, for example, photographic images, be given. Each such image I k e I corresponds to a vector of features P_k, which uniquely corresponds to one of the types, for example, from the above list of document types. [0029] The coordinates of the vector P_k are unknown, but it was experimentally found that they are related to the shape of the document and the location of color regions on the document.
  • the desired transformation f can be constructed as follows to provide the desired appearance of the final image of the document. It is preferable that the primary image of the document was obtained by photographing with a camera, for example, a camera, and the photograph should contain the entire document. [0031]
  • the height of the field with text in documents is no more than 1/26 of the height of the entire photo (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation).
  • the image is converted to the specified size and resolution. Specifically, as shown in FIG. 1B, the image resolution is set to 840 * 600 pixels (compression, provided that the cropped image is larger than 840x600, otherwise proportional stretching), as well as bringing the sides of the image to 5: 7. It is also preferable to convert the image format to 24-bit JPEG and set landscape orientation of its location.
  • the transformation of the primary image can be carried out using well-known graphics processing algorithms that provide the required operations.
  • the input image is a PDF file with multiple pages
  • a document is converted into a set of JPEG images, each of which is subjected to the above processing in step (102).
  • a document reduction (103) transformation is applied to the preprocessed image at step (102), which is performed using a fourfold Gaussian pyramid transition. As shown in FIG. 1B, the transformation using the Gaussian pyramid is performed with the subsequent reduction of the image resolution from that set in step (102).
  • the pyramid of images is a sequence of N images, and each subsequent image is obtained from the previous one by filtering and decimating two times according to the scheme:
  • Image filtering is necessary to suppress high frequency noise.
  • the Gauss function is used as the kernel h (u, v). For this reason, the pyramid is called Gaussian. According to Kotelnikov's theorem, compression in a Gaussian pyramid occurs with minimal loss of information.
  • Image f N (x, y) is a miniature copy of the original image fi (x, y).
  • the following ratios are valid: 2 n- 1 y m .
  • the document image is reduced to 53x38 pixels.
  • the height of a line of text in documents is no more than 1/26 of the height of the image (when the text is placed horizontally relative to the landscape orientation) and no more than 1/19 of the width of the image (when the text is placed vertically relative to the landscape orientation).
  • the height of a line of text will not exceed 2 pixels, as a result of which the letters of the text will be almost indistinguishable.
  • one or more morphological operations are applied to ensure the removal of possible remaining information in the lines of text, as well as to suppress the possible recovery of part of the information.
  • the Gaussian pyramid is sequentially applied to enlarge the image (move 4 levels higher along the Gaussian pyramid) to return to the original size (possibly with minor deviations in the final resolution).
  • the resulting image with the deleted personal data is stored (106), for example, in a database, or transmitted for further processing.
  • An important feature of the presented approach is that information is deleted without the possibility of its recovery.
  • a vector of features is calculated and saved, according to which the type of document can be set. This procedure is performed on the basis that the vector of features characterizing the document type is known for the input document.
  • FIG. 2 shows an example of processing a primary image of a document (210) containing personal data for the purpose of removing them from the final image of a document (220).
  • the final document is converted into a form that excludes the subsequent restoration of the original information in it, but according to the vector of signs, it can be correlated with one or another type of document of the established sample.
  • FIG. 3 shows an architectural execution of a system (300) suitable for implementing the claimed method, which can be performed on the basis of a standardized arrangement of computing devices (personal computer, server, server cluster, mainframe, etc.) and includes such components as: one or multiple processors (301), random access memory (302), storage data (303), input / output interfaces (304), input / output means (305) and networking means (306).
  • processors 301
  • random access memory 302
  • storage data 303
  • input / output interfaces 304
  • input / output means (305)
  • networking means (306).
  • the processor (301) is designed to execute the program logic and the required computational operations necessary for the operation of the system (300).
  • the processor (301) executes the necessary computer readable instructions and instructions contained in the main memory (302).
  • the processor (301) (or multiple processors, multi-core processor, etc.) can be selected from a range of devices that are currently widely used, for example, such manufacturers as: Intel TM, AMD TM, Apple TM, Samsung Exynos TM, MediaTEK TM, Qualcomm Snapdragon TM, etc.
  • a graphics processor for example, NVIDIA GPU or Graphcore, the type of which is also suitable for full or partial implementation of the methods for implementing the declared solution, and can also be used for training and applying models. machine learning.
  • Random access memory (302) is made in the form of RAM and contains the necessary program logic that provides the required functionality.
  • the data storage medium (303) can be performed in the form of HDD, SSD disks, array raid, flash memory, optical storage devices (CD, DVD, MD, Blue-Ray disks), etc. Means (303) allow performing long-term storage of various types of information.
  • Interfaces (304) are standard means for connecting and operating multiple devices, such as USB, RS232, RJ45, LPT, COM, HDMI, PS / 2, Lightning, FireWire, and the like. The choice of interfaces (304) depends on the specific implementation of the system (300). As means of data input / output (305) can be used: keyboard, joystick, display (touchscreen display), projector, touchpad, mouse, trackball, light pen, speakers, microphone, etc.
  • Networking means (306) are selected from devices that provide network reception and transmission of data, for example, Ethernet card, WLAN / Wi-Fi module, Bluetooth module, BLE module, NFC module, IrDa, RFID module, GSM modem, etc. .P.
  • the means (305) provide the organization of data exchange via a wired and / or wireless data transmission channel, for example, WAN, PAN, LAN, Intranet, Internet, WLAN, WMAN or GSM.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

La présente invention se rapporte au domaine des techniques informatiques, et concerne notamment un procédé et un système d'anonymisation d'image de documents qui comprennent des données personnelles. Cette invention est réalisée grâce à un procédé mis en oeuvre par ordinateur d'élimination des données personnelles dans une image de document, lequel est réalisé à l'aide d'un processeur et comprend les étapes suivantes: obtenir une image primaire du document; effectuer un traitement de ladite image au cours duquel on effectue une génération d'une image d'une taille et d'une résolution données; effectuer une réduction de l'image pré-traitée à l'aide de sa transformée quadratique en utilisant une transition selon une pyramide de Gauss; appliquer au moins une opération morphologique avec une fenêtre à 33 pixels sur l'image du document; agrandir l'image obtenue à l'aide de la transformée quadratique en utilisant la transition selon une pyramide de Gauss; et générer une image finale du document dont les données personnelles ont été retirées.
PCT/RU2019/000819 2019-11-15 2019-11-15 Procédé et système d'anonymisation de documents contenant des données personnelles WO2021096381A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2019/000819 WO2021096381A1 (fr) 2019-11-15 2019-11-15 Procédé et système d'anonymisation de documents contenant des données personnelles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2019/000819 WO2021096381A1 (fr) 2019-11-15 2019-11-15 Procédé et système d'anonymisation de documents contenant des données personnelles

Publications (1)

Publication Number Publication Date
WO2021096381A1 true WO2021096381A1 (fr) 2021-05-20

Family

ID=75912108

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2019/000819 WO2021096381A1 (fr) 2019-11-15 2019-11-15 Procédé et système d'anonymisation de documents contenant des données personnelles

Country Status (1)

Country Link
WO (1) WO2021096381A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160330438A1 (en) * 2015-05-07 2016-11-10 Government Of The United States, As Represented By The Secretary Of The Air Force Morphological Automatic Landolt C Orientation Detection
US20170302661A1 (en) * 2016-04-17 2017-10-19 International Business Machines Corporation Anonymizing biometric data for use in a security system
CN110110769A (zh) * 2019-04-24 2019-08-09 长安大学 一种基于宽度径向基函数网络的图像分类方法
US20190279011A1 (en) * 2018-03-12 2019-09-12 Microsoft Technology Licensing, Llc Data anonymization using neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160330438A1 (en) * 2015-05-07 2016-11-10 Government Of The United States, As Represented By The Secretary Of The Air Force Morphological Automatic Landolt C Orientation Detection
US20170302661A1 (en) * 2016-04-17 2017-10-19 International Business Machines Corporation Anonymizing biometric data for use in a security system
US20190279011A1 (en) * 2018-03-12 2019-09-12 Microsoft Technology Licensing, Llc Data anonymization using neural networks
CN110110769A (zh) * 2019-04-24 2019-08-09 长安大学 一种基于宽度径向基函数网络的图像分类方法

Similar Documents

Publication Publication Date Title
Bayar et al. Design principles of convolutional neural networks for multimedia forensics
US11023708B2 (en) Within document face verification
CN105678292A (zh) 基于卷积及递归神经网络的复杂光学文字序列识别系统
CN110599387A (zh) 一种自动去除图片水印的方法及装置
CN109446345A (zh) 核电文件校验处理方法以及系统
JP6882362B2 (ja) 身元確認書類を含む画像を識別するシステムおよび方法
CN112686258A (zh) 体检报告信息结构化方法、装置、可读存储介质和终端
Revathi et al. Comparative analysis of text extraction from color images using tesseract and opencv
LU93381B1 (en) Systems, methods and devices for tamper proofing documents and embedding data in a biometric identifier
Dergachov et al. Data pre-processing to increase the quality of optical text recognition systems
US20190188511A1 (en) Method and system for optical character recognition of series of images
Tymoshenko et al. Real-Time Ukrainian Text Recognition and Voicing.
RU2672395C1 (ru) Способ обучения классификатора, предназначенного для определения категории документа
AU2021312111A1 (en) Classifying pharmacovigilance documents using image analysis
CN111178398B (zh) 检测身份证图像信息篡改的方法、系统、存储介质及装置
WO2021096381A1 (fr) Procédé et système d'anonymisation de documents contenant des données personnelles
RU2793607C1 (ru) Способ и система обезличивания документов, содержащих персональные данные
US20220398399A1 (en) Optical character recognition systems and methods for personal data extraction
Panchal et al. An investigation on feature and text extraction from images using image recognition in Android
Alaei et al. Document Image Quality Assessment: A Survey
CN111414889B (zh) 基于文字识别的财务报表识别方法及装置
Lu et al. A novel assessment framework for learning-based deepfake detectors in realistic conditions
Fang Semantic segmentation of PHT based on improved DeeplabV3+
Shetty et al. Automated Identity Document Recognition and Classification (AIDRAC)-A Review
Bhatt et al. Text Extraction & Recognition from Visiting Cards

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19952369

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.09.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19952369

Country of ref document: EP

Kind code of ref document: A1