WO2023172162A1

WO2023172162A1 - Method for protecting information when printing documents

Info

Publication number: WO2023172162A1
Application number: PCT/RU2022/000383
Authority: WO
Inventors: Михаил Артурович АНИСТРАТЕНКО; Валентин Валерьевич СЫСОЕВ; Иван Александрович ОБОЛЕНСКИЙ; Дмитрий Алексеевич БОРИСОВ; Александр Артурович АНИСТРАТЕНКО
Original assignee: Публичное Акционерное Общество "Сбербанк России"
Priority date: 2022-03-10
Filing date: 2022-12-20
Publication date: 2023-09-14

Abstract

The claimed invention relates to solutions for preventing data leakage when printing documents. Implementing the present method includes: receiving, on a user computing device, information about the printing of a digital document (101), wherein said computing device is associated with a unique identifier (UID) of the user; processing said digital document before it is sent for printing, wherein processing entails identifying letters contained in the digital document (102); encoding the user UID into a set of digital watermarks which are disposed on and/or in the vicinity of the edges of the letters of the digital document (103); sending the digital document for printing with the encoded user UID (104). The invention provides more effective data leakage prevention.

Description

METHOD FOR PROTECTING INFORMATION WHEN PRINTING DOCUMENTS

TECHNICAL FIELD

[0001] The claimed solution relates to the field of information security, in particular to solutions for preventing information leakage when printing documents.

BACKGROUND OF THE ART

[0002] Data Leak Prevention (DLP) technologies are technologies for preventing leaks of confidential information from an information system to the outside, as well as technical devices (software or firmware) for such leak prevention.

[0003] From patent application US 20080091954 Al (Morris et al., 04/17/2008) a solution is known for checking the integrity of data presented on printed documents. The solution is based on the use of a unique identifier, which is used to analyze the contents of the document. Each segment of a document is assigned a digit or group of digits, and each page or segment of a document can be assigned a single digit in the common identifier. The collection of digits associated with the document is combined into an authentication string. When a request for subsequent document processing is received, authentication and document integrity are verified by reading the submitted document to obtain an authentication string, and then comparing the new string with the previously stored string. Once successfully matched, the document is considered valid, authenticated, and unaltered.

[0004] The disadvantage of this solution is the impossibility of using it to prevent leaks in order to identify the employee who committed the leak when printing documents. Also, another disadvantage is the insufficient effectiveness of document protection, which is due to the use of a code to compare the authenticity of a document, which only allows one to establish the fact of the immutability and authenticity of the document, but does not prevent information leakage.

SUMMARY OF THE INVENTION

[0005] The claimed invention is aimed at solving a technical problem, which is to create an effective means for protecting digital information from leakage during printing. [0006] The technical result is to increase the efficiency of data protection from leakage by introducing digital tags into the document that encode a unique user identifier for subsequent identification when analyzing printed documents.

[0007] The claimed result is achieved through a method of encoding information to protect against leaks when printing documents, performed using a processor of a computer device, wherein the method contains stages in which: information about printing of at least one digital document is received on the user’s computer device containing at least text, wherein the computer device is associated with a unique identifier (UID) of the user; Before the digital document is sent for printing, it is processed, during which the letters contained in the digital document are recognized; encoding the user's UID into a set of digital marks that are located on the outlines of the letters and/or near the outlines of the letters of the digital document; transmitting a digital document for printing with an encoded user UID.

[0008] In one of the particular examples of implementation of the method, recognition of a digital document is performed using optical character recognition (OCR).

[0009] In another particular example of a method implementation, all characters on each page of a digital document are recognized.

[0010] In another particular example of the implementation of the method, each user UID character is encoded into binary code.

[0011 ] In another particular example of implementing the method, the area of placement of digital marks is determined based on the bit of the binary code.

[0012] The claimed technical result is also achieved by implementing a method for protecting information from leaks on printed documents, performed using a processor of a computer device, the method comprising the steps of: obtaining at least part of an image of a printed document with an encoded user UID using the above method ; perform recognition of the resulting image; identify letters containing digital marks in their vicinity; performing determination and extraction of the encoded UID. [0013] In one of the particular examples of the method, digital document recognition is performed using OCR.

[0014] The claimed solution is also implemented using corresponding systems comprising a processor and memory that store machine-readable instructions for implementing each of the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] FIG. 1 illustrates a flowchart of a digital mark encoding method.

[0016] FIG. 2A - 2B illustrate examples of the placement of digital marks in a digital document.

[0017] FIG. 3 illustrates a block diagram of digital mark decoding.

[0018] FIG. 4 illustrates a diagram of the hour of disclosure of UID positions. [0019] FIG. 5 illustrates a general view of a computing device.

IMPLEMENTATION OF THE INVENTION

[0020] In FIG. 1 presents a method (100) for protecting information in digital documents from leakage by encoding the user UID in the form of digital marks into the document. At the first stage (101), information about printing of the digital document is obtained. The method (100) is carried out on a computer device of a user, for example, an employee, and a user UID is associated with the device, allowing him to be identified. The execution of step (101) is animated by software logic executed by a computer device and can be implemented, for example, in the form of a software agent or module that provides signals from the processor indicating that a digital document is being sent for printing. A digital document is typically a file and can contain text, graphics, or a combination of both.

[0021] After receiving a command on the device to intercept and analyze the document before sending it to the printer, at step (102) recognition of the mentioned digital document is performed. Document processing is done using OCR technology to ensure recognition of letters and symbols in a digital document.

[0022] After the digital document recognition step, the UID encoding process is carried out at step (103). The UID is, for example, a numeric personnel number of an employee - a digital TAB code, consisting, for example, of 8 digits. This code can be represented as an array of numbers TAB ₈ = {n ₁ , n ₂₎ ... n _m , TAB ₈ G [0 ... 9],m = 8. A schematic view of the code is presented in Table 1. Table 1. Schematic representation of the personnel number:

[0023] Each element of the personnel number is a number from 0 to 9, respectively, each element of the personnel number can be displayed in binary form with a dimension of 4 bits, i.e. it will represent a binary number from 1 to 1100, which is a homomorphism with a shift, presented in Table 2.

Table 2. Scheme of homomorphism of the personnel number from the decimal to the binary number system.

[0024] The display of 0 in 0001 is necessary in order to record the presence of 0 in the personnel number. To encode a personnel number element in binary code TABQ ^IN = {b _r , b ₂ , b ₃ , ... , bi}, i = 8, 4 digits are required = {s _1l s ₂ , s ₃ , s ₄ ], example which are presented in Table 3.

Table 3. Schematic division of a binary number into digits.

[0025] Thus, it is possible to encode any number into a letter through binary encoding. An example of such a division for subsequent encoding is presented in Fig. 2 A - Fig. 2B. Each recognized letter (20) is divided into 4 quarters in a clockwise plane, starting from the lower left corner. [0026] If there is a 1 in the first digit of the binary representation of the digit of the personnel number C, the label is placed in the first quarter. Similar operations are carried out with all bits of the binary representation of a digit.

[0027] The method of marking the space near the letter is that, as shown in FIG. 2B-2B a digital mark is applied in the form of a line (21) on the surface of the letter or a point (22) in the vicinity of the letter in a given quarter.

[0028] An example of encoding marks into letters is presented in Table 4.

Table 4. Positional coding scheme

[0029] Table 4 presented above means that each number position in the personnel number can be encoded into any of the 4 letters. The selection of letters for marking is carried out page by page. Let document D contain I pages, then document D is an array of pages, D = {p _lt p ₂ , p ₃ ... p _g ], I 6 N.

[0030] On each page p _ir i E [1, I] the text is read character by character and written to the character array S _p . where l _p . is the number of characters on page P, from

they reveal the Russian letters Wrus _p . E S _p ..

[0031] Next, 8 arrays Pos^ Pos ₂ ... Pos ₈ are created, each of which corresponds to each position of the personnel number. Each Pos array is filled with those characters from Wrus _p . , which correspond to the position from table 4. For example, Po5 is filled with all the characters from Wrus _p ., which have the values {a, z, p, h}, regardless of case.

[0032] The arrays Pos^ Pos ₂ ... Pos ₈ are mixed, for example, by Knuth shuffle. Let l _PoS1 , lp _0S2 , lp _0S3 ... lpos _e be the dimensions of the resulting arrays, P be the percentage of characters on implementation of the label P E [0,3 ...0,7], then each array of Pos ₁ , Pos ₂ ... Pos _a is trimmed from the end to the dimension

[0033] The resulting arrays Pos , Pos ₂ ... Pos ₈ are used to apply digital marks in the manner described above. Digital tagging is done by cutting out letters using OCR, adding the tagging to pixel coordinates, and adding the digitally tagged letters back into the document to be printed. After introducing all the labels (21, 22) on the desired page p _t , the same is done for the next page p _i+1 and so on until the end of the document p _g .

[0034] Table 5 shows an example of label encoding for the user UID - 00013400.

Table 5. Example of encoding digital marks in the vicinity of letters.

[0035] After digital marks encoding the UID are added to the document sent for printing, at step (104) it is sent for printing. The printed document will contain an encoded UID that is indistinguishable to the human eye. The size of digital marks can be chosen arbitrarily (for example, marks with a radius of 1-2 pixels).

[0036] In FIG. 3 shows the sequence of steps performed when performing the method (300) for recognizing UID on printed documents. At step (301), the computing device used to determine the UID in the printed document obtains an image of such document. The image may contain all or part of the text, with an encoded UID, obtained, for example, by photographing with an external device (smartphone, camera, etc.) or by scanning a printed document using OCR. [0037] Next, at step (302), also using OCR technology, letters in the document are recognized, and if there are several pages in the document, then each page of the document is recognized. At step (303), digital marks in the vicinity of the recognized letters are read. An example of analyzing digital labels can be carried out according to the example given in Table 5, which can be used as a table for matching labels to the corresponding digit of the user UID.

[0038] After this, the UID is decoded at step (304) and used to determine the personnel number of the employee and the corresponding user from whose computer device the document was printed.

[0039] Mathematical justification of the method

[0040] To do this, let’s make sure that the frequencies of disclosure of TAB positions

_2i ... n _m }, m = 8 are uniformly distributed for all m, which allows us to show the probability of extracting a personnel number (UID) from the text of the page.

[0041] For mathematical substantiation, a study was carried out on the frequency of occurrence of letters in text with different contents, for example, consider such a distribution typical for literary works. List of literary works participating in the experiment: The Silmarillion. J.R.R. Tolkien, Twenty Thousand Leagues Under the Sea. Jules G. Verne, Twenty Years Later. Alexandre Dumas, Three Musketeers. Alexandre Dumas, Gone with the Wind. Margaret Mitchell, Ivanhoe. Walter Scott, Hero of Our Time. N.V. Gogol, War and Peace. L.N. Tolstoy, Inhabited Island. Boris and Arkady Strugatsky, Crime and Punishment. F.M. Dostoevsky, The Living and the Dead. K.M. Simonov, total 8,366,594 characters, 3919 pages. Mathematical linguistics showed the following probabilities of the frequency of occurrence of letters of the Russian alphabet in texts (Table 6).

Table 6. Table of the frequency of occurrence of letters of the Russian alphabet in fiction

[0042] To obtain the value of the frequency of opening positions TAB ₈ =

n _2i ... n _m the following actions are performed. From Tables 4 and 5, the letters into which the digits are coded are known. To obtain the frequency of bit opening for the algorithm for applying a mark in the space near a letter, the frequencies of the letters into which the marks are encoded are added up, because The position is opened when a mark is detected in at least one of them. As a result of the above steps, Table 7 is obtained.

Table 7. Table of frequency of disclosure of personnel number positions.

And i Frequency of occurrence of letters Frequency of opening of digit

[0043] Based on Table 7, the diagram shown in FIG. 4.

The diagram shows that the frequency of opening of all positions is distributed relatively evenly. [0044] Let's calculate the number of each letter of the Russian alphabet in the experimental sample:

Table 8. Letter-positional quantitative characteristics of the experimental sample.

[0045] For the method of placing a dot in the space near the letter, the following assumption is made: the percentage P of characters for the implementation of the mark P = 0.3; when transmitted through messengers, a certain percentage M = 0.7 marks are lost. Based on the above, you can calculate the probability of text recognition if the following is available for decryption: a whole page; g pages;

!4 pages.

Table 9. Explanations and probabilities of recognizing text encoded by placing a dot in the space near the letter

[0046] Experimental Application Example.

[0047] During testing, about 500 pages of different content were printed and analyzed: text, sparse text, text with tables, text with graphs, text with formulas; with different types of fonts: Arial, Calibri, Times New Roman; different text formats: regular, italic, bold, underlined; different sizes: 12px, 14px; different line spacing: 0.5, 1.15, 1.5; different character spacing: regular, sparse, compacted;

[0048] In each case, consideration was given to extracting the mark from: printing directly; from a photograph of a printout; a photo printout sent via messenger.

[0049] Printing was carried out on a Lexmark MX71 Ide office black and white laser printer on “Snow Maiden” office paper with CIE 146 whiteness according to ISO 11475.

[0050] Photographing was done on a Samsung A51 phone under office lighting, the paper lies horizontally on the table, photographing is random at different, slight angles, about 2-4% in 3 dimensions.

[0051] When transferring photographs, the Telegram messenger was used with image compression when sending.

[0052] During the experiment, parameters were selected, such as the size of the marks, their optimal locations and application methods. The results of the last phase of the experiment are shown in Table 10.

Table 10. Experiment result.

[0053] The table described above shows good results from the analysis of photographs of printouts sent via messenger on an office black and white printer. As a result of the experiment, optimal parameters were selected for introducing marks, which, on the one hand, would be noticeable on printouts as printer defects, on the other hand, could be easily extracted from photographs sent via instant messengers.

[0054] In FIG. 5 is an overview of a computing device (500) suitable for performing the above methods. The device (500) may be, for example, a computer, a server, or other type of suitable computing device.

[0055] In general, a computing device (500) contains one or more processors (501), memory devices such as RAM (502) and ROM (503), input/output interfaces (504), and input devices connected by a common information exchange bus. /output (505), and a device for network communication (506).

[0056] The processor (501) (or multiple processors, multi-core processor) may be selected from a variety of devices commonly used today, such as those from Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ and etc. A graphics processor, for example, Nvidia, AMD, Graphcore, etc., can also be used as a processor (501).

[0057] RAM (502) is a random access memory and is designed to store machine-readable instructions executable by the processor (501) to perform the necessary logical data processing operations. RAM (502) typically contains executable operating system instructions and related software components (applications, program modules, etc.).

[0058] ROM (503) is one or more permanent storage devices, such as a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, etc.), optical storage media ( CD-R/RW, DVD-R/RW, BlueRay Disc, MD), etc.

[0059] To organize the operation of device components (500) and organize the operation of external connected devices, various types of I/O interfaces (504) are used. The choice of appropriate interfaces depends on the specific implementation of the computing device, which may be, but is not limited to: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc. [0060] To ensure user interaction with the computing device (500), various means (505) of I/O information are used, for example, a keyboard, a display (monitor), a touch display, a touch pad, a joystick, a mouse, a light pen, a stylus, touch panel, trackball, speakers, microphone, augmented reality tools, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[0061] The network communication facility (506) enables the device (500) to transmit data via an internal or external computer network, such as an Intranet, the Internet, a LAN, or the like. One or more means (506) may be used, but not limited to: Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communication module, NFC module, Bluetooth and/or BLE module, Wi-Fi module and etc.

[0062] Additionally, satellite navigation tools can also be used as part of the device (500), for example, GPS, GLONASS, BeiDou, Galileo.

[0063] The submitted application materials disclose preferred examples of implementation of a technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

FORMULA

1. A method for encoding information to protect against leakage when printing documents, performed using a processor of a computer device, wherein the method comprises the steps of: obtaining information on the printing of at least one digital document containing at least text on the user's computer device , wherein the computer device is associated with a unique identifier (UID) of the user; Before the digital document is sent for printing, it is processed, during which the letters contained in the digital document are recognized; encoding the user's UID into a set of digital marks that are located on the outlines of the letters and/or near the outlines of the letters of the digital document; transmitting a digital document for printing with an encoded user UID.

2. The method according to claim 1, characterized in that recognition of the digital document is performed using optical character recognition (OCR).

3. The method according to claim 2, characterized in that all characters on each page of the digital document are recognized.

4. The method according to claim 1, characterized in that each user UID character is encoded into binary code.

5. The method according to claim 4, characterized in that based on the bit of the binary code, the area where the digital marks are placed is determined.

6. A method for protecting information from leaks on printed documents, performed using a processor of a computer device, the method comprising the steps of: obtaining at least part of an image of a printed document with an encoded user UID using the method according to any one of claims. 1-5; perform recognition of the resulting image; identify letters containing digital marks in their vicinity; performing determination and extraction of the encoded UID.

7. The method according to claim 6, characterized in that recognition of a digital document is performed using OCR.

8. An information encoding system to protect against leaks when printing documents, comprising at least one processor, at least one memory associated with the processor and containing machine-readable instructions, which, when executed by the processor, implement the method according to any one of claims. 1-5.

9. A system for protecting information from leaks on printed documents, comprising at least one processor, at least one memory associated with the processor and containing machine-readable instructions, which, when executed by the processor, implement the method according to any one of claims. 6-7.