CN113936764A

CN113936764A - Method and system for desensitizing sensitive information in medical report sheet photo

Info

Publication number: CN113936764A
Application number: CN202111265566.3A
Authority: CN
Inventors: 王珏
Original assignee: Suzhou Liangyihui Network Technology Co ltd
Current assignee: Suzhou Liangyihui Network Technology Co ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-01-14

Abstract

The invention discloses a method and a system for desensitizing sensitive information in a medical report photo, wherein the method comprises the following steps: s1, constructing a sensitive information range based on the prior knowledge; s2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content; s3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network; s4, positioning the coordinates of each piece of identified sensitive information; and S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates. The method and the system for desensitizing sensitive information in the medical report list photo can accurately identify the sensitive information and accurately position the sensitive information, finally realize coding desensitization on the sensitive information, and have the advantages of high feasibility and high accuracy.

Description

Method and system for desensitizing sensitive information in medical report sheet photo

Technical Field

The invention relates to the technical field of image desensitization, in particular to a method and a system for desensitizing sensitive information in a medical report photo.

Background

In the big data era, artificial intelligence is developed rapidly, and the Internet and medical treatment become a new direction for the development of the medical health industry in China; information sharing brings convenience to life, and meanwhile, the security of personal sensitive information faces challenges. With the wide application of medical information systems, medical reports are very common documents, and often contain important information such as patient names, patient numbers, clinic numbers, contact ways, patient addresses, hospitals, doctor names, and the like. Therefore, in many fields such as smart medical treatment, on-line inquiry, medical research, etc., a medical report photo is often taken by using a device such as a mobile phone, etc., and an image is formed and uploaded. To ensure the security of the user's information, it is important to "desensitize" the sensitive information in the images before using them.

Image desensitization refers to information shielding of sensitive information through image blurring, watermark adding, mosaic adding and other modes, so that the sensitive information in an image is protected to a certain extent. The existing image desensitization technology mostly adopts the desensitization technology of template matching and template covering when desensitization is carried out, and a replacement template prepared in advance is placed at an image position needing desensitization. The process of preparing the matching template, however, increases the image pre-processing time, making image desensitization inefficient.

For image desensitization, text data detection and identification in an image are an important link for desensitization of sensitive information, and refer to a process of analyzing, identifying and processing the image to obtain character and layout information. Text detection can detect the position of a text box, but cannot output the position of each character, which causes great difficulty in positioning subsequent sensitive information.

Desensitization methods for medical texts are mainly divided into: rule and dictionary based methods, machine learning based methods, and combinations of the two. The method based on the rules and the dictionary is fast and is an effective mode under the condition of lacking a large amount of familiar material labels, but the accuracy and the recall rate are very dependent on the quality of the dictionary. Machine learning algorithms are generally superior to rule-based algorithms, but it is very difficult to obtain familiar materials in large-scale medical fields, and some private data appear less frequently in medical texts, and the sparsity of the data deteriorates the machine learning result.

At present, related technologies of image desensitization and medical text desensitization exist, but desensitization research on medical report single pictures is deficient. The main reason is that medical report forms are various, and in the scene without a fixed form, how to detect and position desensitization information is the problem that needs to be solved by the technical personnel in the field by associating image text data detection with recognized text data.

Disclosure of Invention

The invention aims to provide a sensitive information desensitization method in a medical report photo, which is high in feasibility and accuracy.

In order to solve the above problems, the present invention provides a method for desensitizing sensitive information in a medical report photo, which comprises the following steps:

s1, constructing a sensitive information range based on the prior knowledge;

s2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;

s3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network;

s4, positioning the coordinates of each piece of identified sensitive information;

and S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates.

As a further improvement of the present invention, the step S1 includes:

s11, sorting and summarizing the existing medical report photos to construct a sensitive information matching keyword list;

s12, collecting and sorting the hospital name list, splitting and combining to obtain a hospital name keyword list.

As a further improvement of the present invention, the sensitive information in the sensitive information matching keyword list includes: name, patient number, patient ID, pathology number, medical record number, case number.

As a further improvement of the present invention, in step S2, the detecting the text box in the current medical report photo and recognizing the text content includes: and detecting a text box in the current medical report photo through an OCR (optical character recognition) model and recognizing text content to obtain the coordinates of the text box, the text content and recognition confidence.

As a further improvement of the present invention, step S3 includes:

s31, traversing the text boxes, segmenting the text recognition content of each text box by using THULAC, labeling the part of speech, and returning part of speech labeling results nes;

s32, traversing a part-of-speech tagging result nes, and if the current information ne _ text is not in the stored sensitive information list ne _ list, further judging according to the part-of-speech;

s33, when keyword list keyword _ list is matched and searched, if the matched keyword match _ word and the corresponding matched information ne _ text are detected and identified in two text boxes by sensitive information, desensitization cannot be carried out, and further secondary processing is needed; if the text _ str text length does not exceed the threshold value skip _ keyword _ len, traversing the keyword list keyword _ list, and searching the sensitive information match _ word;

s34, if the text _ str of the current text box character string contains a hospital second word, traversing a hospital list hospital _ list, searching whether a hospital name exists in the text _ str of the current character string, and if so, adding the hospital to a sensitive information list ne _ list;

s35, detecting an email address email in the character string text _ str, and if the email is not empty, adding the email address into a sensitive information list ne _ list;

s36, after the steps S34 and S35, the sensitive information list ne _ list may have repetition, and the re-repetition processing is performed before the sensitive information ne _ list is traversed in the next step;

s37, traversing each sensitive information in the sensitive information list ne _ list, searching the index of the sensitive information in the text _ str of the current text box character string, and returning the text _ str length and ne _ index.

As a further improvement of the present invention, the step S32 of performing further determination based on the part of speech includes:

s321, if the part of speech of the ne _ text is a name and the length of the ne _ text does not exceed a threshold skip _ np _ len, adding the current word ne _ text into a sensitive information list ne _ list;

s322, if the part of speech of ne _ text is a place name, adding the current word ne _ text into a sensitive information list ne _ list;

s323, if the part of speech of ne _ text is a number word, further judging whether ne _ text is a mobile phone number or an identity card number.

As a further improvement of the present invention, step S323 includes:

s3231, if the length of the ne _ text is 11 bits and the head is 1, the ne _ text is considered to be the mobile phone number, and the current ne _ text is added into a sensitive information list ne _ list;

s3232, if the length of the word ne _ text is 18 bits and ne _ text accords with an identity card number check rule, considering that ne _ text is an identity card number, and adding the current word ne _ text into a sensitive information list ne _ list;

s3233, if the length of the current number ne _ text is larger than or equal to the threshold value match _ m _ len, traversing the matched keyword list keyword _ list, and if the related keyword match _ word in the current text box character string text _ str, adding both the matched keyword match _ word and the current word ne _ text to the sensitive information list ne _ list.

As a further improvement of the present invention, step S4 includes: and establishing a rectangular coordinate system in the medical report photo by taking the pixel as a unit for the sensitive information identification result, dynamically allocating the widths of Chinese and English characters according to the length of the text box, and calculating the coordinate of each sensitive information through a linear equation.

As a further improvement of the present invention, step S5 includes: traversing each sensitive information coordinate, acquiring RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.

In order to solve the above problem, the present invention also provides a system for desensitizing sensitive information in medical report photographs, which includes the following modules:

the sensitive information construction module is used for constructing a sensitive information range based on the priori knowledge;

the text detection and identification module is used for reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;

the sensitive information detection module is used for detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence labeling network;

the sensitive information positioning module is used for positioning the coordinates of each piece of identified sensitive information;

and the sensitive information desensitization module is used for coding and desensitizing the position of the sensitive information according to the positioned coordinates.

The invention has the beneficial effects that:

the method and the system for desensitizing sensitive information in the medical report list photo can accurately identify the sensitive information and accurately position the sensitive information, finally realize coding desensitization on the sensitive information, and have the advantages of high feasibility and high accuracy.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for desensitization based on sensitive information in medical report form photographs in a preferred embodiment of the present invention;

FIG. 2 is a diagram of sensitive information location in a preferred embodiment of the present invention;

FIG. 3 is a keyword _ list (partial) diagram in a preferred embodiment of the invention;

FIG. 4 is a sample illustration of desensitization results obtained in a preferred embodiment of the invention;

fig. 5 is a schematic diagram of a system for desensitization based on sensitive information in medical report form photographs in a preferred embodiment of the invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

As shown in FIG. 1, the method for desensitizing based on sensitive information in medical report photo in the preferred embodiment of the present invention comprises the following steps:

s1, constructing a sensitive information range based on the prior knowledge; the method comprises the following steps:

s11, sorting and summarizing the existing medical report photos to construct a sensitive information matching keyword list; the sensitive information in the sensitive information matching keyword list comprises the following steps: name, patient number, patient ID, pathology number, medical record number, case number, etc.

S12, collecting and sorting the hospital name list, splitting and combining to obtain a hospital name keyword list; the hospital name keywords include: a people hospital, a provincial people hospital, a citizen hospital, a Xinhua hospital, a first subsidiary hospital, etc.

optionally, detecting a text box in the current medical report photo through an OCR model and recognizing text content, so as to obtain coordinates of the text box, the text content and a recognition confidence.

Specifically, inputting a medical report photo img, importing an OCR (optical character recognition) model, and setting model specific parameters: the method comprises the steps of using GPU, lang and ch recognition languages to load a classification model for Chinese, using _ space _ char and True recognition spaces, and using _ angle _ cls to True to load a classification model, enabling classification for True, detecting and recognizing a text box in an image, and returning a recognition result, wherein the recognition result comprises text box coordinates, text content and recognition confidence.

S3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network; the method comprises the following steps:

s31, traversing the text boxes, using THULAC (THULical Analyzer for Chinese) to perform word segmentation on the text recognition content of each text box, labeling the part of speech, and returning a part of speech labeling result nes;

s32, traversing a part-of-speech tagging result nes, and if the current information ne _ text is not in the stored sensitive information list ne _ list (initially empty), further judging according to the part-of-speech; the method comprises the following steps:

s321, if the part of speech of the ne _ text is a name and the length of the ne _ text does not exceed a threshold skip _ np _ len (the default value is 10), adding the current word ne _ text into a sensitive information list ne _ list;

Specifically, step S323 includes:

s3233, if the length of the current number ne _ text is greater than or equal to the threshold value match _ m _ len (the default value is 5), traversing the matched keyword list keyword _ list, and if the related keyword match _ word in the current text box character string text _ str, adding both the matched keyword match _ word and the current word ne _ text to the sensitive information list ne _ list.

S33, when keyword list keyword _ list is matched and searched, if the matched keyword match _ word and the corresponding matched information ne _ text are detected and identified in two text boxes by sensitive information, desensitization cannot be carried out, and further secondary processing is needed; if the text _ str text length does not exceed the threshold value skip _ keyword _ len (the default value is 10), traversing the keyword list keyword _ list, and searching for sensitive information match _ word; this step desensitizes only the matching keyword match _ word, and even if there is a corresponding information ne _ text miss, effective desensitization can be understood because there is no corresponding keyword match _ word.

s35, detecting an email address email in the character string text _ str, and if the email is not empty, adding the email address into a sensitive information list nr _ list;

s37, traversing each sensitive information in the sensitive information list ne _ list, searching the index (which can be a plurality of indexes) of the sensitive information in the text _ str of the current text box character string, and returning the text _ str length and the ne _ index (including the index of each sensitive information in the text _ str, the sensitive information length and the sensitive information content).

optionally, for the sensitive information recognition result, a rectangular coordinate system is established in the medical report list photo by taking the pixel as a unit, the width of the Chinese and English characters is dynamically allocated according to the length of the text box, and the coordinate of each sensitive information is calculated and obtained through a linear equation.

Specifically, referring to fig. 2, the positioning steps are as follows:

s41, obtaining the coordinate (x) of text _ str of text box₁,y₁)，(x₂,y₂)，(x₃,y₃)，(x₄,y₄)；

S42, taking x₁、x₄Medium to small value, obtain a straight line f₁Taking x₂、x₃Medium or large value, obtain a straight line f₂(ii) a Is prepared from (x)₁,y₁)、(x₂,y₂) To obtain a straight line f₃Is composed of (x)₃,y₃)，(x₄,y₄) To obtain a straight line f₄；

S43, converting the straight line f₁、f₂The width between characters is divided according to the length of text _ str and the width of Chinese character is 2, the width of other characters (including numbers, English and the like) is 1, and the horizontal coordinates of each character can be obtained through calculation;

s44, substituting the abscissa obtained in S43 into the straight line f₃、f₄The vertical coordinates of the upper part and the lower part of each character can be obtained through calculation;

s45, based on the steps, according to ne _ index obtained by sensitive information detection, and according to the sensitive information index and the width of each block, calculating to obtain the abscissa of the left boundary of the sensitive information; according to the left boundary abscissa of the sensitive information, the length of the sensitive information and the number of Chinese characters in the sensitive information, the right boundary abscissa of the sensitive information can be obtained through calculation; respectively substituting the abscissa of the left and right boundaries into the straight line f₃、f₄The vertical coordinates of the upper and lower boundaries corresponding to the left and right boundaries, respectively, can be obtained, and finally the coordinates nes _ location of all sensitive information can be obtained.

Optionally, traversing each sensitive information coordinate, obtaining RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.

Specifically, the method comprises the following steps:

s51, reading exif information of the medical report photo, and if the exif information contains direction information, correspondingly rotating the image;

s52, reading and recording an image mode img _ mode, and if the image contains sensitive information, converting the image into an RGB mode;

s53, traversing each sensitive information coordinate according to the sensitive information coordinate nes _ location obtained by the sensitive information positioning module, obtaining RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; because the quadrangle formed by the four coordinates may not be a rectangle, in order to facilitate coding desensitization, a rectangular boundary surrounded by the four coordinates is firstly framed; judging pixel points in the rectangular boundary, and if the pixel points are in the area where the sensitive information is located, modifying the pixel values of the pixel points into background colors, thereby realizing the effect of coding and desensitization;

and S54, restoring the desensitized image to the original image mode img _ mode, and finally storing the desensitized image.

In one embodiment, 256 sensitive information matching keywords keyword _ list are sorted and summarized from 2544 real medical report photographs, and are partially intercepted as shown in fig. 3. And gets 17975 hospital name keywords hospital _ list. Referring to fig. 4, a sample graph of desensitization results obtained in one embodiment is shown. It can be seen that the present invention can accurately achieve desensitization of sensitive information.

As shown in FIG. 5, the preferred embodiment of the invention also discloses a sensitive information desensitization system in a medical report photo, which comprises the following modules:

The sensitive information desensitization system in a medical report is used for implementing the above-mentioned sensitive information desensitization method in a medical report, so that the specific implementation of the system can be seen in the above-mentioned section of the embodiment of the sensitive information desensitization method in a medical report, and therefore, the specific implementation thereof can refer to the description of the corresponding section embodiment and will not be further described herein.

In addition, since the sensitive information desensitization system in the medical report form photo of this embodiment is used for implementing the above-mentioned sensitive information desensitization method in the medical report form photo, the effect corresponds to the effect of the above-mentioned method, and details are not described here.

The above embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. A method for desensitizing sensitive information in a medical report photo, comprising the steps of:

s1, constructing a sensitive information range based on the prior knowledge;

2. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein said step S1 includes:

3. The method of desensitizing sensitive information in medical report form photos of claim 2, wherein the sensitive information matching the sensitive information in the keyword list comprises: name, patient number, patient ID, pathology number, medical record number, case number.

4. The method for desensitizing sensitive information in medical report form photographs according to claim 1, wherein said detecting text boxes and recognizing text content in a current medical report form photograph in step S2 comprises: and detecting a text box in the current medical report photo through an OCR (optical character recognition) model and recognizing text content to obtain the coordinates of the text box, the text content and recognition confidence.

5. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S3 includes:

6. The method for desensitizing sensitive information in medical report form photographs of claim 5, wherein in step S32, making further determinations based on part-of-speech comprises:

7. The method for desensitizing sensitive information in medical report sheets according to claim 6, wherein step S323 comprises:

8. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S4 includes: and establishing a rectangular coordinate system in the medical report photo by taking the pixel as a unit for the sensitive information identification result, dynamically allocating the widths of Chinese and English characters according to the length of the text box, and calculating the coordinate of each sensitive information through a linear equation.

9. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S5 includes: traversing each sensitive information coordinate, acquiring RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.

10. A system for desensitizing sensitive information in medical report sheets, comprising the following modules: