CN113936764A - Method and system for desensitizing sensitive information in medical report sheet photo - Google Patents

Method and system for desensitizing sensitive information in medical report sheet photo Download PDF

Info

Publication number
CN113936764A
CN113936764A CN202111265566.3A CN202111265566A CN113936764A CN 113936764 A CN113936764 A CN 113936764A CN 202111265566 A CN202111265566 A CN 202111265566A CN 113936764 A CN113936764 A CN 113936764A
Authority
CN
China
Prior art keywords
sensitive information
text
list
medical report
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111265566.3A
Other languages
Chinese (zh)
Inventor
王珏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Liangyihui Network Technology Co ltd
Original Assignee
Suzhou Liangyihui Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Liangyihui Network Technology Co ltd filed Critical Suzhou Liangyihui Network Technology Co ltd
Priority to CN202111265566.3A priority Critical patent/CN113936764A/en
Publication of CN113936764A publication Critical patent/CN113936764A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H15/00ICT specially adapted for medical reports, e.g. generation or transmission thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Medical Informatics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a method and a system for desensitizing sensitive information in a medical report photo, wherein the method comprises the following steps: s1, constructing a sensitive information range based on the prior knowledge; s2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content; s3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network; s4, positioning the coordinates of each piece of identified sensitive information; and S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates. The method and the system for desensitizing sensitive information in the medical report list photo can accurately identify the sensitive information and accurately position the sensitive information, finally realize coding desensitization on the sensitive information, and have the advantages of high feasibility and high accuracy.

Description

Method and system for desensitizing sensitive information in medical report sheet photo
Technical Field
The invention relates to the technical field of image desensitization, in particular to a method and a system for desensitizing sensitive information in a medical report photo.
Background
In the big data era, artificial intelligence is developed rapidly, and the Internet and medical treatment become a new direction for the development of the medical health industry in China; information sharing brings convenience to life, and meanwhile, the security of personal sensitive information faces challenges. With the wide application of medical information systems, medical reports are very common documents, and often contain important information such as patient names, patient numbers, clinic numbers, contact ways, patient addresses, hospitals, doctor names, and the like. Therefore, in many fields such as smart medical treatment, on-line inquiry, medical research, etc., a medical report photo is often taken by using a device such as a mobile phone, etc., and an image is formed and uploaded. To ensure the security of the user's information, it is important to "desensitize" the sensitive information in the images before using them.
Image desensitization refers to information shielding of sensitive information through image blurring, watermark adding, mosaic adding and other modes, so that the sensitive information in an image is protected to a certain extent. The existing image desensitization technology mostly adopts the desensitization technology of template matching and template covering when desensitization is carried out, and a replacement template prepared in advance is placed at an image position needing desensitization. The process of preparing the matching template, however, increases the image pre-processing time, making image desensitization inefficient.
For image desensitization, text data detection and identification in an image are an important link for desensitization of sensitive information, and refer to a process of analyzing, identifying and processing the image to obtain character and layout information. Text detection can detect the position of a text box, but cannot output the position of each character, which causes great difficulty in positioning subsequent sensitive information.
Desensitization methods for medical texts are mainly divided into: rule and dictionary based methods, machine learning based methods, and combinations of the two. The method based on the rules and the dictionary is fast and is an effective mode under the condition of lacking a large amount of familiar material labels, but the accuracy and the recall rate are very dependent on the quality of the dictionary. Machine learning algorithms are generally superior to rule-based algorithms, but it is very difficult to obtain familiar materials in large-scale medical fields, and some private data appear less frequently in medical texts, and the sparsity of the data deteriorates the machine learning result.
At present, related technologies of image desensitization and medical text desensitization exist, but desensitization research on medical report single pictures is deficient. The main reason is that medical report forms are various, and in the scene without a fixed form, how to detect and position desensitization information is the problem that needs to be solved by the technical personnel in the field by associating image text data detection with recognized text data.
Disclosure of Invention
The invention aims to provide a sensitive information desensitization method in a medical report photo, which is high in feasibility and accuracy.
In order to solve the above problems, the present invention provides a method for desensitizing sensitive information in a medical report photo, which comprises the following steps:
s1, constructing a sensitive information range based on the prior knowledge;
s2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
s3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network;
s4, positioning the coordinates of each piece of identified sensitive information;
and S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates.
As a further improvement of the present invention, the step S1 includes:
s11, sorting and summarizing the existing medical report photos to construct a sensitive information matching keyword list;
s12, collecting and sorting the hospital name list, splitting and combining to obtain a hospital name keyword list.
As a further improvement of the present invention, the sensitive information in the sensitive information matching keyword list includes: name, patient number, patient ID, pathology number, medical record number, case number.
As a further improvement of the present invention, in step S2, the detecting the text box in the current medical report photo and recognizing the text content includes: and detecting a text box in the current medical report photo through an OCR (optical character recognition) model and recognizing text content to obtain the coordinates of the text box, the text content and recognition confidence.
As a further improvement of the present invention, step S3 includes:
s31, traversing the text boxes, segmenting the text recognition content of each text box by using THULAC, labeling the part of speech, and returning part of speech labeling results nes;
s32, traversing a part-of-speech tagging result nes, and if the current information ne _ text is not in the stored sensitive information list ne _ list, further judging according to the part-of-speech;
s33, when keyword list keyword _ list is matched and searched, if the matched keyword match _ word and the corresponding matched information ne _ text are detected and identified in two text boxes by sensitive information, desensitization cannot be carried out, and further secondary processing is needed; if the text _ str text length does not exceed the threshold value skip _ keyword _ len, traversing the keyword list keyword _ list, and searching the sensitive information match _ word;
s34, if the text _ str of the current text box character string contains a hospital second word, traversing a hospital list hospital _ list, searching whether a hospital name exists in the text _ str of the current character string, and if so, adding the hospital to a sensitive information list ne _ list;
s35, detecting an email address email in the character string text _ str, and if the email is not empty, adding the email address into a sensitive information list ne _ list;
s36, after the steps S34 and S35, the sensitive information list ne _ list may have repetition, and the re-repetition processing is performed before the sensitive information ne _ list is traversed in the next step;
s37, traversing each sensitive information in the sensitive information list ne _ list, searching the index of the sensitive information in the text _ str of the current text box character string, and returning the text _ str length and ne _ index.
As a further improvement of the present invention, the step S32 of performing further determination based on the part of speech includes:
s321, if the part of speech of the ne _ text is a name and the length of the ne _ text does not exceed a threshold skip _ np _ len, adding the current word ne _ text into a sensitive information list ne _ list;
s322, if the part of speech of ne _ text is a place name, adding the current word ne _ text into a sensitive information list ne _ list;
s323, if the part of speech of ne _ text is a number word, further judging whether ne _ text is a mobile phone number or an identity card number.
As a further improvement of the present invention, step S323 includes:
s3231, if the length of the ne _ text is 11 bits and the head is 1, the ne _ text is considered to be the mobile phone number, and the current ne _ text is added into a sensitive information list ne _ list;
s3232, if the length of the word ne _ text is 18 bits and ne _ text accords with an identity card number check rule, considering that ne _ text is an identity card number, and adding the current word ne _ text into a sensitive information list ne _ list;
s3233, if the length of the current number ne _ text is larger than or equal to the threshold value match _ m _ len, traversing the matched keyword list keyword _ list, and if the related keyword match _ word in the current text box character string text _ str, adding both the matched keyword match _ word and the current word ne _ text to the sensitive information list ne _ list.
As a further improvement of the present invention, step S4 includes: and establishing a rectangular coordinate system in the medical report photo by taking the pixel as a unit for the sensitive information identification result, dynamically allocating the widths of Chinese and English characters according to the length of the text box, and calculating the coordinate of each sensitive information through a linear equation.
As a further improvement of the present invention, step S5 includes: traversing each sensitive information coordinate, acquiring RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.
In order to solve the above problem, the present invention also provides a system for desensitizing sensitive information in medical report photographs, which includes the following modules:
the sensitive information construction module is used for constructing a sensitive information range based on the priori knowledge;
the text detection and identification module is used for reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
the sensitive information detection module is used for detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence labeling network;
the sensitive information positioning module is used for positioning the coordinates of each piece of identified sensitive information;
and the sensitive information desensitization module is used for coding and desensitizing the position of the sensitive information according to the positioned coordinates.
The invention has the beneficial effects that:
the method and the system for desensitizing sensitive information in the medical report list photo can accurately identify the sensitive information and accurately position the sensitive information, finally realize coding desensitization on the sensitive information, and have the advantages of high feasibility and high accuracy.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following preferred embodiments are described in detail with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a method for desensitization based on sensitive information in medical report form photographs in a preferred embodiment of the present invention;
FIG. 2 is a diagram of sensitive information location in a preferred embodiment of the present invention;
FIG. 3 is a keyword _ list (partial) diagram in a preferred embodiment of the invention;
FIG. 4 is a sample illustration of desensitization results obtained in a preferred embodiment of the invention;
fig. 5 is a schematic diagram of a system for desensitization based on sensitive information in medical report form photographs in a preferred embodiment of the invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
As shown in FIG. 1, the method for desensitizing based on sensitive information in medical report photo in the preferred embodiment of the present invention comprises the following steps:
s1, constructing a sensitive information range based on the prior knowledge; the method comprises the following steps:
s11, sorting and summarizing the existing medical report photos to construct a sensitive information matching keyword list; the sensitive information in the sensitive information matching keyword list comprises the following steps: name, patient number, patient ID, pathology number, medical record number, case number, etc.
S12, collecting and sorting the hospital name list, splitting and combining to obtain a hospital name keyword list; the hospital name keywords include: a people hospital, a provincial people hospital, a citizen hospital, a Xinhua hospital, a first subsidiary hospital, etc.
S2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
optionally, detecting a text box in the current medical report photo through an OCR model and recognizing text content, so as to obtain coordinates of the text box, the text content and a recognition confidence.
Specifically, inputting a medical report photo img, importing an OCR (optical character recognition) model, and setting model specific parameters: the method comprises the steps of using GPU, lang and ch recognition languages to load a classification model for Chinese, using _ space _ char and True recognition spaces, and using _ angle _ cls to True to load a classification model, enabling classification for True, detecting and recognizing a text box in an image, and returning a recognition result, wherein the recognition result comprises text box coordinates, text content and recognition confidence.
S3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network; the method comprises the following steps:
s31, traversing the text boxes, using THULAC (THULical Analyzer for Chinese) to perform word segmentation on the text recognition content of each text box, labeling the part of speech, and returning a part of speech labeling result nes;
s32, traversing a part-of-speech tagging result nes, and if the current information ne _ text is not in the stored sensitive information list ne _ list (initially empty), further judging according to the part-of-speech; the method comprises the following steps:
s321, if the part of speech of the ne _ text is a name and the length of the ne _ text does not exceed a threshold skip _ np _ len (the default value is 10), adding the current word ne _ text into a sensitive information list ne _ list;
s322, if the part of speech of ne _ text is a place name, adding the current word ne _ text into a sensitive information list ne _ list;
s323, if the part of speech of ne _ text is a number word, further judging whether ne _ text is a mobile phone number or an identity card number.
Specifically, step S323 includes:
s3231, if the length of the ne _ text is 11 bits and the head is 1, the ne _ text is considered to be the mobile phone number, and the current ne _ text is added into a sensitive information list ne _ list;
s3232, if the length of the word ne _ text is 18 bits and ne _ text accords with an identity card number check rule, considering that ne _ text is an identity card number, and adding the current word ne _ text into a sensitive information list ne _ list;
s3233, if the length of the current number ne _ text is greater than or equal to the threshold value match _ m _ len (the default value is 5), traversing the matched keyword list keyword _ list, and if the related keyword match _ word in the current text box character string text _ str, adding both the matched keyword match _ word and the current word ne _ text to the sensitive information list ne _ list.
S33, when keyword list keyword _ list is matched and searched, if the matched keyword match _ word and the corresponding matched information ne _ text are detected and identified in two text boxes by sensitive information, desensitization cannot be carried out, and further secondary processing is needed; if the text _ str text length does not exceed the threshold value skip _ keyword _ len (the default value is 10), traversing the keyword list keyword _ list, and searching for sensitive information match _ word; this step desensitizes only the matching keyword match _ word, and even if there is a corresponding information ne _ text miss, effective desensitization can be understood because there is no corresponding keyword match _ word.
S34, if the text _ str of the current text box character string contains a hospital second word, traversing a hospital list hospital _ list, searching whether a hospital name exists in the text _ str of the current character string, and if so, adding the hospital to a sensitive information list ne _ list;
s35, detecting an email address email in the character string text _ str, and if the email is not empty, adding the email address into a sensitive information list nr _ list;
s36, after the steps S34 and S35, the sensitive information list ne _ list may have repetition, and the re-repetition processing is performed before the sensitive information ne _ list is traversed in the next step;
s37, traversing each sensitive information in the sensitive information list ne _ list, searching the index (which can be a plurality of indexes) of the sensitive information in the text _ str of the current text box character string, and returning the text _ str length and the ne _ index (including the index of each sensitive information in the text _ str, the sensitive information length and the sensitive information content).
S4, positioning the coordinates of each piece of identified sensitive information;
optionally, for the sensitive information recognition result, a rectangular coordinate system is established in the medical report list photo by taking the pixel as a unit, the width of the Chinese and English characters is dynamically allocated according to the length of the text box, and the coordinate of each sensitive information is calculated and obtained through a linear equation.
Specifically, referring to fig. 2, the positioning steps are as follows:
s41, obtaining the coordinate (x) of text _ str of text box1,y1),(x2,y2),(x3,y3),(x4,y4);
S42, taking x1、x4Medium to small value, obtain a straight line f1Taking x2、x3Medium or large value, obtain a straight line f2(ii) a Is prepared from (x)1,y1)、(x2,y2) To obtain a straight line f3Is composed of (x)3,y3),(x4,y4) To obtain a straight line f4
S43, converting the straight line f1、f2The width between characters is divided according to the length of text _ str and the width of Chinese character is 2, the width of other characters (including numbers, English and the like) is 1, and the horizontal coordinates of each character can be obtained through calculation;
s44, substituting the abscissa obtained in S43 into the straight line f3、f4The vertical coordinates of the upper part and the lower part of each character can be obtained through calculation;
s45, based on the steps, according to ne _ index obtained by sensitive information detection, and according to the sensitive information index and the width of each block, calculating to obtain the abscissa of the left boundary of the sensitive information; according to the left boundary abscissa of the sensitive information, the length of the sensitive information and the number of Chinese characters in the sensitive information, the right boundary abscissa of the sensitive information can be obtained through calculation; respectively substituting the abscissa of the left and right boundaries into the straight line f3、f4The vertical coordinates of the upper and lower boundaries corresponding to the left and right boundaries, respectively, can be obtained, and finally the coordinates nes _ location of all sensitive information can be obtained.
And S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates.
Optionally, traversing each sensitive information coordinate, obtaining RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.
Specifically, the method comprises the following steps:
s51, reading exif information of the medical report photo, and if the exif information contains direction information, correspondingly rotating the image;
s52, reading and recording an image mode img _ mode, and if the image contains sensitive information, converting the image into an RGB mode;
s53, traversing each sensitive information coordinate according to the sensitive information coordinate nes _ location obtained by the sensitive information positioning module, obtaining RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; because the quadrangle formed by the four coordinates may not be a rectangle, in order to facilitate coding desensitization, a rectangular boundary surrounded by the four coordinates is firstly framed; judging pixel points in the rectangular boundary, and if the pixel points are in the area where the sensitive information is located, modifying the pixel values of the pixel points into background colors, thereby realizing the effect of coding and desensitization;
and S54, restoring the desensitized image to the original image mode img _ mode, and finally storing the desensitized image.
In one embodiment, 256 sensitive information matching keywords keyword _ list are sorted and summarized from 2544 real medical report photographs, and are partially intercepted as shown in fig. 3. And gets 17975 hospital name keywords hospital _ list. Referring to fig. 4, a sample graph of desensitization results obtained in one embodiment is shown. It can be seen that the present invention can accurately achieve desensitization of sensitive information.
As shown in FIG. 5, the preferred embodiment of the invention also discloses a sensitive information desensitization system in a medical report photo, which comprises the following modules:
the sensitive information construction module is used for constructing a sensitive information range based on the priori knowledge;
the text detection and identification module is used for reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
the sensitive information detection module is used for detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence labeling network;
the sensitive information positioning module is used for positioning the coordinates of each piece of identified sensitive information;
and the sensitive information desensitization module is used for coding and desensitizing the position of the sensitive information according to the positioned coordinates.
The sensitive information desensitization system in a medical report is used for implementing the above-mentioned sensitive information desensitization method in a medical report, so that the specific implementation of the system can be seen in the above-mentioned section of the embodiment of the sensitive information desensitization method in a medical report, and therefore, the specific implementation thereof can refer to the description of the corresponding section embodiment and will not be further described herein.
In addition, since the sensitive information desensitization system in the medical report form photo of this embodiment is used for implementing the above-mentioned sensitive information desensitization method in the medical report form photo, the effect corresponds to the effect of the above-mentioned method, and details are not described here.
The method and the system for desensitizing sensitive information in the medical report list photo can accurately identify the sensitive information and accurately position the sensitive information, finally realize coding desensitization on the sensitive information, and have the advantages of high feasibility and high accuracy.
The above embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims (10)

1. A method for desensitizing sensitive information in a medical report photo, comprising the steps of:
s1, constructing a sensitive information range based on the prior knowledge;
s2, reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
s3, detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence tagging network;
s4, positioning the coordinates of each piece of identified sensitive information;
and S5, coding and desensitizing the position of the sensitive information according to the positioned coordinates.
2. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein said step S1 includes:
s11, sorting and summarizing the existing medical report photos to construct a sensitive information matching keyword list;
s12, collecting and sorting the hospital name list, splitting and combining to obtain a hospital name keyword list.
3. The method of desensitizing sensitive information in medical report form photos of claim 2, wherein the sensitive information matching the sensitive information in the keyword list comprises: name, patient number, patient ID, pathology number, medical record number, case number.
4. The method for desensitizing sensitive information in medical report form photographs according to claim 1, wherein said detecting text boxes and recognizing text content in a current medical report form photograph in step S2 comprises: and detecting a text box in the current medical report photo through an OCR (optical character recognition) model and recognizing text content to obtain the coordinates of the text box, the text content and recognition confidence.
5. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S3 includes:
s31, traversing the text boxes, segmenting the text recognition content of each text box by using THULAC, labeling the part of speech, and returning part of speech labeling results nes;
s32, traversing a part-of-speech tagging result nes, and if the current information ne _ text is not in the stored sensitive information list ne _ list, further judging according to the part-of-speech;
s33, when keyword list keyword _ list is matched and searched, if the matched keyword match _ word and the corresponding matched information ne _ text are detected and identified in two text boxes by sensitive information, desensitization cannot be carried out, and further secondary processing is needed; if the text _ str text length does not exceed the threshold value skip _ keyword _ len, traversing the keyword list keyword _ list, and searching the sensitive information match _ word;
s34, if the text _ str of the current text box character string contains a hospital second word, traversing a hospital list hospital _ list, searching whether a hospital name exists in the text _ str of the current character string, and if so, adding the hospital to a sensitive information list ne _ list;
s35, detecting an email address email in the character string text _ str, and if the email is not empty, adding the email address into a sensitive information list ne _ list;
s36, after the steps S34 and S35, the sensitive information list ne _ list may have repetition, and the re-repetition processing is performed before the sensitive information ne _ list is traversed in the next step;
s37, traversing each sensitive information in the sensitive information list ne _ list, searching the index of the sensitive information in the text _ str of the current text box character string, and returning the text _ str length and ne _ index.
6. The method for desensitizing sensitive information in medical report form photographs of claim 5, wherein in step S32, making further determinations based on part-of-speech comprises:
s321, if the part of speech of the ne _ text is a name and the length of the ne _ text does not exceed a threshold skip _ np _ len, adding the current word ne _ text into a sensitive information list ne _ list;
s322, if the part of speech of ne _ text is a place name, adding the current word ne _ text into a sensitive information list ne _ list;
s323, if the part of speech of ne _ text is a number word, further judging whether ne _ text is a mobile phone number or an identity card number.
7. The method for desensitizing sensitive information in medical report sheets according to claim 6, wherein step S323 comprises:
s3231, if the length of the ne _ text is 11 bits and the head is 1, the ne _ text is considered to be the mobile phone number, and the current ne _ text is added into a sensitive information list ne _ list;
s3232, if the length of the word ne _ text is 18 bits and ne _ text accords with an identity card number check rule, considering that ne _ text is an identity card number, and adding the current word ne _ text into a sensitive information list ne _ list;
s3233, if the length of the current number ne _ text is larger than or equal to the threshold value match _ m _ len, traversing the matched keyword list keyword _ list, and if the related keyword match _ word in the current text box character string text _ str, adding both the matched keyword match _ word and the current word ne _ text to the sensitive information list ne _ list.
8. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S4 includes: and establishing a rectangular coordinate system in the medical report photo by taking the pixel as a unit for the sensitive information identification result, dynamically allocating the widths of Chinese and English characters according to the length of the text box, and calculating the coordinate of each sensitive information through a linear equation.
9. The method for desensitizing sensitive information in medical report sheets according to claim 1, wherein step S5 includes: traversing each sensitive information coordinate, acquiring RGBA values of four corresponding coordinate pixels of each sensitive information, and taking out color intermediate values after sorting to form the most possible background color; and firstly, framing a rectangular boundary surrounded by four coordinates, judging pixel points in the rectangular boundary, and modifying the pixel values of the rectangular boundary into background colors if the rectangular boundary is the region where the sensitive information is located so as to realize coding desensitization.
10. A system for desensitizing sensitive information in medical report sheets, comprising the following modules:
the sensitive information construction module is used for constructing a sensitive information range based on the priori knowledge;
the text detection and identification module is used for reading the current medical report photo, detecting a text box in the current medical report photo and identifying text content;
the sensitive information detection module is used for detecting and identifying the sensitive information of the text content of each text box according to the sensitive information range by combining a named entity identification method in a keyword matching and sequence labeling network;
the sensitive information positioning module is used for positioning the coordinates of each piece of identified sensitive information;
and the sensitive information desensitization module is used for coding and desensitizing the position of the sensitive information according to the positioned coordinates.
CN202111265566.3A 2021-10-28 2021-10-28 Method and system for desensitizing sensitive information in medical report sheet photo Pending CN113936764A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111265566.3A CN113936764A (en) 2021-10-28 2021-10-28 Method and system for desensitizing sensitive information in medical report sheet photo

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111265566.3A CN113936764A (en) 2021-10-28 2021-10-28 Method and system for desensitizing sensitive information in medical report sheet photo

Publications (1)

Publication Number Publication Date
CN113936764A true CN113936764A (en) 2022-01-14

Family

ID=79284743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111265566.3A Pending CN113936764A (en) 2021-10-28 2021-10-28 Method and system for desensitizing sensitive information in medical report sheet photo

Country Status (1)

Country Link
CN (1) CN113936764A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115510440A (en) * 2022-09-21 2022-12-23 中国工程物理研究院计算机应用研究所 Black box model inversion attack method and system based on NES algorithm
CN117612172A (en) * 2024-01-24 2024-02-27 成都医星科技有限公司 Desensitization position locating and desensitization method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115510440A (en) * 2022-09-21 2022-12-23 中国工程物理研究院计算机应用研究所 Black box model inversion attack method and system based on NES algorithm
CN115510440B (en) * 2022-09-21 2023-09-08 中国工程物理研究院计算机应用研究所 Black box model inversion attack method and system based on NES algorithm
CN117612172A (en) * 2024-01-24 2024-02-27 成都医星科技有限公司 Desensitization position locating and desensitization method and device, electronic equipment and storage medium
CN117612172B (en) * 2024-01-24 2024-03-19 成都医星科技有限公司 Desensitization position locating and desensitization method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110569832B (en) Text real-time positioning and identifying method based on deep learning attention mechanism
US10943105B2 (en) Document field detection and parsing
US10339378B2 (en) Method and apparatus for finding differences in documents
CN111476067B (en) Character recognition method and device for image, electronic equipment and readable storage medium
Yanikoglu et al. Pink Panther: a complete environment for ground-truthing and benchmarking document page segmentation
JP5050075B2 (en) Image discrimination method
US20110255789A1 (en) Systems and methods for automatically extracting data from electronic documents containing multiple layout features
CN109685052A (en) Method for processing text images, device, electronic equipment and computer-readable medium
CN113936764A (en) Method and system for desensitizing sensitive information in medical report sheet photo
JP6882362B2 (en) Systems and methods for identifying images, including identification documents
CN111753120A (en) Method and device for searching questions, electronic equipment and storage medium
CN111368632A (en) Signature identification method and device
CN115937887A (en) Method and device for extracting document structured information, electronic equipment and storage medium
JP4859054B2 (en) Image processing apparatus, image processing method, program, and recording medium
Akinbade et al. An adaptive thresholding algorithm-based optical character recognition system for information extraction in complex images
CN113780116A (en) Invoice classification method and device, computer equipment and storage medium
Vafaie et al. Handwritten and printed text identification in historical archival documents
US10991085B2 (en) Classifying panoramic images
CN112464907A (en) Document processing system and method
CN114359912B (en) Software page key information extraction method and system based on graph neural network
Evangelou et al. PU learning-based recognition of structural elements in architectural floor plans
Vafaie et al. Improvements in Handwritten and Printed Text Separation in Historical Archival Documents
KR102363769B1 (en) System and method for classifying and providing digitalized documents in stages and computer-readable recording medium thereof
WO2023062799A1 (en) Information processing system, manuscript type identification method, model generation method and program
CN115731556A (en) Image processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination