CN112149180A - Text information desensitization method for referee document - Google Patents

Text information desensitization method for referee document Download PDF

Info

Publication number
CN112149180A
CN112149180A CN202011036947.XA CN202011036947A CN112149180A CN 112149180 A CN112149180 A CN 112149180A CN 202011036947 A CN202011036947 A CN 202011036947A CN 112149180 A CN112149180 A CN 112149180A
Authority
CN
China
Prior art keywords
information
desensitization
document
text
referee document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011036947.XA
Other languages
Chinese (zh)
Inventor
葛季栋
李传艺
惠天宇
黄云云
周筱羽
骆斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202011036947.XA priority Critical patent/CN112149180A/en
Publication of CN112149180A publication Critical patent/CN112149180A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Technology Law (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for identifying and processing a sensitive information named entity based on rule design, which comprises the following steps: preprocessing user input, including filtering non-desensitizable items and storing official document files to be desensitized to the local; preprocessing the referee document, including readjusting the document structure and removing useless characters; applying desensitization rules, processing the document according to items to be desensitized input by a user, matching by using a regular expression, judging and processing by using feature words, and generating an information coordinate set of the document by using the found sensitive information words; reading the text of the original text line by line, and sequentially extracting information coordinates for text replacement; and outputting the desensitized file. The invention simulates the real scene of desensitization treatment of court trial management personnel in the process of actually carrying out network release on referee documents, analyzes the relevant rules of each sensitive information from the document structure, can accurately position the sensitive information, and improves the accuracy of desensitization results.

Description

Text information desensitization method for referee document
Technical Field
The invention relates to a referee document desensitization method, in particular to a sensitive information named entity recognition and processing technology based on rule design, and belongs to the technical field of natural language processing.
Background
The referee document is a solid document of important litigation activities such as the trial process of the court record case, the applicable law, the final judgment result and the like, and bears the relevant information and the final judgment result of the whole litigation process. It is not only an important evidence for judging legal rights and obligations of the litigant, but also an effective means for the superior court to supervise various cases accepted by the inferior court. In recent years, with the continuous application and popularization of the internet, the open mode of the official documents in China is gradually changed from paper to networking, which not only can effectively meet the participation right and the informed right of the public to judicial activities, but also reflects the progress and the update of the modern judicial ideas, is more important to be beneficial to the analysis and the research of relevant law practitioners, provides practical basis and theoretical support for strengthening the legislation, and further promotes the deep research of the law.
Information desensitization is a technical means for hiding and covering related sensitive privacy information by using some special desensitization modes and finally realizing effective protection of privacy rights of related personnel. In the big data age, personal private information is extremely exposed on the network, so various desensitization technologies for related information are widely applied to desensitization of internet data information. The effective and accurate desensitization method has great significance for the disclosure and the propagation of various information data on the Internet.
The official document itself, which records the litigation course information, contains a large amount of detailed records of the personal privacy information data of the concerned party, and if the relevant important information data about the personal privacy is not necessarily processed in the published official document, the personal privacy right of the concerned party is easily infringed, and the normal work and life of the concerned party may be influenced to some extent. Therefore, accurate and efficient information desensitization treatment is carried out on the private information of the referee document, the supervision right of the public to judicial work can be well guaranteed, and the personal privacy right of the related parties can be avoided being damaged as far as possible.
The judge document records the whole judicial process of the court trial case, can directly express the fairness of the judicial institution and finally reflect the value of litigation. Because it is the information recording carrier of the whole litigation process, it must contain a large amount of personal privacy information of the concerned parties, such as the names of the parties, names of the legal persons (enterprises and public institutions), time, places, addresses (roads), law firms, lawyers, amounts of money, license plates, identification numbers, taxpayer identification numbers (business licenses) and other legal person identification numbers, driving license numbers, bank card numbers (account numbers), product brands, product models, contract numbers and other personal identification information with extremely strong characteristics. If the private information is directly disclosed on the internet without any processing, the personal privacy of the relevant parties is certainly seriously damaged, so the key point for achieving the good balance between the public supervision right of awareness and the personal information privacy is how to correctly see the private information processing problem of the published official documents.
The effective and accurate desensitization of sensitive information to the referee document is realized, so that the related characteristic identification information of the referee can be separated from the recorded litigation process information, the privacy of the referee can be effectively protected, the judicial reference value of the referee can be maintained, and the legal right to know of the public can be protected. In addition, the realization of the information desensitization technology of the referee document is also beneficial to carrying out deeper study and study on the information structure and related processing of the referee document, thereby generating other practical values with practical significance. Therefore, the invention is based on the writing structure of the referee document and takes the sensitive information text in the document as the technical target, and the text information desensitization method of the referee document is intensively researched.
Disclosure of Invention
The invention relates to a text information desensitization method of a referee document, which finishes the preparation work of information desensitization by preprocessing user input, comprises filtering non-desensitization items submitted by a user and storing document files to be desensitized to a server, then further preprocessing the referee document files, using a document structure rearrangement and useless character removal method to make the obtained referee document files easier to extract and process text information, applying a desensitization rule designed for sensitive information, processing the documents according to the items to be desensitized input by the user, selecting a proper regular expression to match text segments containing sensitive information, judging and processing feature words related to a target text segment, generating information coordinates of the text segments in the sensitive information words through the found sensitive information words, obtaining a sensitive information coordinate set, and then sequentially extracting the information coordinates based on the content of the original document files, and performing text content replacement according to the coordinates. The method can effectively extract the sensitive privacy information words related to the relevant parties, and realize the hiding of the sensitive information. The method estimates the possibility of the occurrence of the sensitive words from the document structure level, can accurately judge the position coordinates of the sensitive words in the document, and accords with the working scene that the court trial management personnel need desensitization treatment when disclosing the referee document on the network under the real condition.
The invention relates to a text information desensitization method of a referee document, which is characterized by comprising the following steps:
step (1), preprocessing user input;
preprocessing a referee document;
applying a desensitization rule;
step (4), hiding the sensitive information;
and (5) outputting the desensitized document.
2. A method for desensitizing textual information of a referee's document according to claim 1, characterized in that the user input preprocessing in step (1) comprises the specific sub-steps of:
step (1.1) filtering the desired desensitization treatment items input by the user according to all desensitization information items which are set by the background and can be subjected to desensitization treatment;
and (1.2) generating a corresponding folder in the system background, and receiving and storing one or more referee document files uploaded by a user in the folder for subsequent desensitization treatment.
3. The method for desensitizing textual information of a referee document according to claim 1, wherein the preprocessing of the referee document files uploaded by the user in step (2) comprises the specific sub-steps of:
step (2.1) readjusting the document structure to make it easier to perform the operation processing of extracting sensitive information, wherein the specific operation includes, but is not limited to, adjusting the text format of each line to end with a period, and re-dividing the paragraph structure, etc.;
and (2.2) removing useless characters such as the first part and the end of the sentence, the blank space at the first part and the end of the paragraph and the like.
4. The text information desensitization method of the referee document according to claim 1, characterized in that in step (3), a desensitization rule set by a background according to different desensitization information is applied, a corresponding regular expression is used for matching the referee document, further screening and elimination are carried out according to specified feature words of the referee document, and finally, a position coordinate set of all sensitive information in a referee document file is generated. The method comprises the following specific substeps:
step (3.1) according to and use the item to be desensitized that users input to treat the paper, according to the different desensitization items need to process the difference of the information, will judge the paper copies multiple copies in the memory, then open a thread to carry on further concrete processing separately to each information item needing desensitization;
step (3.2) using a corresponding regular expression which is designed in advance according to a common writing format and specification in the prior referee document to perform text matching on each item of information needing desensitization processing;
step (3.3) using the text part matched by the regular expression in the previous step, reserving and filtering the text part according to corresponding preset feature words corresponding to the sensitive information, and intercepting or cutting the text part according to the key words;
and (3.4) taking the sensitive information extracted after the regular matching and the feature word processing as sensitive words, searching and querying the whole referee document, storing the line number and the offset of the queried result in a position coordinate form, and finally constructing a position coordinate set containing all the information of the information items to be desensitized in the referee document.
5. The text information desensitization method of a referee document according to claim 1, characterized in that in step (4), the position coordinates of each sensitive information are obtained according to the position coordinate set of all information items to be desensitized in the referee document obtained in the previous step, so that text desensitization replacement is performed at the corresponding position in the referee document file. The method comprises the following specific substeps:
reading the original referee document file line by line into a memory, and storing all text lines as a text line character string set;
step (4.2) extracting the position coordinates corresponding to each sensitive information in the position coordinate set in sequence, and processing each text line of the judgment document file;
and (4.3) performing text desensitization replacement on corresponding contents of the corresponding text lines according to the position coordinate contents of the desensitization item information, namely replacing the specified contents with preset desensitization characters.
6. A method of desensitizing textual information of a referee document according to claim 1, wherein step (5) outputs a desensitized document. The final output referee document file is still in the original file format uploaded by the user, and the specific content of the referee document file is consistent with the original file except that the sensitive information position is replaced by desensitization characters.
Compared with the prior art, the invention has the following remarkable advantages: when the official document file is preprocessed, the document structure rearrangement can effectively simplify the further processing operation of subsequently extracting sensitive words, and the influence of unnecessary characters on desensitization results can be reduced by removing useless characters; the recognition rule is designed manually by combining the specific format and position of each sensitive word appearing in the referee document, so that the time consumed for constructing a training model is greatly reduced, the complexity of the method is greatly reduced, the corresponding information can be accurately matched aiming at the referee document files meeting the required format, the dependence of a method which only uses a statistical model on data is reduced, and the final document desensitization effect is improved; by storing all the position coordinates of the sensitive information, text desensitization replacement energy conversion is carried out iteratively line by line, so that a desensitization result is obtained relatively efficiently, and a corresponding desensitization processing postscript set is obtained through accurate output, and the working scene that desensitization processing is required when court trial management personnel disclose referee documents on a network under a real application scene is met. When a new referee document is obtained and processed by using the method, the desensitization rule can be adaptively modified according to different format structures of the referee document so as to face changes.
Drawings
FIG. 1 flow chart of a text information desensitization method of a referee document
FIG. 2 shows a statistical table of feature words designed according to recognition rules for extracting address information
FIG. 3 regular expression for extracting general entity information
FIG. 4 text document desensitization processing flow diagram
FIG. 5 is a table showing the comparison of the identification accuracy and accuracy of each entity
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
The invention aims to solve the problem of document desensitization and provides a text information desensitization method for a referee document. By rearranging the document structure when preprocessing the referee document file, the further processing operation of subsequently extracting sensitive words can be effectively simplified, and the influence of unnecessary characters on desensitization results can be reduced by removing useless characters; the recognition rule is designed manually by combining the specific format and position of each sensitive word appearing in the referee document, so that the time consumed for constructing a training model is greatly reduced, the complexity of the method is greatly reduced, the corresponding information can be accurately matched aiming at the referee document files meeting the required format, the dependence of a method which only uses a statistical model on data is reduced, and the final document desensitization effect is improved; by storing all the position coordinates of the sensitive information, text desensitization replacement energy conversion is carried out iteratively line by line, so that a desensitization result is obtained relatively efficiently, and a corresponding desensitization processing postscript set is obtained through accurate output, and the working scene that desensitization processing is required when court trial management personnel disclose referee documents on a network under a real application scene is met. The invention mainly comprises the following steps:
step (1), preprocessing user input;
preprocessing a referee document;
applying a desensitization rule;
step (4), hiding the sensitive information;
and (5) outputting the desensitized document.
The detailed work flow of the text information desensitization method of the referee document is shown in figure 1. The above steps will be described in detail herein.
1. Because the items to be desensitized submitted by user input are various in types, and filtering is implemented in consideration of the rule range of processable desensitization items designed actually, the user input is firstly screened according to desensitization rule entries, and one or more referee document files uploaded by users are saved. The method comprises the following specific steps:
step (1.1) filters the non-desensitizable term. And filtering and screening the item form to be desensitized returned to the system according to the submission of the user, skipping the non-processable items, and storing the processable items in the system by using corresponding information identification objects.
And (1.2) storing the document to be desensitized uploaded by the user to the local desensitization system. There may be multiple referee document file formats uploaded by users, which need to be screened in the saving process, and different saving processing modes should be used according to different numbers of files uploaded by users at one time.
2. The text of the entire document is processed in detail for better application of desensitization rules in later processing steps, requiring preprocessing of the referee document in step 2. The method comprises the following specific steps:
and (2.1) readjusting the structure of the official document. The official document with a complete structure is mainly composed of a case basic situation section, an original declaration section, a defended dialectical section, an evidence section, a found fact section, a judgment result, a judgment reason, cited legal provision and the like, and contents of different parts have different writing rule formats.
And (2.2) removing useless characters in the document. Due to potential differences and problems possibly existing in the entry process of the referee document, the finally and actually received file of the referee document to be processed contains irrelevant interference information such as blank characters, messy codes and the like, and the identification accuracy rate of sensitive information can be reduced in the subsequent process of applying desensitization rules to extract the sensitive information by the character information, so that the whole text information needs to be traversed and checked one by one. In this way, the special symbols, the blank characters and the scrambled characters are all removed, and only the information relating to the actual content in the official document is retained.
3. The specific processing flow is shown in fig. 4, wherein the sensitive information of the designed extraction rule includes a party name, a legal name, a lawyer name, a law place name, an address name and general entity information. The method comprises the following specific steps:
and (3.1) processing the document according to the item to be desensitized input by the user. Firstly, copying a plurality of copies of a referee document in a memory according to the difference of information to be processed by different desensitization items, and then respectively starting a thread for each information item to be desensitized and occupying a document copy resource for further specific processing;
and (3.2) carrying out regular expression matching. Before processing, corresponding matching regular expressions are designed according to rules which are commonly found in referee documents according to sensitive information, wherein the matching regular expressions comprise party names, legal names, lawyer names, law names, address names and general entity information, and the extraction regular expressions designed for the general entity information are shown in FIG. 3. And (3) after the operation step in the step (3.1) is completed, applying a corresponding regular expression in each thread for performing desensitization processing to perform operation of matching sensitive information.
Only the basic idea of the principal name regular expression recognition specific rule design method is specifically explained here.
The parties appearing in a referee document usually represent the original and the addressees participating in the whole litigation activity, and the broad meaning also includes the common litigators, the litigators representatives, and the like, and only the basic original and addressees are considered when performing the identification. In the process of writing a specific referee document, the appellations of the parties involved in litigation can be changed according to the auditors, the appellations are generally called original reports and postends in the first auditor, and the appellations and postends in the second auditor can be changed into corresponding appellations and postends. In addition, the group represented by the party is not limited to citizens and can be related organizations, so that the group is required to be processed respectively according to different individuals in the actual implementation process.
When writing the official document, the court will generally write the personal information about the principal at the very beginning of the whole document, and will generally describe the document according to the format of "the name of the principal + the name of the principal, other information of the principal", the name of the principal has no other characters, and other information parts of the principal may contain personal information such as sex, nationality, etc., such as "some being advertised, male, Chinese, live in a certain village. According to the above format, when identifying the name of a person, the name pronouncing should be focused on and matched as the identification characteristic word, and because the rigor degree of writing documents in different courts is different, all possible appellations should be considered, some courts may use "original report/followed up", some courts may use "original reporter/followed up", and some court trial may use "followed up/followed up", etc.
In identifying the name of a party represented by a company or organization, the shorthand problem of the company is taken into account, i.e., the court may, for descriptive convenience, give the company's full name only at the very beginning of the document, and use the shorthand of the name in the following. Since the identified information item will be used for matching desensitization replacement throughout the text, this problem may lead to the problem that the identified principal name represented by a company cannot be matched later, and therefore the company's abbreviation throughout the text is searched after the company's abbreviation is identified. The abbreviation is usually given in the form of "(hereinafter" a "or" some ") after the company's full name in the document before use, so that further extraction can be performed according to this writing format.
And (3.3) judging and processing the characteristic words. And using the text part matched by the regular expression in the previous step, reserving and filtering the text part according to the preset corresponding characteristic words corresponding to the sensitive information, and intercepting or cutting the text part according to the key words. The specific set feature words are different according to different sensitive information to be processed, wherein the feature words designed according to the identification rule for extracting the address information are shown in fig. 2.
And (3.4) generating information coordinates of the sensitive information words in the document according to the found sensitive information words. And taking the sensitive information extracted after the regular matching and the feature word processing as sensitive words, searching and querying in the whole referee document, storing the line number and the offset of the queried result in a position coordinate form, and finally constructing a position coordinate set containing all the information of the information items to be desensitized in the referee document.
4. And acquiring the position coordinates of each piece of sensitive information according to the position coordinate set of all pieces of information of the information items to be desensitized in the referee document obtained in the previous step, so as to locate and find the corresponding position in the referee document file for desensitization character text replacement, and finally obtaining the experimental results of the identification accuracy and the identification accuracy of each piece of entity information as shown in fig. 5. The method comprises the following specific steps:
reading the original referee document file line by line into a memory, and storing all text lines as a text line character string set so as to perform desensitization treatment line by line;
step (4.2) extracting the position coordinates corresponding to each sensitive information in the position coordinate set in sequence, and processing each text line of the judgment document file, namely judging whether the text line corresponds to the extracted position coordinates;
and (4.3) carrying out text desensitization replacement on corresponding contents of the corresponding text lines according to the position coordinate contents of the desensitization item information, namely sequentially replacing the specified contents by preset desensitization characters.
5. And outputting and storing the documents subjected to desensitization treatment obtained by the last step as files, and providing the files for users to download. The final output referee document file is still in the original file format uploaded by the user, and the specific content of the referee document file is consistent with the original file except that the sensitive information position is replaced by desensitization characters.
A method of desensitizing textual information of a referee document according to the invention has been described in detail above with reference to the accompanying drawings. The invention has the following advantages: when the official document file is preprocessed, the document structure is rearranged, so that the further processing operation of subsequently extracting sensitive words can be effectively simplified, and the influence of unnecessary characters on desensitization results can be reduced by removing useless characters; the recognition rule is designed manually by combining the specific format and position of each sensitive word appearing in the referee document, so that the time consumed for constructing a training model is greatly reduced, the complexity of the method is greatly reduced, the corresponding information can be accurately matched aiming at the referee document files meeting the required format, the dependence of a method which only uses a statistical model on data is reduced, and the final document desensitization effect is improved; by storing all the position coordinates of the sensitive information, text desensitization replacement energy conversion is carried out iteratively line by line, so that a desensitization result is obtained relatively efficiently, and a corresponding desensitization processing postscript set is obtained through accurate output, and the working scene that desensitization processing is required when court trial management personnel disclose referee documents on a network under a real application scene is met. When a new referee document is obtained and processed by using the method, the desensitization rule can be adaptively modified according to different format structures of the referee document so as to face changes.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. Also, a detailed description of known process techniques is omitted herein for the sake of brevity. The present embodiments are to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (6)

1. A method of desensitizing textual information of a referee document, comprising the steps of:
step (1), preprocessing user input;
preprocessing a referee document;
applying a desensitization rule;
step (4), hiding the sensitive information;
and (5) outputting the desensitized document.
2. A method for desensitizing textual information of a referee's document according to claim 1, characterized in that the user input preprocessing in step (1) comprises the specific sub-steps of:
step (1.1) filtering the desired desensitization treatment items input by the user according to all desensitization information items which are set by the background and can be subjected to desensitization treatment;
and (1.2) generating a corresponding folder in the system background, and receiving and storing one or more referee document files uploaded by a user in the folder for subsequent desensitization treatment.
3. The method for desensitizing textual information of a referee document according to claim 1, wherein the preprocessing of the referee document files uploaded by the user in step (2) comprises the specific sub-steps of:
step (2.1) readjusting the document structure to make it easier to perform the operation processing of extracting sensitive information, wherein the specific operation includes, but is not limited to, adjusting the text format of each line to end with a period, and re-dividing the paragraph structure, etc.;
and (2.2) removing useless characters such as the first part and the end of the sentence, the blank space at the first part and the end of the paragraph and the like.
4. The text information desensitization method of the referee document according to claim 1, characterized in that in step (3), a desensitization rule set by a background according to different desensitization information is applied, a corresponding regular expression is used for matching the referee document, further screening and elimination are carried out according to specified feature words of the referee document, and finally, a position coordinate set of all sensitive information in a referee document file is generated. The method comprises the following specific substeps:
step (3.1) according to and use the item to be desensitized that users input to treat the paper, according to the different desensitization items need to process the difference of the information, will judge the paper copies multiple copies in the memory, then open a thread to carry on further concrete processing separately to each information item needing desensitization;
and (3.2) performing text matching on each item of information needing desensitization processing by using a corresponding regular expression which is designed in advance according to a common writing format and specification in the conventional referee document:
step (3.3) using the text part matched by the regular expression in the previous step, reserving and filtering the text part according to corresponding preset feature words corresponding to the sensitive information, and intercepting or cutting the text part according to the key words;
and (3.4) taking the sensitive information extracted after the regular matching and the feature word processing as sensitive words, searching and querying the whole referee document, storing the line number and the offset of the queried result in a position coordinate form, and finally constructing a position coordinate set containing all the information of the information items to be desensitized in the referee document.
5. The text information desensitization method of a referee document according to claim 1, characterized in that in step (4), the position coordinates of each sensitive information are obtained according to the position coordinate set of all information items to be desensitized in the referee document obtained in the previous step, so that text desensitization replacement is performed at the corresponding position in the referee document file. The method comprises the following specific substeps:
reading the original referee document file line by line into a memory, and storing all text lines as a text line character string set;
step (4.2) extracting the position coordinates corresponding to each sensitive information in the position coordinate set in sequence, and processing each text line of the judgment document file;
and (4.3) performing text desensitization replacement on corresponding contents of the corresponding text lines according to the position coordinate contents of the desensitization item information, namely replacing the specified contents with preset desensitization characters.
6. A method of desensitizing textual information of a referee document according to claim 1, wherein step (5) outputs a desensitized document. The final output referee document file is still in the original file format uploaded by the user, and the specific content of the referee document file is consistent with the original file except that the sensitive information position is replaced by desensitization characters.
CN202011036947.XA 2020-09-27 2020-09-27 Text information desensitization method for referee document Pending CN112149180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011036947.XA CN112149180A (en) 2020-09-27 2020-09-27 Text information desensitization method for referee document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011036947.XA CN112149180A (en) 2020-09-27 2020-09-27 Text information desensitization method for referee document

Publications (1)

Publication Number Publication Date
CN112149180A true CN112149180A (en) 2020-12-29

Family

ID=73895868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011036947.XA Pending CN112149180A (en) 2020-09-27 2020-09-27 Text information desensitization method for referee document

Country Status (1)

Country Link
CN (1) CN112149180A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363016A (en) * 2021-12-20 2022-04-15 浙江大学 Privacy protection flow detection method based on keywords

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN111553318A (en) * 2020-05-14 2020-08-18 北京华宇元典信息服务有限公司 Sensitive information extraction method, referee document processing method and device and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN111553318A (en) * 2020-05-14 2020-08-18 北京华宇元典信息服务有限公司 Sensitive information extraction method, referee document processing method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363016A (en) * 2021-12-20 2022-04-15 浙江大学 Privacy protection flow detection method based on keywords

Similar Documents

Publication Publication Date Title
US9342505B2 (en) Translation protocol for large discovery projects
US7369701B2 (en) Automated docketing system
CN112613501A (en) Information auditing classification model construction method and information auditing method
US20100205020A1 (en) System and method for establishing, managing, and controlling the time, cost, and quality of information retrieval and production in electronic discovery
CN107291780A (en) A kind of user comment information methods of exhibiting and device
AU2019366169B2 (en) Sensitive data detection and replacement
US6374270B1 (en) Corporate disclosure and repository system utilizing inference synthesis as applied to a database
CN111310446A (en) Information extraction method and device for referee document
CN110599289A (en) Method for formatting official document
CN111160345A (en) Intelligent enterprise contract generation system and method
CN112149180A (en) Text information desensitization method for referee document
CN115116082A (en) One-key filing system based on OCR recognition algorithm
Lam et al. Applying large language models for enhancing contract drafting
Balk et al. IMPACT: centre of competence in text digitisation
Pollitt et al. Exploring big haystacks: Data mining and knowledge management
CN111709221A (en) Document generation method and system
CN111428041A (en) Case abstract generation method, device, system and storage medium
CN115908062A (en) Intellectual property full-period management system
CN111813947A (en) Automatic generation method and device for court inquiry synopsis
CN115858470A (en) Policy and regulation file matching method, system, server and storage medium
CN110766091B (en) Method and system for identifying trepanning loan group partner
US20220270008A1 (en) Systems and methods for enhanced risk identification based on textual analysis
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
Gronvall et al. An Empirical Study of the Application of Machine Learning and Keyword Terms Methodologies to Privilege-Document Review Projects in Legal Matters
CN112183032A (en) Text processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201229

WD01 Invention patent application deemed withdrawn after publication