CN117454426A - Method, device and system for desensitizing and collecting information of claim settlement data - Google Patents

Method, device and system for desensitizing and collecting information of claim settlement data Download PDF

Info

Publication number
CN117454426A
CN117454426A CN202311549869.7A CN202311549869A CN117454426A CN 117454426 A CN117454426 A CN 117454426A CN 202311549869 A CN202311549869 A CN 202311549869A CN 117454426 A CN117454426 A CN 117454426A
Authority
CN
China
Prior art keywords
data
desensitization
module
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311549869.7A
Other languages
Chinese (zh)
Inventor
张博
徐伟男
周明强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai China Health Connect Technology Co ltd
Original Assignee
Shanghai China Health Connect Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai China Health Connect Technology Co ltd filed Critical Shanghai China Health Connect Technology Co ltd
Priority to CN202311549869.7A priority Critical patent/CN117454426A/en
Publication of CN117454426A publication Critical patent/CN117454426A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Business, Economics & Management (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Medical Informatics (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of data identification, in particular to a method, a device and a system for desensitizing and collecting information of claim settlement data, which comprise the following steps: based on the original insurance claim data, a deep learning image text recognition technology is adopted to extract the characteristics, and a long-short-term memory network is combined to recognize the sequence characteristics, so that a text recognition data set is generated. According to the invention, through deep learning image text recognition and long-short-time memory network, insurance claim data characteristics are extracted accurately, a converters pre-training model is further utilized to realize high-quality semantic analysis and privacy information labeling, accurate recognition of privacy data is ensured, support vector machine application provides a scientific and reliable basis for privacy information risk assessment, real-time and efficient privacy data coding is realized by combining with OpenCV image processing, cross-document privacy information correlation is simplified by introducing a graph neural network technology, comprehensive insurance claim data privacy protection is provided, multidimensional guarantee is provided for overall privacy protection, and data safety is ensured.

Description

Method, device and system for desensitizing and collecting information of claim settlement data
Technical Field
The invention relates to the technical field of data identification, in particular to a method, a device and a system for desensitizing and collecting information of claim settlement data.
Background
The field of data recognition technology relates to various techniques and methods for recognizing, extracting and processing different types of data. This may include a variety of data types, such as text, images, sound, etc. In the field of data identification, a major issue of concern is how to efficiently identify, classify, analyze and process data to obtain useful information or insight therefrom. Data desensitization is one of the important fields of application, which focuses on how sensitive data is collected while preserving personal privacy.
The method for desensitizing and collecting the information of the claim information is a specific data collecting process and is generally used for the claim collecting process in insurance business. In this case, personal information, health data, and other sensitive information must be collected to assess the validity and size of the insurance claim. However, this information needs to be desensitized to ensure that personal privacy is not violated. Data desensitization refers to the replacement of sensitive information in the original data with insensitive data, such as the replacement of a real name with a code or the replacement of a detailed address with geographic coordinates to protect the privacy of the identified individual. The main purpose is to protect the privacy of insured life when collecting information about claims. The desensitization method can ensure that sensitive information is not revealed or abused during data acquisition and transmission. This helps maintain trust of the customer, obeys relevant privacy regulations, and reduces potential risk of data leakage.
In the actual use process of the existing method for desensitizing and collecting the information of the claim settlement data, the existing method does not fully utilize advanced technology of deep learning and natural language processing, so that the identification of texts and the identification of privacy data are not accurate enough. In addition, the existing method lacks scientific and systematic evaluation means when the desensitization strategy is decided, and leakage or excessive desensitization of private data is easy to cause. In cross-document privacy information association analysis, advanced technologies such as a graph neural network and the like are not utilized, so that the accuracy and the efficiency of association analysis are required to be improved. Meanwhile, the existing verification flow is too dependent on manual work, and lacks an automatic verification means, so that the processing time is prolonged, and errors caused by human negligence occur.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides a method, a device and a system for desensitizing and collecting information of claim settlement data.
In order to achieve the above purpose, the present invention adopts the following technical scheme: a desensitization acquisition method for information of claim settlement data comprises the following steps:
s1: based on original insurance claim data, performing feature extraction by adopting a deep learning image text recognition technology, and recognizing sequence features by combining a long-short-time memory network to generate a text recognition data set;
S2: based on the text recognition data set, adopting a natural language processing algorithm, carrying out semantic analysis through a pre-training model based on a converters architecture, marking privacy information, and generating privacy information marking data;
s3: based on the privacy information labeling data, carrying out risk assessment of the privacy information by adopting a support vector machine, determining a desensitization level, and generating desensitization strategy decision data;
s4: based on the desensitization strategy decision data, performing privacy information coding by adopting an image processing method through an image blurring and masking technology in an OpenCV library to generate preliminary desensitization data;
s5: based on the preliminary desensitization data, linking and analyzing the same person information in different documents by adopting a graph neural network technology to generate cross-document desensitization associated data;
s6: based on the cross-document desensitization associated data, adopting a data consistency verification technology and combining an automatic script and a manual review process to verify the desensitization effect, and generating a desensitization insurance claim data set;
the text recognition data set specifically comprises text content and a set of positions of the text recognition data in a picture, the privacy information labeling data comprises text, category and position information of the privacy information in a document, the desensitization strategy decision data specifically refers to a desensitization level and a processing mode of each piece of privacy information, the preliminary desensitization data are processed claim data, the privacy information is blurred or covered, and the cross-document desensitization association data comprise association and desensitization states of privacy information of the same individual in a plurality of claim data.
As a further scheme of the invention, based on original insurance claim data, a deep learning image text recognition technology is adopted to extract features, and a long-short-time memory network is combined to recognize sequence features, so that a text recognition data set is generated specifically:
s101: based on the original insurance claim data image, adopting a deep convolutional neural network algorithm to perform feature extraction, and performing feature map generation to generate a preliminary feature map;
s102: based on the preliminary feature map, a region proposal network algorithm is adopted to locate a text region, and frame regression adjustment is carried out to generate a text region locating result;
s103: based on the text region positioning result, adopting an optical character recognition technology to accurately recognize text content, classifying characters and generating a text content recognition result;
s104: based on the text content identification result, a long-short-time memory network is adopted to conduct identification optimization of sequence characteristics, and context information processing is conducted to generate a text identification data set;
the deep convolutional neural network specifically uses VGG or ResNet models for deep feature extraction of images, the region proposal network specifically refers to using RPN in Faster R-CNN to propose image regions containing texts, the optical character recognition technology specifically uses Tesseact OCR (optical character recognition) frameworks to recognize and convert text information in the images, and the long-short-term memory network specifically uses BiLSTM models to process time sequence dependence problems of text data.
As a further scheme of the invention, based on the text recognition data set, a natural language processing algorithm is adopted, semantic analysis is carried out through a pre-training model based on a Transformers framework, privacy information is marked, and the step of generating privacy information marking data specifically comprises the following steps:
s201: based on the text recognition data set, performing semantic coding by adopting a pre-training model based on a Transformers framework, and generating a coding result to generate a text semantic coding result;
s202: based on the text semantic coding result, adopting a named entity recognition technology to recognize and classify the entity, and generating an entity classification result to generate an entity recognition result;
s203: based on the entity identification result, carrying out privacy class classification by adopting a text classification algorithm, and generating a privacy class classification result to generate a privacy class classification result;
s204: based on the privacy class classification result, applying a data desensitization technology to label the sensitive information, and generating privacy information label data;
the pre-training model of the Transformers architecture is specifically that a BERT, GPT or RoBERTa model is used for carrying out deep semantic understanding on texts, the named entity recognition technology comprises the steps of applying a BiLSTM-CRF model to recognize entities including person names and place names in the texts, the text classification algorithm is specifically that a support vector machine or a deep neural network algorithm is used for classifying text data, and the data desensitization technology is specifically that a data mask, data camouflage or a differential privacy method is applied to process recognized sensitive information.
As a further scheme of the invention, based on the privacy information labeling data, a support vector machine is adopted to carry out risk assessment of privacy information and determine a desensitization level, and the step of generating desensitization strategy decision data specifically comprises the following steps:
s301: based on the privacy information labeling data, adopting a data cleaning and feature coding algorithm to perform data standardization processing, and converting a feature format to generate preprocessed risk assessment data;
s302: based on the preprocessed risk assessment data, adopting a principal component analysis method to reduce dimensionality and extract key features, and generating screened key feature data;
s303: based on the screened key characteristic data, performing model training by adopting a support vector machine algorithm, wherein the model training comprises the steps of using a core skill and maximizing a soft interval to generate a trained SVM model;
s304: based on the trained SVM model, performing risk assessment, determining the sensitivity degree of the data, determining the desensitization level according to the sensitivity degree, and generating desensitization strategy decision data;
the feature code is specifically used for converting text data into numerical data which can be processed by a machine learning algorithm, the principal component analysis method is used for converting variables with relevance into a group of linearly uncorrelated variables through orthogonal transformation, the kernel skill is specifically used for mapping the data into a high-dimensional space, and the soft interval maximization allows partial data points to be divided in interval areas or in a wrong way.
As a further scheme of the invention, based on the desensitization strategy decision data, an image processing method is adopted, and privacy information coding is executed through an image blurring and masking technology in an OpenCV library, so that the steps for generating preliminary desensitization data are specifically as follows:
s401: based on the desensitization strategy decision data, identifying and positioning a desensitization area by adopting an image segmentation technology, and generating mask area positioning data;
s402: based on the mask region positioning data, carrying out fuzzy processing on the sensitive region by adopting a Gaussian fuzzy or mean fuzzy method, and generating fuzzy processed image data;
s403: based on the blurred image data, adopting an image masking technology, and generating masked image data by adding a cascading style masking layer to cover sensitive information;
s404: based on the masked image data, performing image quality evaluation, ensuring the desensitization effect and simultaneously maintaining the definition of a non-sensitive area to generate preliminary desensitization data;
the image segmentation technology comprises threshold segmentation and region growth, and is used for distinguishing difference objects in images, the Gaussian blur is specifically used for convolving the images by using a Gaussian function, and the image masking technology is specifically used for locally shielding target images by using binary images.
As a further scheme of the invention, based on the preliminary desensitization data, the same person information in different documents is linked and analyzed by adopting a graph neural network technology, and the step of generating the cross-document desensitization associated data comprises the following steps:
s501: based on the preliminary desensitization data, adopting a graph neural network to extract node embedded features and generating an individual feature representation set;
s502: based on the individual characteristic representation set, adopting a graph rolling network to aggregate neighbor information and generating an enhanced individual connection graph;
s503: based on the enhanced individual connection graph, adopting a graph attention network to perform relevance weight analysis and generating a relevance weight analysis result;
s504: based on the relevance weighted analysis result, a graph isomorphic network is adopted, character-based information links are established, and cross-document desensitization relevance data are generated;
the individual characteristic representation set is specifically a vectorized representation of a plurality of volumes in a document, and comprises text content and structural information.
As a further scheme of the invention, based on the cross-document desensitization associated data, a data consistency verification technology is adopted, and an automatic script and a manual review flow are combined to verify the desensitization effect, and the step of generating a desensitization insurance claim data set specifically comprises the following steps:
S601: based on the cross-document desensitization associated data, performing preliminary verification on consistency by adopting data hash matching, and generating a preliminary verification result;
s602: based on the preliminary verification result, adopting a contrast analysis technology to carry out differential detection and generating a refined consistency verification report;
s603: based on the refined consistency check report, an automatic script is adopted to conduct database rule check and generate an automatic check result;
s604: based on the automatic checking result, a manual rechecking process is adopted to confirm the effect and generate a desensitized insurance claim data set.
The device for desensitizing and collecting the information of the claim data comprises a memory and a processor, wherein a computer program is stored in the memory, and the processor realizes the steps of the method for desensitizing and collecting the information of the claim data when executing the computer program.
The system for acquiring the desensitization of the information of the claim settlement information is used for executing the method for acquiring the desensitization of the information of the claim settlement information and comprises a feature extraction module, a semantic recognition module, a data preprocessing module, a desensitization strategy decision module, an image processing module, a relevance analysis module and a result verification module;
The data preprocessing module marks data based on privacy information, performs data cleaning and feature encoding processing, and performs key feature extraction by using a principal component analysis method to generate screened key feature data;
the desensitization strategy decision module is based on the screened key feature data, adopts a support vector machine algorithm to carry out model training, carries out risk assessment and generates desensitization strategy decision data;
the image processing module utilizes desensitization strategy decision data, adopts an image segmentation technology to locate a desensitization area, applies Gaussian blur or mean blur to process, covers privacy information through an image masking technology, and generates preliminary desensitization data after image quality evaluation;
the relevance analysis module is used for extracting features by adopting a graph neural network based on the preliminary desensitization data, carrying out information aggregation by utilizing a graph convolution network, carrying out weight analysis by utilizing a graph attention network, establishing information links by utilizing a graph isomorphic network, and generating cross-document desensitization relevance data;
the result verification module is used for carrying out validity verification on the desensitization result based on the cross-document desensitization associated data, carrying out data verification by using a machine learning classification algorithm, carrying out performance evaluation by using an evaluation index algorithm, and generating desensitization data.
As a further scheme of the invention, the feature extraction module comprises a feature map generation sub-module, a text region positioning sub-module, a text content identification sub-module and a sequence feature optimization sub-module;
the semantic recognition module comprises a semantic coding sub-module, an entity recognition and classification sub-module, a privacy class dividing sub-module and a privacy information labeling sub-module;
the data preprocessing module comprises a data normalization sub-module, a feature conversion sub-module, a dimension reduction sub-module and a key feature extraction sub-module;
the desensitization strategy decision module comprises a model training sub-module, a core skill application sub-module, a soft interval maximization sub-module and a risk assessment sub-module;
the image processing module comprises a region positioning sub-module, a blurring processing sub-module, a mask application sub-module and a quality evaluation sub-module;
the relevance analysis module comprises a feature extraction sub-module, an information aggregation sub-module, a weight analysis sub-module and an information link sub-module;
the result verification module comprises a preliminary verification sub-module, a difference detection sub-module, a database checking sub-module and a manual re-verification sub-module.
Compared with the prior art, the invention has the advantages and positive effects that:
According to the invention, the deep learning image text recognition technology is adopted to accurately extract the features of the original insurance claim data, and the continuous sequence features can be accurately recognized by combining a long-short-term memory network. Further, a pre-training model based on a Transformers architecture is utilized to realize high-quality semantic analysis and privacy information labeling, and accurate identification of privacy data is ensured. The application of the support vector machine brings scientificity and reliability to risk assessment of privacy information, and provides a real-time and efficient privacy data coding function by combining with OpenCV image processing. The introduction of the graph neural network technology makes the cross-document privacy information association simple and efficient, and provides a higher latitude guarantee for the overall privacy protection of insurance claim data.
Drawings
FIG. 1 is a schematic workflow diagram of the present invention;
FIG. 2 is a flow chart of the step S1 refinement of the present invention;
FIG. 3 is a flow chart of the step S2 refinement of the present invention;
FIG. 4 is a flow chart of the step S3 refinement of the present invention;
FIG. 5 is a flowchart of the step S4 refinement of the present invention;
FIG. 6 is a flowchart detailing the step S5 of the present invention;
FIG. 7 is a flowchart detailing the step S6 of the present invention;
FIG. 8 is a system flow diagram of the present invention;
FIG. 9 is a schematic diagram of a system framework of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In the description of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Furthermore, in the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Example 1
Referring to fig. 1, the present invention provides a technical solution: a desensitization acquisition method for information of claim settlement data comprises the following steps:
S1: based on original insurance claim data, performing feature extraction by adopting a deep learning image text recognition technology, and recognizing sequence features by combining a long-short-time memory network to generate a text recognition data set;
s2: based on a text recognition data set, adopting a natural language processing algorithm, carrying out semantic analysis through a pre-training model based on a converters architecture, marking privacy information, and generating privacy information marking data;
s3: based on the privacy information labeling data, carrying out risk assessment of the privacy information by adopting a support vector machine, determining a desensitization level, and generating desensitization strategy decision data;
s4: based on desensitization strategy decision data, performing privacy information coding by adopting an image processing method through image blurring and masking technology in an OpenCV library to generate preliminary desensitization data;
s5: based on the preliminary desensitization data, linking and analyzing the same person information in different documents by adopting a graph neural network technology to generate cross-document desensitization associated data;
s6: based on the cross-document desensitization associated data, adopting a data consistency verification technology, combining an automatic script and a manual review process to verify the desensitization effect, and generating a desensitization insurance claim settlement data set;
The text recognition data set specifically comprises text content and a set of positions of the text recognition data in a picture, the privacy information labeling data comprises text, category and position information of the privacy information in a document, the desensitization policy decision data specifically refers to a desensitization level and a processing mode of each piece of privacy information, the preliminary desensitization data is processed claim data, the privacy information is blurred or covered, and cross-document desensitization association data comprises association and desensitization states of privacy information of the same individual in a plurality of claim data.
And the processing of a large amount of claim data is automated, so that the efficiency is improved, and the cost is reduced. And by adopting high-precision deep learning and natural language processing technology, accurate privacy information labeling and desensitization processing are ensured, so that sensitive data is effectively protected. Multi-level desensitization strategies and data association techniques make the data safer, and it is difficult to restore sensitive information even in multiple documents. Through data consistency verification, the accuracy and consistency of desensitized data are ensured, and the risk of data quality problems is reduced. This approach helps companies to comply with privacy and data security regulations, reducing legal risks. Most importantly, by effective privacy protection, companies can enhance customer trust in their data processing means, improving customer loyalty.
Referring to fig. 2, based on the original insurance claim data, the feature extraction is performed by adopting a deep learning image text recognition technology, and the sequence feature is recognized by combining with a long-short-term memory network, so that the text recognition data set is generated specifically by the following steps:
s101: based on the original insurance claim data image, adopting a deep convolutional neural network algorithm to perform feature extraction, and performing feature map generation to generate a preliminary feature map;
s102: based on the preliminary feature map, a region proposal network algorithm is adopted to locate a text region, and frame regression adjustment is carried out to generate a text region locating result;
s103: based on the text region positioning result, adopting an optical character recognition technology to accurately recognize text content, classifying characters and generating a text content recognition result;
s104: based on the text content recognition result, a long-short-time memory network is adopted to perform recognition optimization of sequence characteristics, and context information processing is performed to generate a text recognition data set;
the deep convolutional neural network specifically uses VGG or ResNet models to extract deep features of images, the regional proposal network specifically refers to using RPN in Faster R-CNN to propose image regions containing texts, the optical character recognition technology specifically uses Tesseact OCR (optical character recognition) framework to recognize and convert text information in the images, and the long-short-term memory network specifically uses BiLSTM models to process time sequence dependence problems of text data.
In S101, an image of the original insurance claim material is imported into the system and stored in a digital image format. Next, the image is feature extracted using a deep convolutional neural network (typically a model such as VGG or res net is selected). This process involves multiple convolution layers and pooling layers for capturing advanced features of the image. The result of the feature extraction generates a preliminary feature map, which is a data structure containing various image features.
In S102, using the preliminary feature map, using the regional proposal network, a potential region that may contain text is detected, typically using RPN in the fast R-CNN. These region suggestions are suggestions of image regions containing text. Frame regression adjustment is performed on these region proposals to more accurately locate text regions, thereby reducing errors and ensuring accuracy of text regions.
In S103, each located text region is identified using an optical character recognition technique (e.g., the Tesseract OCR framework). This involves extracting text content from the image and performing character classification to convert the text content in the image into text data for further processing.
In S104, based on the text content recognition result, a long-short-time memory network (such as a BiLSTM model) is used to process the timing dependency problem of the text data. This includes identifying sequence features in the text, such as date, amount, name, etc. And (5) carrying out context information processing to enhance semantic understanding of text content. A text recognition dataset is generated that includes a collection of text content and its location in the picture for use in subsequent privacy information labeling and other processing steps.
Referring to fig. 3, based on a text recognition data set, a natural language processing algorithm is adopted to perform semantic analysis through a pre-training model based on a transgressing architecture, and privacy information is marked, so that the steps of generating privacy information marking data are specifically as follows:
s201: based on a text recognition data set, performing semantic coding by adopting a pre-training model based on a Transformers framework, and generating a coding result to generate a text semantic coding result;
s202: based on the text semantic coding result, adopting a named entity recognition technology to recognize and classify the entity, and generating an entity classification result to generate an entity recognition result;
s203: based on the entity identification result, a text classification algorithm is adopted to divide the privacy classes, and the privacy class classification result is generated to generate a privacy class classification result;
s204: based on the privacy class classification result, a data desensitization technology is applied to label sensitive information, and privacy information label data is generated;
the pre-training model of the Transformers architecture is specifically that a BERT, GPT or RoBERTa model is used for carrying out deep semantic understanding on texts, a named entity identification technology comprises the steps of identifying entities including person names and place names in the texts by applying a BiLSTM-CRF model, a text classification algorithm is specifically that a support vector machine or a deep neural network algorithm is used for classifying text data, and a data desensitization technology is specifically that a data mask, data camouflage or a differential privacy method is applied for processing identified sensitive information.
In S201, the text recognition dataset is depth semantic coded using a pre-trained model based on the Transformers architecture (e.g. BERT, GPT or RoBERTa). This means that the text is converted into a high-dimensional semantic vector, and the semantic information of the text is captured and further processed by the computer. The encoding result of each sample is generated as a text semantic code.
In S202, the model marks various entities in the text and classifies them into various entity categories, such as person names, place names, and the like. This generates an entity identification result containing location and category information of the entity.
In S203, the text data is classified using a text classification algorithm, typically a support vector machine or a deep neural network, in combination with the text semantic coding result and the entity recognition result. The purpose of classification is to classify the text sample into various privacy classes, such as low risk, medium risk, or high risk classes. A privacy class classification result is generated, which contains privacy class information of the text sample.
In S204, according to the privacy class classification result, a data desensitization technology, such as a data mask, a data masquerading or a differential privacy method, is applied to process the identified privacy information. This includes replacing, deleting or encrypting sensitive information, ensuring privacy is preserved. Generating privacy information labeling data, wherein the privacy information labeling data comprises desensitized text information, and the privacy information labeling data is convenient for further processing or sharing without revealing sensitive information.
Referring to fig. 4, based on the privacy information labeling data, the support vector machine is used to perform risk assessment of the privacy information, determine the desensitization level, and generate desensitization policy decision data specifically includes the steps of:
s301: based on the privacy information labeling data, adopting a data cleaning and feature coding algorithm to perform data standardization processing, and converting a feature format to generate preprocessed risk assessment data;
s302: based on the preprocessed risk assessment data, adopting a principal component analysis method to reduce dimensionality and extract key features, and generating screened key feature data;
s303: based on the screened key characteristic data, performing model training by adopting a support vector machine algorithm, wherein the model training comprises the steps of using a core skill and maximizing a soft interval to generate a trained SVM model;
s304: based on the trained SVM model, performing risk assessment, determining the sensitivity degree of the data, determining the desensitization level according to the sensitivity degree, and generating desensitization strategy decision data;
the feature coding is specifically to convert text data into numerical data which can be processed by a machine learning algorithm, the principal component analysis method converts variables with relevance into a group of linear uncorrelated variables through orthogonal transformation, the kernel skill is specifically to map the data into a high-dimensional space, and soft interval maximization allows partial data points to be in interval areas or misclassified.
In S301, data cleansing is performed from the privacy information labeling data to ensure the quality of the data. The erroneous or anomalous data will be deleted or corrected. And converting the text data into numerical data which can be processed by a machine learning algorithm by adopting a feature coding algorithm. Extracting characteristics of text data by using a word bag model, a TF-IDF or a word embedding method, and generating preprocessed risk assessment data, wherein the preprocessed risk assessment data comprises converted characteristic vectors and related labels.
In S302, dimension reduction is performed on the preprocessed data by using a dimension reduction technique such as Principal Component Analysis (PCA), so as to reduce complexity of the data. This helps extract the most important features in the data, reducing noise and redundant information, and generating filtered key feature data, which are typically those that contribute significantly to interpretation data variance.
In S303, a Support Vector Machine (SVM) algorithm is used to perform model training on the screened key feature data. Training of the model involves selecting an appropriate kernel technique, such as a linear kernel, a polynomial kernel, or a gaussian kernel, to map the data to a high-dimensional space. Soft interval parameters are required to be set, the tolerance of the SVM model is determined, partial data points are allowed to be divided in interval areas or in a wrong way, and the generalization capability of the model is improved. After training is completed, a trained SVM model is generated, and the model is used for risk assessment and desensitization level decision.
In S304, based on the trained SVM model, risk assessment is performed, and the input data is analyzed to determine the sensitivity level thereof. Including calculating the distance of the data point from the decision boundary to determine its risk level. Based on the risk assessment results, a desensitization level is determined, such as low risk, medium risk or high risk. Desensitization policy decision data is generated, which contains suggestions on how to desensitize the different data points, ensuring that sensitive information is protected.
Referring to fig. 5, based on desensitization policy decision data, using an image processing method, performing privacy information coding by using an image blurring and masking technique in an OpenCV library, and generating preliminary desensitization data specifically includes:
s401: based on desensitization strategy decision data, identifying and positioning a desensitization area by adopting an image segmentation technology, and generating mask area positioning data;
s402: based on mask region positioning data, carrying out fuzzy processing on a sensitive region by adopting a Gaussian fuzzy or mean fuzzy method, and generating fuzzy processed image data;
s403: based on the blurred image data, adopting an image masking technology, and adding a cascading style masking layer to cover sensitive information to generate masked image data;
S404: based on the masked image data, performing image quality evaluation, ensuring the desensitization effect and simultaneously maintaining the definition of a non-sensitive area, and generating preliminary desensitization data;
the image segmentation technology comprises threshold segmentation and region growth, is used for distinguishing difference objects in images, and the Gaussian blur is specifically to convolve the images by using a Gaussian function, and the image masking technology is specifically to locally shade a target image by using a binary image.
In S401, the desensitization region is identified and located using image segmentation techniques. The following are some common image segmentation methods:
threshold segmentation: the threshold is used to divide the image into different regions, e.g., pixels in the image into foreground and background.
import cv2
# read image
image=cv2.imread('input_image.jpg',0)
Use threshold segmentation #
ret,binary_image=cv2.threshold(image,threshold_value,max_value,cv2.THRESH_BINARY)
Region growth: starting from the seed pixel, adjacent pixels are merged into one region according to the similarity between pixels.
import cv2
# read image
image=cv2.imread('input_image.jpg',0)
Region growth #
seed_point=(x,y)
connected_regions=cv2.floodFill(image,mask,seed_point,newVal)
Mask region location data is generated indicating which portions are regions that require desensitization.
In S402, based on the mask region positioning data, a gaussian blur or a mean blur method is used to blur the privacy region. This helps to hide sensitive information.
import cv2
# read image
image=cv2.imread('input_image.jpg')
# Gauss blur
blurred_image=cv2.GaussianBlur(image,(ksize_x,ksize_y),sigmaX)
Mean value blur
blurred_image=cv2.blur(image,(ksize_x,ksize_y))
In S403, a mask layer is added using an image masking technique to cover the privacy information. A binary image is created, the area to be desensitized is marked as 1, and then multiplied by the original image to achieve the masking effect.
import cv2
import numpy as np
# read image
image=cv2.imread('input_image.jpg')
Creating mask image #
mask=np.zeros(image.shape[:2],dtype=np.uint8)
# set the region where desensitization is required
cv2.rectangle(mask,(x1,y1),(x2,y2),1,-1)
# application mask
masked_image=cv2.bitwise_and(image,image,mask=mask)
In S404, image quality evaluation is performed to ensure desensitization effects and to maintain sharpness of non-privacy areas. Image quality may be evaluated using an image quality evaluation index, such as PSNR (peak signal-to-noise ratio) or SSIM (structural similarity index).
import cv2
from skimage.metrics import structural_similarity as ssim
# reading original image and desensitized image
original_image=cv2.imread('original_image.jpg')
masked_image=cv2.imread('masked_image.jpg')
# computing SSIM
ssim_score=ssim(original_image,masked_image)
# calculate PSNR
psnr=cv2.PSNR(original_image,masked_image)
Referring to fig. 6, based on the preliminary desensitization data, the same person information in different documents is linked and analyzed by adopting a graph neural network technology, and the step of generating the cross-document desensitization associated data specifically comprises the following steps:
s501: based on the preliminary desensitization data, adopting a graph neural network to extract node embedded features and generating an individual feature representation set;
s502: based on the individual characteristic representation set, adopting a graph rolling network to aggregate neighbor information and generating an enhanced individual connection graph;
S503: based on the enhanced individual connection graph, adopting a graph attention network to perform relevance weight analysis and generating a relevance weight analysis result;
s504: based on the relevance weighted analysis result, a graph isomorphic network is adopted, information links based on characters are established, and cross-document desensitization relevance data are generated;
the individual feature representation set is specifically a vectorized representation of multiple volumes in a document, including text content, structural information.
In S501, node embedded features are extracted using a graph neural network, and an individual feature representation set is generated. For this purpose, individual information in the document is converted into a vector representation, and text content and structural information are converted into numerical features using word embedding techniques. Nodes are created for each individual, and feature embedding of the nodes is extracted by using a node embedding algorithm of the graph neural network, so that an individual feature representation set is generated.
In S502, based on the individual feature representation set, neighbor information aggregation is performed, and an enhanced individual connection graph is generated. This includes building a graph data structure, each node representing an individual in the document, and edges representing associations between them. And then, aggregating neighbor information of the nodes by using a graph rolling network technology to acquire richer information and generating an enhanced individual connection graph.
In S503, a relevance weight analysis is performed and a relevance weight analysis result is generated. This involves analyzing the relevance weights using a method such as a graph-annotation network, and emphasizing the nodes with higher relevance. And generating a relevance weighted analysis result by calculating relevance weights among the nodes, wherein the result comprises relevance weight information among the nodes and reflects the relation among people in different documents.
In S504, a person-based information link is established based on the association-weighted analysis result, and cross-document desensitization association data is generated. This includes linking nodes representing the same persona, matching based on relevance weights. The final generated cross-document desensitization association data includes linked persona information and association weights therebetween for further analysis, such as cross-document persona relationship analysis and privacy desensitization processing.
Referring to fig. 7, based on the cross-document desensitization association data, the data consistency verification technology is adopted, and an automatic script and a manual review process are combined to verify the desensitization effect, so that the steps for generating the desensitization insurance claim data set are specifically as follows:
s601: based on the cross-document desensitization associated data, performing preliminary verification on consistency by adopting data hash matching, and generating a preliminary verification result;
S602: based on the preliminary verification result, adopting a contrast analysis technology to carry out differential detection and generating a refined consistency verification report;
s603: based on the refined consistency check report, an automatic script is adopted to conduct database rule check and generate an automatic check result;
s604: based on the automatic checking result, a manual rechecking process is adopted to confirm the effect and generate a desensitized insurance claim data set.
In S601, data consistency preliminary verification is carried out by using cross-document desensitization associated data, and a preliminary verification result is generated by adopting data hash matching. The operation is as follows:
each document or data set is represented by generating a data hash value. And performing preliminary verification, and comparing data hash values to verify the consistency of the data in different documents. A preliminary verification result is generated indicating which documents or datasets pass the consistency check and which fail.
In S602, based on the preliminary verification result, a contrast analysis technique is used to perform differential detection, and a refined consistency verification report is generated. The operation is as follows:
for documents or datasets that do not pass the preliminary verification, a contrast analysis tool is used to compare their content to detect differences. A refined consistency check report is generated, wherein the report contains detailed difference information, so that deeper processing is facilitated.
In S603, based on the refined consistency check report, the database rule is checked by using an automation script, and an automation check result is generated. The operation is as follows:
using an automation script, it is checked whether the document or the data set complies with a predetermined database rule, based on the refined consistency check report. Automated collation results are generated indicating which documents or datasets are in compliance with the rules and which are not.
In S604, a manual review process is performed to confirm the desensitization effect and generate a desensitized insurance claim dataset. The operation is as follows:
and (3) manually rechecking by a professional, and checking the document or the data set which does not accord with the rule so as to ensure that the desensitization effect accords with the requirement. And generating a desensitization insurance claim data set according to the manual rechecking result, wherein the desensitization insurance claim data set comprises documents or data sets which have passed the checking and rechecking, and the desensitization effect is confirmed.
The device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method for acquiring the information of the claim settlement data when executing the computer program.
Referring to fig. 8, a system for desensitizing and collecting information of claim materials is used for executing the method for desensitizing and collecting information of claim materials, and includes a feature extraction module, a semantic recognition module, a data preprocessing module, a desensitizing policy decision module, an image processing module, a relevance analysis module and a result verification module;
The feature extraction module is used for extracting features by adopting a deep convolutional neural network algorithm based on an original insurance claim data image, and performing text recognition and optimization by an optical character recognition technology and a long-short-time memory network to generate a text recognition data set;
the semantic recognition module receives the text recognition data set, performs semantic coding and entity authentication by using a pre-training model of a Transformers framework and a named entity recognition technology, adopts a text classification algorithm to divide privacy levels, and applies a data desensitization technology to label so as to obtain privacy information labeling data;
the data preprocessing module marks data based on the privacy information, performs data cleaning and feature encoding processing, and performs key feature extraction by using a principal component analysis method to generate screened key feature data;
the desensitization strategy decision module is based on the screened key feature data, adopts a support vector machine algorithm to carry out model training, carries out risk assessment and generates desensitization strategy decision data;
the image processing module utilizes desensitization strategy decision data, adopts an image segmentation technology to locate a desensitization area, applies Gaussian blur or mean blur to process, covers privacy information through an image masking technology, and generates preliminary desensitization data after image quality evaluation;
The relevance analysis module performs feature extraction by adopting a graph neural network based on the preliminary desensitization data, performs information aggregation by using a graph convolution network, performs weight analysis by using a graph attention network, establishes information links by using a graph isomorphic network, and generates cross-document desensitization relevance data;
the result verification module is used for carrying out validity verification on the desensitization result based on the cross-document desensitization associated data, carrying out data verification by using a machine learning classification algorithm, carrying out performance evaluation by using an evaluation index algorithm, and generating desensitization data.
The data security is enhanced, the privacy information is marked efficiently and accurately through the deep convolutional neural network and the text recognition technology, and the data leakage is effectively prevented. Automatic feature extraction and information identification reduce manual intervention, improve processing efficiency and reduce cost. The data preprocessing ensures the consistency and accuracy of the data, so that the desensitized data is more credible. The system also provides a personalized desensitization strategy, and personalized strategies are formulated for different data types through a machine learning model, so that privacy protection and data availability are balanced. Image processing and quality assessment ensures the quality of the desensitized image, maintaining data availability. Cross-document desensitization correlation data facilitates comprehensive analysis of claim data, while result verification and performance assessment improves data quality and system effectiveness.
Referring to fig. 9, the feature extraction module includes a feature map generation sub-module, a text region positioning sub-module, a text content identification sub-module, and a sequence feature optimization sub-module;
the semantic recognition module comprises a semantic coding sub-module, an entity recognition and classification sub-module, a privacy class dividing sub-module and a privacy information labeling sub-module;
the data preprocessing module comprises a data normalization sub-module, a feature conversion sub-module, a dimension reduction sub-module and a key feature extraction sub-module;
the desensitization strategy decision module comprises a model training sub-module, a core skill application sub-module, a soft interval maximization sub-module and a risk assessment sub-module;
the image processing module comprises a region positioning sub-module, a blurring processing sub-module, a mask application sub-module and a quality evaluation sub-module;
the relevance analysis module comprises a feature extraction sub-module, an information aggregation sub-module, a weight analysis sub-module and an information link sub-module;
the result verification module comprises a preliminary verification sub-module, a difference detection sub-module, a database alignment sub-module and a manual verification sub-module.
Through the feature map generation sub-module, the system uses a deep convolutional neural network algorithm to extract features from the original insurance claim data image. The text region positioning sub-module is used for positioning the text region through an image processing technology and providing accurate text recognition for the text content recognition sub-module. The text content recognition submodule recognizes and optimizes the text region by utilizing an optical character recognition technology and a long-short-time memory network to generate a text recognition data set. The sequence feature optimization sub-module further processes and optimizes the text recognition data, and improves the accuracy and quality of text recognition.
The semantic recognition module receives the text recognition dataset and performs further processing. The semantic coding submodule utilizes a pre-training model of a Transformers architecture to carry out semantic coding on text recognition data, and captures semantic information of the text. The entity recognition classification sub-module then uses named entity recognition techniques to identify and classify the text. The privacy class classification sub-module classifies the privacy class by using a text classification algorithm and determines the information needing desensitization. Finally, the privacy information labeling sub-module labels the privacy information in the text and prepares for subsequent desensitization processing.
The data preprocessing module further processes the marked privacy information. The data normalization sub-module performs normalization processing on the privacy information labeling data, and ensures consistency and accuracy of the data. The feature conversion submodule performs feature coding and conversion and prepares for subsequent data analysis and model training. The dimension reduction sub-module reduces the data dimension by using a principal component analysis method, reduces the data complexity and improves the processing efficiency. The key feature extraction submodule extracts key features from the processed data and provides important basis for subsequent model training and desensitization strategy decision.
The desensitization strategy decision module is responsible for making a desensitization strategy. The model training sub-module uses a support vector machine algorithm to carry out model training based on the screened key characteristic data, and determines a desensitization strategy. The kernel tricks application submodule improves the performance and generalization ability of the model, while the soft-interval maximization submodule ensures that the model can handle imperfect data. The risk assessment submodule assesses the performance of the model and determines the risk and effect of the desensitization strategy.
The image processing module is responsible for processing desensitization work involving the image. The region positioning sub-module uses the desensitization strategy decision data to position the desensitization region by using an image segmentation technology. And the blurring processing submodule carries out blurring processing on the image information of the desensitization area by applying Gaussian blurring or mean blurring and other technologies. The mask application submodule overlays the private information with an image masking technique to ensure that the private information is not visible. The quality evaluation submodule evaluates the quality of the image after desensitization, and ensures that the desensitization processing does not influence the usability of the image.
The relevance analysis module processes relevance between different materials. The feature extraction sub-module extracts features from the preliminary desensitization data for subsequent information association analysis. The information aggregation submodule aggregates related information together by using a graph rolling network and establishes connection between the information. The weight analysis sub-module uses the graph meaning network to conduct weight analysis and determines the importance of information association. The information link sub-module establishes information links using a graph isomorphic network, generating cross-document desensitization association data.
The result verification module performs validity verification on the desensitization result. The primary verification sub-module performs primary verification on the cross-document desensitization associated data, and ensures consistency and effectiveness of the data. The difference detection submodule uses a machine learning classification algorithm to carry out data verification and detects differences and abnormal conditions in the data. The database checking sub-module checks the desensitization data with the database to ensure the consistency of the desensitization result and the original data. The manual rechecking sub-module performs manual rechecking to ensure the accuracy and compliance of the desensitization result.
The present invention is not limited to the above embodiments, and any equivalent embodiments which can be changed or modified by the technical disclosure described above can be applied to other fields, but any simple modification, equivalent changes and modification made to the above embodiments according to the technical matter of the present invention will still fall within the scope of the technical disclosure.

Claims (10)

1. The desensitization acquisition method for the information of the claim settlement data is characterized by comprising the following steps of:
Based on original insurance claim data, performing feature extraction by adopting a deep learning image text recognition technology, and recognizing sequence features by combining a long-short-time memory network to generate a text recognition data set;
based on the text recognition data set, adopting a natural language processing algorithm, carrying out semantic analysis through a pre-training model based on a converters architecture, marking privacy information, and generating privacy information marking data;
based on the privacy information labeling data, carrying out risk assessment of the privacy information by adopting a support vector machine, determining a desensitization level, and generating desensitization strategy decision data;
based on the desensitization strategy decision data, performing privacy information coding by adopting an image processing method through an image blurring and masking technology in an OpenCV library to generate preliminary desensitization data;
based on the preliminary desensitization data, linking and analyzing the same person information in different documents by adopting a graph neural network technology to generate cross-document desensitization associated data;
based on the cross-document desensitization associated data, adopting a data consistency verification technology and combining an automatic script and a manual review process to verify the desensitization effect, and generating a desensitization insurance claim data set;
The text recognition data set specifically comprises text content and a set of positions of the text recognition data in a picture, the privacy information labeling data comprises text, category and position information of the privacy information in a document, the desensitization strategy decision data specifically refers to a desensitization level and a processing mode of each piece of privacy information, the preliminary desensitization data are processed claim data, the privacy information is blurred or covered, and the cross-document desensitization association data comprise association and desensitization states of privacy information of the same individual in a plurality of claim data.
2. The method for desensitizing and collecting claim 1, wherein the steps of extracting features by deep learning image text recognition technology based on original insurance claim data, recognizing sequence features by combining long-short-term memory network, and generating text recognition dataset are specifically as follows:
based on the original insurance claim data image, adopting a deep convolutional neural network algorithm to perform feature extraction, and performing feature map generation to generate a preliminary feature map;
based on the preliminary feature map, a region proposal network algorithm is adopted to locate a text region, and frame regression adjustment is carried out to generate a text region locating result;
Based on the text region positioning result, adopting an optical character recognition technology to accurately recognize text content, classifying characters and generating a text content recognition result;
based on the text content identification result, a long-short-time memory network is adopted to conduct identification optimization of sequence characteristics, and context information processing is conducted to generate a text identification data set;
the deep convolutional neural network specifically uses VGG or ResNet models to extract deep features of images, the region proposal network specifically refers to using RPNs in Faster R-CNNs to propose image regions containing texts, the optical character recognition technology specifically uses Tesseact OCR frameworks to recognize and convert text information in the images, and the long-short-term memory network specifically uses BiLSTM models to process time sequence dependence problems of text data.
3. The method for desensitizing and collecting claim 1, wherein based on the text recognition dataset, a natural language processing algorithm is adopted to perform semantic analysis through a pre-training model based on a Transformers architecture, and privacy information is marked, and the step of generating privacy information marking data is specifically as follows:
Based on the text recognition data set, performing semantic coding by adopting a pre-training model based on a Transformers framework, and generating a coding result to generate a text semantic coding result;
based on the text semantic coding result, adopting a named entity recognition technology to recognize and classify the entity, and generating an entity classification result to generate an entity recognition result;
based on the entity identification result, carrying out privacy class classification by adopting a text classification algorithm, and generating a privacy class classification result to generate a privacy class classification result;
based on the privacy class classification result, applying a data desensitization technology to label the sensitive information, and generating privacy information label data;
the pre-training model of the Transformers architecture is specifically that a BERT, GPT or RoBERTa model is used for carrying out deep semantic understanding on texts, the named entity recognition technology comprises the steps of applying a BiLSTM-CRF model to recognize entities including person names and place names in the texts, the text classification algorithm is specifically that a support vector machine or a deep neural network algorithm is used for classifying text data, and the data desensitization technology is specifically that a data mask, data camouflage or a differential privacy method is applied to process recognized sensitive information.
4. The method for desensitizing and collecting claim 1, wherein the step of generating desensitization policy decision data specifically comprises the steps of performing risk assessment of privacy information by using a support vector machine based on the privacy information labeling data and deciding a desensitization level:
based on the privacy information labeling data, adopting a data cleaning and feature coding algorithm to perform data standardization processing, and converting a feature format to generate preprocessed risk assessment data;
based on the preprocessed risk assessment data, adopting a principal component analysis method to reduce dimensionality and extract key features, and generating screened key feature data;
based on the screened key characteristic data, performing model training by adopting a support vector machine algorithm, wherein the model training comprises the steps of using a core skill and maximizing a soft interval to generate a trained SVM model;
based on the trained SVM model, performing risk assessment, determining the sensitivity degree of the data, determining the desensitization level according to the sensitivity degree, and generating desensitization strategy decision data;
the feature code is specifically used for converting text data into numerical data which can be processed by a machine learning algorithm, the principal component analysis method is used for converting variables with relevance into a group of linearly uncorrelated variables through orthogonal transformation, the kernel skill is specifically used for mapping the data into a high-dimensional space, and the soft interval maximization allows partial data points to be divided in interval areas or in a wrong way.
5. The method for desensitizing and collecting claim 1, wherein the step of generating preliminary desensitized data by performing privacy information coding through image blurring and masking technology in OpenCV library by adopting an image processing method based on the desensitization policy decision data is specifically as follows:
based on the desensitization strategy decision data, identifying and positioning a desensitization area by adopting an image segmentation technology, and generating mask area positioning data;
based on the mask region positioning data, carrying out fuzzy processing on the sensitive region by adopting a Gaussian fuzzy or mean fuzzy method, and generating fuzzy processed image data;
based on the blurred image data, adopting an image masking technology, and generating masked image data by adding a cascading style masking layer to cover sensitive information;
based on the masked image data, performing image quality evaluation, ensuring the desensitization effect and simultaneously maintaining the definition of a non-sensitive area to generate preliminary desensitization data;
the image segmentation technology comprises threshold segmentation and region growth, and is used for distinguishing difference objects in images, the Gaussian blur is specifically used for convolving the images by using a Gaussian function, and the image masking technology is specifically used for locally shielding target images by using binary images.
6. The method for desensitizing and collecting information of claim 1, wherein the step of linking and analyzing the same person information in different documents by adopting a graph neural network technology based on the preliminary desensitization information to generate cross-document desensitization associated data is specifically as follows:
based on the preliminary desensitization data, adopting a graph neural network to extract node embedded features and generating an individual feature representation set;
based on the individual characteristic representation set, adopting a graph rolling network to aggregate neighbor information and generating an enhanced individual connection graph;
based on the enhanced individual connection graph, adopting a graph attention network to perform relevance weight analysis and generating a relevance weight analysis result;
based on the relevance weighted analysis result, a graph isomorphic network is adopted, character-based information links are established, and cross-document desensitization relevance data are generated;
the individual characteristic representation set is specifically a vectorized representation of a plurality of volumes in a document, and comprises text content and structural information.
7. The method for desensitizing and collecting claim 1, wherein based on the cross-document desensitizing related data, a data consistency verification technology is adopted, an automatic script and a manual review flow are combined to verify the desensitizing effect, and the step of generating a desensitizing insurance claim data set specifically comprises the following steps:
Based on the cross-document desensitization associated data, performing preliminary verification on consistency by adopting data hash matching, and generating a preliminary verification result;
based on the preliminary verification result, adopting a contrast analysis technology to carry out differential detection and generating a refined consistency verification report;
based on the refined consistency check report, an automatic script is adopted to conduct database rule check and generate an automatic check result;
based on the automatic checking result, a manual rechecking process is adopted to confirm the effect and generate a desensitized insurance claim data set.
8. A claim data information desensitizing and collecting device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the claim data information desensitizing and collecting method according to any one of claims 1 to 7 when executing the computer program.
9. The system for acquiring the desensitization of the information of the claim data is characterized by being used for executing the method for acquiring the desensitization of the information of the claim data according to any one of claims 1-7, and comprises a feature extraction module, a semantic recognition module, a data preprocessing module, a desensitization strategy decision module, an image processing module, a relevance analysis module and a result verification module;
The feature extraction module is used for extracting features by adopting a deep convolutional neural network algorithm based on an original insurance claim data image, and performing text recognition and optimization by an optical character recognition technology and a long-short-time memory network to generate a text recognition data set;
the semantic recognition module receives a text recognition data set, performs semantic coding and entity authentication by using a pre-training model of a Transformers framework and a named entity recognition technology, adopts a text classification algorithm to divide privacy levels, and adopts a data desensitization technology to label so as to obtain privacy information labeling data;
the data preprocessing module marks data based on privacy information, performs data cleaning and feature encoding processing, and performs key feature extraction by using a principal component analysis method to generate screened key feature data;
the desensitization strategy decision module is based on the screened key feature data, adopts a support vector machine algorithm to carry out model training, carries out risk assessment and generates desensitization strategy decision data;
the image processing module utilizes desensitization strategy decision data, adopts an image segmentation technology to locate a desensitization area, applies Gaussian blur or mean blur to process, covers privacy information through an image masking technology, and generates preliminary desensitization data after image quality evaluation;
The relevance analysis module is used for extracting features by adopting a graph neural network based on the preliminary desensitization data, carrying out information aggregation by utilizing a graph convolution network, carrying out weight analysis by utilizing a graph attention network, establishing information links by utilizing a graph isomorphic network, and generating cross-document desensitization relevance data;
the result verification module is used for carrying out validity verification on the desensitization result based on the cross-document desensitization associated data, carrying out data verification by using a machine learning classification algorithm, carrying out performance evaluation by using an evaluation index algorithm, and generating desensitization data.
10. The claim 9, wherein the feature extraction module comprises a feature map generation sub-module, a text region positioning sub-module, a text content identification sub-module, and a sequence feature optimization sub-module;
the semantic recognition module comprises a semantic coding sub-module, an entity recognition and classification sub-module, a privacy class dividing sub-module and a privacy information labeling sub-module;
the data preprocessing module comprises a data normalization sub-module, a feature conversion sub-module, a dimension reduction sub-module and a key feature extraction sub-module;
the desensitization strategy decision module comprises a model training sub-module, a core skill application sub-module, a soft interval maximization sub-module and a risk assessment sub-module;
The image processing module comprises a region positioning sub-module, a blurring processing sub-module, a mask application sub-module and a quality evaluation sub-module;
the relevance analysis module comprises a feature extraction sub-module, an information aggregation sub-module, a weight analysis sub-module and an information link sub-module;
the result verification module comprises a preliminary verification sub-module, a difference detection sub-module, a database checking sub-module and a manual re-verification sub-module.
CN202311549869.7A 2023-11-20 2023-11-20 Method, device and system for desensitizing and collecting information of claim settlement data Pending CN117454426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311549869.7A CN117454426A (en) 2023-11-20 2023-11-20 Method, device and system for desensitizing and collecting information of claim settlement data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311549869.7A CN117454426A (en) 2023-11-20 2023-11-20 Method, device and system for desensitizing and collecting information of claim settlement data

Publications (1)

Publication Number Publication Date
CN117454426A true CN117454426A (en) 2024-01-26

Family

ID=89581744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311549869.7A Pending CN117454426A (en) 2023-11-20 2023-11-20 Method, device and system for desensitizing and collecting information of claim settlement data

Country Status (1)

Country Link
CN (1) CN117454426A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118116609A (en) * 2024-04-23 2024-05-31 上海森亿医疗科技有限公司 Medical data item asset sensitivity identification method, system, terminal and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118116609A (en) * 2024-04-23 2024-05-31 上海森亿医疗科技有限公司 Medical data item asset sensitivity identification method, system, terminal and medium

Similar Documents

Publication Publication Date Title
CN109902622B (en) Character detection and identification method for boarding check information verification
US11455525B2 (en) Method and apparatus of open set recognition and a computer readable storage medium
CN106529380B (en) Image recognition method and device
CN117454426A (en) Method, device and system for desensitizing and collecting information of claim settlement data
CN114329034A (en) Image text matching discrimination method and system based on fine-grained semantic feature difference
CN112241730A (en) Form extraction method and system based on machine learning
Sirajudeen et al. Forgery document detection in information management system using cognitive techniques
CN118211941B (en) Automatic community work order circulation method and system based on RPA
CN116401343A (en) Data compliance analysis method
CN117709317A (en) Report file processing method and device and electronic equipment
CN117523586A (en) Check seal verification method and device, electronic equipment and medium
CN111507850A (en) Authority guaranteeing method and related device and equipment
CN115795079A (en) Engineering cost analysis data acquisition and processing method and system
CN115880702A (en) Data processing method, device, equipment, program product and storage medium
CN115543915A (en) Automatic database building method and system for personnel file directory
CN115512340A (en) Intention detection method and device based on picture
CN115294576A (en) Data processing method and device based on artificial intelligence, computer equipment and medium
Kchaou et al. Two image quality assessment methods based on evidential modeling and uncertainty: application to automatic iris identification systems
US20220398399A1 (en) Optical character recognition systems and methods for personal data extraction
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN113901817A (en) Document classification method and device, computer equipment and storage medium
Brewer et al. Reading PDFs using Adversarially trained Convolutional Neural Network based optical character recognition
CN113934922A (en) Intelligent recommendation method, device, equipment and computer storage medium
Khandan An intelligent hybrid model for identity document classification
US20230140546A1 (en) Randomizing character corrections in a machine learning classification system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination