CN113569533B - Insurance content marking method and system, computer equipment and storage medium - Google Patents

Insurance content marking method and system, computer equipment and storage medium Download PDF

Info

Publication number
CN113569533B
CN113569533B CN202111125237.9A CN202111125237A CN113569533B CN 113569533 B CN113569533 B CN 113569533B CN 202111125237 A CN202111125237 A CN 202111125237A CN 113569533 B CN113569533 B CN 113569533B
Authority
CN
China
Prior art keywords
content
sample
file
data
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111125237.9A
Other languages
Chinese (zh)
Other versions
CN113569533A (en
Inventor
汤海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fubao Technology Co ltd
Original Assignee
Nanjing Fubao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fubao Technology Co ltd filed Critical Nanjing Fubao Technology Co ltd
Priority to CN202111125237.9A priority Critical patent/CN113569533B/en
Publication of CN113569533A publication Critical patent/CN113569533A/en
Application granted granted Critical
Publication of CN113569533B publication Critical patent/CN113569533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The scheme relates to a method and a system for marking insurance content, computer equipment and a storage medium. The method comprises the following steps: acquiring a marked sample file, and extracting marked content in the sample file; analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content; extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field; acquiring a BERT pointer network model, and acquiring an insurance file to be marked; and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content. The marked content is positioned by using a matching algorithm, so that the efficiency and the precision of content identification and matching are improved; the content in the file is labeled and predicted and automatically labeled through the BERT pointer network model, and labeling efficiency is improved.

Description

Insurance content marking method and system, computer equipment and storage medium
Technical Field
The invention relates to the technical field of data processing, in particular to an insurance content marking method, an insurance content marking system, computer equipment and a storage medium.
Background
In traditional insurance file examination, the legal risk judgment of each term mainly depends on professionals, which is a time-consuming and labor-consuming process. The method not only brings huge workload to related legal personnel, but also can be difficult for legal workers with less experience to identify risk terms in the legal personnel, and is easy to cause inaccurate examination of the same terms, so that the examination efficiency is reduced. Therefore, most of files can save the examination time of contract terms by marking the key terms and reading the terms by the staff. With the application of NLP technology in recent years, there is a tool in the market for constructing a knowledge base or a knowledge graph by implementing automatic annotation through an algorithm, and the tool mainly utilizes an NLP algorithm to construct an automatic annotation tool. The existing tools for constructing a knowledge base or a knowledge graph by realizing automatic labeling through an algorithm on the market are difficult to label relatively complex texts, for example, for the same labeled content, the whole text may appear for many times, but because the positions of the labeled content are different, the corresponding label pages are relatively different.
Therefore, the traditional labeling method has the problems of difficult labeling and low labeling efficiency.
Disclosure of Invention
Based on this, in order to solve the above technical problem, an insurance content annotation method, system, computer device and storage medium are provided, which can improve annotation efficiency.
An insurance content annotation method, the method comprising:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model, and acquiring an insurance file to be marked;
and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
In one embodiment, the method further comprises:
analyzing the sample file to obtain an analyzed identifiable file;
determining the typesetting style of the identifiable file, and extracting the sample field in the identifiable file according to the typesetting style;
storing the sample field into a candidate set.
In one embodiment, the analyzing the sample file through a matching algorithm to obtain the positioning information corresponding to the labeled content includes:
extracting content data and title data in the sample file, and storing the content data and the title data into the candidate set;
matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information;
and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
In one embodiment, the matching the content in the sample file through the deep learning model to obtain the content positioning information includes:
calculating similarity between the content data and the content in the sample file through an edit distance algorithm;
and obtaining the content positioning information according to the similarity.
In one embodiment, the extracting content data and title data in the sample file and storing the content data and the title data in the candidate set includes:
extracting content data and title data in the sample file, comparing the extracted content data with candidate content data in the candidate set, and comparing the extracted title data with candidate title data in the candidate set;
storing the content data into the candidate set when the extracted content data is different from the candidate content data; when the extracted header data is different from the candidate header data, storing the header data into the candidate set.
In one embodiment, the performing deduplication processing on each sample field to obtain each target sample field includes:
and comparing each sample field, and deleting each repeated sample field to obtain the target sample field.
In one embodiment, the training process of the BERT pointer network model includes:
acquiring an initial BERT pointer network model, and inputting training sample data into the initial BERT pointer network model to obtain a sample training result;
extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters;
and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
An insurance content annotation system, the system comprising:
the content extraction module is used for acquiring the marked sample file and extracting the marked content in the sample file;
the positioning module is used for analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
the field processing module is used for extracting each sample field corresponding to the marked content according to the positioning information and carrying out duplicate removal processing on each sample field to obtain each target sample field;
the data acquisition module is used for acquiring a BERT pointer network model and acquiring an insurance file to be marked;
and the content marking module is used for predicting the target marking content in the insurance file to be marked according to each target sample field and the BERT pointer network model, and marking and displaying the target marking content.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model, and acquiring an insurance file to be marked;
and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model, and acquiring an insurance file to be marked;
and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
According to the insurance content labeling method, the insurance content labeling system, the computer equipment and the storage medium, labeled sample files are obtained, and labeled content in the sample files is extracted; analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content; extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field; acquiring a BERT pointer network model, and acquiring an insurance file to be marked; and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content. The marked content is positioned by using a matching algorithm, so that the efficiency and the precision of content identification and matching are improved; the content in the file is labeled and predicted and automatically labeled through the BERT pointer network model, and labeling efficiency is improved.
Drawings
FIG. 1 is a diagram of an application environment of an insurance content annotation method according to an embodiment;
FIG. 2 is a flow chart illustrating an insurance content annotation process according to an embodiment;
FIG. 3 is a diagram of a BERT pointer network model in one embodiment;
FIG. 4 is a block diagram of an embodiment of an insurance content annotation system;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The insurance content marking method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. As shown in FIG. 1, the application environment includes a computer device 110. The computer device 110 may obtain the labeled sample file and extract the labeled content in the sample file; the computer device 110 may analyze the sample file through a matching algorithm to obtain the positioning information corresponding to the labeled content; the computer device 110 may extract each sample field corresponding to the labeled content according to the positioning information, and perform deduplication processing on each sample field to obtain each target sample field; the computer device 110 may obtain a BERT pointer network model, obtain an insurance file to be annotated; the computer device 110 can predict the target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotate and display the target annotation content. The computer device 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, robots, tablet computers, and other devices.
In one embodiment, as shown in fig. 2, there is provided an insurance content annotation method, including the following steps:
step 202, obtaining the marked sample file, and extracting the marked content in the sample file.
The marked sample file can be a file which is manually marked in advance by a user and is manually screened, and the file can be in a PDF format. The user can import the labeled sample file into the computer device, that is, the computer device can obtain the labeled sample file.
Because the sample file is labeled, the computer equipment can extract the labeled content in the sample file, so that the labeled content in the sample file is obtained.
And 204, analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content.
The computer device may have matching algorithms stored therein, wherein the matching algorithms may include fuzzy matching algorithms and regular matching algorithms. The computer equipment can analyze the obtained sample file through a matching algorithm, namely, the marked content in the sample file is positioned, so that the positioning information corresponding to the marked content is obtained.
And step 206, extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field.
The computer device may extract fields in the sample file according to the positioning information, and specifically, the computer device may extract each sample field corresponding to the tagged content. Since the sample file is manually labeled by the user, the same labeled content may exist, and therefore, the sample fields extracted by the computer device and corresponding to the labeled content may also exist the same. In this regard, the computer device may perform deduplication processing on each sample field, and each sample field remaining after the deduplication processing may be used as each target sample field. The computer device may store the resulting respective target sample fields in the candidate set.
And step 208, acquiring a BERT pointer network model and acquiring an insurance file to be marked.
The BERT pointer network model may be a pre-trained model stored in a computer device for identifying specific segments in a sentence as tagged content.
The insurance file to be labeled can be an insurance file which needs to be labeled and is imported into the computer equipment by the user, such as a file of an insurance contract. The computer device can obtain the BERT pointer network model and the insurance file to be marked.
And step 210, predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
And the computer equipment can predict the marked content of the imported insurance file to be marked according to each target sample field and the BERT pointer network model, and further mark and display the predicted marked content.
Wherein, marking refers to matching the target marking content with the target sample field; the display refers to that one or more target labeling contents are highlighted in a sample file in a specific color, and jumping can be carried out among the highlighted displays of the plurality of target labeling contents.
In this embodiment, the computer device obtains the labeled sample file and extracts the labeled content in the sample file; analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content; extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field; acquiring a BERT pointer network model, and acquiring an insurance file to be marked; and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content. The marked content is positioned by using a matching algorithm, so that the efficiency and the precision of content identification and matching are improved; the content in the file is labeled and predicted and automatically labeled through the BERT pointer network model, and labeling efficiency is improved.
In one embodiment, the insurance content annotation method provided may further include a process of creating a candidate set, where the specific process includes: analyzing the sample file to obtain an analyzed identifiable file; determining the typesetting style of the identifiable file, and extracting a sample field in the identifiable file according to the typesetting style; the sample field is stored into the candidate set.
The computer device may parse the imported sample file, wherein the format of the imported sample file may be a PDF format. The computer equipment can obtain the recognizable file after analysis.
The computer device may identify a typographical pattern of the identifiable documents. The layout style can be used for representing the layout style of the title in the file, and can be divided into a horizontal format and a vertical format, and the computer equipment can identify whether the imported sample file is in the horizontal format or the vertical format. The computer device may extract the sample fields in the recognizable file according to the layout style, and in particular, the computer device may process the recognizable file into a format corresponding to "title-content" using vertical or horizontal processing logic, and further extract the sample fields, thereby storing the sample fields in the candidate set.
In this embodiment, after the computer device processes the recognizable file into the format corresponding to the "title-content", since it takes a long time to parse the PDF file, the parsed PDF file is saved as a csv-format file, which is convenient for parsing again when needed later, and thus the amount of computation is saved.
In one embodiment, the insurance content annotation method provided may further include a process of obtaining the positioning information, where the specific process includes: extracting content data and title data in the sample file, and storing the content data and the title data into a candidate set; matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information; and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
The computer device may extract the content data and the title data in the sample file and store the content data and the title data in the candidate set. The matching algorithm can comprise a regular matching algorithm and a fuzzy matching algorithm, the computer equipment can position the subtitles in the file based on the regular matching algorithm, specifically, the computer equipment can match the subtitles in the sample file by using the regular matching algorithm based on the subtitle data in the candidate set, and directly memorize and position the subtitles, so that refined subtitle positioning information is returned. Wherein, the regular matching algorithm can be applied to the data which is easily identified accurately, such as the age, the renewal period and the like of the applicant.
The computer device may locate the content in the file based on a fuzzy matching algorithm, and specifically, the computer device may match the content in the sample file through the deep learning model based on the content data in the candidate set to obtain the content location information.
In one embodiment, the insurance content annotation method provided may further include a process of obtaining content location information, where the specific process includes: calculating the similarity between the content data and the content in the sample file through an edit distance algorithm; and obtaining content positioning information according to the similarity.
The edit Distance algorithm may be a method of calculating similarity between two character strings through a Levenshtein Distance, that is, matching the content in the existing candidate set with the content in the sample file, and calculating similarity between the content data and the content in the sample file, thereby obtaining the content positioning information. The formula for calculating the similarity may be:
Figure 60955DEST_PATH_IMAGE001
. Wherein, a and b both represent character strings,
Figure 70106DEST_PATH_IMAGE003
indicates the length of the a-character string,
Figure 177739DEST_PATH_IMAGE005
b, representing the length of the character string, lev representing the editing distance of the two character strings, wherein the smaller the editing distance is, the more similar the character strings are; tail denotes the tail, lev (tail (a), b) denotes the distance between the tail of the a character and the b character.
In one embodiment, the insurance content annotation method provided may further include a process of storing data into the candidate set, where the specific process includes: extracting content data and title data in the sample file, comparing the extracted content data with candidate content data in the candidate set, and comparing the extracted title data with candidate title data in the candidate set; when the extracted content data is different from the candidate content data, storing the content data into a candidate set; when the extracted header data is different from the candidate header data, the header data is stored into the candidate set.
In an embodiment, the insurance content labeling method provided may further include a process of performing deduplication processing on the sample field, where the specific process includes: and comparing each sample field, and deleting each repeated sample field to obtain a target sample field.
The computer device may perform a deduplication process on the sample fields in the candidate set, i.e., remove duplicate sample bullets. The contents in the candidate set are stored in an sql database, and the sql database can allow storage of repeated data.
In an embodiment, the insurance content annotation method provided may further include a training process of a BERT pointer network model, and the specific process includes: acquiring an initial BERT pointer network model, and inputting training sample data into the initial BERT pointer network model to obtain a sample training result; extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters; and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
Since the content in the candidate set is allowed to deviate from the text content, for example, one to two words in the content are not the same but can still match. And positioning and returning the section title in which the section title is positioned within the deviation range, wherein a deep learning model is adopted, and a large amount of data is used for training to improve the matching precision and accuracy.
The adopted deep learning model can be a BERT pointer network model and is used for identifying specific segments of sentences in fuzzy matching as labeled contents. The BERT pointer network model needs to be trained by using data continuously to increase the accuracy rate of the BERT pointer network model continuously.
The BERT pointer network model is shown in fig. 3, where a start position and an end position represent tag contents for fuzzy matching to find a position in a sentence, and a middle segment between two tags is a target tag content. Because the candidate set is refined labeled content and the lengths of the candidate set and the sentence are inconsistent when the sentences are matched, the accuracy of automatic labeling can be improved by intercepting the sentence to be labeled with the same length as the candidate set, then calculating the similarity by using the Levenshtein Distance, and then intercepting a middle segment of the most similar sentence as a final result by using the BERT pointer network model.
It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the above-described flowcharts may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 4, there is provided an insurance content annotation system, including: a content extraction module 410, a positioning module 420, a field processing module 430, a data acquisition module 440, and a content annotation module 450, wherein:
a content extraction module 410, configured to obtain the labeled sample file, and extract labeled content in the sample file;
the positioning module 420 is configured to analyze the sample file through a matching algorithm to obtain positioning information corresponding to the labeled content;
the field processing module 430 is configured to extract each sample field corresponding to the labeled content according to the positioning information, and perform deduplication processing on each sample field to obtain each target sample field;
the data acquisition module 440 is used for acquiring a BERT pointer network model and acquiring an insurance file to be annotated;
and the content marking module 450 is configured to predict target marking content in the insurance file to be marked according to each target sample field and the BERT pointer network model, and mark and display the target marking content.
In one embodiment, the data obtaining module 440 is further configured to parse the sample file to obtain a parsed identifiable file; determining the typesetting style of the identifiable file, and extracting a sample field in the identifiable file according to the typesetting style; the sample field is stored into the candidate set.
In one embodiment, the positioning module 420 is further configured to extract content data and title data in the sample file, and store the content data and the title data in the candidate set; matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information; and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
In one embodiment, the positioning module 420 is further configured to calculate a similarity between the content data and the content in the sample file through an edit distance algorithm; and obtaining content positioning information according to the similarity.
In one embodiment, the data obtaining module 440 is further configured to extract content data and title data in the sample file, compare the extracted content data with candidate content data in the candidate set, and compare the extracted title data with candidate title data in the candidate set; when the extracted content data is different from the candidate content data, storing the content data into a candidate set; when the extracted header data is different from the candidate header data, the header data is stored into the candidate set.
In one embodiment, the field processing module 430 is further configured to compare the sample fields and delete duplicate sample fields to obtain the target sample field.
In an embodiment, the insurance content annotation system provided may further include a model training module, configured to obtain an initial BERT pointer network model, and input training sample data into the initial BERT pointer network model to obtain a sample training result; extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters; and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an insurance content annotation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model, and acquiring an insurance file to be marked;
and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
In one embodiment, the processor, when executing the computer program, further performs the steps of: analyzing the sample file to obtain an analyzed identifiable file; determining the typesetting style of the identifiable file, and extracting a sample field in the identifiable file according to the typesetting style; the sample field is stored into the candidate set.
In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting content data and title data in the sample file, and storing the content data and the title data into a candidate set; matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information; and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
In one embodiment, the processor, when executing the computer program, further performs the steps of: calculating the similarity between the content data and the content in the sample file through an edit distance algorithm; and obtaining content positioning information according to the similarity.
In one embodiment, the processor, when executing the computer program, further performs the steps of: extracting content data and title data in the sample file, comparing the extracted content data with candidate content data in the candidate set, and comparing the extracted title data with candidate title data in the candidate set; when the extracted content data is different from the candidate content data, storing the content data into a candidate set; when the extracted header data is different from the candidate header data, the header data is stored into the candidate set.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and comparing each sample field, and deleting each repeated sample field to obtain a target sample field.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring an initial BERT pointer network model, and inputting training sample data into the initial BERT pointer network model to obtain a sample training result; extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters; and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model, and acquiring an insurance file to be marked;
and predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content.
In one embodiment, the computer program when executed by the processor further performs the steps of: analyzing the sample file to obtain an analyzed identifiable file; determining the typesetting style of the identifiable file, and extracting a sample field in the identifiable file according to the typesetting style; the sample field is stored into the candidate set.
In one embodiment, the computer program when executed by the processor further performs the steps of: extracting content data and title data in the sample file, and storing the content data and the title data into a candidate set; matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information; and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
In one embodiment, the computer program when executed by the processor further performs the steps of: calculating the similarity between the content data and the content in the sample file through an edit distance algorithm; and obtaining content positioning information according to the similarity.
In one embodiment, the computer program when executed by the processor further performs the steps of: extracting content data and title data in the sample file, comparing the extracted content data with candidate content data in the candidate set, and comparing the extracted title data with candidate title data in the candidate set; when the extracted content data is different from the candidate content data, storing the content data into a candidate set; when the extracted header data is different from the candidate header data, the header data is stored into the candidate set.
In one embodiment, the computer program when executed by the processor further performs the steps of: and comparing each sample field, and deleting each repeated sample field to obtain a target sample field.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring an initial BERT pointer network model, and inputting training sample data into the initial BERT pointer network model to obtain a sample training result; extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters; and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (9)

1. An insurance content annotation method, characterized in that the method comprises:
acquiring a marked sample file, and extracting marked content in the sample file;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
extracting each sample field corresponding to the marked content according to the positioning information, and performing duplicate removal processing on each sample field to obtain each target sample field;
acquiring a BERT pointer network model for identifying specific segments in sentences as marked contents, and acquiring insurance files to be marked;
predicting target annotation content in the insurance file to be annotated according to each target sample field and the BERT pointer network model, and annotating and displaying the target annotation content;
analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content, wherein the positioning information comprises the following steps:
extracting content data and title data in the sample file, and storing the content data and the title data into the candidate set;
matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information;
and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
2. The insurance content annotation method of claim 1, further comprising:
analyzing the sample file to obtain an analyzed identifiable file;
determining the typesetting style of the identifiable file, and extracting the sample field in the identifiable file according to the typesetting style;
storing the sample field into the candidate set.
3. The insurance content annotation method according to claim 1, wherein the matching of the content in the sample file by the deep learning model to obtain the content positioning information comprises:
calculating similarity between the content data and the content in the sample file through an edit distance algorithm;
and obtaining the content positioning information according to the similarity.
4. The insurance content annotation method according to claim 1, wherein the extracting content data and title data in the sample file and storing the content data and the title data in the candidate set comprises:
extracting content data and title data in the sample file, comparing the extracted content data with candidate content data in the candidate set, and comparing the extracted title data with candidate title data in the candidate set;
storing the content data into the candidate set when the extracted content data is different from the candidate content data; when the extracted header data is different from the candidate header data, storing the header data into the candidate set.
5. The insurance content annotation method according to claim 1, wherein said performing de-duplication processing on each of said sample fields to obtain each target sample field comprises:
and comparing each sample field, and deleting each repeated sample field to obtain the target sample field.
6. The insurance content annotation method of claim 1, wherein the training process of the BERT pointer network model comprises:
acquiring an initial BERT pointer network model, and inputting training sample data into the initial BERT pointer network model to obtain a sample training result;
extracting model parameters in the initial BERT pointer network model, and adjusting the model parameters according to the sample training result to obtain target model parameters;
and adjusting the initial BERT pointer network model according to the target model parameters to generate the BERT pointer network model.
7. An insurance content annotation system, the system comprising:
the content extraction module is used for acquiring the marked sample file and extracting the marked content in the sample file;
the positioning module is used for analyzing the sample file through a matching algorithm to obtain positioning information corresponding to the marked content;
the field processing module is used for extracting each sample field corresponding to the marked content according to the positioning information and carrying out duplicate removal processing on each sample field to obtain each target sample field;
the data acquisition module is used for acquiring a BERT pointer network model for identifying specific segments in sentences as marked contents and acquiring insurance files to be marked;
a content marking module for predicting the target marking content in the insurance file to be marked according to each target sample field and the BERT pointer network model, and marking and displaying the target marking content
The positioning module is further used for extracting content data and title data in the sample file and storing the content data and the title data into the candidate set; matching the titles in the sample files by using a regular matching algorithm based on the title data in the candidate set to obtain title positioning information; and matching the content in the sample file through a deep learning model based on the content data in the candidate set to obtain content positioning information.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202111125237.9A 2021-09-26 2021-09-26 Insurance content marking method and system, computer equipment and storage medium Active CN113569533B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111125237.9A CN113569533B (en) 2021-09-26 2021-09-26 Insurance content marking method and system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111125237.9A CN113569533B (en) 2021-09-26 2021-09-26 Insurance content marking method and system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113569533A CN113569533A (en) 2021-10-29
CN113569533B true CN113569533B (en) 2022-02-18

Family

ID=78174407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111125237.9A Active CN113569533B (en) 2021-09-26 2021-09-26 Insurance content marking method and system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113569533B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114548072A (en) * 2022-04-25 2022-05-27 杭州实在智能科技有限公司 Automatic content analysis and information evaluation method and system for contract files

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276054A (en) * 2019-05-16 2019-09-24 湖南大学 A kind of insurance text structure implementation method
CN112270604A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Information structuring processing method and device and computer readable storage medium
CN113011141A (en) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 Buddha note model training method, Buddha note generation method and related equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597306A (en) * 2020-12-24 2021-04-02 电子科技大学 Travel comment suggestion mining method based on BERT
CN114093468A (en) * 2021-07-27 2022-02-25 北京好欣晴移动医疗科技有限公司 Cardiovascular disease information entity labeling and identifying method, device and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276054A (en) * 2019-05-16 2019-09-24 湖南大学 A kind of insurance text structure implementation method
CN112270604A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Information structuring processing method and device and computer readable storage medium
CN113011141A (en) * 2021-03-17 2021-06-22 平安科技(深圳)有限公司 Buddha note model training method, Buddha note generation method and related equipment

Also Published As

Publication number Publication date
CN113569533A (en) 2021-10-29

Similar Documents

Publication Publication Date Title
CN108874928B (en) Resume data information analysis processing method, device, equipment and storage medium
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN108664595B (en) Domain knowledge base construction method and device, computer equipment and storage medium
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN109933796B (en) Method and device for extracting key information of bulletin text
CN111062215A (en) Named entity recognition method and device based on semi-supervised learning training
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN111026671A (en) Test case set construction method and test method based on test case set
CN110688853B (en) Sequence labeling method and device, computer equipment and storage medium
CN111680634A (en) Document file processing method and device, computer equipment and storage medium
CN113449489B (en) Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN110866107A (en) Method and device for generating material corpus, computer equipment and storage medium
CN108763368A (en) The method for extracting new knowledge point
CN111274829A (en) Sequence labeling method using cross-language information
CN112614559A (en) Medical record text processing method and device, computer equipment and storage medium
CN114298035A (en) Text recognition desensitization method and system thereof
CN111950262A (en) Data processing method, data processing device, computer equipment and storage medium
CN113569533B (en) Insurance content marking method and system, computer equipment and storage medium
CN110555103A (en) Construction method and device of biomedical entity display platform and computer equipment
CN112580329B (en) Text noise data identification method, device, computer equipment and storage medium
CN110532229B (en) Evidence file retrieval method, device, computer equipment and storage medium
CN113420116B (en) Medical document analysis method, device, equipment and medium
CN110705211A (en) Text key content marking method and device, computer equipment and storage medium
CN111191446A (en) Interactive information processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant