CN112906352A - Vehicle insurance electronic insurance policy text recognition and extraction method and system - Google Patents

Vehicle insurance electronic insurance policy text recognition and extraction method and system Download PDF

Info

Publication number
CN112906352A
CN112906352A CN202110247927.5A CN202110247927A CN112906352A CN 112906352 A CN112906352 A CN 112906352A CN 202110247927 A CN202110247927 A CN 202110247927A CN 112906352 A CN112906352 A CN 112906352A
Authority
CN
China
Prior art keywords
data
text
insurance
vehicle insurance
policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110247927.5A
Other languages
Chinese (zh)
Inventor
卢瑞瑞
杨勇志
张成东
郭大朋
龙金泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daohe Cloud Technology Tianjin Co ltd
Original Assignee
Daohe Cloud Technology Tianjin Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daohe Cloud Technology Tianjin Co ltd filed Critical Daohe Cloud Technology Tianjin Co ltd
Priority to CN202110247927.5A priority Critical patent/CN112906352A/en
Publication of CN112906352A publication Critical patent/CN112906352A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/111Mathematical or scientific formatting; Subscripts; Superscripts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Abstract

The invention provides a vehicle insurance electronic policy text recognition and extraction method and system, and relates to the technical field of digital image processing. A vehicle insurance electronic policy text recognition and extraction method comprises the following steps: constructing a vehicle insurance electronic policy data model base in the insurance industry; extracting coordinates of each character in the PDF file from a data model base and processing the coordinates to obtain text data; filtering the text data to obtain an electronic insurance policy of the vehicle insurance; matching a data set to be extracted of the vehicle insurance electronic policy, and extracting data information on the vehicle insurance electronic policy according to the analytic model; outputting the structured data and writing into the editable document. The electronic insurance policy extraction method can extract the electronic insurance policy of non-vehicle insurance in the insurance industry, and is more widely applied. In addition, the invention also provides a vehicle insurance electronic insurance policy text recognition and extraction system, which comprises: the system comprises a database building module, an extraction module, a filtering module, a processing module and an output module.

Description

Vehicle insurance electronic insurance policy text recognition and extraction method and system
Technical Field
The invention relates to the technical field of digital image processing, in particular to a method and a system for recognizing and extracting a text of an electronic insurance policy of vehicle insurance.
Background
The PDF (Portable Document Format) file Format can encapsulate characters, fonts, formats, colors, graphic images independent of devices and resolutions, and the like in one file, and has the advantages of cross-platform, high integration, high security, and the like. In the electronic process of the insurance industry, the vehicle insurance electronic insurance policy is generated and stored by adopting a PDF file format. In many cases, the policy data information is extracted from the documents for statistics and analysis, and the data information cannot be conveniently converted into readable and writable information from the PDF format documents.
In the prior art, there are some general PDF information extraction techniques, such as extracting cell data in PDF, or extracting data at a specified position, etc. However, as the vehicle insurance electronic policy in the insurance industry has obvious industry characteristics, and the attributes, data formats and data contents of the electronic policy have business characteristics and rules, but the PDF types are various, the policy data are displayed in a form mode, and the policy data are sequentially displayed in a streaming layout, so that the extraction requirement of the vehicle insurance electronic policy in the insurance industry cannot be met by the general PDF information extraction technology.
Disclosure of Invention
The invention aims to provide a text recognition and extraction method for vehicle insurance electronic insurance policies, which can ensure that the vehicle insurance electronic insurance policies have high extraction accuracy, can recognize and extract all vehicle insurance electronic insurance policies in the insurance industry, can extract electronic insurance policies of non-vehicle insurance in the insurance industry, and has wider application.
It is another object of the present invention to provide a vehicle insurance electronic policy text recognition and extraction system capable of operating a vehicle insurance electronic policy text recognition and extraction method.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the application provides a vehicle insurance electronic policy text recognition and extraction method, which includes constructing an insurance industry vehicle insurance electronic policy data model base; extracting coordinates of each character in the PDF file from a data model base and processing the coordinates to obtain text data; filtering the text data to obtain an electronic insurance policy of the vehicle insurance; matching a data set to be extracted of the vehicle insurance electronic policy, and extracting data information on the vehicle insurance electronic policy according to the analytic model; outputting the structured data and writing into the editable document.
In some embodiments of the invention, the constructing the insurance industry vehicle insurance electronic policy data model library comprises: training and establishing a preset rule base, establishing a data set of the vehicle insurance products of the insurance company, and training and establishing a data analysis model base of the vehicle insurance products of the insurance company.
In some embodiments of the present invention, extracting and processing the coordinates of each character in the PDF file in the data model library to obtain text data includes: analyzing the content contained in the PDF document to generate PDF block information; and combining the single character information of the same or similar horizontal coordinates into a line of text by presetting a coordinate deviation threshold, and generating the longitudinal initial coordinate and the horizontal coordinate of the text.
In some embodiments of the present invention, the filtering the text data to obtain the vehicle insurance electronic policy includes: and (3) according to a preset rule base trained and established in advance, removing the vehicle insurance electronic bill, the electronic mark and the electronic invoice by adopting a removing method, and identifying the vehicle insurance electronic policy by adopting a matching method.
In some embodiments of the present invention, the matching the vehicle insurance electronic policy data set to be extracted according to the analytic model comprises: identifying the insurance companies and the vehicle insurance products of the insurance companies according to a pre-trained and established preset rule base, and extracting a data set of the vehicle insurance products of the insurance companies according to the vehicle insurance products of the insurance companies and the insurance companies; and sequentially analyzing the data in the vehicle insurance product data set of the insurance company and extracting the data on the vehicle insurance electronic policy according to a pre-trained and established vehicle insurance product analysis model base of the insurance company.
In some embodiments of the present invention, the above further includes: acquiring a text model through data positioning to obtain a text set containing data, and sequentially combining the text set into text information according to a longitudinal initial coordinate; intercepting text information through a data interception model to obtain the text information of the value of the data item; and formatting the text information of the value of the data item through the data formatting model to obtain the formatted value of the data item.
In some embodiments of the present invention, the obtaining a text model through data positioning to obtain a text set including data, and sequentially combining the text set into a text message according to a vertical start coordinate includes: the positioning acquisition text model is composed of a plurality of positioning acquisition text functions, and the plurality of positioning acquisition text functions are executed in sequence to complete data positioning.
In some embodiments of the present invention, the intercepting the text information by the data interception model to obtain the text information of the value of the data item includes: the data interception model is composed of a plurality of data interception functions, and the plurality of data interception functions are executed in sequence to complete data interception.
In a second aspect, the embodiment of the application provides a vehicle insurance electronic policy text recognition and extraction system, which comprises a construction database module, a database module and a database module, wherein the construction database module is used for constructing a vehicle insurance electronic policy data model library in the insurance industry;
the extraction module is used for extracting and processing the coordinates of each character in the PDF file in the data model base to obtain text data;
the filtering module is used for filtering the text data to obtain the vehicle insurance electronic insurance policy;
the processing module is used for matching the data set to be extracted of the vehicle insurance electronic insurance policy and extracting data information on the vehicle insurance electronic insurance policy according to the analytical model;
and the output module is used for outputting the structured data and writing the structured data into the editable document.
In some embodiments of the invention, the above includes: at least one memory for storing computer instructions; at least one processor in communication with the memory, wherein the at least one processor, when executing the computer instructions, causes the system to: the system comprises a database building module, an extraction module, a filtering module, a processing module and an output module.
Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:
the vehicle insurance electronic insurance policy extraction method can enable the vehicle insurance electronic insurance policy extraction accuracy rate to be high, can identify and extract all vehicle insurance electronic insurance policies in the insurance industry, can extract electronic insurance policies of non-vehicle insurance in the insurance industry, and is wider in application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a schematic diagram of a vehicle insurance electronic policy text recognition and extraction method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of PDF block information according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a vehicle insurance electronic policy text recognition and extraction system module according to an embodiment of the present invention.
Icon: 10-constructing a database module; 20-an extraction module; 30-a filtration module; 40-a processing module; 50-output module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It should be noted that, in this document, the term "comprises/comprising" or any other variation thereof is intended to cover a non-exclusive inclusion, so that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the individual features of the embodiments can be combined with one another without conflict.
Example 1
Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram illustrating steps of a vehicle insurance electronic policy text recognition and extraction method according to an embodiment of the present invention, which is as follows:
s100, constructing an insurance industry vehicle insurance electronic policy data model base;
specifically, a vehicle insurance electronic policy document identification rule base is trained and established, an insurance company vehicle insurance product data set is established, and an insurance company vehicle insurance product data analysis model base is trained and established, wherein the analysis model comprises a data positioning model, a data interception model and a data formatting model.
Step S110, extracting and processing the coordinates of each character in the PDF file in a data model base to obtain text data;
specifically, a PDF document analysis tool is adopted to analyze the content contained in the PDF document and generate PDF block information; the PDF block information comprises: single character information and coordinate information; and combining the single character information of the same or similar horizontal coordinates into a line of text by presetting a coordinate deviation threshold, and generating the longitudinal initial coordinate and the horizontal coordinate of the text.
In some embodiments, the content contained in the PDF document is parsed to generate PDF block information; the PDF block information comprises: the individual character information and the coordinate information are shown in fig. 2. Combining single character information of the same or similar horizontal coordinates into a line of text through a preset coordinate deviation threshold, and generating a longitudinal initial coordinate and a horizontal coordinate of the text; as follows:
{
"top":202.15,
"left":44.63,
"width":128.690002441406,
"height":4.48000001907349,
text of insured Rurui "
}
Step S120, filtering the text data to obtain an electronic insurance policy of the vehicle insurance;
specifically, according to a vehicle insurance electronic policy document identification rule base trained and established in advance, an elimination method is adopted to eliminate vehicle insurance electronic batch notes, electronic marks and electronic invoices. And identifying the vehicle insurance electronic policy by adopting a matching method according to a pre-trained and established vehicle insurance electronic policy document identification rule base, wherein the residual condition is that the PDF document type is not identified.
In some embodiments, the vehicle insurance electronic batch, the electronic mark and the electronic invoice are excluded by adopting an exclusion method according to a pre-trained and established vehicle insurance electronic policy document identification rule base;
eliminating text characters analyzed by the pdf, which do not contain the ' China Bank insurance supervision and management Committee ' for supervision ';
the text word analyzed by the pdf has 'electronic bill', 'electronic mark' and 'value-added tax electronic invoice', and is excluded;
eliminating the text line number which is less than 10 lines and analyzed by the pdf;
identifying the vehicle insurance electronic policy by adopting a matching method according to a vehicle insurance electronic policy document identification rule base trained and established in advance;
the number of text lines analyzed by the pdf is more than 10 lines, and the text lines include 'China Bank insurance supervision and management Committee supervision' and 'policy number', and the text lines are identified as the vehicle policy electronic policy.
Step S130, matching a data set to be extracted of the vehicle insurance electronic insurance policy, and extracting data information on the vehicle insurance electronic insurance policy according to the analytic model; in some embodiments of the present invention, the substrate is,
specifically, identifying the insurance company according to an insurance company identification rule base trained and established in advance;
identifying the vehicle insurance products of the insurance company according to a pre-trained and established vehicle insurance product identification rule base of the insurance company;
extracting a data set of vehicle insurance products of the insurance company according to the insurance company and the vehicle insurance products of the insurance company;
and sequentially analyzing and extracting data on the vehicle insurance electronic insurance policy for each item of data in the vehicle insurance product data set of the insurance company according to a pre-trained and established vehicle insurance product analysis model library of the insurance company.
Specifically, a text model is obtained through data positioning, a text set containing data is obtained, and the text set is sequentially combined into text information according to a longitudinal initial coordinate.
In some embodiments, the location-finding text model is comprised of a plurality of location-finding text functions that are executed in sequence to accomplish data location. Localization functions include, but are not limited to: the method comprises the steps of containing character strings, containing and not containing character strings, containing a plurality of character strings, obtaining texts of the first lines, obtaining texts of the last lines, obtaining line data appointed after the line is positioned, obtaining line data appointed before the line is positioned, obtaining the current line and obtaining the appointed line data upwards or downwards from the starting point of the current line, and being capable of appointing a plurality of lines.
Specifically, the text information is intercepted through a data interception model to obtain the text information of the value of the data item.
In some embodiments, the data interception model is composed of a plurality of data interception functions, and the plurality of data interception functions are executed in sequence to complete data interception. Data interception functions include, but are not limited to: intercepting the appointed character string to the end, intercepting the value between the two character strings, dividing and sequencing the divided data according to subscripts, replacing the appointed character as an empty character string, replacing the appointed character as another character string, dividing and taking the last section of character, not dividing the whole line data returned to the positioning line, intercepting the appointed length character string from the starting position, judging whether the intercepted value contains the appointed character, checking whether the intercepted character length meets the standard, judging whether the intercepted value does not contain Chinese characters and Chinese character symbols or not, judging whether the intercepted value does not contain numbers or not, judging whether the intercepted value does not contain letters A-Za-z or numbers or not, reversely intercepting the appointed length, and intercepting the appointed length character string from the appointed position to the front.
Specifically, the text information of the value of the data item is formatted through the data formatting model to obtain the formatted value of the data item.
In some embodiments, the data formatting model is comprised of a plurality of data formatting functions that are executed in sequence to complete the data formatting. Data formatting functions include, but are not limited to: removing two spaces and checking whether to start with a certain character string, removing two spaces and checking whether to end with a certain character string, removing all spaces and checking the length, replacing the spaces at the front and back and middle positions of the character string, formatting date related fields, formatting numerical value type data, removing all spaces and converting the spaces into numbers, multiplying, dividing, removing two spaces, removing Chinese and Chinese symbols, converting the Chinese amount into the digital amount, removing other characters in the numbers, only retaining the characters of the numbers, and removing other characters and symbols except the numbers and the letters.
In some embodiments, the insurance company is identified according to a pre-trained and established insurance company identification rule base; insurance companies are identified through keyword matching, such as insurance company name, short name, official website, customer service telephone, and the like.
Identifying the vehicle insurance products of the insurance company according to a pre-trained and established vehicle insurance product identification rule base of the insurance company; and carrying out rule analysis according to the policy number to obtain the vehicle insurance product. Or enumerating the vehicle insurance product name of the insurance company, and matching the vehicle insurance product name with the text extracted by the electronic insurance policy to determine the vehicle insurance product.
According to the insurance company and the vehicle insurance products of the insurance company, extracting a data set of the vehicle insurance products of the insurance company, wherein the structure is as follows: the method comprises the following steps of maintaining a policy data item name, positioning and acquiring a text function set, a data interception function set and a data formatting function set.
In some embodiments, taking the example of extracting a "premium" on a vehicle insurance electronic policy:
firstly, obtaining a text set containing data through a data positioning model: "premium total (RMB capitals): Lorentsburdistance element ([ gamma ]: 665.00 element) wherein the salvage fund (%) > element";
then, intercepting the text information through a data interception model to obtain the text information of the value of the data item: "land Bai Lu Shi Wu Yuan Zhen Yuan;
and finally, formatting the text information of the value of the data item through a data formatting model to obtain a formatted value of the data item: "665.00".
Step S140, outputting the structured data and writing the editable document.
Specifically, the structured data is in the form of a two-dimensional array composed of data items and data item values. And writing the structural data analyzed by the vehicle insurance electronic policy into the editable document through an encoding technology.
In some embodiments, structured data is written in a txt document in json format:
{
"warranty number" 1265405072020009747",
"insurance product": motor vehicle traffic accident liability mandatory insurance ",
the 'insurant' is Ruorey,
the insurant comprises the Ruirui,
"premium" 665.00"
}
Example 2
Referring to fig. 3, fig. 3 is a schematic diagram of a vehicle insurance electronic policy text recognition and extraction system provided by an embodiment of the present invention, which is as follows:
the construction database module 10 is used for constructing a vehicle insurance electronic policy data model base in the insurance industry;
the extraction module 20 is configured to extract coordinates of each character in the PDF file in the data model base and perform processing on the coordinates to obtain text data;
the filtering module 30 is used for filtering the text data to obtain an electronic insurance policy of the vehicle insurance;
the processing module 40 is used for matching the data set to be extracted of the vehicle insurance electronic insurance policy and extracting data information on the vehicle insurance electronic insurance policy according to the analytic model;
and the output module 50 is used for outputting the structured data and writing the structured data into the editable document.
Also included are a memory, a processor, and a communication interface, which are electrically connected, directly or indirectly, to each other to enable transmission or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by executing the software programs and modules stored in the memory. The communication interface may be used for communicating signaling or data with other node devices.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
It will be appreciated that the configuration shown in fig. 3 is merely illustrative and may include more or fewer components than shown in fig. 3, or have a different configuration than shown in fig. 3. The components shown in fig. 3 may be implemented in hardware, software, or a combination thereof.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the method and the system for recognizing and extracting the text of the vehicle insurance electronic policy provided by the embodiment of the application can enable the vehicle insurance electronic policy to be high in extraction accuracy, can recognize and extract all vehicle insurance electronic policies in the insurance industry, can extract non-vehicle insurance electronic policies in the insurance industry, and are more widely applied.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. A vehicle insurance electronic policy text recognition and extraction method is characterized by comprising the following steps:
constructing a vehicle insurance electronic policy data model base in the insurance industry;
extracting coordinates of each character in the PDF file from a data model base and processing the coordinates to obtain text data;
filtering the text data to obtain an electronic insurance policy of the vehicle insurance;
matching a data set to be extracted of the vehicle insurance electronic policy, and extracting data information on the vehicle insurance electronic policy according to the analytic model;
outputting the structured data and writing into the editable document.
2. The vehicle insurance electronic policy text recognition and extraction method according to claim 1, wherein said constructing an insurance industry vehicle insurance electronic policy data model base comprises:
training and establishing a preset rule base, establishing a data set of the vehicle insurance products of the insurance company, and training and establishing a data analysis model base of the vehicle insurance products of the insurance company.
3. The vehicle insurance electronic policy text recognition and extraction method according to claim 1, wherein extracting coordinates of each character in the PDF file in the data model base and processing the extracted coordinates to obtain text data comprises:
analyzing the content contained in the PDF document to generate PDF block information;
and combining the single character information of the same or similar horizontal coordinates into a line of text by presetting a coordinate deviation threshold, and generating the longitudinal initial coordinate and the horizontal coordinate of the text.
4. The vehicle insurance electronic policy text recognition and extraction method according to claim 1, wherein the filtering the text data to obtain the vehicle insurance electronic policy comprises:
and (3) according to a preset rule base trained and established in advance, removing the vehicle insurance electronic bill, the electronic mark and the electronic invoice by adopting a removing method, and identifying the vehicle insurance electronic policy by adopting a matching method.
5. The vehicle insurance electronic policy text recognition and extraction method according to claim 1, wherein matching the vehicle insurance electronic policy data set to be extracted, extracting data information on the vehicle insurance electronic policy according to the analytic model comprises:
identifying the insurance companies and the vehicle insurance products of the insurance companies according to a pre-trained and established preset rule base, and extracting a data set of the vehicle insurance products of the insurance companies according to the vehicle insurance products of the insurance companies and the insurance companies;
and sequentially analyzing the data in the vehicle insurance product data set of the insurance company and extracting the data on the vehicle insurance electronic policy according to a pre-trained and established vehicle insurance product analysis model base of the insurance company.
6. The vehicle insurance electronic policy text recognition and extraction method according to claim 5, further comprising:
acquiring a text model through data positioning to obtain a text set containing data, and sequentially combining the text set into text information according to a longitudinal initial coordinate;
intercepting text information through a data interception model to obtain the text information of the value of the data item;
and formatting the text information of the value of the data item through the data formatting model to obtain the formatted value of the data item.
7. The vehicle insurance electronic policy text recognition and extraction method according to claim 6, wherein the obtaining of the text model through data localization, obtaining the text set containing the data and combining the text set into a text message in sequence according to the vertical start coordinates comprises:
the positioning acquisition text model is composed of a plurality of positioning acquisition text functions, and the plurality of positioning acquisition text functions are executed in sequence to complete data positioning.
8. The vehicle insurance electronic policy text recognition and extraction method according to claim 6, wherein intercepting the text message by the data interception model to obtain the text message of the value of the data item comprises:
the data interception model is composed of a plurality of data interception functions, and the plurality of data interception functions are executed in sequence to complete data interception.
9. A vehicle insurance electronic policy text recognition and extraction system, comprising:
the system comprises a construction database module, a data processing module and a data processing module, wherein the construction database module is used for constructing a vehicle insurance electronic policy data model base in the insurance industry;
the extraction module is used for extracting and processing the coordinates of each character in the PDF file in the data model base to obtain text data;
the filtering module is used for filtering the text data to obtain the vehicle insurance electronic insurance policy;
the processing module is used for matching the data set to be extracted of the vehicle insurance electronic insurance policy and extracting data information on the vehicle insurance electronic insurance policy according to the analytical model;
and the output module is used for outputting the structured data and writing the structured data into the editable document.
10. A vehicle insurance electronic policy text recognition and extraction system according to claim 9, comprising:
at least one memory for storing computer instructions;
at least one processor in communication with the memory, wherein the at least one processor, when executing the computer instructions, causes the system to perform: the system comprises a database building module, an extraction module, a filtering module, a processing module and an output module.
CN202110247927.5A 2021-03-06 2021-03-06 Vehicle insurance electronic insurance policy text recognition and extraction method and system Pending CN112906352A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110247927.5A CN112906352A (en) 2021-03-06 2021-03-06 Vehicle insurance electronic insurance policy text recognition and extraction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110247927.5A CN112906352A (en) 2021-03-06 2021-03-06 Vehicle insurance electronic insurance policy text recognition and extraction method and system

Publications (1)

Publication Number Publication Date
CN112906352A true CN112906352A (en) 2021-06-04

Family

ID=76107992

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110247927.5A Pending CN112906352A (en) 2021-03-06 2021-03-06 Vehicle insurance electronic insurance policy text recognition and extraction method and system

Country Status (1)

Country Link
CN (1) CN112906352A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627189A (en) * 2021-08-17 2021-11-09 青岛全掌柜科技有限公司 Entity identification information extraction, storage and display method for insurance clauses
CN113642408A (en) * 2021-07-15 2021-11-12 杭州玖欣物联科技有限公司 Method for processing and analyzing picture data in real time through industrial internet

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018036998A (en) * 2016-09-02 2018-03-08 株式会社アイリックコーポレーション Insurance policy image analysis system, description content analysis device, portable terminal and portable terminal program
CN109918679A (en) * 2019-03-22 2019-06-21 成都晟堃科技有限责任公司 A method of parsing papery declaration form data
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN111666868A (en) * 2020-06-03 2020-09-15 阳光保险集团股份有限公司 Insurance policy identification method and device and computer equipment
CN112270224A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Insurance responsibility analysis method and device and computer readable storage medium
CN112307741A (en) * 2020-12-31 2021-02-02 北京邮电大学 Insurance industry document intelligent analysis method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018036998A (en) * 2016-09-02 2018-03-08 株式会社アイリックコーポレーション Insurance policy image analysis system, description content analysis device, portable terminal and portable terminal program
CN109918679A (en) * 2019-03-22 2019-06-21 成都晟堃科技有限责任公司 A method of parsing papery declaration form data
CN110334346A (en) * 2019-06-26 2019-10-15 京东数字科技控股有限公司 A kind of information extraction method and device of pdf document
CN111666868A (en) * 2020-06-03 2020-09-15 阳光保险集团股份有限公司 Insurance policy identification method and device and computer equipment
CN112270224A (en) * 2020-10-14 2021-01-26 招商银行股份有限公司 Insurance responsibility analysis method and device and computer readable storage medium
CN112307741A (en) * 2020-12-31 2021-02-02 北京邮电大学 Insurance industry document intelligent analysis method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642408A (en) * 2021-07-15 2021-11-12 杭州玖欣物联科技有限公司 Method for processing and analyzing picture data in real time through industrial internet
CN113627189A (en) * 2021-08-17 2021-11-09 青岛全掌柜科技有限公司 Entity identification information extraction, storage and display method for insurance clauses

Similar Documents

Publication Publication Date Title
EP2671190B1 (en) System for data extraction and processing
CN112733639B (en) Text information structured extraction method and device
CN112906352A (en) Vehicle insurance electronic insurance policy text recognition and extraction method and system
CN112861648A (en) Character recognition method and device, electronic equipment and storage medium
CA3048356A1 (en) Unstructured data parsing for structured information
CN114218391A (en) Sensitive information identification method based on deep learning technology
CN112860905A (en) Text information extraction method, device and equipment and readable storage medium
JP2019079347A (en) Character estimation system, character estimation method, and character estimation program
CN114005126A (en) Table reconstruction method and device, computer equipment and readable storage medium
CN112380300A (en) Multi-class event element extraction and analysis method and equipment
CN107943760B (en) Method and device for optimizing fonts of PDF document editing, terminal equipment and storage medium
CN116030469A (en) Processing method, processing device, processing equipment and computer readable storage medium
CN115294592A (en) Claim settlement information acquisition method and acquisition device, computer equipment and storage medium
CN110874398B (en) Forbidden word processing method and device, electronic equipment and storage medium
CN114706886A (en) Evaluation method and device, computer equipment and storage medium
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN113888760A (en) Violation information monitoring method, device, equipment and medium based on software application
CN113343663A (en) Bill structuring method and device
CN113536782A (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN115357688B (en) Enterprise list information acquisition method and device, storage medium and electronic equipment
Al-Barhamtoshy et al. Universal metadata repository for document analysis and recognition
CN109522423B (en) Fingerprint implanting and information identifying method, device, computer equipment and storage medium
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN112651725B (en) Electronic invoice parsing method and device
CN116244439A (en) Method, device, equipment and readable storage medium for analyzing intention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604