CN115935042B

CN115935042B - Mortgage asset intelligent duplicate checking method and system based on fusion model

Info

Publication number: CN115935042B
Application number: CN202310109657.0A
Authority: CN
Inventors: 申宇峰; 李建斌
Original assignee: Rose Tree Technology Co ltd
Current assignee: Rose Tree Technology Co ltd
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-09-26
Anticipated expiration: 2043-01-19
Also published as: CN115935042A

Abstract

The application provides a method and a system for intelligent duplication checking of a mortgage asset based on a fusion model, wherein the method comprises the steps of searching a central network access database for effective registration records of a market subject through an asynchronous query strategy and downloading the effective registration records according to the names of the market subject and comparison targets input by a user; all the registered certificates and the PDF format files and JPG format files attached to the certificates are subjected to text extraction to obtain structured and semi-structured text results; and comparing the text result with a comparison target object, judging whether the comparison target object has repeated mortgage or not, and returning a duplicate checking result. The application focuses on intelligent duplication checking of the mortgage asset, and the structural output of the recognized text content is rapidly and accurately carried out through the improved fusion model of the OCR recognition model and PDF analysis and OCR recognition, so that the duplication checking accuracy and speed are improved.

Description

Mortgage asset intelligent duplicate checking method and system based on fusion model

Technical Field

The application relates to the technical field of intelligent mortgage asset duplication checking, in particular to a method and a system for intelligent mortgage asset duplication checking based on a fusion model.

Background

In the mortgage and financing security business, in order to avoid the business risk of repeated mortgage of the mortgage assets of the market body, the middle-registered mortgage assets of the market body need to be queried in the middle-registered network to judge whether the mortgage assets of the market body are repeatedly mortgage security or not, the registered mortgage assets of the market body in the middle-registered network usually comprise tens of thousands of small assets, and the whole registered mortgage assets of the market body are searched, queried and downloaded by only manually registering the middle-registered assets and are compared with the objects of the comparison mark one by one, so that labor, material resources and financial resources are very consumed, and the risk of human misjudgment is high.

In the mortgage financing warranty examination business, the prior art scheme realizes an asset check technology based on NLP technology and intelligent classification, such as patent application number CN202111671968.3, an asset check method based on intelligent classification, a system, equipment and a computer readable storage medium, which are patents for classifying the identified and obtained content according to preset asset description items, so that the asset content information conforming to the retrieval description can be more quickly and conveniently found according to the retrieval conditions in the subsequent keyword retrieval process, and in order to prevent missing situations, the original asset registration files which are not of the asset description type are synchronously subjected to feedback display for users to carry out secondary check on the original asset registration files, thereby preventing missing situations in the process of checking whether the asset is repeatedly registered.

Aiming at the technical scheme mentioned in the patent, the method and the system for intelligently checking the mortgage asset are needed to quickly and efficiently judge whether the to-be-mortgage asset of the market subject is at risk of repeated mortgages in the mortgage guarantee financing service.

Disclosure of Invention

The application provides a mortgage asset intelligent duplicate checking method and system based on a fusion model.

In a first aspect, a method for intelligent duplication checking of a mortgage asset based on a fusion model is provided, and the following technical scheme is adopted:

a mortgage asset intelligent duplicate checking method based on a fusion model comprises the following steps:

starting a medium-net surfing data crawler engine according to the operation of inputting the market subject name and the contrast target object by a user;

searching effective registration records under the market subject names in a medium-network-access database through an asynchronous query strategy according to the market subject names and the comparison targets;

downloading the registration certificate file and the accessories of the effective registration record in batches from the internet-surfing database, and uploading the registration certificate file and the accessories to a block storage access data module constructed by an S3 object storage protocol;

converting the PDF format file in the registration certificate file and the attachment thereof into a JPG format file; inputting the converted standard PDF file of the registration certificate into a fusion model with PDF analysis and OCR recognition functions, extracting a structured text result of the registration certificate form, converting other files except the standard PDF file, inputting the converted standard PDF file into an OCR recognition model, and extracting a semi-structured text result;

identifying a semi-structured text result and a structured text result through a summary model, and temporarily storing the text result obtained by identification into a block storage access data module;

comparing the semi-structured text result and the structured text result with a comparison target object of a market main body, and returning that the comparison is not passed if any one of an invoice code, an invoice number or a contract code and a contract number is repeated with the comparison target object; if not, judging whether the contract name or the debtor name is repeated with the comparison target object, if yes, transferring to manual checking, and if not, returning to checking again.

Preferably, the method further comprises the step of judging whether the semi-structured text result and the structured text result contain sensitive keywords by using a keyword recognition model after the contract name or the debtor name is not repeated with the comparison target object, if the semi-structured text result and the structured text result contain the sensitive keywords, transferring to manual review, and if the semi-structured text result and the structured text result do not contain the sensitive keywords, returning to pass the review.

Preferably, the inputting the converted standard PDF file into a fusion model with PDF parsing and OCR recognition functions and extracting the structured text result of the registration certificate form includes:

processing a first page of a JPG format of a standard PDF file through an OCR recognition model, and recognizing and extracting to obtain a structured text result;

judging whether the length of the text character string identified by the OCR recognition model exceeds a set threshold value, if so, starting to call the PDF analysis model to process the PDF format file of the standard PDF file corresponding to the page number, combining the context semantics of the long text paragraph to splice a complete long text paragraph, and adding the complete long text paragraph into the corresponding position of the structured result dictionary to obtain a structured text result.

Preferably, the processing flow of the OCR recognition model includes:

inputting the file in the JPG format into the OCR recognition model to obtain text contents of the multi-place text boxes and position coordinates of the text boxes;

acquiring text content of each text box and position coordinates of the text box identified by the OCR recognition model, calculating the position coordinates of the central point of the text box at the position and storing the position coordinates as temporary variables;

each text content with the position relation finds a matched key-value text pair through the corresponding center point position coordinates;

text content with semantic relationship finds the matched contextual string through the corresponding center point position coordinates.

Preferably, the finding a matching key-value text pair by the text content with the position relation through the corresponding center point position coordinates includes:

calculating the difference between the position coordinates of the central point of one text content and the ordinate of the position coordinates of the central point of the other text content;

if the difference value is within the set threshold value range, the text content is a key-value text pair with the other text content.

Preferably, the text content with semantic relation finds a matched contextual string according to the corresponding position coordinates of the central point, including:

calculating the difference value between the abscissa coordinates of the position coordinates of the central point of the text strings of the upper part and the lower part of the text content;

and if the difference value between the abscissas is within the set threshold value range, splicing the two text strings into a text string with complete semantics.

In a second aspect, a system for intelligent weight checking of a mortgage asset based on a fusion model is provided, and the system adopts the following technical scheme:

a system for intelligent review of mortgage assets based on a fusion model, comprising: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor is used for realizing the intelligent mortgage asset duplication checking method based on the fusion model according to the first aspect when executing the program.

In a third aspect, a computer readable storage medium is provided, which adopts the following technical scheme:

a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of intelligent review of a mortgage asset based on a fusion model as described in the first aspect.

Compared with the prior art, the application provides the intelligent review method and the intelligent review system for the mortgage assets based on the fusion model, which aim at comprehensively and systematically carrying out intelligent review on the mortgage assets in the mortgage guarantee financing service, can well match the semantic relation and the position relation between each text content through the improved OCR recognition model, and can quickly and accurately carry out structured output on the recognized text content through PDF analysis and OCR recognition fusion model, thereby improving the review accuracy and speed.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a method for intelligent review of a mortgage asset based on a fusion model in an embodiment of the application;

fig. 2 is a block diagram of a system for intelligent review of mortgage assets based on a fusion model in an embodiment of the application.

Detailed Description

The preferred embodiments of the present application will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present application only, and are not intended to limit the present application.

The application mainly comprises two large functional modules of data temporary storage and fusion model intelligent duplicate checking, based on which, the application needs to construct the data temporary storage module and the fusion model intelligent duplicate checking module, and the application comprises the following concrete steps:

1) And constructing a data temporary storage module.

Starting a net surfing data crawler engine, searching and downloading all registered evidence files and accessories of a main body of a market to be checked in batches through an asynchronous query strategy, and storing and accessing a data module by utilizing an S3 object storage protocol building block;

2) And constructing an intelligent duplicate checking module of the fusion model.

Selecting a registration proof standard PDF file registered by a market subject in a medium-network from a storage catalog of an S3 object storage block according to a file name naming format, outputting a structured result of the registration proof standard form by utilizing a PDF analysis and OCR (Optical Character Recognition ) identification fusion model, converting non-standard PDF files in a registration proof file attachment of the market subject in the medium-network into JPG format files by utilizing a PDF format to JPG format unit in batches, performing OCR identification together with other JPG format files, summarizing the model identification result, temporarily storing the result into the storage catalog stored by the appointed S3 object block, performing comparison analysis on the model identification result and a comparison target object of the market subject, returning to fail check if an invoice code, an invoice number or a contract code or a contract number is repeated, transferring to manual check if the contract name or a debt name is repeated, judging whether all registration proof files of the market subject and attachments thereof contain sensitive keywords (namely all, future, including but not limited by the like) by utilizing a keyword identification model, transferring the manual check if the sensitive keywords are contained, and returning to check if the manual check is not problematic.

The application provides a mortgage asset intelligent duplicate checking method based on a fusion model, which is applied to a server, as shown in fig. 1, and comprises the following steps:

step S100, starting the internet surfing data crawler engine according to the operation of inputting the main market name (comprising the great name) and the comparison object by the user.

Step 200, searching whether an effective registration record exists under the market subject name in the internet surfing database through an asynchronous query strategy according to the market subject name and a comparison object input by a user, and if the effective registration record does not exist, directly returning to the internet surfing database through duplicate checking, and terminating the program; if there is a valid registration record, the program continues to execute step S300.

Before effective registration record is carried out, the method also comprises the steps of searching whether registration record exists under the market subject name in a medium-access network database through an asynchronous query strategy according to the market subject name and a comparison target object input by a user, if the registration record does not exist, indicating that no mortgage asset registration is carried out, no repetition problem exists, directly returning to the method, and terminating the program; if the registration record exists, searching the internet database for whether the valid registration record exists under the name of the market subject through an asynchronous query strategy.

Step S300, downloading all registration evidence files and accessories of the effective registration records in batches from the internet-surfing database, and uploading the downloaded registration evidence files and accessories to a block storage access data module constructed by using the S3 object storage protocol; the data storage directory structure of the block store access data module is as follows:

market subject name/data/enrollment proof number/enrollment proof file and its attachments, such as/a technology company, ltd/data/00834787000100861048/00834787000100861048. Pdf/a technology company, ltd/data/00834787000100861048/attorney docket. Pdf.

In the embodiment of the application, if the downloaded certificate and the accessory thereof are zip compression packages, the zip compression packages of the certificate and the accessory thereof are decompressed, and then the decompressed certificate and the accessory thereof are uploaded to the block storage access data module.

Step S400, judging whether the decompressed registration certificate file and the file format of the attachment are medium-network registration certificate standard PDF files in sequence, and if so, executing step S600; otherwise, executing the step S500; in the embodiment of the application, whether the standard PDF file is registered for the medium login is determined by a file name format, wherein the file name of the standard PDF file is composed of a 'mortgage asset registration evidence number' and a 'PDF', which are main files of the mortgage asset registration, and the main display form of the content of the main files is a registration evidence form.

Step S500, converting the nonstandard PDF files in the registration certificate and the accessory into JPG format files in batches through a PDF format conversion JPG format unit, inputting the JPG format files and the registration certificate and the original JPG format files in the accessory into an OCR recognition model, extracting and recognizing a semi-structured text result, and then executing step S700.

Step S600, converting the standard PDF file into a JPG format file through a PDF format-to-JPG format unit, inputting a fusion model with PDF analysis and OCR recognition functions, and extracting and recognizing a text result of the registration evidence form structure.

Step S700, the summarizing model identifies a semi-structured text result and a structured text result, and temporarily stores the identified text result into an S3 object storage platform block storage access data module, wherein the storage directory structure is as follows:

market subject name/result/registration certificate number/result of the registration certificate and its attachment, for example/a technology limited/result/00834787000100861048/00834787000100861048.

Json (this is a structured text result) and/a technologies limited/result/00834787000100861048/attorney docket 0.Txt (this is a semi-structured text result);

step S800, comparing the semi-structured text result and the structured text result with the comparison target object of the market main body, judging whether the invoice code, the invoice code or the contract code and the contract code are repeated with the comparison target object, if one of the invoice code, the invoice code or the contract code and the contract code is repeated with the comparison target object, returning the text result which is not repeated by check, and returning the text result which is repeated with the invoice code, the invoice code or the contract code and the contract code; otherwise, step S900 is performed.

Step S900, judging whether the contract name or the debtor name in the semi-structured text result and the structured text result is repeated with the comparison standard object, and if the contract name or the debtor name is repeated, transferring to manual checking; otherwise, step S1000 is performed.

In the embodiment of the application, steps S800 and S900 are executed, and keywords such as invoice codes, invoice numbers or contract codes, contract numbers, contract names or debtor names are screened from the structured text results to judge whether repetition exists between the keywords and the comparison target; and then, keywords such as invoice codes, invoice numbers or contract codes, contract numbers, contract names or debtor names are screened from the semi-structured text results, and whether repetition exists between the keywords and the comparison target is judged.

In this embodiment, the invoice code and other names are determined first, then the debt name determination is a determination logic required by the business side, the exchange order is generally not allowed, the two are in priority difference, the repetition of the invoice code and other names is not passed through check, the system flow is terminated, the repetition of the invoice code and other names is not passed on, the file is transferred to the manual check (other files still need to be transferred to the keyword recognition model to determine whether keywords exist), the keyword recognition model is passed on to determine whether sensitive words exist, the file is transferred to the manual check if yes, the duplication is passed on, and the system flow is terminated.

Step S1000, judging whether all registered certificates of the market subject names and accessories (a semi-structured text result and a structured text result) thereof contain sensitive keywords (namely all, future and including but not limited to, etc.) by utilizing a keyword recognition model, if the sensitive keywords are contained, transferring to manual inspection, and if the sensitive keywords are not problematic, returning to pass check.

The embodiment of the application focuses on comprehensively and systematically carrying out intelligent duplicate checking on the mortgage asset in the mortgage guarantee financing service, can well match the semantic relation and the position relation between each text content through an improved OCR recognition model, and can quickly and accurately carry out structured output on the recognized text content through a fusion model of PDF analysis and OCR recognition, thereby improving the duplicate checking accuracy and speed.

The OCR recognition model can better recognize the text content in each text box in the JPG format file and the position coordinates of four vertexes of the text box, and the result examples are [ [885.0,46.0], [1449.0,46.0], [1449.0,91.0], [885.0,91.0] ] (' registration proof number: 00834787000100861048 ', 0.9222105145454407) ], wherein ' registration proof number: 0083478700010086 1048' is a text string recognized by the OCR recognition model, 0.9222105145454407 is that the OCR recognition model recognizes that the content in this text box is "registration proof number: 0083478700010086 1048', probability value; the application relates to a method for recognizing text content, which is characterized in that [ (885.0,46.0 ], [1449.0,46.0], [1449.0,91.0], [885.0,91.0] ] are the position coordinates of four vertexes of a text box corresponding to a text string at the position, and the main body content of a PDF file is mainly in a table structure in consideration of registration evidence standard of medium logon, but the OCR recognition model can be positioned to the position coordinates of each text content and the four vertexes of the text box, but is difficult to perform structural output on the recognized text content, so that the OCR recognition model is improved to better match the semantic relation and the position relation among the text contents so as to perform structural output on the recognized text content, and specifically comprises the following steps:

inputting the JPG format file into an OCR recognition model to obtain text contents of a plurality of text boxes and position coordinates of the text boxes;

acquiring text content of each text box and position coordinates of the text box identified by the OCR recognition model, calculating the position coordinates of the central point of the text box at the position and storing the position coordinates as temporary variables, wherein each text content has corresponding position coordinates of the central point;

each text content with the position relation can find a matched key-value text pair by means of the corresponding central point position coordinates, namely, a difference value between the central point position coordinates of the text content keys and the ordinate of the central point position coordinates of the text content value is calculated, and if the difference value is within a set threshold value range, the text content keys and the text content value are key-value text pairs.

For example, the register certificate has the key of address in the message of the let and the message of the assignee, address key corresponds to the address text character string this value, if there is no constraint of the position relation, will cause the address text character string value in the message of the let and address text character string value in the message of the assignee to match and misplace easily, calculate the difference between the central point position coordinate of every address and ordinate of the central point position coordinate of the address text character string, if the difference is within the threshold value range presumed, address and address text character string form key-value text pair here;

the text content with semantic relation can find the matched context character string by means of the corresponding central point position coordinates, namely, the difference value between the horizontal coordinates of the central point position coordinates of the text character strings of the upper part and the lower part of the text content is calculated, and if the difference value is within the set threshold value range, the text character strings of the two parts are spliced into text character strings with complete semantics.

For example: the organization code/unified social credit code in the registration certificate is generally divided into an upper part and a lower part of organization code/unified agency and credit code, the horizontal coordinate difference value of the central point of the text character strings of the upper part and the lower part is calculated, and if the difference value is within a set threshold value range, the text character strings of the two parts can be spliced into text character strings with complete semantics.

The application relates to a method for identifying a dense long text paragraph, which is characterized in that the problem that the OCR identification model is easy to take too long time and the identification accuracy is seriously reduced when the dense long text paragraph is identified, the second half part of a multi-page table structure in a medium-logon registration proving standard PDF file is mainly the dense long text paragraph, for example, the transfer property description in transfer property information and the transfer property information accessory part are the dense long text paragraph, the dense long text paragraph in the PDF file can be rapidly extracted by a PDF analysis model, and the method has one innovation point that the PDF analysis model and the OCR identification model are fused to pointedly process the long and short text paragraphs in the file to be identified so as to rapidly and accurately carry out structured output on the identified text content, and the fusion model of the PDF analysis model and the OCR identification model is as follows:

processing a network registration certification JPG format file (first page of a standard PDF file) in the first page by an improved OCR recognition model; if the length of the identified text string exceeds the set threshold value, the PDF analysis model is started to be called to process the middle net registration proving PDF format file of the corresponding page number, the context semantic of the long text paragraph is combined to splice a complete long text paragraph, and the complete long text paragraph is added into the corresponding position of the structured result dictionary. For example, the first page of the medium access registration certificate contains most of the content of the registration certificate, and the transfer property description in the transfer property information below the first page corresponds to the long text content; the second page contains the contents of part of transfer property description and transfer property information attachment, the third page does not contain useful text contents, most of the contents of the registration certificate are identified through the improved OCR recognition model, the PDF analysis model is called to extract corresponding long text character strings rapidly when the long text contents of the transfer property description and transfer property information attachment are encountered, so that the improved OCR recognition model only needs to process the network registration certificate JPG format file in the first page, the rest page numbers can call the PDF analysis model to extract the corresponding long text character strings and add the corresponding long text character strings into the corresponding position of the structured result dictionary, the model efficiency can be improved by reducing the number of the improved OCR recognition model to process the JPG format file, and the fusion PDF analysis model can improve the identification efficiency and the accuracy of the fusion model long text character strings.

The application also provides a system for intelligent mortgage asset duplicate checking based on the fusion model, which is deployed at the server; specifically, the system includes: one or more processors and memory, as shown in FIG. 2, are illustrated by way of example as one processor 200 and memory 100. The processor 200 and the memory 100 may be connected by a bus or other means, such as by way of example.

Memory 100 serves as a non-transitory computer readable storage medium that may be used to store a non-transitory software program and a non-transitory computer executable program, such as a method for intelligent review of mortgage assets based on a fusion model in an embodiment of the application. The processor 200 implements a method for intelligent duplication checking of mortgage assets based on a fusion model in the above-described embodiments of the present application by running non-transitory software programs and instructions stored in the memory 100.

The memory 100 may include a storage program area that may store an operating device, an application program required for at least one function, and a storage data area; the storage data area may store data and the like required for performing a method for intelligent review of a mortgage asset based on a fusion model in the above embodiments. In addition, memory 100 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software program and instructions required to implement a method for intelligent review of a fusion model-based mortgage asset in the above embodiments are stored in memory and when executed by one or more processors, perform a method for intelligent review of a fusion model-based mortgage asset in the above embodiments, for example, perform the method steps S100-S1000 in fig. 1 described above.

In addition, the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the intelligent mortgage asset duplication checking method based on the fusion model.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A mortgage asset intelligent duplicate checking method based on a fusion model is characterized by comprising the following steps:

comparing the semi-structured text result and the structured text result with a comparison target object of a market main body, and returning that the comparison is not passed if any one of an invoice code, an invoice number or a contract code and a contract number is repeated with the comparison target object; if not, judging whether the contract name or the debtor name is repeated with the comparison target object, if yes, transferring to manual checking, and if not, returning to check again;

the OCR recognition model processing flow comprises the following steps:

2. The method of claim 1, further comprising determining whether the semi-structured text result and the structured text result contain sensitive keywords using a keyword recognition model after the contract name or the debtor name is not repeated with the comparison target, and if the semi-structured text result and the structured text result contain sensitive keywords, transferring to a manual review, and if there is no problem, returning to pass the review.

3. The method of claim 1, wherein inputting the converted standard PDF file into a fusion model having PDF parsing and OCR recognition functions and extracting the structured text result of the enrollment proof form comprises:

4. The method of claim 1, wherein the locating the matching key-value text pairs for each text content having a positional relationship by the corresponding center point position coordinates comprises:

5. The method of claim 1, wherein the text content having semantic relationships finds matching contextual strings by corresponding center point location coordinates, comprising:

6. The system for intelligently checking the weight of the mortgage asset based on the fusion model is characterized by comprising the following components: a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements a method for intelligent review of mortgage assets based on a fusion model as claimed in any one of claims 1 to 5 when executing the program.

7. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a method of intelligent mortgage asset duplication checking based on a fusion model as claimed in any one of claims 1 to 5.