CN114564930A

CN114564930A - Document information integration method, apparatus, device, medium, and program product

Info

Publication number: CN114564930A
Application number: CN202210199157.6A
Authority: CN
Inventors: 高毓斌
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-05-31

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a document information integration method, a device, a system, equipment, a medium and a program product, wherein the method comprises the following steps: the method comprises the steps of obtaining a list of objects to be evaluated, obtaining corresponding unstructured information source documents according to the list of the objects to be evaluated, matching content extraction rules corresponding to the documents according to document numbers of the documents in the information source documents corresponding to the objects to be evaluated, extracting target key information from the unstructured information source documents based on the content extraction rules, and further integrating according to a preset information integration template and the target key information to obtain target structured documents.

Description

Document information integration method, apparatus, device, medium, and program product

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment, a medium and a program product for integrating document information.

Background

In some project evaluation or credit assessment scenarios in the financial field, a large amount of related information content needs to be analyzed and evaluated. However, the information content to be evaluated has a large number of document sources and a large number of documents, and it is necessary to extract and integrate effective information in each document in advance.

At present, a template is written for each type of document by a template engine (freemaker), and then the document content is extracted according to the document template structure. However, the sources of the documents are various in format, the template configuration difficulty is high, the space occupied by the exported content is large, the exported content is a non-standard document, and dynamic reprocessing is difficult, so that the information integration efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a document information integration method, a document information integration device, a document information integration equipment, a document information integration medium and a document information integration program product, and aims to improve the extraction and integration efficiency of multi-source non-structural document information.

In a first aspect, an embodiment of the present invention provides a document information integration method, where the method includes:

acquiring a list of objects to be evaluated, and acquiring a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document;

aiming at each object to be evaluated in the list of objects to be evaluated, matching a content extraction rule corresponding to each document according to the document number of each document in the information source document corresponding to the object to be evaluated, and extracting target key information according to the content extraction rule;

and integrating the target key information according to a preset information integration template to generate a target structured document.

Optionally, the matching, according to the document number of each document in the information source document corresponding to the object to be evaluated, a content extraction rule corresponding to each document, and extracting the target key information according to the content extraction rule, includes:

determining a document type corresponding to the information source document according to the document number, and matching a corresponding content extraction rule according to the document type;

and extracting target key information from the information source document of the corresponding type based on preset key word information in the content extraction rule.

Optionally, the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule includes:

acquiring a first preset keyword phrase in the preset keyword information;

and matching and extracting a first paragraph taking the first start keyword as a start position and the first end keyword as an end position according to the first start keyword and the first end keyword in the first preset keyword phrase.

Optionally, the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule further includes:

acquiring a second preset keyword phrase in the preset keyword information;

according to a second starting keyword and a second ending keyword in the second preset keyword phrase, matching a second section which takes the second starting keyword as an initial position and the second ending keyword as an ending position;

in the second paragraph, sentences containing first preset keywords are extracted.

and extracting the table containing the second preset key words or the cell contents in the table.

Optionally, the obtaining of the list of objects to be evaluated includes:

establishing connection with a target risk early warning system, requesting and acquiring a business risk warning list;

and taking the business risk alarm list as the list of the objects to be evaluated.

Optionally, the obtaining of the corresponding information source document according to the list of the objects to be evaluated includes:

requesting and acquiring a credit granting document of an object to be evaluated from a preset credit granting service system;

the credit granting document comprises at least one information source document of a survey report, a preset project evaluation rating report, a preset service application book, an invitation document and a loan amount application book.

Optionally, the integrating the target key information according to a preset information integration template to generate a target structured document includes:

matching the target key information content with the document content hierarchical structure of the preset information integration template;

and integrating the target key information according to the matching result of the document content hierarchical structure to generate a target structured document.

Optionally, the method further includes:

and acquiring the modification of the user on the target key information extraction item in the content extraction rule, and extracting the target key information according to the modified content extraction rule.

Optionally, the method further includes:

and acquiring the modification of the user on the document content hierarchical structure in the preset information integration template, and integrating the target key information according to the modified document content hierarchical structure.

Optionally, the method further includes:

when the user downloads the target structured document, displaying information source document downloading prompt information and downloading connection for the user;

and downloading the information source document associated with the target structured document in response to the triggering operation of the downloading connection by the user.

In a second aspect, an embodiment of the present invention provides a document information integration apparatus, where the apparatus includes:

the source document acquisition module is used for acquiring a list of objects to be evaluated and acquiring a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document;

the information extraction module is used for matching content extraction rules corresponding to the documents according to the document numbers of the documents in the information source documents corresponding to the objects to be evaluated and extracting target key information according to the content extraction rules aiming at each object to be evaluated in the list of the objects to be evaluated;

and the information integration module is used for integrating the target key information according to a preset information integration template to generate a target structured document.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the document information integration method provided by any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the document information integration method provided in any embodiment of the present invention.

In a fifth aspect, an embodiment of the present invention further provides a computer program product, where the computer program is stored on the computer program, and when the computer program is executed by a processor, the computer program implements the document information integration method provided in any embodiment of the present invention.

The embodiment of the invention has the following advantages or beneficial effects:

in the embodiment of the invention, by acquiring the list of the objects to be evaluated, the corresponding unstructured information source document is acquired according to the list of the objects to be evaluated, and aiming at each object to be evaluated in the objects to be evaluated, matching content extraction rules corresponding to the documents according to the document numbers of the documents in the information source document corresponding to the object to be evaluated, extracting target key information from the unstructured information source document based on the content extraction rules, and then, a target structured document is obtained according to the preset information integration template and the extracted target key information, information extraction and information integration of the multi-source non-structural document are achieved, an integration template does not need to be configured independently for each type of source document, the technical problem that information extraction is carried out depending on complex template configuration in the prior art is solved, and the efficiency of information extraction and integration of the multi-source non-structural document is improved. In addition, the method can generate the structured document according to the extracted information, the problem that the unstructured document generated after the information is extracted in the prior art is difficult to flexibly display is solved, the generated structured document is beneficial to analyzing the extracted information, and the efficiency of obtaining the information by a user is improved.

Drawings

FIG. 1 is a flowchart of a document information integration method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a document information integration method according to a second embodiment of the present invention;

FIG. 3A is a flowchart of a document information integration method according to a third embodiment of the present invention;

fig. 3B is an interaction diagram of a document information integration system, a target risk early warning system, and a preset credit granting service system according to a third embodiment of the present invention;

FIG. 3C is a flow chart of information extraction of a document information integration system according to a third embodiment of the present invention;

FIG. 4 is a flowchart of a document information integration method provided by the fourth embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a document information integration apparatus according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a document information integration method according to an embodiment of the present invention, which is applicable to a case where target key information is extracted from a plurality of unstructured information source documents, and a target structured document is generated according to the extracted target key information. For example, in a credit evaluation scenario of the financial industry, target key information may be extracted from unstructured documents such as a survey report, a preset project evaluation rating report, a preset service application, an invitation document, a loan amount application and the like in a client credit file to generate a structured client risk check report; or, in the physical examination scene in the medical industry, the target key information can be extracted from unstructured documents such as basic information documents, image report documents, examination item documents, examination conclusion documents and the like in the client physical examination archive, so as to generate a structured physical examination report; alternatively, in the event of academic performance evaluation in the education industry, target key information may be extracted from the basic information document, the performance recording document, the performance analysis document, and the teacher evaluation document in the student file, and a structured performance evaluation report may be generated. The above scenarios are merely examples, and the method provided by the present embodiment is not limited to the above scenarios. The method can be executed by a document information integration device, which can be implemented by software and/or hardware, and is integrated in a computer device with application development function.

As shown in fig. 1, the document information integration method includes the following modules:

s110, obtaining a list of objects to be evaluated, and obtaining a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document.

The list of objects to be evaluated may include at least one object to be evaluated, and the object to be evaluated may be an object that requires a target structured document to be generated according to an information source document corresponding to the object to be evaluated. In this embodiment, the list of objects to be evaluated may be sent by the third-party system, or may be determined according to the current selection operation of the user.

For example, in a credit assessment scenario of the financial industry, a target risk early warning system (a third-party system) may automatically determine a customer meeting a preset early warning condition as an object to be assessed, and periodically send a list of the object to be assessed, where the preset early warning condition may be a business condition that needs to be focused on, such as fund shortage, project loss, guarantee loss, or making an industry loss behavior; or, the credit service system may determine a newly added client in a preset time period as an object to be evaluated, and periodically send a list of the objects to be evaluated. For another example, in a physical examination scene in the medical industry, a hospital information system (third-party system) may determine a client of a certain enterprise as an object to be evaluated, or determine a client whose preset examination item does not meet a standard reference value as an object to be evaluated, or determine a client whose examination time is within a preset time window as an object to be evaluated, and so on.

In this embodiment, after the list of the objects to be evaluated is obtained, information source documents corresponding to the list of the objects to be evaluated, that is, information source documents of each object in the list of the objects to be evaluated, are further obtained. The information source document may be a source document corresponding to the object to be evaluated and requiring information extraction. In the embodiment, the information source document is an unstructured document, such as a word document, a pdf document, a txt document, and the like. One information source document may be composed of at least one of text information, table information, and picture information, and target key information may be extracted from the text information, table information, or picture information of the information source document by the method of the present embodiment.

For example, in a credit assessment scenario of the financial industry, the information source document may include at least one of a survey report, a preset project assessment rating report, a preset service application, an invitation document, and a loan amount application. The survey report may include at least one of a company-type new credit customer survey report, a company-type stock customer survey report, and a business-type new credit and stock survey report. The preset item evaluation rating report may include at least one of a real estate development loan evaluation rating report and a fixed property loan evaluation rating report. The request document may contain an opinion of the business management of the reporting entity. The loan amount declaration form may include at least one of a company-type and business-type new credit client declaration form and a company-type and business-type stock client declaration form.

S120, aiming at each object to be evaluated in the list of the objects to be evaluated, matching a content extraction rule corresponding to each document according to the document number of each document in the information source document corresponding to the object to be evaluated, and extracting target key information according to the content extraction rule.

Specifically, after the information source document corresponding to each object to be evaluated in the list of objects to be evaluated is obtained, for each object to be evaluated, a content extraction rule corresponding to the information source document of the object to be evaluated may be determined, so as to extract the target key information from each information source document according to the content extraction rule.

In this embodiment, the content extraction rules corresponding to the documents may be matched according to the document numbers of the documents in the information source document. The document number may be a number that characterizes the document type or a document sub-type, that is, documents of the same type may have the same document number.

Specifically, a corresponding document number may be formulated in advance for each document type or document subtype, and then the content extraction rule corresponding to each document type or document subtype and the document number corresponding to the document type or document subtype are stored in an associated manner to generate a directory, so that the corresponding content extraction rule may be queried in the directory according to the obtained document number of each document in each information source document.

In this embodiment, taking a credit assessment scenario of the financial industry as an example, since the information source document may be divided into types of a survey report, a preset project assessment rating report, a preset service application book, an invitation document, a loan amount application book, and the like, the survey report may be divided into sub-types of a company-type new credit customer survey report, a company-type stock customer survey report, a business-type new credit and stock survey report, and the like, and content extraction rules corresponding to each seed type may be different, it is preferable that the document number may adopt a number representing the sub-type of the document, so as to accurately obtain the content extraction rules according with the sub-type of the document.

Further, after the content extraction rule corresponding to each document is obtained, the target key information can be extracted from the document according to the content extraction rule. The content extraction rule may be a keyword extraction rule, that is, target key information is selected by a keyword. The content extraction rule can also be an attention content extraction rule, namely determining the attention content of the document in the document and extracting the attention content of the document as target key information; for example, the document is divided into paragraphs, the frequency of occurrence of each phrase in each paragraph is counted, and the paragraph where the phrase with the highest frequency of occurrence is located is determined as the document attention content.

Following the above-mentioned examples of credit assessment scenarios in the financial industry, the extracted target key information may include at least one of major product conditions, regional advantages, capacity conditions, top five account conditions of capital traffic, details of accounts receivable by top five customers, bank enterprise key information tables, various business income summary tables, and associated guarantee conditions for the survey reports. The extracted target key information can comprise at least one of project geographic position, building content, construction scale, investment and fund source, development period, construction progress, development cost, project income measurement and calculation, profit-loss balance analysis, project sensitivity analysis, project peripheral matching condition, real estate removing condition in a project area, project net present value and project return condition. Aiming at the preset service declaration book, the extracted target key information can comprise at least one of credit service application, operation management opinions of declaration units, guarantee conditions (including guarantee mode, risk slow-release measure, guarantee evaluation value and mortgage rate) and main repayment sources. For the application file, the extracted target key information can comprise the business management opinions of the reporting unit. Aiming at the loan amount declaration book, the extracted target key information can comprise at least one of credit granting service usage, credit granting service amount type and distribution, inventory client credit granting and previous comparison change, declaration unit operation management opinion, guarantee condition (comprising guarantee mode, risk slow release measure, guarantee evaluation value, mortgage rate and inventory client guarantee condition and previous comparison change) and main repayment source.

In this embodiment, the extracted target key information is structured data, which may include a service number, a title (which may be divided into multiple levels), a logical order of document blocks (extracting record item sorting codes, sorting in ascending order), a structured data type (text: table), a structured data body (JSON format), a table size (row/column), a structured flag, a supplementary description, and a data date. Wherein, one table entry may be a JSON array, and each JSON object contains the following attributes:

type: object type (t _ head header annotation/tex text/table); data: specifically, a) for the header: annotate content, b) for text: text content, c) for table: json array, each array object is a cell containing attributes including, Col: current column, Colspan: column merging number, Row current line, Row merging number and Text cell content; hit: hit the keyword list; and Doc: a source document.

S130, integrating the target key information according to a preset information integration template to generate a target structured document.

Specifically, after the target key information is acquired, the target key information can be integrated into the target structured document according to a preset information integration template. The preset information integration template may include a document content hierarchy, that is, a hierarchy of each target key information in the target structured document, which may include the arrangement order information or the preset position information.

Certainly, the preset information integration template may further include text attribute information and cell format information, where the text attribute information includes, but is not limited to, text size, text color, font type, whether to be bolded, and text background, and the cell format information includes, but is not limited to, cell size, text line number included in a cell, combination relationship between cells, and cell alignment manner.

In the embodiment, the target key information extracted from various different types of information source documents can be integrated through the unified preset information integration template, and a corresponding integration template does not need to be established for each type of information source document independently. Specifically, the target key information may be sequentially arranged according to a document content hierarchical structure in the preset information integration template, or the target key information may be placed at a preset position to obtain the target structured document.

In an optional implementation manner, the integrating the target key information according to a preset information integration template to generate a target structured document may be: matching the target key information content with the document content hierarchical structure of the preset information integration template; and integrating the target key information according to the matching result of the document content hierarchical structure to generate a target structured document.

The document content hierarchical structure may include hierarchical structure marks of titles at various levels, and is used for organizing corresponding structured target key information by taking an object to be evaluated as a dimension. In this optional embodiment, each level of title in the target key information may be matched with the document content hierarchical structure to determine the position of each level of title in the target key information content, and then according to the position of each level of title, information corresponding to each level of title in the target key information content is filled into the preset information integration template to generate the target structured document. Through the optional implementation mode, the integrated target structured document can have a document content hierarchical structure, and a user can conveniently and quickly acquire information.

After generating the target structure document, the method provided by this embodiment may further include: when the user downloads the target structured document, displaying information source document downloading prompt information and downloading connection for the user; and downloading the information source document associated with the target structured document in response to the triggering operation of the downloading connection by the user.

That is, the present embodiment may also download the information source document used for generating the target structured document together with the download of the target structured document, so that the user may view and verify the content of the target structured document based on the downloaded target structured document and information source document.

In the technical scheme of the embodiment, by acquiring the list of the objects to be evaluated, the corresponding unstructured information source document is acquired according to the list of the objects to be evaluated, and for each object to be evaluated in the objects to be evaluated, matching content extraction rules corresponding to the documents according to the document numbers of the documents in the information source document corresponding to the object to be evaluated, extracting target key information from the unstructured information source document based on the content extraction rules, and then, a target structured document is obtained according to the preset information integration template and the extracted target key information, information extraction and information integration of the multi-source non-structural document are achieved, an integration template does not need to be configured independently for each type of source document, the technical problem that information extraction is carried out depending on complex template configuration in the prior art is solved, and the efficiency of information extraction and integration of the multi-source non-structural document is improved. In addition, the method can generate the structured document according to the extracted information, the problem that the unstructured document generated after the information is extracted in the prior art is difficult to flexibly display is solved, the generated structured document is beneficial to analyzing the extracted information, and the efficiency of obtaining the information by a user is improved.

Example two

Fig. 2 is a flowchart of a document information integration method according to a second embodiment of the present invention, which is optimized based on the above embodiments, and further describes a specific process of extracting target key information. Referring to fig. 2, the document information integration method provided by the present embodiment includes the following steps:

s210, obtaining a list of objects to be evaluated, and obtaining a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document.

S220, aiming at each object to be evaluated in the list of the objects to be evaluated, determining a document type corresponding to an information source document according to the document number, matching a corresponding content extraction rule according to the document type, and extracting target key information from the information source document of the corresponding type based on preset keyword information in the content extraction rule.

In this embodiment, each document type and the document number corresponding to each document type may be stored in association in advance, and each document type and the content extraction rule corresponding to each document type may be stored in association, so that the corresponding document type may be queried through the document number of the information source document, and then the corresponding content extraction rule may be queried through the document type. It should be noted that the number of content extraction rules corresponding to one document type may be one or multiple.

The content extraction rule in this embodiment includes preset keyword information, that is, the content extraction rule is keyword extraction, and specifically, the preset keyword information may be matched with text content in an information source document, so as to determine target keyword information. Optionally, before matching the preset keyword information with the text content in the information source document, the text content in the information source document may be identified by a character identification method, or format conversion is performed on the information source document to obtain the text content in the information source document. For example, a word document is converted into an HTML document.

Illustratively, the content extraction rules may include a structured flag, titles at each level, respective corresponding sequence numbers of the titles at each level, preset keyword information, document sub-types, extraction modes, and positioning rules. For example, following the above example of a credit assessment scenario for the financial industry, if there is a level four title, the content extraction rules may be: 0,1, customer basic information, 7, major product status, 1, main product | major product | business range, 3910001,4, w, w.

The structured flag is used to identify whether the extraction object targeted by the content extraction rule is structured data or unstructured data, for example, a value of 1 represents extraction of structured data. Each level of title corresponds to a corresponding level of title in the generated target structured document. Following the above example of the credit assessment scenario of the financial industry, the primary titles may include four titles, and from the order of the titles in the generated target structured document (the client risk check report), the primary titles may sequentially be the client basic information, the current credit granting business information, the project assessment rating (real estate) and the project assessment rating (credit-fixing type project), and for the primary title of the client basic information, the secondary titles under the category may include the client name, the registered capital, the group, the formation date, the main business, the main product condition, the regional advantage, the capacity condition, the account condition of the top five accounts of the capital transaction amount, the account detail of the top five clients which should be received and paid, and the like.

The title serial numbers are used for representing the sequencing among the titles at the same level, each level of title has the corresponding title serial number, and the value of the title serial number at each level starts from 1. The division of the titles at each level can be determined according to the actual application scene. The preset keyword information may be a single keyword or may be composed of a group of keywords, and is used for matching the corresponding text content in the unstructured information source document. Combined keywords may be separated by "|" indicating a relationship of "or". For example, "main product | business range" indicates that if the text matches with "main product" or "business range", this indicates that the text is the text that needs to be extracted by the content extraction rule.

In the embodiment, different unstructured information source documents need to be extracted with different contents, and even though the same title exists, the content extraction rules applied to different information source documents may be different. Based on the method, the rules are associated with the documents by using the document sub-types in the content extraction rules, so that when the unstructured information source documents are extracted, all content extraction rules related to the unstructured information source documents can be found according to the document sub-types corresponding to the documents.

The extraction manner defines the type of the target key information extracted by the content extraction rule, and includes, but is not limited to, extracting a table without a header, extracting a paragraph containing a keyword, extracting a sentence containing a keyword, extracting a table containing a header, where the header is on a title, extracting a table containing a header, where the header is on a top line of the table, extracting mixed paragraphs, and extracting cell data from the table. The positioning rules describe the location of the target key information that the content extraction rules need to extract in the entire information source document. Different extraction modes and different positioning rules are adopted.

In an optional implementation manner, the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule may be: acquiring a first preset keyword phrase in the preset keyword information; and matching and extracting a first section which takes the first starting keyword as an initial position and the first ending keyword as an ending position according to the first starting keyword and the first ending keyword in the first preset keyword phrase.

The first preset keyword phrase may include a first start keyword and a first end keyword. Specifically, a start position and an end position can be respectively determined in the information source document through the first start keyword and the first end keyword, and paragraphs between the start position and the end position can be extracted.

For example, for an information source document, i.e. an application file, since the information extracted from the application file is an application unit administration and management opinion, and the application unit administration and management opinion is usually extracted in a paragraph manner, two titles before and after a paragraph to be extracted may be set as a first start keyword and a first end keyword, respectively, and the paragraph where the application unit administration and management opinion is located may be located by using the above method to extract the application unit administration and management opinion.

By the alternative implementation mode, the extraction of paragraph contents based on the first starting keyword and the first ending keyword can be realized, and the extraction of text contents existing in paragraph forms in the information source document is facilitated. Optionally, when the first start keyword and the first end keyword are preset marks, the first start keyword and the first end keyword may be expressed as full-text search, that is, full-text content is extracted; when the first start keyword and the first end keyword are titles, the entire paragraph contents between the two titles can be extracted.

In another optional implementation manner, the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule further includes: acquiring a second preset keyword phrase in the preset keyword information; according to a second starting keyword and a second ending keyword in the second preset keyword phrase, matching a second section which takes the second starting keyword as an initial position and the second ending keyword as an ending position; in the second paragraph, sentences containing first preset keywords are extracted.

For example, the degree of solidification of the title in the customer discrepancy assessment report, the limit declaration book and the service declaration book is good, the title can be set as a second start keyword and a second end keyword to position the paragraph where the content to be extracted is located, and the fixed phrase in the sentence where the information to be extracted is located is set as a first preset keyword to screen the sentence to be extracted from the paragraph.

In this optional embodiment, the preset keyword information includes a first preset keyword and a second preset keyword phrase, the second preset keyword phrase includes a second start keyword and a second end keyword, the second preset keyword phrase is used to locate a second paragraph in the information source document, and the first preset keyword is used to extract a sentence including the first preset keyword from the located second paragraph.

By the method, the target key information can be accurately extracted from the information source document, the situation that when a plurality of sentences containing the first preset keywords exist in the information source document, the first preset keywords are directly adopted for full-text matching to extract redundant information is avoided, and the extraction efficiency and accuracy of the target key information are improved. Of course, in this alternative embodiment, the target sub-paragraph containing the first preset keyword may also be extracted from the second paragraph according to the first preset keyword.

In another optional implementation manner, the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule may further include: and extracting the table containing the second preset key words or the cell contents in the table.

The second preset keyword can be used for matching a corresponding title in the information source document. Specifically, the title of the table to be extracted can be matched in the information source document according to the second preset keyword, and then the table is extracted, or all cell contents in the table are extracted. By this alternative embodiment, extraction of table contents can be achieved. Of course, the table may be extracted at the same time as the table, and the table header is in the upper row of the table.

It can be understood that, in another embodiment, a table including a second preset keyword may also be determined in the information source document according to the first preset keyword, and then cell contents are extracted from the table according to the second preset keyword, and by this way, accurate extraction of part of the cell contents in the table may be achieved.

And S230, integrating the target key information according to a preset information integration template to generate a target structured document.

According to the technical scheme, after the information source document corresponding to the list of the objects to be evaluated is obtained, the document type corresponding to the information source document is determined according to the document number, the corresponding content extraction rule is matched according to the document type, and then the target key information is extracted from the information source document of the corresponding type based on the preset keyword information in the content extraction rule, so that the information extraction based on the keywords is realized, and the extraction efficiency of the multi-source non-structural document information is improved.

EXAMPLE III

Fig. 3A is a flowchart of a document information integration method according to a third embodiment of the present invention, which may be combined with the solutions of the foregoing embodiments, and further describes a process of document information integration from the perspective of interaction between systems. Referring to fig. 3A, the document information integration method provided by the present embodiment includes the following steps:

s310, establishing connection with a target risk early warning system, requesting and acquiring a business risk alarm list, and taking the business risk alarm list as the list of the objects to be evaluated.

The target risk early warning system can be a system for screening each object according to risk early warning rules in a credit evaluation scene of the financial industry. The risk early warning rule can be that the item loss proportion exceeds a set proportion, or the fund shortage amount exceeds a set amount, or the guarantee information is lacked, and the like.

Specifically, the target risk early warning system may evaluate each object according to the information source document of each object acquired by the credit granting service system to determine the service risk alarm list. The credit business system may be a system for acquiring information source documents (authorized credit file related documents) of each object.

S320, acquiring a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document.

Optionally, the obtaining of the corresponding information source document according to the list of the objects to be evaluated includes: requesting and acquiring a credit granting document of an object to be evaluated from a preset credit granting service system; the credit granting document comprises at least one information source document of a survey report, a preset project evaluation rating report, a preset service application book, an invitation document and a loan amount application book.

That is, in a credit assessment scenario of the financial industry, the information source document may include at least one of a survey report, a preset project assessment rating report, a preset service application, an invitation document, and a loan amount application. The survey report may include at least one of a company-type new credit customer survey report, a company-type stock customer survey report, and a business-type new credit and stock survey report. The preset item evaluation rating report may include at least one of a real estate development loan evaluation rating report and a fixed property loan evaluation rating report. The request document may contain an opinion of the business management of the reporting entity. The loan limit declaration may include at least one of a company-type, a cause-type new credit client declaration and a company-type, cause-type stock client declaration.

Exemplarily, as shown in fig. 3B, an interaction diagram among a document information integration system, a target risk early warning system, and a preset credit granting business system is shown. The document information integration system is used for extracting and integrating information of the multi-source non-structural document based on the document information integration method provided by each embodiment of the invention to generate a structured document. The document information integration system can be respectively connected with the target risk early warning system and the preset credit granting business system. The document information integration system comprises an outbound component, and packages and transmits end-to-end the related documents of the credit archives in the preset credit granting service system by configuring a timing task. Meanwhile, the document information integration system also supports batch outbound and self-increment outbound of the client files according to time intervals, and achieves the purposes of high efficiency and accuracy of system data and capability of automatic configuration and acquisition. The document information integration system can also obtain a business risk alarm list determined by the target risk early warning system at regular time (for example, daily), and obtain a full compressed packet of the documents related to the credit files of each client in the business risk alarm list from a preset credit granting business system through an outbound component.

Illustratively, as shown in fig. 3C, an information extraction flow of a document information integration system is shown, where the system configures document extraction rules in advance for chapter (paragraph) contents and diagram contents that need to be extracted, starts a timing task in a job scheduling system, decompresses a full-size compressed packet of a system alarm and a manually-introduced credit archive by monitoring a transmission result of an upstream outbound component at regular time, and performs information extraction and information warehousing storage on the decompressed document. The method supports the screening of information source documents of formulated types including client survey reports, project evaluation rating reports, service declarations, limit declarations and the like from decompressed files, analyzes the contents of important attention of the services into structured data and stores the structured data in a database through content extraction rules, and provides subsequent data display and leads out a unified client risk check report. Of course, the document information integration system can also perform calculation of corresponding indexes meeting business requirements according to analysis of the extracted structured data.

S330, aiming at each object to be evaluated in the list of the objects to be evaluated, matching a content extraction rule corresponding to each document according to the document number of each document in the information source document corresponding to the object to be evaluated, and extracting target key information according to the content extraction rule.

S340, integrating the target key information according to a preset information integration template to generate a target structured document.

Specifically, after acquiring the target key information, the document information integration system may integrate the target key information through an APACHE POI package to generate a customer risk check report, i.e., a target structured document. In this embodiment, the customer risk check report may include customer basic information, financial information, single credit granting business information, project evaluation rating condition, branch feedback opinion, and other contents.

According to the technical scheme, the business risk alarm list is used as the list of the objects to be evaluated by establishing connection with the target risk early warning system to request and acquire the business risk alarm list, and the structured documents are generated according to the information source documents corresponding to the list of the objects to be evaluated, so that information extraction and integration of bank authorization documents are realized, and extraction and integration of related information of the alarm objects are realized.

Example four

Fig. 4 is a flowchart of a document information integration method according to a third embodiment of the present invention, which is further optimized based on the foregoing embodiments, and illustrates a process of obtaining an operation instruction of a user, completing information integration according to a user-defined setting of a content extraction rule and a preset information integration template by the user, and according to a personalized setting of the user. Referring to fig. 4, the document information integration method provided by the present embodiment includes the following steps:

s410, obtaining a list of objects to be evaluated, and obtaining a corresponding information source document according to the list of the objects to be evaluated, wherein the information source document is an unstructured document.

S420, aiming at each object to be evaluated in the list of the objects to be evaluated, matching a content extraction rule corresponding to each document according to the document number of each document in the information source document corresponding to the object to be evaluated, and extracting target key information according to the content extraction rule.

And S430, integrating the target key information according to a preset information integration template to generate a target structured document.

S440, acquiring the modification of the user on the target key information extraction item in the content extraction rule, and extracting the target key information according to the modified content extraction rule.

In this embodiment, the user may modify the target key information extraction item in the content extraction rule (including adding an extraction item, deleting an extraction item, or updating an extraction item), and extract information according to the content extraction rule modified by the user, so as to obtain a structured document meeting the personalized requirements of the user.

Specifically, in this embodiment, each extraction item in the content extraction rule may be displayed on the first preset interface, a control is modified, added, or deleted, a control trigger event on the first preset interface is monitored, and modification of the target key information extraction item by the user is determined based on the monitored control trigger event.

Of course, the user may also adjust the preset information integration template. That is, optionally, the method may further include: and acquiring the modification of the user on the document content hierarchical structure in the preset information integration template, and integrating the target key information according to the modified document content hierarchical structure.

The user can modify the document content hierarchical structure in the preset information integration template, for example, modify the hierarchy of the title, delete a certain hierarchy of the title, and the like. And integrating the extracted target key information according to the preset information integration template modified by the user to obtain a structured document meeting the personalized requirements of the user.

Specifically, the embodiment may display the preset information integration template on the second preset interface, display an adjustment control (such as moving, adding, deleting, and the like) for a document content hierarchy structure in the preset information integration template, monitor a control trigger event on the second preset interface, and determine modification of the preset information integration template by the user based on the monitored control trigger event.

According to the technical scheme of the embodiment, the modification of the user on the target key information extraction item in the content extraction rule is obtained, and the target key information is extracted according to the modified content extraction rule, so that the user can customize the concerned content in the structured document according to the requirement, and the use experience of the user is improved.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a document information integration apparatus according to a fifth embodiment of the present invention, which is applicable to a situation where target key information is extracted from a plurality of unstructured information source documents, and a target structured document is generated according to the extracted target key information.

As shown in fig. 5, the document information integrating apparatus includes: a source document acquisition module 510, an information extraction module 520, and an information integration module 530.

The source document obtaining module 510 is configured to obtain a list of objects to be evaluated, and obtain a corresponding information source document according to the list of objects to be evaluated, where the information source document is an unstructured document;

the information extraction module 520 is configured to match, for each to-be-evaluated object in the to-be-evaluated object list, a content extraction rule corresponding to each document according to a document number of each document in an information source document corresponding to the to-be-evaluated object, and extract target key information according to the content extraction rule;

and the information integration module 530 is configured to integrate the target key information according to a preset information integration template to generate a target structured document.

According to the technical scheme of the embodiment, by acquiring the list of the objects to be evaluated, the corresponding unstructured information source documents are acquired according to the list of the objects to be evaluated, and for each object to be evaluated in the objects to be evaluated, matching content extraction rules corresponding to the documents according to the document numbers of the documents in the information source document corresponding to the object to be evaluated, extracting target key information from the unstructured information source document based on the content extraction rules, and then, a target structured document is obtained according to a preset information integration template and extracted target key information, information extraction and information integration of the multi-source non-structural document are achieved, an integration template does not need to be configured independently for each type of source document, the technical problem that information extraction is carried out depending on complex template configuration in the prior art is solved, and the efficiency of extracting and integrating information of the multi-source non-structural document is improved. In addition, the method can generate the structured document according to the extracted information, the problem that the unstructured document generated after the information is extracted in the prior art is difficult to flexibly display is solved, the generated structured document is beneficial to analyzing the extracted information, and the efficiency of obtaining the information by a user is improved.

Optionally, the information extraction module 520 includes a rule determination unit and an information extraction unit, where the rule determination unit is configured to determine a document type corresponding to an information source document according to the document number, and match a corresponding content extraction rule according to the document type; and the information extraction unit is used for extracting target key information from the information source documents of the corresponding types based on the preset keyword information in the content extraction rule.

Optionally, the information extracting unit is specifically configured to:

acquiring a first preset keyword phrase in the preset keyword information; and matching and extracting a first paragraph taking the first start keyword as a start position and the first end keyword as an end position according to the first start keyword and the first end keyword in the first preset keyword phrase.

Optionally, the information extraction unit is further configured to obtain a second preset keyword phrase in the preset keyword information; according to a second starting keyword and a second ending keyword in the second preset keyword phrase, matching a second section which takes the second starting keyword as an initial position and the second ending keyword as an ending position; in the second paragraph, sentences containing first preset keywords are extracted.

Optionally, the information extraction unit is further configured to extract a table containing a second preset keyword or cell content in the table.

Optionally, the source document obtaining module 510 includes an alarm list obtaining unit, where the alarm list obtaining unit is configured to establish a connection with a target risk early warning system, request and obtain a business risk alarm list; and taking the business risk alarm list as the list of the objects to be evaluated.

Optionally, the source document acquiring module 520 further includes a credit granting document acquiring unit, where the credit granting document acquiring unit is configured to request a preset credit granting service system and acquire a credit granting document of an object to be evaluated; the credit granting document comprises at least one information source document of a survey report, a preset project evaluation rating report, a preset service application book, an invitation document and a loan amount application book.

Optionally, the information integrating module 530 is specifically configured to:

matching the target key information content with the document content hierarchical structure of the preset information integration template; and integrating the target key information according to the matching result of the document content hierarchical structure to generate a target structured document.

Optionally, the apparatus further includes a rule modification module, where the rule modification module is configured to obtain a modification of a target key information extraction item in the content extraction rule by a user, and extract target key information according to the modified content extraction rule.

Optionally, the apparatus further includes a template modification module, where the template modification module is configured to obtain a modification of a document content hierarchical structure in the preset information integration template by a user, and integrate the target key information according to the modified document content hierarchical structure.

Optionally, the apparatus further includes a source document downloading module, where the source document downloading module is configured to display information source document downloading prompt information and downloading connection for a user when the user downloads the target structured document; and downloading the information source document associated with the target structured document in response to the triggering operation of the downloading connection by the user.

The document information integration device provided by the embodiment of the invention can execute the document information integration method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Fig. 6 is a schematic structural diagram of a computer device according to a sixth embodiment of the present invention. FIG. 6 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in FIG. 6 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention. The computer device 12 may be any terminal device with computing capability, such as a terminal device of an intelligent controller, a server, a mobile phone, and the like.

As shown in FIG. 6, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the invention as described.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with computer device 12, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, computer device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in FIG. 6, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, to implement the document information integration method provided by the embodiment, the method including:

EXAMPLE seven

The seventh embodiment provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a document information integration method provided in any embodiment of the present invention, where the computer program includes:

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood by those skilled in the art that the modules or steps of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented by program code executable by a computing device, such that it may be stored in a memory device and executed by a computing device, or it may be separately fabricated into various integrated circuit modules, or it may be fabricated by fabricating a plurality of modules or steps thereof into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A document information integration method is characterized by comprising the following steps:

acquiring a list of objects to be evaluated, and acquiring corresponding information source documents according to the list of the objects to be evaluated, wherein the information source documents are unstructured documents;

2. The method according to claim 1, wherein the matching of content extraction rules corresponding to documents according to document numbers of the documents in the information source document corresponding to the object to be evaluated and the extraction of target key information according to the content extraction rules comprises:

3. The method according to claim 2, wherein extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule comprises:

acquiring a first preset keyword phrase in the preset keyword information;

4. The method according to claim 2, wherein the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule further comprises:

acquiring a second preset keyword phrase in the preset keyword information;

5. The method according to claim 2, wherein the extracting target key information from the information source document of the corresponding type based on the preset keyword information in the content extraction rule further comprises:

6. The method of claim 1, wherein the obtaining the list of objects to be evaluated comprises:

7. The method according to claim 6, wherein the obtaining of the corresponding information source document according to the list of objects to be evaluated includes:

8. The method according to any one of claims 1 to 7, wherein the integrating the target key information according to a preset information integration template to generate a target structured document comprises:

9. The method of claim 1, further comprising:

10. The method of claim 9, further comprising:

11. The method of claim 1, further comprising:

and responding to the triggering operation of the user on the downloading connection, and downloading the information source document associated with the target structured document.

12. A document information integration apparatus, comprising:

13. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the document information integration method of any of claims 1-11.

14. A computer-readable storage medium on which a computer program is stored, the program, when being executed by a processor, implementing the document information integration method according to any one of claims 1 to 11.

15. A computer program product comprising a computer program which, when executed by a processor, implements the document information integration method of any one of claims 1-11.