CN117649674A - Key information extraction method and device - Google Patents

Key information extraction method and device Download PDF

Info

Publication number
CN117649674A
CN117649674A CN202311757294.8A CN202311757294A CN117649674A CN 117649674 A CN117649674 A CN 117649674A CN 202311757294 A CN202311757294 A CN 202311757294A CN 117649674 A CN117649674 A CN 117649674A
Authority
CN
China
Prior art keywords
text
image
item
ocr
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311757294.8A
Other languages
Chinese (zh)
Inventor
殷晓婷
杜宇宁
刘毅
赵乔
胡晓光
于佃海
马艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202311757294.8A priority Critical patent/CN117649674A/en
Publication of CN117649674A publication Critical patent/CN117649674A/en
Pending legal-status Critical Current

Links

Abstract

The disclosure provides a key information extraction method and a device thereof, and relates to the field of artificial intelligence such as image recognition, optical Character Recognition (OCR) character recognition and the like. The specific implementation scheme is as follows: determining an OCR text corresponding to the first image; the first image is an image of key information to be extracted; determining a first text based on the OCR text and at least one Query word Query; inputting the first text into a large model to obtain key information; and the key information is a query result corresponding to the query word. The method and the device can improve the efficiency and accuracy of key information extraction.

Description

Key information extraction method and device
Technical Field
The present disclosure relates to the field of image processing, and more particularly, to the field of artificial intelligence such as image recognition, optical character recognition (Optical Character Recognition, OCR) text recognition, and more particularly, to a method and apparatus for extracting key information.
Background
The extraction of key information (Key Information Extraction, KIE) refers to extracting key information from a text or image, which is generally used as a downstream task of OCR recognition, that is, OCR recognition is performed on the text or image to obtain text information of the text or image, and then the key information is extracted from the text information. The KIE may be applied to various application scenarios, for example, form recognition, ticket information extraction, and id card information extraction, but the current key information extraction method has low efficiency, poor universality and poor effect.
Disclosure of Invention
The disclosure provides a key information extraction method, a key information extraction device, electronic equipment and a storage medium.
According to a first aspect of the present disclosure, there is provided a key information extraction method, including:
determining an OCR text corresponding to the first image; the first image is an image of key information to be extracted;
determining a first text based on the OCR text and at least one Query word Query;
inputting the first text into a large model to obtain key information; and the key information is a query result corresponding to the query word.
According to a second aspect of the present disclosure, there is provided a key information extraction apparatus including:
the first determining module is used for determining the optical character recognition OCR text corresponding to the first image; the first image is an image of key information to be extracted;
a second determining module for determining a first text based on the OCR text and at least one Query word Query;
the input module is used for inputting the first text into the large model to obtain key information; and the key information is a query result corresponding to the query word.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the preceding first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of the preceding first aspect.
The technology solves the problems of poor slot value information extraction effect and the like in the related technology.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flowchart of a key information extraction method according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a key information extraction method according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a key information extraction method according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a key information extraction apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram of a key information extraction apparatus according to an embodiment of the present disclosure;
fig. 6 is a block diagram of a key information extraction apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of a key information extraction apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of an electronic device for implementing a key information extraction method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that, in the technical solution of the present disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, etc. of the personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations.
In the related art, the method for extracting the key information includes the following two methods:
firstly, extracting key information by adopting a template matching-based method, such as fixing the position, format and the like of the key information of a text image of a specific scene, so that the key information can be extracted based on the fixed position of a fixed template;
the second type uses a model of deep learning (e.g., layoutLM, strucText, etc.) to extract key information.
However, in the first method, since the text images in different application scenes correspond to different templates, a great deal of manual effort is required to go through to adjust the templates adapted to the text images, and a plurality of extraction rules are customized, so that the migration cost is high, time and effort are wasted, and the efficiency is low. And, in the second method, a large amount of high-quality training data needs to be prepared to train the model, and the visual annotation and the natural language annotation need to be aligned, so that training is complex and not universal enough. Meanwhile, in practical application, fine adjustment of specific scene data is often required to be re-marked, the operation is complex, and the extraction effect cannot be guaranteed.
Based on this, the embodiments of the present disclosure provide a key information extraction method to solve the above technical problems.
Fig. 1 is a flowchart of a key information extraction method according to an embodiment of the present disclosure. As shown in fig. 1, the key information extraction method may include, but is not limited to, the following steps.
In step 101, OCR text corresponding to the first image is determined.
In some embodiments, the first image may be an image of key information to be extracted.
Optionally, the first image may include at least one item and item information corresponding to the item. Optionally, the item and item information may include, for example, at least one of:
contract number: (19731196);
sign-on location: market A;
signing date: 2021-06-23;
the seller: b Limited;
wherein, the items may include "contract number", "place of sign", "date of sign", "seller" as described above; and, the entry information corresponding to the entry item "contract number" may be: (19731196); the item information corresponding to the item "place of subscription" may be: market A; the item information corresponding to the item "date of subscription" may be: 2021-06-23; the item information corresponding to the item "vendor" may be: b Limited;
It should be understood that the above-described items and items of information are merely examples of embodiments of the present disclosure, and are not limited thereto in practical application.
Alternatively, the above-described item items and item information may exist in the first image in the form of text, or may exist in the first image in the form of a table. And, in some embodiments, the first image may include a text region and a form region.
In some embodiments, when determining the OCR text corresponding to the first image, the first image may be input to the first model to obtain a first output result; wherein the first model may be used for performing layout analysis on the first image to identify a text region and a table region of the first image, and the first model may be a layout analysis model, for example, may be: pico-det layout analysis model. Alternatively, the first output result of the first model may be: a first image identifying a text region and a form region; after the first output result is obtained, the first output result can be input into a second model to obtain a second output result; optionally, the second model may be used for performing table recognition on the table area of the first image, and the second model may be a table recognition model, for example, may be a SLANet model, and optionally, performing table recognition on the first image by using the second model may include: performing structure sequence recognition on the table in the table area of the first image and performing text recognition on the text in the table, wherein the second model may be that the text in the table is recognized by adopting an OCR recognition technology, and optionally, the second output result output by the second model may include: the table structure sequence recognition result in the table area and the text recognition result in the table. Wherein the second output result may be hypertext markup language (Hyper Text Markup Language, HTML).
And in some embodiments, after the first model outputs the first output result, deleting the table area in the first output result and inputting the deleted table area into the third model to obtain a third output result; the third model may be used for performing OCR recognition, and the third model may be, for example, an OCR recognition model, for example, a PP-OCRv4 model, and the third output result output by the third model may include: text recognition results for text regions of the first image. And after the third model outputs the third output result, combining the third output result with the second output result to obtain the OCR text corresponding to the first image. In some embodiments, when the second output result and the third output result are combined, the table structure sequence recognition result in the second output result may be deleted first, and then combined with the third output result to obtain the OCR text corresponding to the first image.
For example, assuming that a table is included in the first image, the text region of the first image may be, for example, a header description portion of the table, and the table region of the first image may be a region in which the table is located. The text recognition result (i.e., the third output result) of the text region may be, for example, the basic information of the certificate application unit of origin in table 1; the table identification result (i.e., the aforementioned second output result) output by the second model may be, for example: < td rowspan "2" > application unit td < td > chinese name td < td > < td colspan= "3" > B finite responsibility company td < tr > td > english name td < td colspan= "3" > B co, lttd < tr > < tdcolspan= "2" > legal stands for human. The words such as "</td > < td >" are the recognition results of the table structure sequences, the words such as the application unit, the Chinese name, the English name are the recognition results of the words in the table, the application unit is the item, and the B finite liability company can be the item information corresponding to the application unit.
Further, for example, the OCR text corresponding to the first image obtained by combining the second output result and the third output result may be: contract number: (19731196); b, a finite responsibility company media carbon electronic purchasing contract; buying and selling coal; contract buyer: b limited responsibility; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23; 1. consignee name, variety specification, quantity, settlement standard consignee name (delivery place): d land four phase.
Optionally, in other embodiments, when determining the OCR text corresponding to the first image, the first processing may be performed on the first image to obtain the second image, where the first processing may include: deleting the table structure of the table in the first image, and reserving the characters in the table; and then, inputting the second image into a third model to obtain the OCR text corresponding to the first image.
Optionally, in some embodiments, since the first image includes at least one item and item information corresponding to the item, the OCR text determined after the first image is identified should also include at least one item and item information corresponding to the item.
In step 102, a first text is determined based on the OCR text and at least one Query term (Query).
In some embodiments, query terms (Query) may be understood as the core module of the search service, and by parsing the Query terms entered by the user, the user's purpose is fully understood so that the downstream module can find out the results desired by the user from among a large number of results.
Alternatively, in some embodiments, the first text may be constructed directly based on Query and OCR text.
Alternatively, in other embodiments, a second text may be determined from the OCR text based on the at least one Query, and the second text may be text related to the Query in the OCR text; and determining the second text and the at least one Query as the first text.
Specifically, in some embodiments, the OCR text may be first divided into at least one sub-text, for example, the OCR text may be first divided into at least one sub-text by using a group of N characters, where N is a positive integer, and N may be, for example, 100, where, because the OCR text includes at least one item and item information corresponding to the item, the divided sub-text should also include at least one item and item information corresponding to the item. Then, carrying out the subedding vectorization processing on at least one sub-text to obtain a first subedding vector corresponding to the sub-text; specifically, the entry item and the entry information in the sub-text may be subjected to the vectorization processing of the ebedding to obtain at least one first ebedding vector corresponding to the sub-text, and the at least one first ebedding vector may be constructed into a vector library, for example, a vector library may be constructed by using a FAISS, where the FAISS is an AI similarity search library for vector search, and multiple index types may be defined according to requirements such as precision, search time, memory size, and the like, for example: index type based on high precision search requirement, index type balancing search precision and memory size, index type based on search efficiency, and index type based on size of vector library to be searched. The Faiss can meet the requirement of large-scale vector similarity retrieval, and the similarity calculation efficiency between vectors can be effectively improved.
Further, after the vector library is built, at least one Query can be respectively subjected to the vector embedding processing to obtain at least one second vector embedding; determining a third emmbedding vector from all the first emmbedding vectors (namely, in a vector library), wherein the similarity between the third emmbedding vector and the second emmbedding vector is highest; optionally, in some embodiments, the third reducing vector may be a reducing vector corresponding to a first item, where the first item is the item most similar to the Query content, and after determining the third reducing vector, determining a sub-text corresponding to the determined third reducing vector as a second text, where the second text includes the first item, and then constructing the first text based on the second text and the at least one Query.
By way of example, assume that a Query entered by a user includes a contract number, buyer, place of sign-up, date of sign-up; and the OCR text determined in step 101 is, for example: contract number: (19731196); b, a finite responsibility company media carbon electronic purchasing contract; buying and selling coal; contract buyer: b limited responsibility; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23; 1. consignee name, variety specification, quantity, settlement standard consignee name (delivery place): d land four phase. The second text determined in this step 102 may be: contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23, and the first text may be text comprising the second text and at least one Query.
In step 103, the first text is entered into the large model, resulting in key information.
Alternatively, in some embodiments, the key information may be a query result corresponding to the query term.
In some embodiments, the large model may determine a first item from the second text that is closest to the Query semantic based on the Query based on semantic analysis, and determine item information corresponding to the first item as key information.
For example, assume that Query includes a contract number, buyer, sign-on location, sign-on date; the second text is: contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23; the key information may be, for example: contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23.
Alternatively, in some embodiments, the large model may be, for example: large language models (Large Language Model, LLM).
It should be understood that the solution of the embodiments of the present disclosure may be applicable to all the critical information extraction fields, for example: business licenses, motor vehicle travel licenses, drivers licenses, car checks, value-added tax receipts, high-speed receipts, market receipts, train tickets, avionics travel slips (airplane receipts), express delivery slip numbers, express/taxi travel slips, identity cards, social security cards, bank cards, business cards, identity cards, social security cards, house books, wedding cards, birth cards, and the like.
In summary, by implementing the embodiment of the disclosure, when extracting the key information, the OCR text corresponding to the first image is determined first, the first text is determined based on Query and OCR text, and then the first text is input to the large model to obtain the key information. Therefore, manual operation is not needed in the key information extraction method of the embodiment of the disclosure, so that the labor cost can be greatly reduced, and the extraction efficiency is improved. In addition, in the embodiment of the disclosure, the key information is extracted by combining the OCR recognition technology and the large model, so that the high precision and the universality of the OCR recognition technology are utilized, and the key information in the first image can be extracted rapidly and accurately by combining the information extraction capability of the large model. In addition, in the embodiment of the disclosure, the first text is screened from the OCR text corresponding to the first image to screen out the second text related to Query, the first text is formed based on the second text and at least one Query, and then the key information is extracted directly based on the first text, instead of extracting the key information based on the OCR text with larger data size, so that the extraction efficiency can be greatly improved. Meanwhile, the capacity of borrowing the large model can be used without secondary training, so that the efficiency is effectively improved.
Fig. 2 is a flowchart of a key information extraction method according to an embodiment of the present disclosure. As shown in fig. 2, the key information extraction method may include, but is not limited to, the following steps.
In step 201, OCR text corresponding to the first image is determined.
Alternative implementations of step 201 may refer to alternative implementations of step 101 of fig. 1, and other relevant parts in the embodiment related to fig. 1, which are not described herein.
In step 202, a first text is determined based on the OCR text and at least one Query.
Alternative implementations of step 202 may refer to alternative implementations of step 102 of fig. 1, and other relevant parts of the embodiment related to fig. 1, which are not described herein.
In step 203, a text template requirement corresponding to the text content of the first text is determined.
In some embodiments, an application scene corresponding to the first text may be determined based on the text content of the first text, and then a text template requirement adapted to the application scene may be determined based on the application scene. The text template requirements corresponding to each application scene can be prestored.
Alternatively, in some embodiments, the text template requirements may include a task description and task rules to be followed when completing the task. The task descriptions mentioned above can be understood, for example, as: the description of the task "critical information extraction" and the task rules described above can be understood, for example, as: and completing the rule to be observed when the task of extracting key information under the corresponding application scene is completed.
Further example, assume that the first text is: contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23. The application scenario corresponding to the first text is: the extracting of contract information, optionally, a text template requirement corresponding to the application scenario of "extracting contract information" may include, for example:
task description: the current task is to extract key information corresponding to each item in a key word list from a first text;
task rules: the first text is surrounded with "" "symbols, containing the recognized words, in order from left to right, top to bottom in the original picture. The specified key word list is surrounded by [ ] symbols, note that the first text may have the problems of long sentence line feed being cut off, unreasonable word segmentation, text being combined erroneously, etc., and needs to be integrated to be cut off in combination with context semantics to extract accurate key information, and a JS object numbered notation (Java Script Object Notation, JSON) format is used when the result is returned, including a plurality of solution velue pairs, the key value is the specified key word, and the vahue value is the extracted result. If the value which is supposed to be when the key word kay is not found in the first text, the value is assigned as unknown, please only output the json format result, and return after the json format verification, the value does not need to contain other redundant characters-! The following formally starts;
It should be noted that, the keywords kay mentioned in the above text template requirement may be understood as query terms mentioned in the embodiments of the disclosure, and the value may be understood as a query result corresponding to the query terms, for example.
In step 204, typesetting the first text based on the text template requirements generates typeset first text.
Alternatively, taking the text template requirement as an example, where the text template requirement is exemplified in step 203, it is assumed that the Query input by the user includes a contract number, a buyer, a place of subscription, and a date of subscription; the first text is: contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23; the typeset first text required to be output based on the text template exemplified in step 203 above may be, for example: "contract number: (19731196); the buyer: b Limited liability company; sign-on location: market A; the seller: c Limited; signing date: 2021-06-23"; keyword list: contract number, buyer, place of sign, date of sign ].
Alternatively, in some embodiments, the text template requirements may be determined by a Prompt generator and the at least one query term and the first text typeset based on the text template requirements.
In step 205, the typeset first text is input to the large model to obtain key information.
Alternative implementations of step 205 may refer to alternative implementations of step 103 of fig. 1, and other relevant parts of the embodiment related to fig. 1, which are not described herein.
It should be noted that the foregoing embodiments of fig. 1 and fig. 2 are merely exemplary illustrations of the disclosure, and it should be understood that other implementations of the disclosure are also possible, for example, in some embodiments, after determining the OCR text corresponding to the first image, the first text may be first constructed based on the OCR text and at least one Query, and then the steps 203 and 204 may be performed to typeset the first text to obtain a typeset first text, and then, based on the at least one Query, determine a second text from the typeset first text, where the second text is a text related to the Query, and input the second text and the at least one Query into the large model to obtain the key information.
In summary, by implementing the embodiment of the disclosure, before the first text is input into the large model to obtain the key information, a text template requirement corresponding to the text content of the first text is determined, the typesetting is performed on the first text based on the text template requirement to generate the typeset first text, and then the typeset first text is input into the large model to obtain the key information. Therefore, in the embodiment of the disclosure, the first text can be automatically typeset into the text template adapted to the application scene based on the difference of the application scene corresponding to the text content of the first text, so that the text template can be adapted to different application scenes, expansibility and universality are high, manual template adjustment is not needed, and efficiency and accuracy are greatly improved. Meanwhile, the adaptive typesetting is carried out on the first text before the first text is input into the large model, so that the typesetting is not needed to be carried out on the first text by the large model, the large model can be trained to realize typesetting without utilizing high-quality annotation data, the problem that the high-quality annotation data is difficult to obtain is effectively solved, the quick extraction can be carried out without secondary training, and the complexity of key information extraction is greatly reduced. Meanwhile, the first text can be automatically matched with the template, so that a plurality of models corresponding to different application scenes do not need to be trained, and the cost is greatly saved.
In addition, manual operation is not needed in the key information extraction method disclosed by the embodiment of the invention, so that the labor cost can be greatly reduced, and the extraction efficiency is improved. In addition, in the embodiment of the disclosure, the key information is extracted by combining the OCR recognition technology and the large model, so that the high precision and the universality of the OCR recognition technology are utilized, and the key information in the first image can be extracted rapidly and accurately by combining the information extraction capability of the large model. In addition, in the embodiment of the disclosure, the first text is screened from the OCR text corresponding to the first image to screen out the second text related to Query, the first text is formed based on the second text and at least one Query, and then the key information is extracted directly based on the first text, instead of extracting the key information based on the OCR text with larger data size, so that the extraction efficiency can be greatly improved. Meanwhile, the capacity of borrowing the large model can be used without secondary training, so that the efficiency is effectively improved.
For the convenience of those skilled in the art to more understand the present disclosure, the following detailed description will be made with reference to fig. 3.
Fig. 3 is a flow chart of a key information extraction method provided by an embodiment of the present disclosure, as shown in fig. 3, a first image and at least one query word may be input first, and then layout analysis is performed on the first image to identify a text region and a table region of the first image, and further text region recognition and table region recognition are performed on the first image to obtain an OCR text corresponding to the first image. And then, respectively carrying out vector retrieval on the OCR text and the Query words to screen a second text from the OCR text, and determining a first text based on the second text and at least one Query word, wherein the second text can be a text related to Query in the OCR text, then, carrying out self-adaptive typesetting on the determined first text based on a corresponding text template requirement to obtain a typeset first text, and inputting the typeset first text into a large model to obtain key information.
Fig. 4 is a block diagram of a key information extraction apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the key information extraction apparatus may include: a first determination module 410, a second determination module 420, and an input module 430.
Wherein, the first determining module 410 is configured to determine an optical character recognition OCR text corresponding to the first image; the first image is an image of key information to be extracted;
a second determination module 420 for determining a first text based on the OCR text and at least one Query word Query;
an input module 430, configured to input the first text to a large model to obtain key information; and the key information is a query result corresponding to the query word.
Optionally, in some embodiments, as shown in fig. 5, the second determining module 420 includes:
a first determining unit 421, configured to determine, based on at least one Query, a second text from the OCR texts, where the second text is a text related to the Query in the OCR texts;
a second determining unit 422, configured to determine the second text and at least one Query as the first text.
Optionally, in some embodiments, the first determining unit 421 is specifically configured to:
dividing the OCR text into at least one sub-text;
carrying out the subedding vectorization processing on at least one of the sub texts to obtain a first subedding vector corresponding to the sub text;
Carrying out the emmbedding vectorization processing on at least one Query to obtain at least one second emmbedding vector;
determining a third emmbedding vector from all the first emmbedding vectors, wherein the similarity between the third emmbedding vector and the second emmbedding vector is highest;
and determining the sub text corresponding to the third text sending vector as the second text.
Optionally, in some embodiments, the first determining unit 421 is specifically configured to:
dividing the OCR text into at least one sub-text by taking N characters as a group, wherein N is a positive integer.
Optionally, in some embodiments, as shown in fig. 6, the apparatus may further include: a third determination module 440 and a generation module 450. Wherein,
a third determining module 440, configured to determine a text template requirement corresponding to a text content of the first text;
a generating module 450, configured to generate a typeset first text by typesetting the first text based on the text template requirement;
in some embodiments, as shown in fig. 7, the third determining module 440 may include a first determining unit 441 and a second determining unit 442; wherein,
a third determining unit 441, configured to determine an application scenario corresponding to the first text based on text content of the first text;
A fourth determining unit 442, configured to determine, based on the application scenario, a text template requirement adapted to the application scenario.
In some embodiments, the input module 430 is specifically configured to:
and inputting the typeset first text into the large model.
In some embodiments, the first determining module 410 is specifically configured to:
inputting the first image into a first model to obtain a first output result; the first model is used for carrying out layout analysis on the first image to identify a text area and a table area of the first image, and the first output result is that: a first image identifying a text region and a form region;
inputting the first output result to a second model to obtain a second output result; the second model is used for carrying out table recognition on the table area of the first image, and the second output result comprises: a table structure sequence recognition result in the table area and a character recognition result in the table;
deleting a table area in the first output result and then inputting the deleted table area into a third model to obtain a third output result; wherein the third model is used for performing OCR recognition, and the third output result comprises: a text recognition result of the text region;
And combining the second output result and the third output result to obtain the OCR text corresponding to the first image.
In some embodiments, the first determining module 410 is specifically configured to:
deleting the table structure sequence identification result in the second output result, and combining with the third output result.
In some embodiments, the first determining module 410 is specifically configured to:
performing a first process on the first image to obtain a second image, wherein the first process comprises: deleting the table structure of the table in the first image, and reserving the characters in the table;
inputting the second image into a third model to obtain OCR text corresponding to the first image; wherein the third model is used for OCR recognition.
In some embodiments, the first image includes at least one item and item information corresponding to the item;
the OCR text corresponding to the first image also comprises at least one item and item information corresponding to the item;
the sub-text also includes at least one item and item information corresponding to the item.
In some embodiments, the first determining unit 421 is specifically configured to:
And carrying out the ebedding vectorization processing on the item items and the item information in the sub-text respectively to obtain at least one first ebedding vector corresponding to the sub-text.
In some embodiments, the third enabling vector is an enabling vector corresponding to a first item, the first item being an item most similar to the Query content;
the second text is a sub-text including the first item.
In some embodiments, the big model is configured to determine the first item from the second text based on the Query, and determine item information corresponding to the first item as the key information.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.
As shown in fig. 8, is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device includes: one or more processors 801, memory 802, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 801 is illustrated in fig. 8.
Memory 802 is a non-transitory computer-readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the critical information extraction methods provided by the present disclosure. The non-transitory computer readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the key information extraction method provided by the present disclosure.
The memory 802 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the key information extraction method in the embodiments of the present disclosure. The processor 801 executes various functional applications of the server and data processing, i.e., implements the key information extraction method in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 802.
Memory 802 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device extracted from the key information, and the like. In addition, memory 802 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 802 may optionally include memory located remotely from processor 801, which may be connected to the critical information extraction electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device for extracting key information may further include: an input device 803 and an output device 804. The processor 801, memory 802, input devices 803, and output devices 804 may be connected by a bus or other means, for example in fig. 8.
The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device for key information extraction, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output device 804 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present application may be performed in parallel or sequentially or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (31)

1. A key information extraction method, comprising:
determining an OCR text corresponding to the first image; the first image is an image of key information to be extracted;
determining a first text based on the OCR text and at least one Query word Query;
inputting the first text into a large model to obtain key information; and the key information is a query result corresponding to the query word.
2. The method of claim 1, wherein the determining the first text based on the OCR text and at least one Query comprises:
Determining a second text from the OCR text based on at least one Query, wherein the second text is a text related to the Query in the OCR text;
and determining the second text and at least one Query as the first text.
3. The method of claim 2, wherein the determining a second text from the OCR text based on at least one of the Query comprises:
dividing the OCR text into at least one sub-text;
carrying out the subedding vectorization processing on at least one of the sub texts to obtain a first subedding vector corresponding to the sub text;
carrying out the emmbedding vectorization processing on at least one Query to obtain at least one second emmbedding vector;
determining a third emmbedding vector from all the first emmbedding vectors, wherein the similarity between the third emmbedding vector and the second emmbedding vector is highest;
and determining the sub text corresponding to the third text sending vector as the second text.
4. A method as recited in claim 3, wherein said dividing the OCR text into at least one sub-text comprises:
dividing the OCR text into at least one sub-text by taking N characters as a group, wherein N is a positive integer.
5. The method of any of claims 1-4, wherein prior to said entering the first text into the large model, further comprising:
determining a text template requirement corresponding to the text content of the first text;
and typesetting the first text based on the text template requirement to generate typeset first text.
6. The method of claim 5, wherein the determining a text template requirement corresponding to text content of the first text comprises:
determining an application scene corresponding to the first text based on the text content of the first text;
and determining text template requirements adapted to the application scene based on the application scene.
7. The method of claim 5, wherein the inputting the first text into a large model comprises:
and inputting the typeset first text into the large model.
8. The method of any of claims 1-4, wherein the determining OCR text corresponding to the first image comprises:
inputting the first image into a first model to obtain a first output result; the first model is used for carrying out layout analysis on the first image to identify a text area and a table area of the first image, and the first output result is that: a first image identifying a text region and a form region;
Inputting the first output result to a second model to obtain a second output result; the second model is used for carrying out table recognition on the table area of the first image, and the second output result comprises: a table structure sequence recognition result in the table area and a character recognition result in the table;
deleting a table area in the first output result and then inputting the deleted table area into a third model to obtain a third output result; wherein the third model is used for performing OCR recognition, and the third output result comprises: a text recognition result of the text region;
and combining the second output result and the third output result to obtain the OCR text corresponding to the first image.
9. The method of claim 8, wherein the combining the second output result with the third output result comprises:
deleting the table structure sequence identification result in the second output result, and combining with the third output result.
10. The method of any of claims 1-4, wherein the determining OCR text corresponding to the first image comprises:
performing a first process on the first image to obtain a second image, wherein the first process comprises: deleting the table structure of the table in the first image, and reserving the characters in the table;
Inputting the second image into a third model to obtain OCR text corresponding to the first image; wherein the third model is used for OCR recognition.
11. The method of claim 3, wherein the first image includes at least one item and item information corresponding to the item;
the OCR text corresponding to the first image also comprises at least one item and item information corresponding to the item;
the sub-text also includes at least one item and item information corresponding to the item.
12. The method of claim 11, wherein the performing the embedding vectorization processing on the at least one sub-text to obtain the first embedding vector corresponding to the sub-text includes:
and carrying out the ebedding vectorization processing on the item items and the item information in the sub-text respectively to obtain at least one first ebedding vector corresponding to the sub-text.
13. The method of claim 12, wherein the third enabling vector is an enabling vector corresponding to a first item, the first item being the item most similar to the Query content;
the second text is a sub-text including the first item.
14. The method of claim 13, wherein the large model is used to determine the first item from the second text based on the Query, and determine item information corresponding to the first item as the key information.
15. A key information extraction apparatus comprising:
the first determining module is used for determining the optical character recognition OCR text corresponding to the first image; the first image is an image of key information to be extracted;
a second determining module for determining a first text based on the OCR text and at least one Query word Query;
the input module is used for inputting the first text into the large model to obtain key information; and the key information is a query result corresponding to the query word.
16. The apparatus of claim 14, wherein the second determination module comprises:
the first determining unit is used for determining a second text from the OCR texts based on at least one Query, wherein the second text is a text related to the Query in the OCR texts;
and the second determining unit is used for determining the second text and at least one Query as the first text.
17. The apparatus of claim 16, wherein the first determining unit is specifically configured to:
Dividing the OCR text into at least one sub-text;
carrying out the subedding vectorization processing on at least one of the sub texts to obtain a first subedding vector corresponding to the sub text;
carrying out the emmbedding vectorization processing on at least one Query to obtain at least one second emmbedding vector;
determining a third emmbedding vector from all the first emmbedding vectors, wherein the similarity between the third emmbedding vector and the second emmbedding vector is highest;
and determining the sub text corresponding to the third text sending vector as the second text.
18. The method of claim 17, wherein the first determining unit is specifically configured to:
dividing the OCR text into at least one sub-text by taking N characters as a group, wherein N is a positive integer.
19. The apparatus of any of claims 15-18, further comprising:
a third determining module, configured to determine a text template requirement corresponding to a text content of the first text;
and the generating module is used for typesetting the first text based on the text template requirement to generate typeset first text.
20. The apparatus of claim 19, wherein the third determination module comprises:
A third determining unit, configured to determine an application scenario corresponding to the first text based on text content of the first text;
and the fourth determining unit is used for determining the text template requirement which is adapted to the application scene based on the application scene.
21. The apparatus of claim 19, wherein the input module is specifically configured to:
and inputting the typeset first text into the large model.
22. The apparatus of any one of claims 15-18, wherein the first determining module is specifically configured to:
inputting the first image into a first model to obtain a first output result; the first model is used for carrying out layout analysis on the first image to identify a text area and a table area of the first image, and the first output result is that: a first image identifying a text region and a form region;
inputting the first output result to a second model to obtain a second output result; the second model is used for carrying out table recognition on the table area of the first image, and the second output result comprises: a table structure sequence recognition result in the table area and a character recognition result in the table;
Deleting a table area in the first output result and then inputting the deleted table area into a third model to obtain a third output result; wherein the third model is used for performing OCR recognition, and the third output result comprises: a text recognition result of the text region;
and combining the second output result and the third output result to obtain the OCR text corresponding to the first image.
23. The apparatus of claim 22, wherein the first determining module is specifically configured to:
deleting the table structure sequence identification result in the second output result, and combining with the third output result.
24. The apparatus of any one of claims 15-18, wherein the first determining module is specifically configured to:
performing a first process on the first image to obtain a second image, wherein the first process comprises: deleting the table structure of the table in the first image, and reserving the characters in the table;
inputting the second image into a third model to obtain OCR text corresponding to the first image; wherein the third model is used for OCR recognition.
25. The apparatus of claim 17, wherein the first image comprises at least one item and item information corresponding to the item;
The OCR text corresponding to the first image also comprises at least one item and item information corresponding to the item;
the sub-text also includes at least one item and item information corresponding to the item.
26. The apparatus of claim 25, wherein the first determining unit is specifically configured to:
and carrying out the ebedding vectorization processing on the item items and the item information in the sub-text respectively to obtain at least one first ebedding vector corresponding to the sub-text.
27. The apparatus of claim 26, wherein the third enabling vector is an enabling vector corresponding to a first item, the first item being a most similar item to the Query content;
the second text is a sub-text including the first item.
28. The apparatus of claim 27, wherein the large model is configured to determine the first item from the second text based on the Query, and determine item information corresponding to the first item as the key information.
29. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.
30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.
31. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the steps of the method of any of claims 1-14.
CN202311757294.8A 2023-12-19 2023-12-19 Key information extraction method and device Pending CN117649674A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311757294.8A CN117649674A (en) 2023-12-19 2023-12-19 Key information extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311757294.8A CN117649674A (en) 2023-12-19 2023-12-19 Key information extraction method and device

Publications (1)

Publication Number Publication Date
CN117649674A true CN117649674A (en) 2024-03-05

Family

ID=90043295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311757294.8A Pending CN117649674A (en) 2023-12-19 2023-12-19 Key information extraction method and device

Country Status (1)

Country Link
CN (1) CN117649674A (en)

Similar Documents

Publication Publication Date Title
CN112437917B (en) Natural language interface for databases using autonomous agents and thesaurus
JP6714024B2 (en) Automatic generation of N-grams and conceptual relationships from language input data
US10176370B2 (en) Field verification of documents
CN111428507A (en) Entity chain finger method, device, equipment and storage medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US10546054B1 (en) System and method for synthetic form image generation
US11361002B2 (en) Method and apparatus for recognizing entity word, and storage medium
US9483740B1 (en) Automated data classification
US11797593B2 (en) Mapping of topics within a domain based on terms associated with the topics
CN112507068A (en) Document query method and device, electronic equipment and storage medium
JP6462970B1 (en) Classification device, classification method, generation method, classification program, and generation program
CN112541359A (en) Document content identification method and device, electronic equipment and medium
US20190171872A1 (en) Semantic normalization in document digitization
US9516089B1 (en) Identifying and processing a number of features identified in a document to determine a type of the document
CN112988784B (en) Data query method, query statement generation method and device
CN113761377B (en) False information detection method and device based on attention mechanism multi-feature fusion, electronic equipment and storage medium
US20240012809A1 (en) Artificial intelligence system for translation-less similarity analysis in multi-language contexts
CN111382243A (en) Text category matching method, text category matching device and terminal
US11086600B2 (en) Back-end application code stub generation from a front-end application wireframe
CN112559718A (en) Dialogue processing method and device, electronic equipment and storage medium
CN111125445A (en) Community theme generation method and device, electronic equipment and storage medium
EP4174795A1 (en) Multiple input machine learning framework for anomaly detection
CN112989011B (en) Data query method, data query device and electronic equipment
CN117649674A (en) Key information extraction method and device
CN111708819B (en) Method, apparatus, electronic device, and storage medium for information processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination