CN115527230A - Information extraction method and device, electronic equipment and storage medium - Google Patents

Information extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115527230A
CN115527230A CN202110708085.9A CN202110708085A CN115527230A CN 115527230 A CN115527230 A CN 115527230A CN 202110708085 A CN202110708085 A CN 202110708085A CN 115527230 A CN115527230 A CN 115527230A
Authority
CN
China
Prior art keywords
area
region
character
text
contract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110708085.9A
Other languages
Chinese (zh)
Inventor
王亚东
林茂华
高睿
苏能武
阮琳琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Caizhu eComerce Co Ltd
Original Assignee
Zhuhai Caizhu eComerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Caizhu eComerce Co Ltd filed Critical Zhuhai Caizhu eComerce Co Ltd
Priority to CN202110708085.9A priority Critical patent/CN115527230A/en
Publication of CN115527230A publication Critical patent/CN115527230A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The application relates to the technical field of information extraction, and particularly discloses an information extraction method, an information extraction device, electronic equipment and a storage medium, wherein the information extraction method comprises the following steps: determining a contract type of a contract text in an image to be extracted; acquiring a standard contract text corresponding to the contract type according to the contract type; according to the standard contract text, performing region screening on the image to be extracted to obtain at least one first region; acquiring identity information of a user; selecting at least one second area from the at least one first area, wherein the association degree of each second area in the at least one second area with the identity information is greater than a threshold value; and in the image to be extracted, information extraction is respectively carried out on the image corresponding to each second area in the at least one second area, so as to obtain at least one text message.

Description

Information extraction method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of information extraction technologies, and in particular, to an information extraction method and apparatus, an electronic device, and a storage medium.
Background
At present, the examination of the contract usually adopts a mode of identifying the content of the full contract to obtain the contract text for examination. However, full-text recognition takes a long time, and the recognition result contains a large amount of useless standard texts, which results in low auditing efficiency.
Disclosure of Invention
In order to solve the above problems in the prior art, embodiments of the present application provide an information extraction method, an information extraction device, an electronic device, and a storage medium, which can purposefully extract contract information with high association degree with a user for the user to audit, so that while the audit efficiency is improved, the recognition time is reduced, and the user experience is improved.
In a first aspect, an embodiment of the present application provides an information extraction method, including:
determining a contract type of a contract text in an image to be extracted;
acquiring a standard contract text corresponding to the contract type according to the contract type;
according to the standard contract text, performing region screening on an image to be extracted to obtain at least one first region;
acquiring identity information of a user;
selecting at least one second area from the at least one first area, wherein the association degree of each second area in the at least one second area with the identity information is greater than a threshold value;
and in the image to be extracted, information extraction is respectively carried out on the image corresponding to each second area in the at least one second area, so as to obtain at least one text message.
In a second aspect, an embodiment of the present application provides an information extraction apparatus, including:
the information acquisition module is used for determining the contract type of the contract text in the image to be extracted;
the matching module is used for acquiring a standard contract text corresponding to the contract type according to the contract type;
the screening module is used for carrying out region screening on the image to be extracted according to the standard contract text to obtain at least one first region;
the information acquisition module is also used for acquiring the identity information of the user;
the screening module is further used for selecting at least one second area from the at least one first area, wherein the association degree of each second area in the at least one second area with the identity information is greater than a threshold value;
and the extraction module is used for respectively extracting information of the image corresponding to each second area in the at least one second area in the image to be extracted to obtain at least one text message.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor coupled to the memory, the memory for storing a computer program, the processor for executing the computer program stored in the memory to cause the electronic device to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer operable to cause the computer to perform a method according to the first aspect.
The implementation of the embodiment of the application has the following beneficial effects:
in the embodiment of the application, the contract type of the contract text in the image to be extracted is determined, and then the standard contract text corresponding to the contract type is obtained. Then, according to the standard contract text, determining the region where the key information is located in the contract text of the type, and then performing region screening on the image to be extracted to obtain at least one first region containing the key information. And finally, selecting at least one second area with higher association degree with the user from the at least one first area according to the identity information of the user, and extracting information of the image corresponding to the at least one second area to obtain at least one text message for the user to review. Therefore, before information extraction is carried out on the contract text, the position of the key information in the contract is preferentially determined, then the position of the information closely related to the user is determined from the position of the key information, and the information extraction is carried out on the positions, so that the component of the information needing to be extracted is reduced, and the information extraction speed is improved. Meanwhile, the extracted information is closely related to the user, in other words, the extracted information is the key point of the auditing required by the user identity, and therefore the auditing efficiency and the user experience are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic diagram of a hardware structure of an information extraction apparatus according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of an information extraction method according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart of a method for determining a contract type of a contract text in an image to be extracted according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating a method for determining an inverse document frequency for each candidate word according to an embodiment of the present application;
fig. 5 is a schematic flowchart of a method for extracting information from an image corresponding to each of at least one second region in an image to be extracted to obtain at least one text message according to an embodiment of the present application;
fig. 6 is a block diagram illustrating functional modules of an information extraction apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different elements and not for describing a particular sequential order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of an information extraction apparatus according to an embodiment of the present disclosure. The information extraction device 100 includes at least one processor 101, a communication link 102, a memory 103, and at least one communication interface 104.
In this embodiment, the processor 101 may be a general processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more ics for controlling the execution of programs according to the present disclosure.
The communication link 102, which may include a path, carries information between the aforementioned components.
The communication interface 104 may be any transceiver or other device (e.g., an antenna, etc.) for communicating with other devices or communication networks, such as an ethernet, RAN, wireless Local Area Network (WLAN), etc.
The memory 103 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
In this embodiment, the memory 103 may be independent and connected to the processor 101 through the communication line 102. The memory 103 may also be integrated with the processor 101. The memory 203 provided in the embodiments of the present application may generally have a nonvolatile property. The memory 103 is used for storing computer-executable instructions for executing the present application, and is controlled by the processor 101 to execute. The processor 101 is configured to execute computer-executable instructions stored in the memory 103, thereby implementing the methods provided in the embodiments of the present application described below.
In alternative embodiments, computer-executable instructions may also be referred to as application code, which is not specifically limited in this application.
In alternative embodiments, processor 101 may include one or more CPUs, such as CPU0 and CPU1 of FIG. 1.
In alternative embodiments, information extraction device 100 may include multiple processors, such as processor 101 and processor 107 in FIG. 1. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
In an alternative embodiment, if the information extraction apparatus 100 is a server, the information extraction apparatus 100 may further include an output device 105 and an input device 106. The output device 105 is in communication with the processor 101 and may display information in a variety of ways. For example, the output device 105 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like. The input device 106 is in communication with the processor 101 and may receive user input in a variety of ways. For example, the input device 106 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.
The information extraction apparatus 100 may be a general-purpose device or a special-purpose device. The present embodiment does not limit the type of the information extraction device 100.
Hereinafter, the information extraction method disclosed in the present application will be explained:
referring to fig. 2, fig. 2 is a schematic flow chart of an information extraction method according to an embodiment of the present disclosure. The information extraction method comprises the following steps:
201: and determining the contract type of the contract text in the image to be extracted.
In the present embodiment, there is provided a method for determining a contract type of a contract text in an image to be extracted, as shown in fig. 3, the method including:
301: the area of the title bar of the contract text is determined in the image to be extracted.
Typically, the title bar of the contract is positioned at the top of the first page of the contract, or the first page of the contract is the cover page where only the title is present. Therefore, in the embodiment, the feature of the image belonging to the first page of the contract text in the image to be extracted can be extracted, and then the first page is determined to be the front cover or the text of the first page with the title through feature comparison, that is, the area of the title bar of the contract text is determined.
302: and extracting information from the area of the title bar to obtain the title of the contract text.
In the present embodiment, the title of the contract is generally large in size and the characters are generally fixed. Therefore, the text of the contract can be obtained by recognizing the text in the region by a technique such as OCR (Optical Character Recognition).
303: keywords in the title of the contract text are extracted.
Typically, the title of the contract text carries key terms indicating the type of contract, such as "XXXXX labor contract", "XXXXX bidding contract", "XXXXX security contract", etc. Meanwhile, the types of keywords in the title of the contract text are generally fixed. Thus, the contract type of the contract text can be determined by extracting the keywords in the title of the contract text.
In the present embodiment, the title of the contract text may be first segmented by N-gram segmentation using the numbers of elements 2, 3, and 4, respectively. Specifically, the N-gram segmentation method is a method of segmenting a sentence into a plurality of segment sequences each composed of N characters, each segment being called an N-gram. The N-gram segmentation may be referred to as uni-gram (unary-gram) when N =1, bi-gram (binary gram) when N =2, and tri-gram (ternary gram) when N = 3. Illustratively, if a bi-gram is used to segment the text of the sentence "build engineering bid contract", then "build", "design", "project", "deal", "bid", "contract", and "book contract" can be obtained.
In the present embodiment, based on the obtained segmentation results, the segmentation results are filtered and cleaned to filter out meaningless segmentation results, such as: "set up work", "move", "bid", and "book with the book", retain the segmentation result containing certain semantics, such as: "construction", "project", "bid", and "contract" are optional words.
And finally, determining the reverse document frequency of each alternative word, and taking the alternative words with the reverse document frequency larger than a threshold value as the key words in the title of the contract text.
Generally, a keyword refers to a word that is important in a sentence in the current semantic environment. For the contract text, the common word of contract can be simply determined to be not the keyword. Therefore, in the embodiment, the importance of each candidate word can be determined by calculating the inverse document frequency of the candidate word, and then common words in the candidate word under the current semantic environment are eliminated.
Illustratively, the present embodiment provides a method for determining an inverse document frequency of each candidate word, as shown in fig. 4, the method includes:
401: determining the number of all the linguistic data containing each candidate word in the database to obtain a first number.
In the present embodiment, the database is a database that stores analysis data for historical contract texts.
402: and determining a quotient of the total number of the corpora in the database and the first number to obtain a first quotient.
403: and taking the logarithm of the first quotient as the inverse document frequency of each candidate word.
Specifically, the inverse document frequency can be represented by formula (1):
Figure BDA0003131463000000071
wherein | D | represents the total number of corpora in the database; and l [ j ∈ d ] | represents the total number of the linguistic data containing the candidate word t in the database, namely the first number.
Meanwhile, if there is no corpus containing the candidate word t in the database, it will result in | [ j: t ∈ d ] | being 0, so to avoid this, the inverse document frequency can also be represented by formula (2):
Figure BDA0003131463000000072
wherein c is a constant and can be adjusted correspondingly according to actual conditions. Illustratively, c may be 1.
304: and determining the text type of the contract text according to the keywords.
In this embodiment, after obtaining the keywords, the text type of the contract may be determined according to the nature of the keywords. For example, following the example of "construction project bid agreement book" above, the keywords obtained by parsing are: "construction", "engineering" and "bidding". Through property analysis, the fields of 'construction' and 'engineering' can be obtained, and the types of 'bidding' are obtained. Thus, the contract can be determined to be a "tender" type contract in the fields of "construction" and "engineering".
202: and acquiring a standard contract text corresponding to the contract type according to the contract type.
In this embodiment, a standard contract text may be prepared for each contract type in advance, and the standard contract text may be stored in the database after being associated with the contract type information.
203: and according to the standard contract text, performing region screening on the image to be extracted to obtain at least one first region.
In the present embodiment, the standard contract text includes at least one preset region, which is divided by a professional according to experience when the standard contract text is collated, and a region label is marked for each divided region, for example, divided into a "title region", a "conversation region", a "first-party right obligation region", a "second-party right obligation region", a "dispute resolution region", and the like, and a plurality of each category region may exist at the same time.
Thus, in the present embodiment, at least one third region in the standard contract text may be determined according to the reselection rule, and each of the at least one third region is a region in which the key information in the standard contract text is located. For example: "first party rights obligation area" which records the rights and obligations that first party has and needs to fulfill in this type of contract, belonging to the key information, and therefore, this area is divided into a third area; the "conversation region" is a region in which non-critical information such as transition sentences and guide sentences which are commonly used in such contracts is described, and therefore, the "conversation region" should not be divided into the third region.
Then, according to the selected at least one third area, the area distribution of the third areas in the standard contract text is determined. Since the standard contract text is a standard structure of the contract text in the image to be extracted, the two contracts are consistent in the spatial distribution of the text structure. Therefore, after determining the area distribution of the at least one third area in the standard contract text, the area with the same distribution can be determined as the at least one first area in the contract book in the picture to be extracted according to the area distribution. In other words, at least one first region determined in the contract text in the picture to be extracted corresponds to at least one third region determined in the standard contract text one to one, and the region distribution of the at least one first region determined in the contract text in the picture to be extracted is the same as the region distribution of the at least one third region determined in the standard contract text.
204: and acquiring identity information of the user.
In the embodiment, the camera can be called to obtain the facial image of the user, and then the identity information of the user can be determined through face recognition. In addition, the identity information of the user can also be determined through the login information of the user on the system. This is not limited by the present application.
205: at least one second area is selected among the at least one first area.
In this embodiment, the association degree of each of the at least one second area with the identity information is greater than a threshold value.
In this embodiment, the at least one second region is selected from the at least one first region, and the at least one first region and the at least one third region are in one-to-one correspondence, and the at least one third region is selected from the at least one preset region. Therefore, according to the relationship between each selection and the corresponding relation, the area label of each third area can be determined according to the area label of each preset area, and then the area label of each first area can be determined according to the area label of each third area. And then get one zone label for each first zone.
Based on this, in this embodiment, the association degree between the area tag of each first area and the identity information of the user may be respectively calculated, and then according to the association degree, at least one second area may be selected from the at least one first area, where the association degree corresponding to each second area in the second area is greater than the threshold.
Illustratively, if the region label carries a character identical to the user identity information, the association degree between the region label and the user identity information is 2; if the area label has semantics related to the user identity information after being analyzed, the association degree of the area label and the user identity information is 1; otherwise, the association degree of the area label and the user identity information is 0. Illustratively, if the identity information of the user is "party b", the area label of the area "party b rights obligation area" directly carries a character "party b", and the association degree between the character and the identity information of the user is 2; the dispute resolution area carries no character ' second party ', but after analysis, the semantic meaning ' resolution method area when the first party disputes with the second party ' is related to the second party ', and the association degree of the area and the identity information of the user is 1; the "first-party right obligation area" does not carry the "second-party" character, and the semantic meaning of the area is irrelevant to the "second party", so that the association degree of the area and the identity information of the user is 0. And finally, selecting the area with the association degree larger than 0 as a second area.
206: and in the image to be extracted, information extraction is respectively carried out on the image corresponding to each second area in the at least one second area, so as to obtain at least one text message.
In this embodiment, a method for extracting information from an image corresponding to each of at least one second region in an image to be extracted to obtain at least one text message is provided, as shown in fig. 5, where the method includes:
501: and performing character recognition on the image corresponding to each second area to obtain a first character string.
502: and performing character segmentation on the first character string to obtain at least one first character.
In the embodiment, the image corresponding to the second region can be processed according to the processing sequence and mode of binarization, closed operation, connected domain calculation, area screening and region sorting, so that the Chinese character segmentation work aiming at the contract text image is realized, and meanwhile, an artificial intelligent character segmentation algorithm based on a YOLO model can be adopted, so that the Chinese character segmentation part is ensured to have higher accuracy and robustness.
503: and for each first character in the at least one first character, respectively inputting each first character into a preset recognition model to obtain at least one second character corresponding to the at least one first character one to one.
In the present embodiment, a frame based on deep learning CNN may be used to complete a character recognition function for a single kanji character image, or recognition models such as ResNet50 and VGG19 may be used for recognition, which is not limited in the present application.
504: and sequencing at least one second character according to the position of the first character corresponding to each second character in the first character string to obtain a second character string.
In an optional embodiment, after step 504 is performed, parsing and syntax analysis may be performed on the obtained second character string, so as to ensure that the syntax and syntax of the second character string are legal and belong to a sentence with clear and unanimous language. Meanwhile, the integrity of the second character string can be confirmed, and the generated second character string is ensured to belong to a complete sentence. Otherwise, the second string may be highlighted to indicate to the user that there is a question in the statement, asking for caution.
In an optional embodiment, if the syntax of the second character string is illegal, or the syntax is illegal, or the sentence is incomplete, the process may further return to step 502 to split and identify the second character string again, so as to obtain a new second character string.
In the present embodiment, the methods of the syntax analysis, and the integrity analysis are not limited. In other words, any method that can implement parsing, syntactic analysis, and integrity analysis may be applied to the present application.
505: and taking the second character string as one text message in the at least one text message.
Therefore, after the character string of the initial recognition is subjected to character segmentation, each single character is accurately recognized again, and then a new recognition result is formed by the result of the accurate recognition, so that the accuracy of character recognition is improved, and the condition of difficulty or ambiguity in contract understanding caused by inaccurate recognition of character contents in the contract is avoided.
In summary, in the information extraction method provided by the present invention, the contract type of the contract text in the image to be extracted is determined, and then the standard contract text corresponding to the contract type is obtained. Then, according to the standard contract text, determining the region where the key information is located in the contract text of the type, and then performing region screening on the image to be extracted to obtain at least one first region containing the key information. And finally, selecting at least one second area with higher association degree with the user from the at least one first area according to the identity information of the user, and extracting information of the image corresponding to the at least one second area to obtain at least one text message for the user to check. Therefore, before information extraction is carried out on the contract text, the positions of the key information in the contract are preferentially determined, then the positions of the information closely related to the user are determined from the positions of the key information, and the information extraction is carried out on the positions, so that the component of the information needing to be extracted is reduced, and the information extraction speed is improved. Meanwhile, the extracted information is closely related to the user, in other words, the extracted information is the key point of the auditing required by the user identity, so that the auditing efficiency and the user experience are improved.
Referring to fig. 6, fig. 6 is a block diagram illustrating functional modules of an information extraction apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the information extraction apparatus 600 includes:
the information acquisition module 601 is configured to determine a contract type of a contract text in an image to be extracted;
the matching module 602 is configured to obtain a standard contract text corresponding to the contract type according to the contract type;
the screening module 603 is configured to perform region screening on the image to be extracted according to the standard contract text to obtain at least one first region;
the information obtaining module 601 is further configured to obtain identity information of the user;
the screening module 603 is further configured to select at least one second area from the at least one first area, where a degree of association between each second area in the at least one second area and the identity information is greater than a threshold;
the extracting module 604 is configured to perform information extraction on an image corresponding to each of the at least one second region in the image to be extracted, respectively, to obtain at least one text message.
In an embodiment of the present invention, in terms of performing region screening on an image to be extracted according to a standard contract text to obtain at least one first region, the screening module 603 is specifically configured to:
determining at least one third area in the standard contract text according to the screening rule, wherein each third area in the at least one third area is an area where key information in the standard contract text is located;
and determining at least one first region in the image to be extracted according to the region distribution of the at least one third region, wherein the region distribution of the at least one first region is the same as that of the at least one third region, and the at least one first region corresponds to the at least one third region one to one.
In an embodiment of the present invention, the standard contract text includes at least one preset region;
based on this, in terms of determining at least one third region in the standard contract text according to the screening rule, the screening module 603 is specifically configured to:
for each preset area in at least one preset area in the standard contract text, respectively acquiring an area label of each preset area to obtain at least one area label, wherein the at least one area label is in one-to-one correspondence with the at least one preset area;
and determining at least one third area in the at least one preset area according to the text type of the contract text and the at least one area label.
In an embodiment of the present invention, in terms of selecting at least one second area from the at least one first area, the screening module 603 is specifically configured to:
determining the area label of each third area according to the area label of each preset area;
determining an area label of each first area according to the area label of each third area;
respectively calculating the association degree between the area label of each first area and the identity information of the user;
and selecting at least one second area from the at least one first area according to the association degree, wherein the association degree corresponding to each second area in the second areas is greater than a threshold value.
In an embodiment of the present invention, in an image to be extracted, information of an image corresponding to each of at least one second region is extracted to obtain at least one text message, where the extracting module 604 is specifically configured to:
performing character recognition on the image corresponding to each second area to obtain a first character string;
performing character segmentation on the first character string to obtain at least one first character;
for each first character in the at least one first character, inputting each first character into a preset recognition model respectively to obtain at least one second character, wherein the at least one second character corresponds to the at least one first character one to one;
sequencing at least one second character according to the position of a first character corresponding to each second character in the first character string to obtain a second character string;
and taking the second character string as one text message in the at least one text message.
In an embodiment of the present invention, after at least one second character is sorted according to a position of a first character corresponding to each second character in a first character string to obtain a second character string, the extracting module 604 is further configured to:
determining that the grammar of the second string is legal;
determining that the syntax of the second string is legal;
determining that the sentence of the second character string is complete;
if the syntax of the second character string is illegal, the syntax is illegal, or the sentence is incomplete, the second character string is highlighted.
In an embodiment of the present invention, in terms of determining a text type of a contract text in an image to be extracted, the information obtaining module 601 is specifically configured to:
determining the area of a title bar of a contract text in an image to be extracted;
extracting information from the area of the title bar to obtain a title of the contract text;
extracting key words in the title of the contract text;
and determining the text type of the contract text according to the keywords.
Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 700 is disposed in a first tenant system. As shown in fig. 7, the electronic device 700 includes a transceiver 701, a processor 702, and a memory 703. Connected to each other by a bus 704. The memory 703 is used to store computer programs and data, and may transfer the data stored in the memory 703 to the processor 702.
The processor 702 is configured to read the computer program in the memory 703 to perform the following operations:
determining a contract type of a contract text in an image to be extracted;
acquiring a standard contract text corresponding to the contract type according to the contract type;
according to the standard contract text, performing region screening on an image to be extracted to obtain at least one first region;
acquiring identity information of a user;
selecting at least one second area from the at least one first area, wherein the association degree of each second area in the at least one second area with the identity information is greater than a threshold value;
and in the image to be extracted, information extraction is respectively carried out on the image corresponding to each second area in the at least one second area, so as to obtain at least one text message.
In an embodiment of the present invention, in terms of performing region screening on an image to be extracted according to a standard contract text to obtain at least one first region, the processor 702 is specifically configured to perform the following operations:
determining at least one third area in the standard contract text according to the screening rule, wherein each third area in the at least one third area is an area where key information in the standard contract text is located;
and determining at least one first region in the image to be extracted according to the region distribution of the at least one third region, wherein the region distribution of the at least one first region is the same as that of the at least one third region, and the at least one first region corresponds to the at least one third region one to one.
In an embodiment of the present invention, the standard contract text includes at least one preset region;
based thereon, in determining at least one third region in the standard contract text according to the filtering rule, the processor 702 is specifically configured to perform the following operations:
for each preset area in at least one preset area in the standard contract text, respectively acquiring an area label of each preset area to obtain at least one area label, wherein the at least one area label is in one-to-one correspondence with the at least one preset area;
and determining at least one third area in the at least one preset area according to the text type of the contract text and the at least one area label.
In an embodiment of the present invention, in terms of selecting at least one second area from the at least one first area, the processor 702 is specifically configured to:
determining the area label of each third area according to the area label of each preset area;
determining an area label of each first area according to the area label of each third area;
respectively calculating the association degree between the area label of each first area and the identity information of the user;
and selecting at least one second area from the at least one first area according to the association degree, wherein the association degree corresponding to each second area in the second areas is greater than a threshold value.
In an embodiment of the present invention, in an aspect that, in an image to be extracted, information of an image corresponding to each of at least one second region is extracted, so as to obtain at least one text message, the processor 702 is specifically configured to perform the following operations:
performing character recognition on the image corresponding to each second area to obtain a first character string;
performing character segmentation on the first character string to obtain at least one first character;
for each first character in the at least one first character, inputting each first character into a preset recognition model respectively to obtain at least one second character, wherein the at least one second character corresponds to the at least one first character one to one;
sequencing at least one second character according to the position of a first character corresponding to each second character in the first character string to obtain a second character string;
and taking the second character string as one text message in the at least one text message.
In an embodiment of the present invention, after at least one second character is sorted according to a position of a first character corresponding to each second character in the first character string, so as to obtain a second character string, the processor 702 is further configured to perform the following operations:
determining that the grammar of the second string is legal;
determining that the syntax of the second string is legal;
determining that the sentence of the second character string is complete;
and if the grammar of the second character string is illegal, the syntax is illegal, or the sentence is incomplete, highlighting the second character string.
In an embodiment of the present invention, in determining a text type of a contract text in an image to be extracted, the processor 702 is specifically configured to:
determining the area of a title bar of a contract text in an image to be extracted;
extracting information from the area of the title bar to obtain a title of the contract text;
extracting key words in the title of the contract text;
and determining the text type of the contract text according to the keywords.
It should be understood that the information extracting apparatus in the present application may include a smart Phone (such as an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a Mobile Internet device MID (Mobile Internet Devices, abbreviated as MID), a robot, or a wearable device. The information extraction device is merely an example, and is not exhaustive, and includes but is not limited to the information extraction device. In practical applications, the information extracting apparatus may further include: intelligent vehicle-mounted terminal, computer equipment and the like.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention can be implemented by combining software and a hardware platform. Based on such understanding, all or part of the technical solutions of the present invention, which contribute to the background art, can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present invention.
Therefore, the present application embodiment also provides a computer readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to implement part or all of the steps of any one of the information extraction methods as described in the above method embodiments. For example, the storage medium may include a hard disk, a floppy disk, an optical disk, a magnetic tape, a magnetic disk, a flash memory, and the like.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the information extraction methods as described in the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are all alternative embodiments and that the acts and modules referred to are not necessarily required by the application.
In the above embodiments, the description of each embodiment has its own emphasis, and for parts not described in detail in a certain embodiment, reference may be made to the description of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical function division, and other division may be implemented in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, indirect coupling or communication connection between devices or units, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, and the memory may include: flash Memory disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the methods and their core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. An information extraction method, characterized in that the method comprises:
determining a contract type of a contract text in an image to be extracted;
acquiring a standard contract text corresponding to the contract type according to the contract type;
according to the standard contract text, performing region screening on the image to be extracted to obtain at least one first region;
acquiring identity information of a user;
selecting at least one second area from the at least one first area, wherein the association degree of each second area in the at least one second area with the identity information is greater than a threshold value;
and in the image to be extracted, information extraction is respectively carried out on the image corresponding to each second area in the at least one second area, so as to obtain at least one text message.
2. The method as claimed in claim 1, wherein the region screening the image to be extracted according to the standard contract text to obtain at least one first region comprises:
determining at least one third area in the standard contract text according to a screening rule, wherein each third area in the at least one third area is an area where key information in the standard contract text is located;
and determining the at least one first region in the image to be extracted according to the region distribution of the at least one third region, wherein the region distribution of the at least one first region is the same as that of the at least one third region, and the at least one first region and the at least one third region are in one-to-one correspondence.
3. The method of claim 2, wherein the standard contract text includes at least one preset region, and wherein determining at least one third region in the standard contract text based on the filter rules comprises:
for each preset area in at least one preset area in the standard contract text, respectively acquiring an area label of each preset area to obtain at least one area label, wherein the at least one area label is in one-to-one correspondence with the at least one preset area;
and determining the at least one third area in the at least one preset area according to the text type of the contract text and the at least one area label.
4. The method of claim 3, wherein said selecting at least one second region among said at least one first region comprises:
determining the area label of each third area according to the area label of each preset area;
determining the area label of each first area according to the area label of each third area;
respectively calculating the association degree between the area label of each first area and the identity information of the user;
and selecting at least one second area from the at least one first area according to the association degree, wherein the association degree corresponding to each second area in the second areas is greater than a threshold value.
5. The method according to any one of claims 1 to 4, wherein the extracting information of the image corresponding to each of the at least one second region in the image to be extracted to obtain at least one text message comprises:
performing character recognition on the image corresponding to each second area to obtain a first character string;
performing character segmentation on the first character string to obtain at least one first character;
for each first character in the at least one first character, inputting each first character into a preset recognition model respectively to obtain at least one second character, wherein the at least one second character corresponds to the at least one first character one to one;
sequencing the at least one second character according to the position of the first character corresponding to each second character in the first character string to obtain a second character string;
and taking the second character string as one text message in the at least one text message.
6. The method of claim 5, wherein after said sorting said at least one second character by the position in said first string of the first character to which said each second character corresponds, resulting in a second string, said method further comprises:
determining that a grammar of the second string is legal;
determining that the syntax of the second string is legal;
determining that the sentence of the second string is complete;
and if the grammar of the second character string is illegal, or the grammar is illegal, or the sentence is incomplete, highlighting the second character string.
7. The method of any one of claims 1-6, wherein the determining the text type of the contract text in the image to be extracted comprises:
determining the area of a title bar of the contract text in the image to be extracted;
extracting information from the area of the title bar to obtain the title of the contract text;
extracting key words in the title of the contract text;
and determining the text type of the contract text according to the keywords.
8. An information extraction apparatus, characterized in that the apparatus comprises:
the information acquisition module is used for determining the contract type of the contract text in the image to be extracted;
the matching module is used for acquiring a standard contract text corresponding to the contract type according to the contract type;
the screening module is used for carrying out region screening on the image to be extracted according to the standard contract text to obtain at least one first region;
the information acquisition module is also used for acquiring the identity information of the user;
the screening module is further configured to select at least one second area from the at least one first area, where a degree of association between each of the at least one second area and the identity information is greater than a threshold;
and the extraction module is used for respectively extracting information of the image corresponding to each second area in the at least one second area in the image to be extracted to obtain at least one text message.
9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, the one or more programs including instructions for performing the steps in the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method according to any one of claims 1-7.
CN202110708085.9A 2021-06-24 2021-06-24 Information extraction method and device, electronic equipment and storage medium Pending CN115527230A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110708085.9A CN115527230A (en) 2021-06-24 2021-06-24 Information extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110708085.9A CN115527230A (en) 2021-06-24 2021-06-24 Information extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115527230A true CN115527230A (en) 2022-12-27

Family

ID=84693950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110708085.9A Pending CN115527230A (en) 2021-06-24 2021-06-24 Information extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115527230A (en)

Similar Documents

Publication Publication Date Title
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
CN111581976B (en) Medical term standardization method, device, computer equipment and storage medium
CN108829893B (en) Method and device for determining video label, storage medium and terminal equipment
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
WO2022105122A1 (en) Answer generation method and apparatus based on artificial intelligence, and computer device and medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN109446885B (en) Text-based component identification method, system, device and storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN111814465A (en) Information extraction method and device based on machine learning, computer equipment and medium
CN111177375A (en) Electronic document classification method and device
CN112699645A (en) Corpus labeling method, apparatus and device
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN111178080B (en) Named entity identification method and system based on structured information
CN110795942A (en) Keyword determination method and device based on semantic recognition and storage medium
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN111814481A (en) Shopping intention identification method and device, terminal equipment and storage medium
CN114842982B (en) Knowledge expression method, device and system for medical information system
WO2023137903A1 (en) Reply statement determination method and apparatus based on rough semantics, and electronic device
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
CN111046627A (en) Chinese character display method and system
CN115527230A (en) Information extraction method and device, electronic equipment and storage medium
CN114064906A (en) Emotion classification network training method and emotion classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination