CN114495145A - Policy document number extraction method, device, equipment and storage medium - Google Patents

Policy document number extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN114495145A
CN114495145A CN202210143541.4A CN202210143541A CN114495145A CN 114495145 A CN114495145 A CN 114495145A CN 202210143541 A CN202210143541 A CN 202210143541A CN 114495145 A CN114495145 A CN 114495145A
Authority
CN
China
Prior art keywords
policy
text
information
target
document number
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210143541.4A
Other languages
Chinese (zh)
Inventor
郑梓昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202210143541.4A priority Critical patent/CN114495145A/en
Publication of CN114495145A publication Critical patent/CN114495145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing

Abstract

The embodiment of the application relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for extracting a policy document number, wherein the method comprises the steps of receiving a policy document number extraction instruction, determining a target policy file, and identifying a policy information area in the target policy file to obtain a policy information image; extracting a target policy information text in the policy information image; extracting policy number information from the target policy information text to obtain at least two different policy numbers; acquiring the text position of each policy document number in the target policy information text; extracting keywords in the front of each policy number to obtain a first keyword, and extracting keywords in the back of each policy number to obtain a second keyword; and weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to select the target policy document number.

Description

Policy document number extraction method, device, equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for extracting a policy document number.
Background
Based on government affair items, the correct policy document number of the policy needs to be extracted for use in display, search or document filing, and the policy document number may appear in each corner of the policy document.
In the process, a manual extraction mode is used, so careless mistakes are easy to occur, and a large amount of time, manpower and material resources are consumed, so that how to quickly and accurately extract the policy document number in the policy document is a technical problem to be solved urgently by technical staff in the field.
Disclosure of Invention
The embodiment of the application mainly aims to provide a method, a device, equipment and a storage medium for extracting a policy document number, and aims to realize the quick and accurate extraction of the policy document number in a policy document.
In a first aspect, an embodiment of the present application provides a policy document number extraction method, which is applied to an electronic device, and includes:
receiving a policy document number extraction instruction, acquiring a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers;
acquiring the text position of each policy document number in the target policy information text;
extracting keywords in the front part of each policy document number to obtain a first keyword, and extracting keywords in the rear part of each policy document number to obtain a second keyword;
weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number;
and selecting the policy document number with the highest criticality as the target policy document number of the target policy document.
In a second aspect, an embodiment of the present application further provides a policy document number extracting apparatus, including:
the image acquisition module is used for receiving a policy document number extraction instruction, acquiring a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image;
the text extraction module is used for extracting the policy information in the policy information image to obtain a corresponding target policy information text;
the text number extraction module is used for extracting the policy text number information of the target policy information text according to a preset text number extraction model to obtain at least two different policy text numbers;
the position acquisition module is used for acquiring the text position of each policy document number in the target policy information text;
the keyword module is used for extracting keywords from the front part of each policy document number to obtain a first keyword, and extracting keywords from the rear part of each policy document number to obtain a second keyword;
the text number screening module is used for weighting and summing the policy text numbers based on a first weighting coefficient of each first keyword, a second weighting coefficient of each second keyword and a third weighting coefficient of each text position to obtain the key degree of each policy text number;
and the target document number module is used for selecting the policy document number with the highest key degree as the target policy document number of the target policy document.
In a third aspect, embodiments of the present application further provide an electronic device, which includes a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for implementing connection communication between the processor and the memory, wherein when the computer program is executed by the processor, the steps of any one of the policy document number extraction methods provided in this specification are implemented.
In a fourth aspect, the present application further provides a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any one of the policy document number extraction methods provided in the present specification.
The embodiment of the application provides a policy document number extraction method, a device, equipment and a storage medium, wherein the method comprises the steps of receiving a policy document number extraction instruction, obtaining a target policy file to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image; extracting policy information in the policy information image to obtain a corresponding target policy information text; extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers; acquiring the text position of each policy document number in the target policy information text; extracting keywords in the front part of each policy document number to obtain a first keyword, and extracting keywords in the rear part of each policy document number to obtain a second keyword; weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number; and selecting the policy document number with the highest criticality as the target policy document number of the target policy document. The method comprises the steps of extracting policy numbers in policy files by using a number extraction model so as to obtain all policy numbers in a target policy file, obtaining text positions of all policy numbers in the file, preceding keywords of all policy numbers and following keywords of all policy numbers based on the difference between the positions and the preceding keywords of the target policy file and the non-target policy numbers, and evaluating by using multiple dimensions such as the text positions, the preceding keywords of the policy numbers, the following keywords of the policy numbers and the like, so that the target policy numbers corresponding to the target policy file can be obtained quickly and accurately from all policy numbers.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a policy document number extraction method according to an embodiment of the present disclosure;
fig. 2 is a schematic view of an application scenario of a policy document number extraction method according to an embodiment of the present application;
fig. 3 is a schematic diagram of a region layout structure of a target policy file according to an embodiment of the present application;
fig. 4 is a schematic block diagram of a policy document number extracting apparatus according to an embodiment of the present disclosure;
fig. 5 is a block diagram schematically illustrating a structure of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
With regard to government affair items, the policy document number needs to be extracted correctly for use in display, search or document filing, and the policy document number may appear in each corner of the policy document.
In the process, careless mistakes are easy to occur by using a manual extraction mode, the operation cost is also increased, and a large amount of time and manpower and material resources are consumed, so that how to realize the rapid and accurate extraction of the policy number in the policy document is a technical problem to be solved urgently by technical personnel in the field.
In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for extracting a policy document number, where the method for extracting a policy document number is applied to an electronic device, and the electronic device may be a terminal device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, or a server, where the server may be an independent server or a server cluster.
Specifically, the method comprises the steps of receiving a policy document number extraction instruction, obtaining a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image; extracting policy information in the policy information image to obtain a corresponding target policy information text; extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers; acquiring the text position of each policy document number in the target policy information text; extracting keywords in the front part of each policy document number to obtain a first keyword, and extracting keywords in the rear part of each policy document number to obtain a second keyword; weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number; and selecting the policy document number with the highest criticality as the target policy document number of the target policy document. The method comprises the steps of extracting policy numbers in policy files by using a number extraction model so as to obtain all policy numbers in a target policy file, obtaining text positions of all policy numbers in the file, preceding keywords of all policy numbers and following keywords of all policy numbers based on the difference between the positions and the preceding keywords of the target policy file and the non-target policy numbers, and evaluating by using multiple dimensions such as the text positions, the preceding keywords of the policy numbers, the following keywords of the policy numbers and the like, so that the target policy numbers corresponding to the target policy file can be obtained quickly and accurately from all policy numbers.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flow chart of a policy document number extraction method according to an embodiment of the present application.
As shown in fig. 1, the method for extracting a policy number includes steps S1 to S7.
Step S1: receiving a policy document number extraction instruction, acquiring a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image.
As shown in fig. 2-3, the file acquiring terminal 101 has an image capturing device, such as a camera, for capturing an image of a target policy file, and when a user needs to identify a policy document number of the target policy file to archive the policy file, a tag to be identified is set for the policy file to be identified, for example, a word to be identified is set in the upper left corner or the upper right corner of the policy file to be identified.
Meanwhile, a policy information area where the policy document number is located in the policy document is labeled, so that the electronic device 300 obtains the policy document corresponding to the tag control document to be identified according to the tag control document obtaining terminal 101, and performs policy information extraction on the labeled policy information area, thereby obtaining a policy information text in the policy information area, such as an area a shown in fig. 3, and performs policy document number extraction from the policy information text, thereby obtaining all policy document numbers in the policy information area.
In some embodiments, the target policy file is provided with an area tag, and the identifying a policy information area in the target policy file obtains a policy information image, including:
acquiring a policy file image of the target policy file;
identifying a policy information area in the policy file image according to the area tag, and segmenting the policy information area from the policy file image;
and preprocessing the image of the policy information area to obtain a policy information image.
The target policy file is illustratively a non-image file, e.g., a paper file or a file in PDF format, it is necessary to acquire a policy document image of the target policy document through image conversion or image acquisition, set a corresponding area tag based on the target policy document, the electronic device may identify a policy information area in the policy document image based on the area tag, for example, the user selects the policy information field in the target policy document by using the paint pen, and the electronic device 300 controls the document acquiring terminal 101 to acquire the policy document image of the target policy document when receiving the policy document number extracting instruction sent by the terminal device 102, after the policy file image of the target policy file is obtained, the policy information area in the policy file image, that is, the image corresponding to the area a in fig. 3, can be obtained by identifying the frame selection area in the policy file image. And carrying out image preprocessing on the acquired policy information area to obtain a policy information image. Wherein the image preprocessing comprises at least one of image noise reduction, image brightness enhancement, image color enhancement, and the like.
In some embodiments, the policy information area includes a text area and a non-text area, and the image preprocessing is performed on the policy information area to obtain a policy information image, including:
acquiring the gray value of each pixel point in the policy information area;
determining target brightness levels of the text area and the non-text area according to the gray value of each pixel point;
and carrying out color enhancement processing on the text area and the non-text area according to the target brightness level to obtain the policy information image.
Illustratively, the gray value is a measure of the degree of gray a pixel in an image appears in a black and white image. In the grayscale map, the grayscale is generally displayed from darkest black to brightest white. The logarithmic relationship between white and black is divided into several levels, called "gray scale". Typically ranging from 0 to 255, 255 for white and 0 for black. And determining the maximum gray value and the average gray value according to the obtained gray values by obtaining the gray value of each pixel point in the policy information area.
Generally, the policy information area comprises a text area and a non-text area, wherein the text area is an area corresponding to the text identification, the current brightness levels of the text area and the non-text area can be comprehensively represented from different dimensions by determining the maximum gray value and the average gray value according to the obtained gray values, and then the target brightness levels of the text area and the non-text area are determined according to the current brightness levels of the text area and the non-text area, so that the brightness of the text area and the non-text area in the policy information area is adjusted by using the target brightness levels, the accuracy reduction of the text identification caused by overexposure of the text area is avoided, or the accuracy reduction of the text identification caused by the interference of the exposure of the non-text area on the text area is avoided.
It is understood that in some embodiments, the file acquisition terminal 101 may be integrated within the electronic device 300. In some embodiments, the user may issue the policy document number extraction instruction to the electronic device 300 through an input device such as a mouse, a keyboard, a touch panel, etc. communicatively connected to the electronic device 300, without issuing the policy document number extraction instruction through the terminal device 102.
Step S2: and extracting the policy information in the policy information image to obtain a corresponding target policy information text.
The policy information comprises at least two languages, such as Chinese, English and Arabic numerals, and the OCR recognition model is used for recognizing the policy information in the policy information image, so that a target policy information text corresponding to the policy information in the policy information image is obtained.
In some embodiments, the policy information includes first language information and second language information, and the extracting policy information in the policy information image to obtain a corresponding target policy information text includes:
inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
and sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
Illustratively, policy information typically includes Chinese, English, numeric, etc., and thus, non-Chinese text includes at least English and numeric. The Character Recognition model is an OCR (Optical Character Recognition) model, and policy information in the policy information image is obtained by the OCR model.
The OCR model comprises a first OCR model and a second OCR model, wherein the first OCR model is a Chinese information recognition model and is used for recognizing Chinese texts in the policy information image and first text positions of the Chinese texts in the policy information image, and the second OCR model is a non-Chinese information recognition model and is used for recognizing non-Chinese texts in the policy information image and second text positions of the non-Chinese texts in the policy information image. And sequencing the first policy information text and the second policy information text by using the first text position and the second text position so as to obtain a target policy information text. Text recognition accuracy can be improved by performing text recognition on the policy information image through two different OCR models.
Step S3: and extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers.
The target policy file comprises a target policy number and a non-target policy number, wherein the non-target policy number is a cited number of the target policy number, and how to extract the target policy number from the target policy file is important.
Illustratively, a text number extraction model is taken as an NLP text number extraction model for illustration, policy text number information extraction is performed on a target policy information text through the NLP text number extraction model, and all policy text numbers in the target policy information text are obtained, so as to form a policy text number set of the target policy document administrative policy text number, wherein the policy text number set comprises at least two different policy text numbers, that is, the policy text number set comprises the target policy text number and a non-target policy text number. The NLP (language) file number extraction model is obtained by training with policy file number data.
Step S4: and acquiring the text position of each policy document number in the target policy information text.
Because the positions of different policy numbers in the text may be different, the position of the policy number in the target policy information text is identified, and the text position can be used to assist in judging whether the current policy number is the target policy number or the non-target policy number, which can help to improve the judgment accuracy of whether the current policy number is the target policy number.
Step S5: and extracting keywords from the front part of each policy document number to obtain a first keyword, and extracting keywords from the rear part of each policy document number to obtain a second keyword.
The keywords in the preceding and following text based on different policy text numbers may be different, and the corresponding keywords in the preceding and following text of the target policy text number and the non-target policy text number have a larger difference, and the non-target policy text number is usually a reference object of the target policy text number, so that the keywords corresponding to the target policy text number and the following text of the non-target policy text number are extracted and identified, so that the current policy text number can be assisted to be judged to be the target policy text number or the non-target policy text number.
In some embodiments, the extracting keywords from the preamble of each policy document number to obtain a first keyword includes:
confirming the line coordinate of each policy document number in the policy information image;
acquiring first character information of a first preset number in the front text of each policy text number according to the row coordinate;
and comparing the first character information with a preset first word bank to obtain a first keyword matched with words in the first word bank.
In some embodiments, the extracting keywords after the second preset number of lengths of each policy document number includes:
confirming the line coordinate of each policy document number in the policy information image;
acquiring second character information of a second preset number in the postamble of each policy character number according to the line coordinates;
and comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
Illustratively, after all policy document numbers in the target policy document are acquired, the target policy document number needs to be confirmed from all policy document numbers, and when the target policy document number is confirmed, whether the current policy document number is the target policy document number is judged through keyword assistance by extracting keywords in the front and the back of each policy document number.
Specifically, the row coordinate of each policy document number in the policy information image is confirmed, a first preset number of character information in the text of the policy document number is obtained according to the row coordinate, the character information is compared with a preset first word bank, a first keyword matched with the word in the first word bank is obtained, and the first keyword corresponding to each policy document number is obtained.
Meanwhile, confirming the line coordinates of each policy document number in the policy information image; and acquiring character information of a second preset number in the later text of the policy text number according to the row coordinate, and comparing the character information with a preset second word bank to obtain a second keyword matched with the words in the second word bank. Namely, the second keyword corresponding to each policy number is obtained. The first preset number and the second preset number may be set as required, for example, the first preset number and the second preset number are both 10 to 15 character string lengths corresponding to chinese characters, or 20 to 25 character string lengths corresponding to combinations of english and numbers.
Step S6: and carrying out weighted summation on the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number.
Illustratively, the keywords in different locations contribute differently to whether the policy document number is the target policy document number, and at the same time, the keywords in different locations in the policy information area also contribute differently to whether the policy document number is the target policy document number.
Therefore, the corresponding relation between the first keyword and the first weighting coefficient, the corresponding relation between the second keyword and the second weighting coefficient, and the corresponding relation between the text position and the third weighting coefficient are preset, when the target policy document number is selected from all policy document numbers in the target policy document, the first weighting coefficient of the first keyword, the second weighting coefficient of the second keyword and the third weighting coefficient of the text position corresponding to each policy document number are obtained to carry out weighted summation, so that the key degree of each policy document number is obtained, and the probability that the current policy document number is the target policy document number can be judged according to the key degree.
Step S7: and selecting the policy document number with the highest criticality as the target policy document number of the target policy document.
And after the key degrees of all policy document numbers in the target policy document are obtained, ascending or descending order is carried out according to the key degrees, so that the policy document number with the highest key degree is selected as the target policy document number of the target policy document.
In some embodiments, after step S7, the method further comprises:
sending the target policy file and the target policy document number to a user terminal;
and receiving response information of the user terminal, and filing and storing the target policy file by taking the target policy file number as a filing number according to the response information.
Illustratively, the response message is a message which is triggered to be sent by the user terminal after the user confirms that the target policy file is matched with the target policy file number.
After a target policy document number corresponding to a target policy document is obtained, the target policy document number is sent to a preset user terminal, a user can receive the target policy document number and the target policy document sent by electronic equipment through the user terminal, so that whether the extracted target policy document number asks the policy document number corresponding to the target policy document is confirmed, when the user confirms that the target policy document number extracted by the electronic equipment is the policy document number corresponding to the target policy document, response information is sent to a trigger user terminal through input equipment such as a mouse, a keyboard and a touch panel, so that the target policy document is filed and stored by the electronic equipment, and when the document is filed and stored, the target policy document number can be used for naming the filing number. The archive storage may be a hard disk stored in the background of the electronic device, or may be sent to a corresponding cloud server for storage, which is not limited herein.
Referring to fig. 4, fig. 4 is a schematic block diagram illustrating a structure of a policy document number extracting apparatus according to an embodiment of the present application.
As shown in fig. 4, the policy document number extracting apparatus 200 may be applied to an electronic device, and the policy document number extracting apparatus 200 includes an image capturing module 201, a text extracting module 202, a document number extracting module 203, a location obtaining module 204, a keyword module 205, a document number screening module 206, and a target document number module 207.
The image acquisition module 201 is configured to receive a policy document number extraction instruction, acquire a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identify a policy information area in the target policy document to obtain a policy information image;
the text extraction module 202 is configured to extract policy information in the policy information image to obtain a corresponding target policy information text;
the text number extraction module 203 is used for extracting policy text number information from the target policy information text according to a preset text number extraction model to obtain at least two different policy text numbers;
a location obtaining module 204, configured to obtain a text location of each policy document number in the target policy information text;
a keyword module 205, configured to perform keyword extraction on the preamble of each policy document number to obtain a first keyword, and perform keyword extraction on the preamble of each policy document number to obtain a second keyword;
a text number screening module 206, configured to perform weighted summation on each policy text number based on a first weighting coefficient of each first keyword, a second weighting coefficient of each second keyword, and a third weighting coefficient of each text position, so as to obtain a degree of criticality of each policy text number;
and a target document number module 207, configured to select the policy document number with the highest criticality as the target policy document number of the target policy document.
In some embodiments, the policy document number extracting apparatus 200 further includes a document filing module for sending the target policy document and the target policy document number to the user terminal; and receiving response information of the user terminal, and filing and storing the target policy file by taking the target policy document number as a label according to the response information.
In some embodiments, the response message is a message triggered to be sent by the user terminal after the user confirms that the target policy document matches the target policy document number.
In some embodiments, the target policy file is provided with an area tag, and the image capturing module 201, when identifying the policy information area in the target policy file and obtaining the policy information image, includes:
acquiring a policy file image of the target policy file;
identifying a policy information area in the policy file image according to the area tag, and segmenting the policy information area from the policy file image;
and preprocessing the image of the policy information area to obtain a policy information image.
In some embodiments, the policy information area includes a text area and a non-text area, and the image capturing module 201, when performing image preprocessing on the policy information area to obtain a policy information image, includes:
acquiring the gray value of each pixel point in the policy information area;
determining target brightness levels of the text area and the non-text area according to the gray value of each pixel point;
and carrying out color enhancement processing on the text area and the non-text area according to the target brightness level to obtain the policy information image.
In some embodiments, the policy information includes first language information and second language information, and the text extraction module 202, when extracting the policy information in the policy information image to obtain a corresponding target policy information text, includes:
inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
and sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
In some embodiments, the keyword module 205, in extracting keywords from the preamble of each policy document number to obtain the first keyword, includes:
confirming the line coordinate of each policy document number in the policy information image;
acquiring first character information of a first preset number in the front text of each policy text number according to the row coordinate;
and comparing the first character information with a preset first word bank to obtain a first keyword matched with words in the first word bank.
In some embodiments, the keyword module 205, in extracting the keywords from the following text of each policy document number, comprises:
confirming the line coordinate of each policy document number in the policy information image;
acquiring second character information of a second preset number in the post text of each policy text number according to the line coordinates;
and comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
Referring to fig. 5, fig. 5 is a schematic block diagram of a structure of an electronic device according to an embodiment of the present disclosure.
As shown in fig. 5, the electronic device 300 comprises a processor 301 and a memory 302, the processor 301 and the memory 302 being connected by a bus 303, such as an I2C (Inter-integrated Circuit) bus.
In particular, the processor 301 is used to provide computing and control capabilities, supporting the operation of the entire electronic device. The Processor 301 may be a Central Processing Unit (CPU), and the Processor 301 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) magnetic disk, an optical disk, a usb disk, or a removable hard disk.
Those skilled in the art will appreciate that the structure shown in fig. 5 is a block diagram of only a part of the structure related to the embodiments of the present application, and does not constitute a limitation to the electronic device to which the embodiments of the present application are applied, and in particular, the electronic device may include more or less components than those shown in the drawings, or combine some components, or have different arrangements of components.
The processor 301 is configured to run a computer program stored in the memory, and when executing the computer program, implement any one of the policy document number extraction methods provided in the embodiments of the present application.
In some embodiments, the processor 301 is configured to run a computer program stored in the memory and to implement the following steps when executing the computer program:
receiving a policy document number extraction instruction, acquiring a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers;
acquiring the text position of each policy document number in the policy information text;
extracting keywords in the front part of each policy document number to obtain a first keyword, and extracting keywords in the rear part of each policy document number to obtain a second keyword;
weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number;
and selecting the policy document number with the highest criticality as the target policy document number of the target policy document.
In some embodiments, the target policy file is provided with a region tag, and the processor 301, when identifying the policy information region in the target policy file and obtaining the policy information image, includes:
acquiring a policy file image of the target policy file;
identifying a policy information area in the policy file image according to the area tag, and segmenting the policy information area from the policy file image;
and preprocessing the image of the policy information area to obtain a policy information image.
In some embodiments, the policy information area includes a text area and a non-text area, and when the processor 301 performs image preprocessing on the policy information area to obtain a policy information image, the method includes:
acquiring the gray value of each pixel point in the policy information area;
determining target brightness levels of the text area and the non-text area according to the gray value of each pixel point;
and carrying out color enhancement processing on the text area and the non-text area according to the target brightness level to obtain the policy information image.
In some embodiments, the policy information includes first language information and second language information, and when extracting the policy information in the policy information image to obtain a corresponding target policy information text, the processor 301 includes:
inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
and sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
In some embodiments, the processor 301, when performing keyword extraction on the preamble of each policy document number to obtain the first keyword, includes:
confirming the line coordinate of each policy document number in the policy information image;
acquiring first character information of a first preset number in the front text of each policy text number according to the row coordinate;
and comparing the first character information with a preset first word bank to obtain a first keyword matched with words in the first word bank.
In some embodiments, the processor 301, in extracting the second keyword from the following text of each policy document number, includes:
confirming the line coordinate of each policy document number in the policy information image;
acquiring second character information of a second preset number in the post text of each policy text number according to the line coordinates;
and comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
In some embodiments, the processor 301 is further configured to:
sending the target policy file and the target policy document number to a user terminal;
and receiving response information of the user terminal, and filing and storing the target policy file by taking the target policy file number as a filing number according to the response information.
In some embodiments, the response message is a message that triggers the user terminal to send after the user confirms that the target policy document and the target policy document are matched.
It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing embodiment of the policy document number extraction method, and is not described herein again.
Embodiments of the present application also provide a storage medium for a computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any one of the policy document number extraction methods provided in the embodiments of the present application.
The storage medium may be an internal storage unit of the electronic device of the foregoing embodiment, for example, a hard disk or a memory of the electronic device. The storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device.
It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
It should be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A policy document number extraction method is applied to electronic equipment and is characterized by comprising the following steps:
receiving a policy document number extraction instruction, acquiring a target policy document to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy document to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting policy document number information from the target policy information text according to a preset document number extraction model to obtain at least two different policy document numbers;
acquiring the text position of each policy document number in the target policy information text;
extracting keywords in the front part of each policy document number to obtain a first keyword, and extracting keywords in the rear part of each policy document number to obtain a second keyword;
weighting and summing the policy document numbers based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document number;
and selecting the policy document number with the highest criticality as the target policy document number of the target policy document.
2. The method of claim 1, wherein the target policy file is provided with an area tag, and wherein identifying a policy information area in the target policy file results in a policy information image comprising:
acquiring a policy file image of the target policy file;
identifying a policy information area in the policy file image according to the area tag, and segmenting the policy information area from the policy file image;
and preprocessing the image of the policy information area to obtain a policy information image.
3. The method of claim 2, wherein the policy information area includes a text area and a non-text area, and the image preprocessing of the policy information area to obtain a policy information image includes:
acquiring the gray value of each pixel point in the policy information area;
determining target brightness levels of the text area and the non-text area according to the gray value of each pixel point;
and carrying out color enhancement processing on the text area and the non-text area according to the target brightness level to obtain the policy information image.
4. The method of claim 1, wherein the policy information includes first language information and second language information, and wherein extracting the policy information in the policy information image to obtain a corresponding target policy information text comprises:
inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
and sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
5. The method of claim 1, wherein the extracting keywords from the preamble of each policy word to obtain a first keyword comprises:
confirming the line coordinate of each policy document number in the policy information image;
acquiring first character information of a first preset number in the front text of each policy text number according to the row coordinate;
and comparing the first character information with a preset first word bank to obtain a first keyword matched with words in the first word bank.
6. The method of claim 5, wherein said keyword extracting the following text of each policy number to obtain a second keyword comprises:
confirming the line coordinate of each policy document number in the policy information image;
acquiring second character information of a second preset number in the post text of each policy text number according to the line coordinates;
and comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
7. The method of any one of claims 1-6, further comprising:
sending the target policy file and the target policy document number to a user terminal;
and receiving response information of the user terminal, and filing and storing the target policy file by taking the target policy file number as a filing number according to the response information.
8. A policy document number extraction device, comprising:
the image acquisition module is used for receiving a policy document number extraction instruction, acquiring a target policy file to be subjected to policy document number extraction according to the policy document number extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
the text extraction module is used for extracting the policy information in the policy information image to obtain a corresponding target policy information text;
the text number extraction module is used for extracting the policy text number information of the target policy information text according to a preset text number extraction model to obtain at least two different policy text numbers;
the position acquisition module is used for acquiring the text position of each policy document number in the target policy information text;
the keyword module is used for extracting keywords from the front part of each policy document number to obtain first keywords and extracting keywords from the rear part of each policy document number to obtain second keywords;
the text number screening module is used for weighting and summing the policy text numbers based on a first weighting coefficient of each first keyword, a second weighting coefficient of each second keyword and a third weighting coefficient of each text position to obtain the key degree of each policy text number;
and the target document number module is used for selecting the policy document number with the highest key degree as the target policy document number of the target policy document.
9. An electronic device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of policy document number extraction as recited in any one of claims 1 to 7.
10. A storage medium for computer readable storage, the storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps of the policy document number extraction of any of claims 1-7.
CN202210143541.4A 2022-02-16 2022-02-16 Policy document number extraction method, device, equipment and storage medium Pending CN114495145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210143541.4A CN114495145A (en) 2022-02-16 2022-02-16 Policy document number extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210143541.4A CN114495145A (en) 2022-02-16 2022-02-16 Policy document number extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114495145A true CN114495145A (en) 2022-05-13

Family

ID=81481435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210143541.4A Pending CN114495145A (en) 2022-02-16 2022-02-16 Policy document number extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114495145A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130103249A (en) * 2012-03-09 2013-09-23 가톨릭대학교 산학협력단 Method of classifying emotion from multi sentence using context information
CN109635082A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Policy implication analysis method, device, computer equipment and storage medium
CN110457696A (en) * 2019-07-31 2019-11-15 福州数据技术研究院有限公司 A kind of talent towards file data and policy intelligent Matching system and method
CN110532451A (en) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 Search method and device for policy text, storage medium, electronic device
CN110866116A (en) * 2019-10-25 2020-03-06 远光软件股份有限公司 Policy document processing method and device, storage medium and electronic equipment
CN110968757A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Policy file processing method and device
CN111782772A (en) * 2020-07-24 2020-10-16 平安银行股份有限公司 Text automatic generation method, device, equipment and medium based on OCR technology
CN113033333A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Entity word recognition method and device, electronic equipment and storage medium
CN113822067A (en) * 2021-08-17 2021-12-21 深圳市东信时代信息技术有限公司 Key information extraction method and device, computer equipment and storage medium
CN113870083A (en) * 2021-09-27 2021-12-31 中关村意谷(北京)科技服务有限公司 Policy matching method, device and system, electronic equipment and readable storage medium
CN113961666A (en) * 2021-09-18 2022-01-21 腾讯科技(深圳)有限公司 Keyword recognition method, apparatus, device, medium, and computer program product

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130103249A (en) * 2012-03-09 2013-09-23 가톨릭대학교 산학협력단 Method of classifying emotion from multi sentence using context information
CN110968757A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Policy file processing method and device
CN109635082A (en) * 2018-11-26 2019-04-16 平安科技(深圳)有限公司 Policy implication analysis method, device, computer equipment and storage medium
CN110532451A (en) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 Search method and device for policy text, storage medium, electronic device
CN110457696A (en) * 2019-07-31 2019-11-15 福州数据技术研究院有限公司 A kind of talent towards file data and policy intelligent Matching system and method
CN110866116A (en) * 2019-10-25 2020-03-06 远光软件股份有限公司 Policy document processing method and device, storage medium and electronic equipment
CN111782772A (en) * 2020-07-24 2020-10-16 平安银行股份有限公司 Text automatic generation method, device, equipment and medium based on OCR technology
CN113033333A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Entity word recognition method and device, electronic equipment and storage medium
CN113822067A (en) * 2021-08-17 2021-12-21 深圳市东信时代信息技术有限公司 Key information extraction method and device, computer equipment and storage medium
CN113961666A (en) * 2021-09-18 2022-01-21 腾讯科技(深圳)有限公司 Keyword recognition method, apparatus, device, medium, and computer program product
CN113870083A (en) * 2021-09-27 2021-12-31 中关村意谷(北京)科技服务有限公司 Policy matching method, device and system, electronic equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵一方;裴雷;康乐乐;: "基于段落信息增益的政策文本主题识别研究", 数字图书馆论坛, no. 11, 25 November 2018 (2018-11-25), pages 2 - 10 *
马玉新;吴爱萍;李华;王方;: "中国企业技术创新政策演变过程――基于扎根理论与加权共词分析法", 科学学与科学技术管理, no. 09, 10 September 2018 (2018-09-10), pages 61 - 72 *

Similar Documents

Publication Publication Date Title
US10706320B2 (en) Determining a document type of a digital document
CN111476227B (en) Target field identification method and device based on OCR and storage medium
US10846553B2 (en) Recognizing typewritten and handwritten characters using end-to-end deep learning
US10489672B2 (en) Video capture in data capture scenario
US20190294921A1 (en) Field identification in an image using artificial intelligence
US11321559B2 (en) Document structure identification using post-processing error correction
RU2760471C1 (en) Methods and systems for identifying fields in a document
CN112036295B (en) Bill image processing method and device, storage medium and electronic equipment
CN110569856A (en) sample labeling method and device, and damage category identification method and device
JP5634972B2 (en) Method, computer program product and system for text segmentation
CN112464927B (en) Information extraction method, device and system
CN114495145A (en) Policy document number extraction method, device, equipment and storage medium
US11335108B2 (en) System and method to recognise characters from an image
CN114049686A (en) Signature recognition model training method and device and electronic equipment
CN113128496B (en) Method, device and equipment for extracting structured data from image
CN116563869B (en) Page image word processing method and device, terminal equipment and readable storage medium
CN114998906B (en) Text detection method, training method and device of model, electronic equipment and medium
CN112861841B (en) Training method and device for bill confidence value model, electronic equipment and storage medium
CN113536169B (en) Method, device, equipment and storage medium for typesetting characters of webpage
CN114637877A (en) Labeling method, electronic device and storage medium
CN106446902A (en) Non-character image recognition method and device
CN110163203B (en) Character recognition method, device, storage medium and computer equipment
CN116681058A (en) Text processing method, device and storage medium
CN117058697A (en) Extraction sequence prediction method, device, equipment and medium for case information
CN117935294A (en) Document identification method, document translation method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination