CN114495145B - Policy and document extraction method, device, equipment and storage medium - Google Patents
Policy and document extraction method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN114495145B CN114495145B CN202210143541.4A CN202210143541A CN114495145B CN 114495145 B CN114495145 B CN 114495145B CN 202210143541 A CN202210143541 A CN 202210143541A CN 114495145 B CN114495145 B CN 114495145B
- Authority
- CN
- China
- Prior art keywords
- policy
- document
- text
- information
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 89
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000004044 response Effects 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000012015 optical character recognition Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 239000000284 extract Substances 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241001270131 Agaricus moelleri Species 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000000049 pigment Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application relates to the field of artificial intelligence and discloses a method, a device, equipment and a storage medium for extracting a policy document, wherein the method comprises the steps of receiving a policy document extraction instruction, determining a target policy document, and identifying a policy information area in the target policy document to obtain a policy information image; extracting target policy information text in the policy information image; extracting the policy document information from the target policy information text to obtain at least two different policy documents; acquiring the text position of each policy and document number in the target policy information text; extracting keywords from the front of each policy and document number to obtain a first keyword, and extracting keywords from the rear of each policy and document number to obtain a second keyword; the target policy clerks are selected by weighted summation of the respective policy clerks based on the first weighting coefficients of the respective first keywords, the second weighting coefficients of the respective second keywords, and the third weighting coefficients of the respective text positions.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a policy document extraction method, apparatus, device, and storage medium.
Background
Based on the fact that the correct policy document of the policy needs to be extracted for presentation, searching or filing, the policy document may appear at each corner of the policy document, and the conventional policy document extraction method generally requires a staff to read the policy document throughout and manually extract the policy document, which may reduce the timeliness of the policy.
In the process, the manual extraction method is not only easy to cause careless mistakes, and consumes a great deal of time, manpower and material resources, so that the technical problem to be solved by the technicians in the field is how to realize the rapid and accurate extraction of the policy clerks in the policy documents.
Disclosure of Invention
The embodiment of the application mainly aims to provide a method, a device, equipment and a storage medium for extracting a policy document, aiming at realizing quick and accurate extraction of the policy document in a policy file.
In a first aspect, an embodiment of the present application provides a policy and document extraction method, which is applied to an electronic device, and includes:
receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document;
acquiring the text position of each policy and document number in the target policy information text;
extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword;
Carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark;
And selecting the policy document with the highest key degree as the target policy document of the target policy document.
In a second aspect, an embodiment of the present application further provides a policy document extraction device, including:
The image acquisition module is used for receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
The text extraction module is used for extracting the policy information in the policy information image to obtain a corresponding target policy information text;
The document extraction module is used for extracting the policy document information from the target policy information text according to a preset document extraction model to obtain at least two different policy documents;
The position acquisition module is used for acquiring the text position of each policy document in the target policy information text;
The keyword module is used for extracting keywords from the front part of each policy document to obtain a first keyword, and extracting keywords from the rear part of each policy document to obtain a second keyword;
The document screening module is used for carrying out weighted summation on the policy documents based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document;
And the target document module is used for selecting the policy document with the highest key degree as the target policy document of the target policy document.
In a third aspect, embodiments of the present application also provide an electronic device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of any of the policy and document extraction methods as provided in the present specification.
In a fourth aspect, an embodiment of the present application further provides a storage medium, for storing a computer readable storage medium, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the steps of any policy and document extraction method as provided in the specification of the present application.
The embodiment of the application provides a method, a device, equipment and a storage medium for extracting a policy and a document, wherein the method comprises the steps of receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image; extracting policy information in the policy information image to obtain a corresponding target policy information text; extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document; acquiring the text position of each policy and document number in the target policy information text; extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword; carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark; and selecting the policy document with the highest key degree as the target policy document of the target policy document. The method comprises the steps of extracting the policy document in the policy document by using a document extraction model, so as to obtain all the policy document in the target policy document, based on the difference between the position of the target policy document in the document and keywords of the front and the back of the target policy document and the non-target document, obtaining the text position of all the policy document in the document and the keywords of the front and the back of the whole policy document, evaluating the document by using the text position, the keywords of the front of the policy document, the keywords of the back of the policy document and the like in multiple dimensions, and accordingly obtaining the target policy document corresponding to the target policy document from all the policy document more quickly and accurately.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a policy and document extraction method according to an embodiment of the present application;
Fig. 2 is a schematic diagram of an application scenario of a policy and document extraction method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a region layout structure of a target policy document according to an embodiment of the present application;
fig. 4 is a schematic block diagram of a policy document extraction device according to an embodiment of the present application;
fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.
It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
With the counting of the right policy document based on the government affairs project, the correct policy document of the policy needs to be extracted for showing, searching or filing, and the policy document may appear at each corner of the policy document, the traditional policy document extraction method usually requires that the staff read the policy document throughout and manually extract the policy document, which may reduce the timeliness of the policy.
In the process, the manual extraction mode is not only easy to cause careless mistakes, but also increases the operation cost, consumes a great deal of time and manpower and material resources, so that the technical problem to be solved by the technicians in the field is how to realize the rapid and accurate extraction of the policy clerks in the policy files.
In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for extracting a policy document, where the policy document extraction method is applied to an electronic device, and the electronic device may be a terminal device such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable device, or a server, where the server may be an independent server or a server cluster.
Specifically, the method comprises the steps of receiving a policy and document extraction instruction, obtaining a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image; extracting policy information in the policy information image to obtain a corresponding target policy information text; extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document; acquiring the text position of each policy and document number in the target policy information text; extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword; carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark; and selecting the policy document with the highest key degree as the target policy document of the target policy document. The method comprises the steps of extracting the policy document in the policy document by using a document extraction model, so as to obtain all the policy document in the target policy document, based on the difference between the position of the target policy document in the document and keywords of the front and the back of the target policy document and the non-target document, obtaining the text position of all the policy document in the document and the keywords of the front and the back of the whole policy document, evaluating the document by using the text position, the keywords of the front of the policy document, the keywords of the back of the policy document and the like in multiple dimensions, and accordingly obtaining the target policy document corresponding to the target policy document from all the policy document more quickly and accurately.
Some embodiments of the application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a flowchart of a policy and document extraction method according to an embodiment of the application.
As shown in fig. 1, the policy document extraction method includes steps S1 to S7.
Step S1: and receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image.
As shown in fig. 2 to 3, the document acquiring terminal 101 is provided with an image acquiring device, such as a camera, and may be used for acquiring an image of a target policy document, and when a user needs to perform policy document identification on the target policy document to implement policy document archiving, a to-be-identified tag is set for the to-be-identified policy document, for example, a word to be identified is set in the upper left corner or the upper right corner of the to-be-identified policy document.
Meanwhile, the policy information area in which the policy document is located is marked, so that the electronic device 300 controls the file obtaining terminal 101 to obtain the corresponding policy document to be identified according to the label to be identified, and performs policy information extraction on the marked policy information area, so as to obtain a policy information text in the policy information area, as shown in an area A of fig. 3, and performs policy document extraction from the policy information text, so as to obtain all the policy documents in the policy information area.
In some embodiments, the target policy file is provided with a region tag, and the identifying the policy information area in the target policy file to obtain the policy information image includes:
Collecting a policy document image of the target policy document;
Identifying a policy information area in the policy document image according to the area tag, and dividing the policy information area from the policy document image;
and carrying out image preprocessing on the policy information area to obtain a policy information image.
For example, if the target policy file is a non-image file, for example, a paper file or a PDF file, the policy file image of the target policy file needs to be obtained through image conversion or image acquisition, the electronic device may identify the policy information area in the policy file image based on the corresponding area tag, for example, the user may use a pigment pen to frame the policy information area in the target policy file, when receiving the policy document extraction instruction sent by the terminal device 102, the electronic device 300 controls the file obtaining terminal 101 to acquire the policy file image of the target policy file, and after acquiring the policy file image of the target policy file, the policy information area in the policy file image, that is, the image corresponding to the area a in fig. 3, may be obtained by identifying the frame selection area in the policy file image. And carrying out image preprocessing on the obtained policy information area to obtain a policy information image. Wherein the image preprocessing includes at least one of image noise reduction, image brightness enhancement, image color enhancement, and the like.
In some embodiments, the policy information area includes a text area and a non-text area, and the performing image preprocessing on the policy information area to obtain a policy information image includes:
acquiring gray values of all pixel points in the policy information area;
Determining the target brightness level of the text region and the non-text region according to the gray value of each pixel point;
And carrying out color enhancement processing on the text region and the non-text region according to the target brightness level to obtain the policy information image.
Illustratively, the gray value is a measure of the degree of gray that a pixel in an image exhibits in a black and white image. In the gray scale map, gray scales from darkest black to brightest white are generally displayed. The logarithmic relationship between white and black is divided into several levels, called "gray levels". Typically ranging from 0 to 255, white is 255 and black is 0. And acquiring the gray value of each pixel point in the policy information area, so as to determine the maximum gray value and the gray average value according to each acquired gray value.
Generally, the policy information area includes a text area and a non-text area, where the text area is an area corresponding to the text area to be identified, and the current brightness levels of the text area and the non-text area can be comprehensively represented from different dimensions by determining a maximum gray value and a gray average value according to the obtained gray values, and further, the target brightness levels of the text area and the non-text area are determined according to the current brightness levels of the text area and the non-text area, so that brightness adjustment is performed on the text area and the non-text area in the policy information area by using the target brightness levels, thereby avoiding degradation of accuracy of text identification caused by excessive exposure of the text area, or avoiding degradation of accuracy of text identification caused by interference to the text area caused by exposure of the non-text area.
It will be appreciated that in some embodiments, the file acquisition terminal 101 may be integrated within the electronic device 300. In some embodiments, the user may issue the policy and document extraction instruction to the electronic device 300 through an input device such as a mouse, a keyboard, a touch panel, etc. communicatively connected to the electronic device 300, instead of issuing the policy and document extraction instruction through the terminal device 102.
Step S2: and extracting the policy information in the policy information image to obtain a corresponding target policy information text.
The policy information includes at least two languages, such as Chinese, english and Arabic numerals, and the policy information in the policy information image is identified through the OCR recognition model, so that a target policy information text corresponding to the policy information in the policy information image is obtained.
In some embodiments, the policy information includes first language information and second language information, and the extracting the policy information in the policy information image to obtain a corresponding target policy information text includes:
Inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
Inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
And sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
Illustratively, policy information generally includes Chinese, english, digits, etc., and thus, non-Chinese text includes at least English and digits. The character recognition model is an OCR (Optical Character Recognition ) model, and policy information in the policy information image can be obtained by the OCR model.
The OCR model comprises a first OCR model and a second OCR model, wherein the first OCR model is a Chinese information recognition model and is used for recognizing Chinese texts in the policy information image and first text positions of the Chinese texts in the policy information image, and the second OCR model is a non-Chinese information recognition model and is used for recognizing non-Chinese texts in the policy information image and second text positions of the non-Chinese texts in the policy information image. And sequencing the first policy information text and the second policy information text by using the first text position and the second text position, thereby obtaining the target policy information text. The text recognition accuracy can be improved by performing text recognition on the policy information image through two different OCR models.
Step S3: and extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document.
The target policy file comprises a target policy document and a non-target policy document, wherein the non-target policy document is a quoted document of the target policy document, and how to extract the target policy document from the target policy file is important.
Illustratively, taking the document extraction model as an NLP document extraction model for illustration, performing policy document information extraction on the target policy information text through the NLP document extraction model to obtain all policy documents in the target policy information text, thereby forming a policy document set of administrative policy documents in the target policy file, wherein the policy document set comprises at least two different policy documents, namely the policy document set comprises a target policy document and a non-target policy document. The NLP document extraction model is obtained through training by using policy document data.
Step S4: and acquiring the text position of each policy document in the target policy information text.
Based on the fact that the positions of different policy and document numbers in the text are possibly different, the position of the policy and document number in the target policy and document information text is identified, and the text position can be used for assisting in judging whether the current policy and document number is the target policy and document number or not, so that the judging accuracy of whether the current policy and document number is the target policy and document number can be improved.
Step S5: and extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword.
The keywords based on the contexts of different policy marks may be different, and the keywords corresponding to the contexts of the target policy mark and the non-target policy mark have larger differences, and the non-target policy mark is usually the reference object of the target policy tail mark, so that the current policy mark can be assisted to be judged as the target policy mark or the non-target policy mark by extracting and identifying the keywords corresponding to the contexts of the target policy mark and the non-target policy mark.
In some embodiments, the extracting the keyword from the preamble of each policy document to obtain the first keyword includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring first character information of a first preset number in the front of each policy document according to the row coordinates;
and comparing the first character information with a preset first word stock to obtain a first keyword matched with the word in the first word stock.
In some embodiments, the extracting the keywords from the second preset number of length documents of each policy document to obtain the second keywords includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring second character information of a second preset number in the post of each policy document according to the row coordinates;
And comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
For example, after all the policy marks in the target policy file are acquired, the target policy mark needs to be confirmed from all the policy marks, and when the target policy mark is confirmed, the front and rear keywords of each policy mark are extracted, and the keyword is used for assisting in judging whether the current policy mark is the target policy mark.
Specifically, row coordinates of each policy document in the policy information image are confirmed, first preset number of character information in the front of the policy document is obtained according to the row coordinates, the character information is compared with a preset first word stock, first keywords matched with words in the first word stock are obtained, and then the first keywords corresponding to each policy document are obtained.
Meanwhile, confirming row coordinates of each policy document in the policy information image; and obtaining character information of a second preset number in the text of the policy document according to the row coordinates, and comparing the character information with a preset second word stock to obtain second keywords matched with words in the second word stock. And obtaining a second keyword corresponding to each policy document. The first preset number and the second preset number can be set according to the needs, for example, the first preset number and the second preset number are both the character string lengths corresponding to 10-15 Chinese characters or the character string lengths corresponding to 20-25 English and digital combinations.
Step S6: and carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark.
Illustratively, the keywords at different locations have different contributions to whether the policy document is a target policy document, and at the same time, the different locations of the policy document in the policy information area have different contributions to whether the policy document is a target policy document.
Therefore, when the target policy document is selected from all the policy documents in the target policy document, the first weighting coefficient of the first keyword, the second weighting coefficient of the second keyword and the third weighting coefficient of the text position are obtained for weighted summation, so that the keyword degree of each policy document is obtained, and the probability that the current policy document is the target policy document can be judged according to the keyword degree.
Step S7: and selecting the policy document with the highest key degree as the target policy document of the target policy document.
After the key degree of all the policy marks in the target policy file is obtained, the policy marks with the highest key degree are arranged in ascending order or descending order according to the key degree, so that the policy mark with the highest key degree is selected as the target policy mark of the target policy file.
In some embodiments, after step S7, the method further comprises:
transmitting the target policy file and the target policy document to a user terminal;
And receiving response information of the user terminal, and archiving and storing the target policy file according to the response information by taking the target policy file number as an archiving number.
Illustratively, the response information is information which triggers the user terminal to send after the user confirms that the target policy file and the target policy document are matched.
After the target policy document number corresponding to the target policy document is obtained, the target policy document number is sent to a preset user terminal, a user can receive the target policy document number and the target policy document sent by the electronic device through the user terminal, so that whether the extracted target policy document number is a policy document number corresponding to the target policy document is confirmed, when the user confirms that the target policy document number extracted by the electronic device is the policy document number corresponding to the target policy document, response information is sent to the trigger user terminal through input devices such as a mouse, a keyboard and a touch panel, so that the electronic device files and stores the target policy document, and when documents are archived and stored, the target policy document number can be used as a file number for naming. The archive storage may be a hard disk stored in the background of the electronic device, or may be sent to a corresponding cloud server for storage, which is not limited herein.
Referring to fig. 4, fig. 4 is a schematic block diagram illustrating a policy document extraction device according to an embodiment of the application.
As shown in fig. 4, the policy document extraction device 200 is applicable to an electronic apparatus, and the policy document extraction device 200 includes an image acquisition module 201, a text extraction module 202, a document extraction module 203, a location acquisition module 204, a keyword module 205, a document screening module 206, and a target document module 207.
The image acquisition module 201 is configured to receive a policy document extraction instruction, obtain a target policy document to be subjected to policy document extraction according to the policy document extraction instruction, and identify a policy information area in the target policy document to obtain a policy information image;
The text extraction module 202 is configured to extract policy information in the policy information image to obtain a corresponding target policy information text;
The document extraction module 203 is configured to extract the policy document information from the target policy information text according to a preset document extraction model, so as to obtain at least two different policy documents;
a location obtaining module 204, configured to obtain a text location of each of the policy document in the target policy information text;
a keyword module 205, configured to extract a keyword from a front of each of the policy documents to obtain a first keyword, and extract a keyword from a rear of each of the policy documents to obtain a second keyword;
A document filtering module 206, configured to weight and sum each policy document based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword, and the third weighting coefficient of each text position, so as to obtain a key degree of each policy document;
The target document module 207 is configured to select the policy document with the highest key degree as the target policy document of the target policy document.
In some embodiments, the policy and document extraction device 200 further includes a document filing module, configured to send the target policy document and the target policy and document to a user terminal; and receiving response information of the user terminal, and archiving and storing the target policy file according to the response information by taking the target policy file number as a label.
In some embodiments, the response information is information that the user triggers the user terminal to send after confirming that the target policy file and the target policy document match.
In some embodiments, the target policy file is provided with an area tag, and the image acquisition module 201, when identifying a policy information area in the target policy file and obtaining a policy information image, includes:
acquiring a policy document image of the target policy document;
Identifying a policy information area in the policy document image according to the area tag, and dividing the policy information area from the policy document image;
and carrying out image preprocessing on the policy information area to obtain a policy information image.
In some embodiments, the policy information area includes a text area and a non-text area, and when the image acquisition module 201 performs image preprocessing on the policy information area to obtain a policy information image, the method includes:
acquiring gray values of all pixel points in the policy information area;
Determining the target brightness level of the text region and the non-text region according to the gray value of each pixel point;
And carrying out color enhancement processing on the text region and the non-text region according to the target brightness level to obtain the policy information image.
In some embodiments, the policy information includes first language information and second language information, and the text extraction module 202, when extracting the policy information in the policy information image to obtain the corresponding target policy information text, includes:
Inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
Inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
And sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
In some embodiments, the keyword module 205, when extracting the keyword from the preamble of each policy document, includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring first character information of a first preset number in the front of each policy document according to the row coordinates;
and comparing the first character information with a preset first word stock to obtain a first keyword matched with the word in the first word stock.
In some embodiments, the keyword module 205, when extracting the keywords from the post-text of each policy document to obtain the second keywords, includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring second character information of a second preset number in the post of each policy document according to the row coordinates;
And comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
Referring to fig. 5, fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present application.
As shown in FIG. 5, the electronic device 300 includes a processor 301 and a memory 302, the processor 301 and the memory 302 being connected by a bus 303, such as an I2C (Inter-INTEGRATED CIRCUIT) bus.
In particular, the processor 301 is used to provide computing and control capabilities to support the operation of the overall electronic device. The Processor 301 may be a central processing unit (Central Processing Unit, CPU), the Processor 301 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure related to the embodiment of the present application, and does not constitute a limitation of the electronic device to which the embodiment of the present application is applied, and in particular, the electronic device may include more or less components than those shown in the drawings, or may combine some components, or have different arrangements of components.
The processor 301 is configured to run a computer program stored in the memory, and implement any one of the policy and document extraction methods provided by the embodiments of the present application when the computer program is executed.
In some embodiments, the processor 301 is configured to run a computer program stored in a memory and when executing the computer program implement the steps of:
receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document;
acquiring the text position of each policy and document number in the mark policy information text;
extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword;
Carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark;
And selecting the policy document with the highest key degree as the target policy document of the target policy document.
In some embodiments, the target policy file is provided with a region tag, and the processor 301, when identifying a policy information area in the target policy file, obtains a policy information image, includes:
acquiring a policy document image of the target policy document;
Identifying a policy information area in the policy document image according to the area tag, and dividing the policy information area from the policy document image;
and carrying out image preprocessing on the policy information area to obtain a policy information image.
In some embodiments, the policy information area includes a text area and a non-text area, and when the processor 301 performs image preprocessing on the policy information area to obtain a policy information image, the method includes:
acquiring gray values of all pixel points in the policy information area;
Determining the target brightness level of the text region and the non-text region according to the gray value of each pixel point;
And carrying out color enhancement processing on the text region and the non-text region according to the target brightness level to obtain the policy information image.
In some embodiments, the policy information includes first language information and second language information, and when extracting policy information in the policy information image to obtain a corresponding target policy information text, the processor 301 includes:
Inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
Inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
And sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
In some embodiments, when the processor 301 performs keyword extraction on the preamble of each of the policy documents to obtain the first keyword, the method includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring first character information of a first preset number in the front of each policy document according to the row coordinates;
and comparing the first character information with a preset first word stock to obtain a first keyword matched with the word in the first word stock.
In some embodiments, when the processor 301 extracts the second keyword from the keywords of the post-text of each policy document, the method includes:
Confirming row coordinates of each policy document in the policy information image;
acquiring second character information of a second preset number in the post of each policy document according to the row coordinates;
And comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
In some implementations, the processor 301 is further configured to:
transmitting the target policy file and the target policy document to a user terminal;
And receiving response information of the user terminal, and archiving and storing the target policy file according to the response information by taking the target policy file number as an archiving number.
In some embodiments, the response information is information that the user triggers the user terminal to send after confirming that the target policy file and the target policy document match.
It should be noted that, for convenience and brevity of description, specific working processes of the above-described electronic device may refer to corresponding processes in the foregoing embodiments of the policy document extraction method, and are not described herein again.
The embodiment of the application also provides a storage medium for computer readable storage, the storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of any policy document extraction method as provided in the embodiments of the present application.
The storage medium may be an internal storage unit of the electronic device of the foregoing embodiment, for example, a hard disk or a memory of the electronic device. The storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk provided on the electronic device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware embodiment, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
It should be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (10)
1. A method for extracting a policy document applied to an electronic device, comprising:
receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
extracting policy information in the policy information image to obtain a corresponding target policy information text;
extracting the policy and document information from the target policy information text according to a preset document extraction model to obtain at least two different policy and document;
acquiring the text position of each policy and document number in the target policy information text;
extracting keywords from the front of each policy document to obtain a first keyword, and extracting keywords from the rear of each policy document to obtain a second keyword;
Carrying out weighted summation on the policy marks based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy mark;
And selecting the policy document with the highest key degree as the target policy document of the target policy document.
2. The method of claim 1, wherein the target policy document is provided with a region tag, and the identifying the policy information area in the target policy document to obtain the policy information image comprises:
acquiring a policy document image of the target policy document;
Identifying a policy information area in the policy document image according to the area tag, and dividing the policy information area from the policy document image;
and carrying out image preprocessing on the policy information area to obtain a policy information image.
3. The method of claim 2, wherein the policy information area includes a text area and a non-text area, the performing image preprocessing on the policy information area to obtain a policy information image includes:
acquiring gray values of all pixel points in the policy information area;
Determining the target brightness level of the text region and the non-text region according to the gray value of each pixel point;
And carrying out color enhancement processing on the text region and the non-text region according to the target brightness level to obtain the policy information image.
4. The method of claim 1, wherein the policy information includes first language information and second language information, and the extracting the policy information in the policy information image to obtain the corresponding target policy information text includes:
Inputting the policy information image into a first character recognition model to obtain a first policy information text corresponding to the first language information and a first text position of the first policy information text in the policy information image;
Inputting the policy information image into a second character recognition model to obtain a second policy information text corresponding to the second language information and a second text position of the second policy information text in the policy information image;
And sequencing the first policy information text and the second policy information text according to the first text position and the second text position to obtain a target policy information text.
5. The method of claim 1, wherein the extracting the keywords from the pre-text of each of the policy documents to obtain the first keywords comprises:
Confirming row coordinates of each policy document in the policy information image;
acquiring first character information of a first preset number in the front of each policy document according to the row coordinates;
and comparing the first character information with a preset first word stock to obtain a first keyword matched with the word in the first word stock.
6. The method of claim 5, wherein the extracting the keywords from the post-text of each of the policy documents to obtain the second keywords comprises:
Confirming row coordinates of each policy document in the policy information image;
acquiring second character information of a second preset number in the post of each policy document according to the row coordinates;
And comparing the second character information with a preset second word stock to obtain a second keyword matched with the words in the second word stock.
7. The method of any one of claims 1-6, wherein the method further comprises:
transmitting the target policy file and the target policy document to a user terminal;
And receiving response information of the user terminal, and archiving and storing the target policy file according to the response information by taking the target policy file number as an archiving number.
8. A policy document extraction device, comprising:
The image acquisition module is used for receiving a policy and document extraction instruction, acquiring a target policy file to be subjected to policy and document extraction according to the policy and document extraction instruction, and identifying a policy information area in the target policy file to obtain a policy information image;
The text extraction module is used for extracting the policy information in the policy information image to obtain a corresponding target policy information text;
The document extraction module is used for extracting the policy document information from the target policy information text according to a preset document extraction model to obtain at least two different policy documents;
The position acquisition module is used for acquiring the text position of each policy document in the target policy information text;
The keyword module is used for extracting keywords from the front part of each policy document to obtain a first keyword, and extracting keywords from the rear part of each policy document to obtain a second keyword;
The document screening module is used for carrying out weighted summation on the policy documents based on the first weighting coefficient of each first keyword, the second weighting coefficient of each second keyword and the third weighting coefficient of each text position to obtain the key degree of each policy document;
And the target document module is used for selecting the policy document with the highest key degree as the target policy document of the target policy document.
9. An electronic device comprising a processor, a memory, a computer program stored on the memory and executable by the processor, and a data bus for enabling a connection communication between the processor and the memory, wherein the computer program, when executed by the processor, implements the steps of policy document extraction according to any of claims 1 to 7.
10. A storage medium for computer readable storage, wherein the storage medium stores one or more programs executable by one or more processors to implement the steps of policy document extraction of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210143541.4A CN114495145B (en) | 2022-02-16 | 2022-02-16 | Policy and document extraction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210143541.4A CN114495145B (en) | 2022-02-16 | 2022-02-16 | Policy and document extraction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114495145A CN114495145A (en) | 2022-05-13 |
CN114495145B true CN114495145B (en) | 2024-05-28 |
Family
ID=81481435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210143541.4A Active CN114495145B (en) | 2022-02-16 | 2022-02-16 | Policy and document extraction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495145B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130103249A (en) * | 2012-03-09 | 2013-09-23 | 가톨릭대학교 산학협력단 | Method of classifying emotion from multi sentence using context information |
CN109635082A (en) * | 2018-11-26 | 2019-04-16 | 平安科技(深圳)有限公司 | Policy implication analysis method, device, computer equipment and storage medium |
CN110457696A (en) * | 2019-07-31 | 2019-11-15 | 福州数据技术研究院有限公司 | A kind of talent towards file data and policy intelligent Matching system and method |
CN110532451A (en) * | 2019-06-26 | 2019-12-03 | 平安科技(深圳)有限公司 | Search method and device for policy text, storage medium, electronic device |
CN110866116A (en) * | 2019-10-25 | 2020-03-06 | 远光软件股份有限公司 | Policy document processing method and device, storage medium and electronic equipment |
CN110968757A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Policy file processing method and device |
CN111782772A (en) * | 2020-07-24 | 2020-10-16 | 平安银行股份有限公司 | Text automatic generation method, device, equipment and medium based on OCR technology |
CN113033333A (en) * | 2021-03-05 | 2021-06-25 | 北京百度网讯科技有限公司 | Entity word recognition method and device, electronic equipment and storage medium |
CN113822067A (en) * | 2021-08-17 | 2021-12-21 | 深圳市东信时代信息技术有限公司 | Key information extraction method and device, computer equipment and storage medium |
CN113870083A (en) * | 2021-09-27 | 2021-12-31 | 中关村意谷(北京)科技服务有限公司 | Policy matching method, device and system, electronic equipment and readable storage medium |
CN113961666A (en) * | 2021-09-18 | 2022-01-21 | 腾讯科技(深圳)有限公司 | Keyword recognition method, apparatus, device, medium, and computer program product |
-
2022
- 2022-02-16 CN CN202210143541.4A patent/CN114495145B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130103249A (en) * | 2012-03-09 | 2013-09-23 | 가톨릭대학교 산학협력단 | Method of classifying emotion from multi sentence using context information |
CN110968757A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Policy file processing method and device |
CN109635082A (en) * | 2018-11-26 | 2019-04-16 | 平安科技(深圳)有限公司 | Policy implication analysis method, device, computer equipment and storage medium |
CN110532451A (en) * | 2019-06-26 | 2019-12-03 | 平安科技(深圳)有限公司 | Search method and device for policy text, storage medium, electronic device |
CN110457696A (en) * | 2019-07-31 | 2019-11-15 | 福州数据技术研究院有限公司 | A kind of talent towards file data and policy intelligent Matching system and method |
CN110866116A (en) * | 2019-10-25 | 2020-03-06 | 远光软件股份有限公司 | Policy document processing method and device, storage medium and electronic equipment |
CN111782772A (en) * | 2020-07-24 | 2020-10-16 | 平安银行股份有限公司 | Text automatic generation method, device, equipment and medium based on OCR technology |
CN113033333A (en) * | 2021-03-05 | 2021-06-25 | 北京百度网讯科技有限公司 | Entity word recognition method and device, electronic equipment and storage medium |
CN113822067A (en) * | 2021-08-17 | 2021-12-21 | 深圳市东信时代信息技术有限公司 | Key information extraction method and device, computer equipment and storage medium |
CN113961666A (en) * | 2021-09-18 | 2022-01-21 | 腾讯科技(深圳)有限公司 | Keyword recognition method, apparatus, device, medium, and computer program product |
CN113870083A (en) * | 2021-09-27 | 2021-12-31 | 中关村意谷(北京)科技服务有限公司 | Policy matching method, device and system, electronic equipment and readable storage medium |
Non-Patent Citations (3)
Title |
---|
中国企业技术创新政策演变过程――基于扎根理论与加权共词分析法;马玉新;吴爱萍;李华;王方;;科学学与科学技术管理;20180910(09);61-72 * |
和志强 ; 王丽鹏 ; 张鹏云 ; .基于词共现的关键词提取算法研究与改进.电子技术与软件工程.2018,(01),144-146. * |
基于段落信息增益的政策文本主题识别研究;赵一方;裴雷;康乐乐;;数字图书馆论坛;20181125(11);2-10 * |
Also Published As
Publication number | Publication date |
---|---|
CN114495145A (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110135411B (en) | Business card recognition method and device | |
CN110442744B (en) | Method and device for extracting target information in image, electronic equipment and readable medium | |
US10366123B1 (en) | Template-free extraction of data from documents | |
EP3712812A1 (en) | Recognizing typewritten and handwritten characters using end-to-end deep learning | |
US8965126B2 (en) | Character recognition device, character recognition method, character recognition system, and character recognition program | |
RU2760471C1 (en) | Methods and systems for identifying fields in a document | |
US11232300B2 (en) | System and method for automatic detection and verification of optical character recognition data | |
US10489645B2 (en) | System and method for automatic detection and verification of optical character recognition data | |
CN112508011A (en) | OCR (optical character recognition) method and device based on neural network | |
US9286526B1 (en) | Cohort-based learning from user edits | |
CN110956739A (en) | Bill identification method and device | |
JP2008276766A (en) | Form automatic filling method and device | |
CN110705952A (en) | Contract auditing method and device | |
CN111046879A (en) | Certificate image classification method and device, computer equipment and readable storage medium | |
CN111612081B (en) | Training method, device, equipment and storage medium for recognition model | |
CN112036295B (en) | Bill image processing method and device, storage medium and electronic equipment | |
CN113221918B (en) | Target detection method, training method and device of target detection model | |
RU2656573C2 (en) | Methods of detecting the user-integrated check marks | |
CN114637877A (en) | Labeling method, electronic device and storage medium | |
CN111008624A (en) | Optical character recognition method and method for generating training sample for optical character recognition | |
CN114495145B (en) | Policy and document extraction method, device, equipment and storage medium | |
CN116030469A (en) | Processing method, processing device, processing equipment and computer readable storage medium | |
US11335108B2 (en) | System and method to recognise characters from an image | |
CN113255674A (en) | Character recognition method, character recognition device, electronic equipment and computer-readable storage medium | |
US12106593B2 (en) | Multi-layer neural network and convolutional neural network for context sensitive optical character recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |