CN114694154A - File analysis method, system and storage medium - Google Patents

File analysis method, system and storage medium Download PDF

Info

Publication number
CN114694154A
CN114694154A CN202210372198.0A CN202210372198A CN114694154A CN 114694154 A CN114694154 A CN 114694154A CN 202210372198 A CN202210372198 A CN 202210372198A CN 114694154 A CN114694154 A CN 114694154A
Authority
CN
China
Prior art keywords
policy
image
file
red
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210372198.0A
Other languages
Chinese (zh)
Inventor
杨婉琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202210372198.0A priority Critical patent/CN114694154A/en
Publication of CN114694154A publication Critical patent/CN114694154A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and discloses a file analysis method, a file analysis system and a storage medium, wherein the method comprises the steps of removing a watermark image of a file image to be detected through an SIFT model based on OpenCv; removing the seal of the file to be detected from the file image to be detected by utilizing OpenCv; determining the policy issuing department of the red-headed characters by utilizing a pre-trained policy department identification model; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department. The method saves the time for revising the analysis result by carefully reading a large amount of policy texts by manpower, and avoids the occurrence of wrong and missed judgment phenomena caused by artificial subjective factors; the key attribute information of the policy documents such as a policy issuing department of the policy documents and the like can be accurately acquired, and the labor cost of manpower can be greatly reduced; finally, the technical effect of accurately and efficiently analyzing the policy files is achieved.

Description

File analysis method, system and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a file parsing method, a file parsing system and a computer readable storage medium.
Background
In the field of computer vision, NLP (Natural Language Processing) is an important research direction in the fields of computer science and artificial intelligence, and it researches and uses computer to process, understand and use human Language (such as chinese and english) to achieve effective communication between human and computer; the method comprises the following steps of text expansion analysis, information extraction and the like. However, the policy file consists of a red head character, a foreword and an attachment text; the attachment text cannot be parsed with NLP.
In the prior art, OCR (Optical Character Recognition) is usually used to identify the text of an attachment of a policy file, then key attribute extraction is performed manually, and finally the policy file is edited and classified manually according to the key attribute; the disadvantages are as follows:
the policy document has interference such as watermark and red head characters, so that the policy document cannot be directly identified by using OCR. If the key attributes are extracted manually after the watermark area and the red-headed characters are corrected manually, the problems of large workload, more subjective mistakes and omissions and the like exist.
Therefore, a method for document parsing based on policy documents is needed.
Disclosure of Invention
The invention provides a file parsing method, a file parsing system, electronic equipment and a storage medium, and mainly aims to solve at least one problem in the prior art.
In order to achieve the above object, a file parsing method provided by the present invention is applied to an electronic device, and includes:
identifying a watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed;
carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected;
acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
Preferably, before the step of uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department, text cleaning pretreatment is further performed on the policy text image.
Further, preferably, before the step of uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department, the method further includes identifying a policy document number of the policy text image;
acquiring text content of the policy text image by using OCR (optical character recognition);
extracting the policy document number of the acquired text content by using a pre-trained policy document number identification model;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department and the policy document number.
Further, preferably, before the policy document number identification of the policy text image, policy validity period acquisition of the policy text image is further included;
acquiring text content of the policy text image by using OCR (optical character recognition);
acquiring the validity period starting time and the validity period ending time of the acquired text content by using a pre-trained validity period identification model;
format processing is carried out on the obtained valid period starting time and the valid period ending time, and a standardized time interval is obtained to be used as a policy valid period;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department, the policy document number and the policy validity period.
Further, preferably, the training method of the policy department identification model includes,
acquiring a training data set marked with a policy department name;
training a named entity recognition model based on BILSTM-CRF by using the training data set; the named entity recognition model based on BILSTM-CRF comprises a first layer, a second layer and a third layer, wherein the first layer of the named entity recognition model based on BILSTM-CRF is a low latitude word vector layer, and the second layer of the named entity recognition model based on BILSTM-CRF is a bidirectional LSTM layer; the third layer is a CRF layer.
Further, preferably, the method for identifying the watermark image of the file image to be detected by the SIFT model based on OpenCv, identifying the key points of the watermark image, and performing gradient transformation on the neighborhood around the key points until the watermark image is removed comprises the steps of,
searching all scale spaces corresponding to the file image to be detected through an SIFT model based on OpenCv;
screening interest points with invariable scale and rotation in each scale space by using a Gaussian differential function as candidate key points;
selecting candidate key points with stability meeting set requirements as key points;
and performing gradient transformation on the neighborhood around the key point until the watermark image is removed.
Further, preferably, the method for selecting candidate keypoints with stability meeting the setting requirement as the keypoints comprises,
the scale of the candidate key point is refined through fitting a 3-D quadratic function;
screening out candidate key points with low contrast and candidate key points with poor stability of scale and rotation degree;
and screening candidate key points with offset of interpolation center smaller than 0.5 as key points.
In order to solve the above problem, the present invention further provides a file parsing system, which includes:
the image processing unit is used for identifying the watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed;
carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected;
the policy information acquisition unit is used for acquiring red-headed characters from the red channel map, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters;
and the file uploading unit is used for uploading the policy text image to a corresponding position of the policy text database according to the determined policy issuing department.
In order to solve the above problem, the present invention also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the aforementioned file parsing method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the file parsing method described above.
According to the file analysis method, the file analysis system and the storage medium, the watermark image of the file image to be detected is identified through an SIFT model based on OpenCv, key points of the watermark image are identified, and gradient transformation is carried out on neighborhoods around the key points until the watermark image is removed; carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected; acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department. The invention has the following beneficial effects:
1) the method not only removes the watermark and the seal, but also carries out targeted analysis and information extraction on the red-headed characters, thereby saving the time for manually carefully reading a large number of policy texts to revise the analysis result, and avoiding the occurrence of error and omission judgment phenomena caused by artificial subjective factors;
2) the key attribute information of the policy documents such as a policy issuing department of the policy documents and the like can be accurately acquired, and the labor cost of manpower can be greatly reduced; finally, the technical effect of accurately and efficiently analyzing the policy files is achieved.
Drawings
FIG. 1 is a flowchart illustrating a file parsing method according to an embodiment of the invention;
FIG. 2 is a block diagram of a logical structure of a file parsing system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device implementing a file parsing method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology in the application is a machine learning technology based on a convolutional neural network. Convolutional neural network based applications can be used in many different fields, such as speech recognition, medical diagnostics, testing of applications, etc.
Aiming at the technical problem of low accuracy of policy file analysis in the prior art, the invention provides a file analysis method, which not only removes watermarks and seals, but also performs targeted analysis and information extraction on the red-headed characters, saves the time for manually reading a large number of policy texts carefully to revise the analysis result, and avoids the occurrence of error and omission judgment phenomena caused by artificial subjective factors; the key attribute information of the policy documents such as a policy issuing department of the policy documents and the like can be accurately acquired, and the labor cost of manpower can be greatly reduced; finally, the technical effect of accurately and efficiently analyzing the policy files is achieved.
The noun explains:
ocr (optical Character recognition) is a process in which an electronic device (e.g., a scanner or a digital camera) checks characters printed on paper and then translates the shapes into computer characters using a Character recognition method; namely, the process of scanning the text data, then analyzing and processing the image file and obtaining the character and layout information.
Open CV (Open Source Computer Vision Library) is an Open-Source Computer Vision Library that provides many functions that implement Computer Vision algorithms very efficiently (the most basic filtering to advanced object detection is covered). The Open CV library is written by C language and C + + language and can run in Windows, Linux, Mac OS X and other systems. Meanwhile, interfaces of Python, Java, Matlab and other languages are actively developed, and libraries are imported into android and iOS to develop applications for mobile equipment. The Open CV has wide application fields, including image splicing, image noise reduction, product quality inspection, human-computer interaction, face recognition, action tracking, unmanned driving and the like. Open CV also provides a machine learning module, which can use machine learning algorithms such as normal Bayes, K nearest neighbor, support vector machine, decision tree, random forest, artificial neural network, etc.
SIFT (Scale-invariant feature transform) is an algorithm of computer vision for detecting and describing local features in an image, and is used for searching extreme points in a spatial Scale and extracting position, Scale and rotation invariants of the extreme points; the SIFT features are based on some local appearance interest points on the object and are independent of the size and rotation of the image, the detection rate of partial object occlusion using SIFT feature description is quite high, and even more than 3 SIFT object features are enough to calculate the position and orientation.
NER (Named Entity Recognition), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. Simply speaking, it is a sequence labeling problem to identify the boundaries and categories of entity names in natural text.
Specifically, as an example, fig. 1 is a schematic flow chart of a file parsing method according to an embodiment of the present invention. Referring to fig. 1, the present invention provides a file parsing method, which may be performed by a device, which may be implemented by software and/or hardware.
In this embodiment, the file parsing method includes: steps S110 to S150.
S110, identifying the watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed.
The file to be detected, namely the image acquisition of the policy file, can be realized by a Computer Vision (CV) technology, and the Computer Vision is a science for researching how to make a machine see, and further means that a camera and a Computer replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, and the like.
Specifically, all file images of the policy file including the red-headed text, the preamble text, and the attachment text are acquired by a camera, a computer, a PAD, or the like.
In a specific embodiment, the method for identifying the watermark image of the file image to be detected by using an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed includes: and S111, searching all scale spaces corresponding to the file image to be detected through an OpenCv-based SIFT model. And S112, screening interest points which are invariable in scale and rotation in each scale space by using a Gaussian differential function to serve as candidate key points. And S113, selecting candidate key points with stability meeting the set requirement as key points. The method for selecting the candidate key points with the stability meeting the set requirement as the key points comprises the steps of fitting a 3-D quadratic function to accurately measure the candidate key points; screening out candidate key points with low contrast and candidate key points with poor stability of scale and rotation degree; and screening candidate key points with offset of interpolation center smaller than 0.5 as key points. And S114, performing gradient transformation on the neighborhood around the key point until the watermark image is removed.
It should be noted that the scale space L (x, y, delta) of an image is defined as the convolution operation of the original image I (x, y) and a 2-D gaussian function G (x, y, delta) with variable scale. The scale is a representation controlled by this parameter delta. Different L (x, y, delta) forms a scale Space (Space, Space set), and in practice, even a continuous gaussian function is discretized into a (generally odd-sized) (2 × k +1) × (2 × k +1) matrix for convolution with the digital image.
Specifically, firstly, searching policy text image positions on all scales, identifying potential interest points which are invariable in scale and rotation through a Gaussian differential function, and taking the identified potential interest points which are invariable in scale and rotation as candidate key points; the position and scale of the points are then accurately determined by fitting a fine model (3-D quadratic function) at the location of each candidate keypoint. The selection of the key points depends on their degree of stability. Then, based on the local gradient direction of the image, one or more directions are assigned to each keypoint location, and all subsequent operations on the image data are relative to the direction, scale and position of the keypointTransforms are performed to provide invariance to these transforms. Among the features that have been detected, feature points of low contrast and unstable edge response points are removed. Removing points with low contrast; wherein
Figure BDA0003589222860000081
Represents the offset from the center of interpolation, when it is offset in either dimension by more than 0.5 (i.e., x or y or σ), meaning that the center of interpolation has been shifted to its neighbors, so the position of the current keypoint must be changed. Finally, the local gradient of the image is measured in the neighborhood around each key point on the selected scale. These gradients are transformed into a representation that allows for relatively large local shape distortions and illumination variations, i.e., the resulting features have image scale (feature size) and rotation invariance, and a degree of invariance to illumination variations, thereby enabling policy text image de-watermarking.
S120, color image separation channel processing is carried out on the to-be-detected file image with the watermark image removed by utilizing OpenCv, and a red channel image and a policy text image are obtained; judging whether a seal exists in the red channel map; and if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected.
That is to say, color image separation channel processing is performed on a policy document image to be detected based on OpenCv, a red channel image is taken, threshold segmentation is performed on the red channel image, a red image is separated, and then red stamp removal, red head characters of a red head document and a full version of policy text image are performed.
S130, red-headed character acquisition is carried out on the red channel image, binarization processing is carried out on the acquired red-headed characters, and black-white character images of the red-headed characters are acquired; performing character extraction on the black-white character image of the red head characters by using OCR; and identifying the extracted text content by using a pre-trained policy department identification model, and determining the policy issuing department of the red-headed characters.
The training method of the policy department recognition model comprises the following steps: s1301, acquiring a training data set labeled with a policy department name; s1302, training a named entity recognition model based on BILSTM-CRF (conditional random field algorthm) by using the training data set; the named entity recognition model based on the BILSTM-CRF comprises a first layer, a second layer and a third layer, wherein the first layer is a low latitude word vector layer, and the second layer is a bidirectional LSTM layer; the third layer is a CRF layer. That is to say, a hidden layer obtained by processing a feature layer by a double-layer LSTM layer is arranged in front of a CRF random vector field, and further deeper semantics are obtained. The named entity recognition model is provided with characteristic boundaries which can be obtained as proper nouns suitable for named entity recognition. The method comprises the steps of inputting a vector of each character of a sentence of a text, splicing a hidden state sequence output by a forward LSTM and a hidden state sequence output by a reverse LSTM to obtain a complete hidden state sequence, utilizing a CRF layer to extract sentence characteristics of the hidden state sequence, and carrying out sentence-level sequence labeling to obtain an entity naming recognition department name model.
That is, the red-headed characters obtained in step S120 are binarized into black-and-white character images, and OCR is performed to obtain the text of the red-headed characters, and the policy issuing department is obtained from the text of the red-headed characters by using the trained policy department recognition model.
In a specific implementation process, not only the name of the policy department but also the validity period and the text number of the policy are identified, so that the text image of the policy is subjected to text cleaning preprocessing in step S131 before the step of uploading the text image of the policy to the corresponding position of the text database of the policy according to the determined policy issuing department. Specifically, the text cleaning preprocessing includes space deletion and misregistration character recognition.
And according to the determined policy issuing department, before the step of uploading the policy text image to the corresponding position of the policy text database, the method also comprises S132, and policy text number identification is carried out on the policy text image. The method comprises the following specific steps: s1321, acquiring text content of the policy text image by using OCR; s1322, policy document number extraction is carried out on the obtained text content by utilizing the pre-trained policy document number recognition model.
Before the policy text image is identified by a policy document number, the method further includes step S133 of obtaining a policy validity period of the policy text image; s1331, acquiring text content of the policy text image by using OCR; s1332, acquiring the validity period starting time and the validity period ending time of the acquired text content by using a pre-trained validity period identification model; and S1333, performing format processing on the acquired validity period starting time and validity period ending time, and acquiring a standardized time interval as a policy validity period.
It should be noted that the algorithms used for Feature point extraction of the policy document number identification model and the validity period identification model include, but are not limited to, SIFT (scale invariant Feature Transform) algorithm, SURF (Speeded Up robust features) algorithm, and orb (organized Fast and Rotated brief) algorithm. The network structure and the training method of the policy document number identification model and the validity period identification model can refer to the policy department identification model.
And S140, uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
And uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department, the policy document number and the policy validity period.
In summary, the file analysis method of the invention identifies the watermark image of the file image to be detected through the SIFT model based on OpenCv, identifies the key points of the watermark image, and performs gradient transformation on the neighborhood around the key points until the watermark image is removed; carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected; red-head character acquisition is carried out on the red channel image, binarization processing is carried out on the acquired red-head characters, and black-white character images of the red-head characters are acquired; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department. The method not only removes the watermark and the seal, but also carries out targeted analysis and information extraction on the red-headed characters, saves the time for manually and carefully reading a large number of policy texts to revise the analysis result, and avoids the occurrence of wrong and missed judgment phenomena caused by artificial subjective factors; the key attribute information of the policy documents such as a policy issuing department of the policy documents and the like can be accurately acquired, and the labor cost of manpower can be greatly reduced; finally, the technical effect of accurately and efficiently analyzing the policy files is achieved.
Corresponding to the file analysis method, the invention also provides a file analysis method. FIG. 3 illustrates functional modules of a file parsing system according to an embodiment of the invention.
As shown in fig. 3, the file parsing system 200 provided by the present invention can be installed in an electronic device. According to the implemented functions, the file parsing system 200 may include an image processing unit 210, a policy information obtaining unit 220, and a file uploading unit 230. The units of the invention, which may also be referred to as modules, refer to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a certain fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions of the respective modules/units are as follows:
the image processing unit 210 is configured to identify a watermark image of the to-be-detected file image through an SIFT model based on OpenCv, perform key point identification on the watermark image, and perform gradient transformation on a neighborhood around the key point until the watermark image is removed;
carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel image; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected;
a policy information obtaining unit 220, configured to perform red-headed character obtaining on the red channel map, perform binarization processing on the obtained red-headed characters, and obtain black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; identifying the extracted text content by using a pre-trained policy department identification model, and determining a policy issuing department of the red-headed characters;
and the file uploading unit 230 is used for uploading the policy text image to a corresponding position of the policy text database according to the determined policy issuing department.
The policy information obtaining unit 220 may further include a policy department identification module 221, a policy document number identification module 222, and a policy validity period identification module 223; the policy department identification module 221 is configured to perform red-headed character acquisition on the red channel map, perform binarization processing on the acquired red-headed characters, and acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters; a policy document number identification module 222, configured to perform text content acquisition on the policy text image by using OCR; extracting the policy document number of the acquired text content by using a pre-trained policy document number identification model; a policy validity period identification module 223, configured to perform text content acquisition on the policy text image by using OCR; s1232, acquiring the validity period starting time and the validity period ending time of the acquired text content by using the pre-trained validity period identification model; and S1233, performing format processing on the acquired validity period starting time and validity period ending time, and acquiring a standardized time interval as a policy validity period.
More specific implementation manners of the file parsing method provided by the present invention can be described with reference to the above embodiments of the file parsing method, and are not listed here.
According to the embodiment, the file analysis system provided by the invention identifies the watermark image of the file image to be detected through the SIFT model based on OpenCv, identifies the key points of the watermark image, and performs gradient transformation on the neighborhood around the key points until the watermark image is removed; carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected; acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; identifying the extracted text content by using a pre-trained policy department identification model, and determining a policy issuing department of the red-headed characters; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department. The method not only removes the watermark and the seal, but also carries out targeted analysis and information extraction on the red-headed characters, saves the time for manually and carefully reading a large number of policy texts to revise the analysis result, and avoids the occurrence of wrong and missed judgment phenomena caused by artificial subjective factors; the key attribute information of the policy documents such as a policy issuing department of the policy documents and the like can be accurately acquired, and the labor cost of manpower can be greatly reduced; finally, the technical effect of accurately and efficiently analyzing the policy files is achieved.
As shown in fig. 3, the present invention provides an electronic device 3 of a file parsing method.
The electronic device 3 may comprise a processor 30, a memory 31 and a bus, and may further comprise a computer program, such as a file parser 32, stored in the memory 31 and operable on said processor 30.
The memory 31 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 31 may in some embodiments be an internal storage unit of the electronic device 3, for example a removable hard disk of the electronic device 3. The memory 31 may also be an external storage device of the electronic device 3 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 31 may be used not only to store application software installed in the electronic device 3 and various types of data, such as codes of a file parser, but also to temporarily store data that has been output or is to be output.
The processor 30 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 30 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules (e.g., file parsing programs, etc.) stored in the memory 31 and calling data stored in the memory 31.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 31 and at least one processor 30 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 3, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 30 through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 3 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 3 and other electronic devices.
Optionally, the electronic device 3 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), or optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (organic light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 3 and for displaying a visualized user interface.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The file parser 32 stored in the memory 31 of the electronic device 3 is a combination of instructions that, when executed in the processor 30, may implement: identifying a watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed; carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected; acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
Specifically, the processor 30 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the instruction, which is not described herein again. It should be emphasized that, in order to further ensure the privacy and security of the file parser, the file parser is stored in a node of a blockchain where the server cluster is located.
Further, the integrated modules/units of the electronic device 3, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium may be nonvolatile or volatile, and the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements: identifying a watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed; carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected; acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters; and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
Specifically, the specific implementation method of the computer program when being executed by the processor may refer to the description of the relevant steps in the embodiment file parsing method, which is not repeated herein.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, and an application service layer, which may store medical data, such as personal health profiles, kitchens, examination reports, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A file parsing method is applied to an electronic device, and is characterized by comprising the following steps:
identifying a watermark image of a file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed;
carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists, extracting a seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected;
acquiring red-headed characters from the red channel image, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining a policy issuing department to which the red-headed characters belong;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
2. The file parsing method according to claim 1,
and according to the determined policy issuing department, before the step of uploading the policy text image to the corresponding position of the policy text database, text cleaning pretreatment is carried out on the policy text image.
3. The document parsing method as claimed in claim 1, wherein, prior to the step of uploading the policy text image to a corresponding location of a policy text database according to the determined policy issuing authority, the method further comprises performing policy document number recognition on the policy text image;
acquiring text content of the policy text image by using OCR (optical character recognition);
extracting the policy document number of the acquired text content by using a pre-trained policy document number identification model;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department and the policy document number.
4. The file parsing method according to claim 3,
before the policy text image is subjected to policy text number identification, policy validity period acquisition is further carried out on the policy text image;
acquiring text content of the policy text image by using OCR (optical character recognition);
acquiring the validity period starting time and the validity period ending time of the acquired text content by using a pre-trained validity period identification model;
format processing is carried out on the obtained valid period starting time and the valid period ending time, and a standardized time interval is obtained to be used as a policy valid period;
and uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department, the policy document number and the policy validity period.
5. The file parsing method according to claim 1,
the training method of the policy department recognition model comprises the following steps,
acquiring a training data set marked with a policy department name;
training a named entity recognition model based on BILSTM-CRF by using the training data set; the named entity recognition model based on the BILSTM-CRF comprises a first layer, a second layer and a third layer, wherein the first layer is a low latitude word vector layer, and the second layer is a bidirectional LSTM layer; the third layer is a CRF layer.
6. The file parsing method according to claim 1,
identifying the watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed,
searching all scale spaces corresponding to the file image to be detected through an SIFT model based on OpenCv;
screening interest points with invariable scale and rotation in each scale space by using a Gaussian differential function as candidate key points;
selecting candidate key points with stability meeting set requirements as key points;
and performing gradient transformation on the neighborhood around the key point until the watermark image is removed.
7. The file parsing method of claim 6,
a method for selecting candidate key points with stability meeting set requirements as key points comprises the following steps,
the scale of the candidate key point is refined through fitting a 3-D quadratic function;
screening out candidate key points with low contrast and candidate key points with poor stability of scale and rotation degree;
and screening candidate key points with offset of interpolation center smaller than 0.5 as key points.
8. A file parsing system, comprising: the system comprises the following components of a computer,
the image processing unit is used for identifying the watermark image of the file image to be detected through an SIFT model based on OpenCv, identifying key points of the watermark image, and performing gradient transformation on neighborhoods around the key points until the watermark image is removed;
carrying out color image separation channel processing on the to-be-detected file image with the watermark image removed by utilizing OpenCv to obtain a red channel image and a policy text image; judging whether a seal exists in the red channel map; if the file to be detected exists in the seal area, extracting the seal area of the seal, and replacing pixels in the seal area with white to finish seal removal of the file to be detected;
the policy information acquisition unit is used for acquiring red-headed characters from the red channel map, and performing binarization processing on the acquired red-headed characters to acquire black-and-white character images of the red-headed characters; performing character extraction on the black-white character image of the red head characters by using OCR; recognizing the extracted text content by using a pre-trained policy department recognition model, and determining the policy issuing department of the red-headed characters;
and the file uploading unit is used for uploading the policy text image to a corresponding position of a policy text database according to the determined policy issuing department.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the steps of the file parsing method of any of claims 1-7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the file parsing method according to any one of claims 1 to 7.
CN202210372198.0A 2022-04-11 2022-04-11 File analysis method, system and storage medium Pending CN114694154A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210372198.0A CN114694154A (en) 2022-04-11 2022-04-11 File analysis method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210372198.0A CN114694154A (en) 2022-04-11 2022-04-11 File analysis method, system and storage medium

Publications (1)

Publication Number Publication Date
CN114694154A true CN114694154A (en) 2022-07-01

Family

ID=82143006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210372198.0A Pending CN114694154A (en) 2022-04-11 2022-04-11 File analysis method, system and storage medium

Country Status (1)

Country Link
CN (1) CN114694154A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102171708A (en) * 2008-12-26 2011-08-31 日立系统解决方案有限公司 Business document processor
CN108319945A (en) * 2018-01-09 2018-07-24 佛山科学技术学院 A kind of separate type OCR recognition methods and its system
CN110728453A (en) * 2019-10-14 2020-01-24 山东嘉熙信息科技有限公司 Big data based policy automatic matching analysis system and method
CN110827189A (en) * 2019-11-01 2020-02-21 山东浪潮人工智能研究院有限公司 Method and system for removing watermark of digital image or video
CN111985464A (en) * 2020-08-13 2020-11-24 山东大学 Multi-scale learning character recognition method and system for court judgment documents
US20210192262A1 (en) * 2019-12-23 2021-06-24 Canon Kabushiki Kaisha Apparatus for processing image, storage medium, and image processing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102171708A (en) * 2008-12-26 2011-08-31 日立系统解决方案有限公司 Business document processor
CN108319945A (en) * 2018-01-09 2018-07-24 佛山科学技术学院 A kind of separate type OCR recognition methods and its system
CN110728453A (en) * 2019-10-14 2020-01-24 山东嘉熙信息科技有限公司 Big data based policy automatic matching analysis system and method
CN110827189A (en) * 2019-11-01 2020-02-21 山东浪潮人工智能研究院有限公司 Method and system for removing watermark of digital image or video
US20210192262A1 (en) * 2019-12-23 2021-06-24 Canon Kabushiki Kaisha Apparatus for processing image, storage medium, and image processing method
CN111985464A (en) * 2020-08-13 2020-11-24 山东大学 Multi-scale learning character recognition method and system for court judgment documents

Similar Documents

Publication Publication Date Title
US11314969B2 (en) Semantic page segmentation of vector graphics documents
US10853638B2 (en) System and method for extracting structured information from image documents
Cliche et al. Scatteract: Automated extraction of data from scatter plots
CN103154974A (en) Character recognition device, character recognition method, character recognition system, and character recognition program
CN112699775A (en) Certificate identification method, device and equipment based on deep learning and storage medium
CN113033543A (en) Curved text recognition method, device, equipment and medium
CN114218391A (en) Sensitive information identification method based on deep learning technology
CN114881698A (en) Advertisement compliance auditing method and device, electronic equipment and storage medium
CN114821590A (en) Document information extraction method, device, equipment and medium
CN115937887A (en) Method and device for extracting document structured information, electronic equipment and storage medium
CN115687643A (en) Method for training multi-mode information extraction model and information extraction method
CN113673294B (en) Method, device, computer equipment and storage medium for extracting document key information
CN117115565B (en) Autonomous perception-based image classification method and device and intelligent terminal
CN111898528B (en) Data processing method, device, computer readable medium and electronic equipment
CN117351505A (en) Information code identification method, device, equipment and storage medium
Jim et al. KU-BdSL: An open dataset for Bengali sign language recognition
CN114694154A (en) File analysis method, system and storage medium
CN115203375A (en) Data enhancement method, device, equipment and storage medium of image-text cross-modal model
Rahul et al. Reading industrial inspection sheets by inferring visual relations
Al-Barhamtoshy et al. Universal metadata repository for document analysis and recognition
Rezkiani et al. Logo Detection Using You Only Look Once (YOLO) Method
Silat et al. Remo vision: a computer vision web application
Baskaran et al. Comic character recognition (CCR): extraction of speech balloon context and character of interest in comics
Gungor et al. Automated visual verification of avionics cockpit displays
Agarwal et al. New Text Detection Technique Using Machine Learning Architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination