CN114241485A - Information identification method, device, equipment and storage medium of property certificate - Google Patents

Information identification method, device, equipment and storage medium of property certificate Download PDF

Info

Publication number
CN114241485A
CN114241485A CN202210168516.1A CN202210168516A CN114241485A CN 114241485 A CN114241485 A CN 114241485A CN 202210168516 A CN202210168516 A CN 202210168516A CN 114241485 A CN114241485 A CN 114241485A
Authority
CN
China
Prior art keywords
property certificate
property
certificate
subgraph
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210168516.1A
Other languages
Chinese (zh)
Inventor
杨志
陈耀麟
刘昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dadaoyun Technology Co ltd
Original Assignee
Shenzhen Dadaoyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dadaoyun Technology Co ltd filed Critical Shenzhen Dadaoyun Technology Co ltd
Priority to CN202210168516.1A priority Critical patent/CN114241485A/en
Publication of CN114241485A publication Critical patent/CN114241485A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Abstract

The invention relates to the field of image recognition and discloses a method, a device, equipment and a storage medium for recognizing information of a house property certificate. The method comprises the following steps: acquiring a house property certificate picture to be identified, wherein the house property certificate picture comprises: a property certificate subgraph; according to a preset identification extraction algorithm, carrying out identification extraction processing on a property certificate subgraph of a property certificate picture to obtain a property certificate subgraph; according to a preset character recognition algorithm, character recognition processing is carried out on the house property certificate subgraph to obtain a house property certificate character set, wherein the house property certificate character set comprises the following steps: the project name and the project content corresponding to the project name; judging whether the house property certificate subgraph belongs to a preset text type or not based on the house property certificate character set; if the property certificate belongs to the preset text type, performing character replacement processing on the property certificate character set to obtain a property certificate character set with replaced characters; and extracting the project name and the project content in the character set of the property certificate with the replaced characters to generate recognition result data.

Description

Information identification method, device, equipment and storage medium of property certificate
Technical Field
The invention relates to the field of image recognition, in particular to a method, a device, equipment and a storage medium for recognizing information of a house property certificate.
Background
The property certificate is a certificate for proving that an individual has property of a house, and mainly plays a role in finishing registration, after the house is registered by related departments of the country, the property certificate is issued to a property owner, and the property owner with the property certificate can exercise various rights such as residence, buying and selling, transfer and the like by virtue of all the property certificates. With the continuous and high-speed development of real estate marketization and commercialization in China, real estate and attachments are important components of household assets of residents, and corresponding real estate certificates are also increasingly used, and are common certificates of personal credit standing certificates in financial service scenes except for traditional real estate certificates, registered households and the like.
However, in addition to the traditional use modes of checking original documents, providing paper copies, shooting original document photos and the like, the house property certificate has a wide application range in the use process under the era that the office computerization of the current enterprises is completely popularized and the acquisition and application of big data are very wide, and more use modes aim at information transmission, information processing and information analysis of house property certificate electronic data.
The house property certificate has the characteristics of various types and styles and complex document information elements, the traditional manual house property certificate information inputting mode has the obvious problems of low efficiency and long time consumption, the use experience of a user is greatly influenced, and difficulties are brought to enterprise data switching and data analysis. Therefore, a new technology is provided to solve the technical problems that different house property certificates have various image recognition difficulties and manual entry efficiency is low.
Disclosure of Invention
The invention mainly aims to solve the technical problems that different house property certificates have various types of images and are difficult to identify and low in manual entry efficiency.
The first aspect of the invention provides a method for identifying information of a house property certificate, which comprises the following steps:
acquiring a house property certificate picture to be identified, wherein the house property certificate picture comprises: a property certificate subgraph;
according to a preset identification extraction algorithm, carrying out identification extraction processing on the property certificate subgraph of the property certificate picture to obtain a property certificate subgraph;
according to a preset character recognition algorithm, carrying out character recognition processing on the property certificate subgraph to obtain a property certificate character set, wherein the property certificate character set comprises: the project name and the project content corresponding to the project name;
judging whether the house property certificate subgraph belongs to a preset text type or not based on the house property certificate character set;
if the property certificate belongs to the preset text type, performing character replacement processing on the property certificate character set to obtain a property certificate character set with replaced characters;
and extracting the item names and the item contents in the character set of the property certificate with the replaced characters to generate identification result data.
Optionally, in a first implementation manner of the first aspect of the present invention, the identifying, extracting, and processing the property certificate sub-graph of the property certificate picture according to a preset identifying and extracting algorithm to obtain the property certificate sub-graph includes:
carrying out binarization processing on the house property certificate picture to obtain a binarization picture;
according to a preset kernel similarity algorithm, carrying out similarity calculation on the binary image to obtain a property certificate edge line, and according to a preset linear analysis algorithm, carrying out regression calculation on the property certificate edge line to obtain a slope of the property certificate edge line relative to the property certificate image;
based on the slope, carrying out correction rotation processing on the house property certificate picture to obtain a corrected house property certificate picture;
and cutting the corrected house property card picture according to the house property card edge line to obtain a house property card subgraph.
Optionally, in a second implementation manner of the first aspect of the present invention, the performing character recognition processing on the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set includes:
analyzing character coordinate data of the property certificate subgraph according to a preset Craft algorithm;
according to a preset labelme marking component and the character coordinate data, marking the character string of the property certificate subgraph to obtain an item name label and an item content label of the property certificate subgraph;
and according to a preset identification algorithm, identifying the property certificate subgraph to obtain a property certificate character set with the project name label and the project content label.
Optionally, in a third implementation manner of the first aspect of the present invention, the identifying, according to a preset identification algorithm, the property certificate sub-graph, and obtaining the property certificate character set with the item name tag and the item content tag includes:
according to a preset CNN algorithm, carrying out image extraction processing on the property certificate subgraph to obtain image character string characteristics;
according to a preset RNN algorithm, carrying out sequence recognition processing on the house property certificate subgraph to obtain character sequence characteristics;
and sequencing the character string features of the image according to the character sequence features to obtain a property certificate character set.
Optionally, in a fourth implementation manner of the first aspect of the present invention, the determining, based on the property certificate character set, whether the property certificate subgraph belongs to a preset text category includes:
setting key labels for the names of the items in the property certificate character set, and setting value labels for the contents of the items in the property certificate character set;
converting the property certificate character set into JSON data based on the key tag and the value tag;
and analyzing the JSON data, and judging whether the property certificate subgraph belongs to a preset text type.
Optionally, in a fifth implementation manner of the first aspect of the present invention, the analyzing the JSON data and determining whether the house property certificate subgraph belongs to a preset text category includes:
reading a preset name hit analysis table, and matching data corresponding to the key tag in the JSON data with the name hit analysis table to obtain a matching name set;
calculating the matching rates of different preset text types according to the matching name set to obtain a matching rate set;
judging whether the matching rate set has a matching rate exceeding a preset matching threshold value;
if the matching rate exceeding a preset matching threshold exists, determining the text type corresponding to the highest matching rate in the matching rate set as the text type of the house property certificate subgraph;
and if the matching rate exceeding a preset matching threshold does not exist, determining the property certification subgraph as an undefined class.
Optionally, in a sixth implementation manner of the first aspect of the present invention, after the determining, based on the property certificate character set, whether the property certificate subgraph belongs to a preset text category, the method further includes:
and if the data does not belong to the preset text type, extracting the item name and the item content in the property certificate character set to generate identification result data.
A second aspect of the present invention provides an information identification apparatus of a property certificate, including:
the acquisition module is used for acquiring a property certificate picture to be identified, wherein the property certificate picture comprises: a property certificate subgraph;
the image extraction module is used for identifying and extracting the property certificate subgraph of the property certificate image according to a preset identification extraction algorithm to obtain the property certificate subgraph;
the character recognition module is used for carrying out character recognition processing on the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set, wherein the property certificate character set comprises: the project name and the project content corresponding to the project name;
the judging module is used for judging whether the property certificate subgraph belongs to a preset text type or not based on the property certificate character set;
the character replacement module is used for performing character replacement processing on the property certificate character set if the property certificate character set belongs to the preset text type to obtain a property certificate character set with replaced characters;
and the identification extraction module is used for extracting the item names and the item contents in the property certificate character set replaced by the characters to generate identification result data.
A third aspect of the present invention provides an information identifying apparatus of a property certificate, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor calls the instructions in the memory to enable the information identification device of the property certificate to execute the information identification method of the property certificate.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described information identification method of a property certificate.
In the embodiment of the invention, by covering various national property certificates in data processing, the OCR (optical character recognition) technology of the property certificates, which has the characteristics of rapidness, easiness in use, accurate recognition and support for different terminals, can bring great efficiency improvement to related industries and users, effectively reduce the input cost of users and greatly improve the use experience of the users.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a method for identifying information of a property certificate according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an embodiment of an information identification apparatus of a property certificate according to an embodiment of the present invention;
fig. 3 is a schematic view of another embodiment of the information identification device of the property certificate in the embodiment of the present invention;
fig. 4 is a schematic diagram of an embodiment of an information identification device of a property certificate in the embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying information of a house property certificate.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of understanding, a detailed flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a method for identifying information of a property certificate in an embodiment of the present invention includes:
101. acquiring a house property certificate picture to be identified, wherein the house property certificate picture comprises: a property certificate subgraph;
in this embodiment, the house property certificate picture to be recognized is shot by the client, the shot picture has angle deviation and background content, and includes a view of a non-house property certificate picture, such as a desktop, a book and other backgrounds, the placing position of the picture is inclined, and the blank area of the picture is large.
102. According to a preset identification extraction algorithm, carrying out identification extraction processing on a property certificate subgraph of a property certificate picture to obtain a property certificate subgraph;
in this embodiment, because there is a lot of unwanted information in the property certificate picture itself that may interfere with the text parsing, the offset of the property certificate picture needs to be corrected first, so that the property certificate picture is displayed at an angle of 90 degrees perpendicular to the left and right edges, and the image of the property edge is cut out, and only a property certificate sub-picture is left.
Further, at 102, the following steps may be performed:
1021. carrying out binarization processing on the house property certificate picture to obtain a binarization picture;
1022. according to a preset kernel similarity algorithm, carrying out similarity calculation on the binary image to obtain a property certificate edge line, and according to a preset linear analysis algorithm, carrying out regression calculation on the property certificate edge line to obtain a slope of the property certificate edge line relative to the property certificate image;
1023. based on the slope, carrying out correction rotation processing on the house property certificate picture to obtain a corrected house property certificate picture;
1024. and according to the edge line of the property card, cutting the corrected property card picture to obtain a property card sub-picture.
In steps 1021-. And then carrying out binarization on the image, reducing the dimension of the two-dimensional image into one-dimensional line segments, and determining the boundary of the house property certificate by determining a kernel similarity algorithm. The house property certificate uploaded by the user may have the problems of inclination and the like, so that black pixel points on the left side and the right side of the picture are collected after the picture is binarized by modifying partial details of the image through adjusting the inclination, the slope of a straight line is obtained through linear regression, and finally the picture is rotated according to the slope. Finally, according to the edge lines obtained before, blank areas on two sides of the property certificate are cut off so as to better focus on the character content of the property certificate.
103. According to a preset character recognition algorithm, character recognition processing is carried out on the house property certificate subgraph to obtain a house property certificate character set, wherein the house property certificate character set comprises the following steps: the project name and the project content corresponding to the project name;
in this embodiment, the names and the contents of the items in the property card character set are information of the property itself, and the names of the property card items, such as texts of "property owner", "house seat", and the like, and the contents of the items, such as texts of "zhang san", "one street 001 in XX area", and the like, are labeled. The marking step is to mark the text box with corresponding classification. And (4) performing labeling processing by adopting an open source tool labelme.
Further, at 103, the following steps may be performed:
1031. analyzing character coordinate data of the house property certificate subgraph according to a preset Craft algorithm;
1032. according to the preset labelme marking component and the character coordinate data, marking the character string of the property certificate subgraph to obtain an item name label and an item content label of the property certificate subgraph;
1033. and according to a preset recognition algorithm, recognizing the house property certificate subgraph to obtain a house property certificate character set with a project name label and a project content label.
In step 1031-. And labeling the item name and the item content by using the labelme labeling component and the coordinate data which are taken. And (3) fully extracting sample characteristics from the marked text through a series of convolution operations, and outputting a Region score characteristic diagram and a pixel Affinity score characteristic diagram. Then, the probability of the extracted character center is encoded by adopting a Gaussian heat map, and the classification problem is converted into a regression problem. Text recognition is carried out by adopting a CRNN (end-to-end recognition network) algorithm, image features are extracted by a CNN, sequence features of characters are extracted by an RNN, and finally a property certificate character set with a project name label and a project content label is obtained.
Further, at 1033, the following steps may be performed:
10331. according to a preset CNN algorithm, carrying out image extraction processing on the property certificate subgraph to obtain image character string characteristics;
10332. according to a preset RNN algorithm, carrying out sequence recognition processing on the house property certificate subgraph to obtain character sequence characteristics;
10333. and sequencing the character string characteristics of the image according to the character sequence characteristics to obtain a property certificate character set.
In the 10331-10333 step, the CNN algorithm extracts the image character string features, the RNN algorithm extracts the sequence features of the characters, and ensures that both the character sorting and character recognition features are obtained in the model, and finally sorts the image character recognition features according to the character sequence and attaches the existing label to generate the property certificate character set.
104. Judging whether the house property certificate subgraph belongs to a preset text type or not based on the house property certificate character set;
in this embodiment, there are many project names of the property certificate character set, and the related project name hit conditions can be counted and divided into different types of property certificate text types. According to the type of the house property certificate, the classification can be as follows: single-page immobility property right certificate, double-page immobility property right certificate, single-page land right certificate, double-page land right certificate, single-page real estate property right certificate, double-page real estate property right certificate, undefined type.
Further, the following steps may also be performed at 104:
1041. setting key labels for the names of items in the property certificate character set, and setting value labels for the contents of the items in the property certificate character set;
1042. converting the property certificate character set into JSON data based on the key tag and the value tag;
1043. and analyzing the JSON data, and judging whether the house property certificate subgraph belongs to the preset text type.
In the 1041-1043 step, a key and a value exist in the JSON data structure, the key and the value are two corresponding characters in the JSON data, the key tag and the value tag are combined with the corresponding relation between the project name and the project content, and the key and value corresponding relation is used for converting the property certificate character set into the JSON data.
Here, the project name "property owner" is key, the project content "zhang san" is value, and the corresponding JSON data structure { "property owner": "open three".
Further, at 1043, the following steps may be performed:
10431. reading a preset name hit analysis table, and matching data corresponding to key tags in JSON data with the name hit analysis table to obtain a matching name set;
10432. calculating the matching rates of different preset text types according to the matching name set to obtain a matching rate set;
10433. judging whether the matching rate set has a matching rate exceeding a preset matching threshold value;
10434. if the matching rate exceeds the preset matching threshold, determining the text type corresponding to the highest matching rate in the matching rate set as the text type of the house property certificate subgraph;
10435. and if the matching rate exceeding the preset matching threshold does not exist, determining the property certificate subgraph as an undefined type.
In 10431-10435, the hit analysis table may have the following item names: real estate unit number, floor number, registration price, registration time, place number, land type (use), exclusive area, real estate name, real estate owner, ground location, house number, house structure, house condition, house ownership procurement, house owner, house property, house condition, house location, allocated area, shared area (m), shared area, planned use, building structure, building type, building area (m), building and its attachments, completion date, area, approved house use, approved use period, approved land use, others, where, price is obtained, claim type, claim other condition, claimant, property right, identification number, identity number, use area, use limit, use period, use right type, Right of use area, right of use acquisition means, house number or part, district in which it is located, abstract and attached notes of other rights, set of building area (m), set of area, figure number, land condition, land age, land right acquisition means, land right person, land location, land property, land use, land condition, use, certificate number, expiration date, number of buildings, type, structure, area of use (m), number of land parcel, area of land parcel (dune), area of land parcel, total number of layers, area of land parcel, seating, land grade, nature of ownership, remarks, property right of property, name of identity, number of identity, acquisition means, area of use (m), area of sole use (m), allocated area (m), The shared use weight area (m).
The item names in the hit analysis table are obtained by counting the item names of different real estate types, and then the hit judgment is carried out on the matching name set, and the type of the matching name set is judged based on the hit condition.
Different single-page immovable property right certificates, double-page immovable property right certificates, single-page land right certificates, double-page land right certificates, single-page real-estate property right certificates and double-page real-estate right certificates are different, and according to the induction of fields contained in different types of real-estate certificates, common fields are extracted as follows: the name of the right person, the property of the right person, the share of the right person, the identity number of the right person, the serial number of the property certificate, the house seating, the land use, the house use, the land area, the house building area, the completion date and the shared situation.
And if the judgment of the item name based on the identification matching name set on the double-page immobility property right is 96%, the matching rate of the matching rate set exceeds the preset matching threshold value by 90%.
And determining the highest single-page real estate right in the single-page real estate right 97%, the double-page immobile estate right 96%, the single-page land right 56%, the double-page land right 76%, the single-page real estate right 51% and the double-page real estate right 66% in the matching rate set as the text type of the real estate certificate subgraph.
And if none of the matching rate sets exceeds 90% of the matching rate, the property certificate subgraph is considered to be the case of undefined class.
105. If the property certificate belongs to the preset text type, performing character replacement processing on the property certificate character set to obtain a property certificate character set with replaced characters;
in this embodiment, when a space, a line feed, or a special character exists in the extracted data content, the extracted data content is checked, and if the extracted data content exists, the replacement processing is performed. And the extracted data content is identified as letters by numbers, and the letter content is subjected to digital conversion.
106. And extracting the project name and the project content in the character set of the property certificate with the replaced characters to generate recognition result data.
In this embodiment, the content of the property certificate character set is processed, and the data combination of the obtained text type and identification is extracted to obtain the identification result data.
Further, after 106, the following steps may also be performed:
107. and if the data does not belong to the preset text type, extracting the item name and the item content in the property certificate character set to generate identification result data.
In this embodiment, the text type of the property certificate character set cannot be identified, and since it is unclear whether the text type is a standard text and the replacement and repair processing cannot be performed, the item name and the item content are directly extracted to obtain the identification result data.
In the embodiment of the invention, by covering various national property certificates in data processing, the OCR (optical character recognition) technology of the property certificates, which has the characteristics of rapidness, easiness in use, accurate recognition and support for different terminals, can bring great efficiency improvement to related industries and users, effectively reduce the input cost of users and greatly improve the use experience of the users.
With reference to fig. 2, the method for identifying information of a property certificate in the embodiment of the present invention is described above, and an embodiment of an apparatus for identifying information of a property certificate in the embodiment of the present invention includes:
the acquiring module 201 is configured to acquire a property certificate picture to be identified, where the property certificate picture includes: a property certificate subgraph;
the image extraction module 202 is configured to perform recognition extraction processing on the property certificate sub-image of the property certificate image according to a preset recognition extraction algorithm to obtain a property certificate sub-image;
the character recognition module 203 is configured to perform character recognition processing on the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set, where the property certificate character set includes: the project name and the project content corresponding to the project name;
the judging module 204 is used for judging whether the property certificate subgraph belongs to a preset text type or not based on the property certificate character set;
the character replacement module 205 is configured to, if the property certificate belongs to a preset text type, perform character replacement processing on the property certificate character set to obtain a property certificate character set with character replacement;
and the identification extraction module 206 is configured to extract the item name and the item content in the property certificate character set with the character replaced, and generate identification result data.
In the embodiment of the invention, by covering various national property certificates in data processing, the OCR (optical character recognition) technology of the property certificates, which has the characteristics of rapidness, easiness in use, accurate recognition and support for different terminals, can bring great efficiency improvement to related industries and users, effectively reduce the input cost of users and greatly improve the use experience of the users.
Referring to fig. 3, another embodiment of the information identification apparatus for a property certificate according to the embodiment of the present invention includes:
the acquiring module 201 is configured to acquire a property certificate picture to be identified, where the property certificate picture includes: a property certificate subgraph;
the image extraction module 202 is configured to perform recognition extraction processing on the property certificate sub-image of the property certificate image according to a preset recognition extraction algorithm to obtain a property certificate sub-image;
the character recognition module 203 is configured to perform character recognition processing on the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set, where the property certificate character set includes: the project name and the project content corresponding to the project name;
the judging module 204 is used for judging whether the property certificate subgraph belongs to a preset text type or not based on the property certificate character set;
the character replacement module 205 is configured to, if the property certificate belongs to a preset text type, perform character replacement processing on the property certificate character set to obtain a property certificate character set with character replacement;
and the identification extraction module 206 is configured to extract the item name and the item content in the property certificate character set with the character replaced, and generate identification result data.
The picture extraction module 202 is specifically configured to:
carrying out binarization processing on the house property certificate picture to obtain a binarization picture;
according to a preset kernel similarity algorithm, carrying out similarity calculation on the binary image to obtain a property certificate edge line, and according to a preset linear analysis algorithm, carrying out regression calculation on the property certificate edge line to obtain a slope of the property certificate edge line relative to the property certificate image;
based on the slope, carrying out correction rotation processing on the house property certificate picture to obtain a corrected house property certificate picture;
and cutting the corrected house property card picture according to the house property card edge line to obtain a house property card subgraph.
Wherein the character recognition module 203 comprises:
the coordinate recognition unit 2031 is configured to analyze character coordinate data of the property certificate sub-graph according to a preset Craft algorithm;
a marking unit 2032, configured to mark the character string of the property certificate sub-graph according to a preset labelme marking component and the character coordinate data, to obtain an item name label and an item content label of the property certificate sub-graph;
the identifying unit 2033 is configured to identify the property certificate sub-graph according to a preset identification algorithm, so as to obtain a property certificate character set with the item name tag and the item content tag.
The identifying unit 2033 is specifically configured to:
according to a preset CNN algorithm, carrying out image extraction processing on the property certificate subgraph to obtain image character string characteristics;
according to a preset RNN algorithm, carrying out sequence recognition processing on the house property certificate subgraph to obtain character sequence characteristics;
and sequencing the character string features of the image according to the character sequence features to obtain a property certificate character set.
The determining module 204 is specifically configured to:
setting key labels for the names of the items in the property certificate character set, and setting value labels for the contents of the items in the property certificate character set;
converting the property certificate character set into JSON data based on the key tag and the value tag;
and analyzing the JSON data, and judging whether the property certificate subgraph belongs to a preset text type.
Wherein, the character replacing module 205 is further specifically configured to:
reading a preset name hit analysis table, and matching data corresponding to the key tag in the JSON data with the name hit analysis table to obtain a matching name set;
calculating the matching rates of different preset text types according to the matching name set to obtain a matching rate set;
judging whether the matching rate set has a matching rate exceeding a preset matching threshold value;
if the matching rate exceeding a preset matching threshold exists, determining the text type corresponding to the highest matching rate in the matching rate set as the text type of the house property certificate subgraph;
and if the matching rate exceeding a preset matching threshold does not exist, determining the property certification subgraph as an undefined class.
The information identification apparatus of the property certificate further includes an undefined text extracting module 207, where the undefined text extracting module 207 is specifically configured to:
and if the data does not belong to the preset text type, extracting the item name and the item content in the property certificate character set to generate identification result data.
In the embodiment, all types of house property certificates in the whole country can be covered by the data processing, the OCR (optical character recognition) technology of the house property certificate, which is rapid and easy to use, accurate in recognition and capable of supporting different terminal characteristics, can bring great efficiency improvement to related industries and users, the input cost of the user is effectively reduced, and the use experience of the user is greatly improved.
Fig. 2 and 3 describe the information identification device of the property certificate in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the information identification device of the property certificate in the embodiment of the present invention is described in detail from the perspective of hardware processing.
Fig. 4 is a schematic structural diagram of an information identification device of a property certificate according to an embodiment of the present invention, where the information identification device 400 of the property certificate may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 410 (e.g., one or more processors) and a memory 420, and one or more storage media 430 (e.g., one or more mass storage devices) storing an application 433 or data 432. Memory 420 and storage medium 430 may be, among other things, transient or persistent storage. The program stored in the storage medium 430 may include one or more modules (not shown), each of which may include a series of instruction operations in the information identifying apparatus 400 for a property certificate. Further, the processor 410 may be configured to communicate with the storage medium 430, and execute a series of instruction operations in the storage medium 430 on the information identifying apparatus 400 of the property certificate.
The property certificate-based information identification apparatus 400 may also include one or more power supplies 440, one or more wired or wireless network interfaces 450, one or more input-output interfaces 560, and/or one or more operating systems 431, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. It will be understood by those skilled in the art that the structure of the property certificate information identification apparatus shown in fig. 4 does not constitute a limitation of the property certificate-based information identification apparatus, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the information identification method of a property certificate.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses, and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An information identification method of a house property certificate is characterized by comprising the following steps:
acquiring a house property certificate picture to be identified, wherein the house property certificate picture comprises: a property certificate subgraph;
according to a preset identification extraction algorithm, carrying out identification extraction processing on the property certificate subgraph of the property certificate picture to obtain a property certificate subgraph;
according to a preset character recognition algorithm, carrying out character recognition processing on the property certificate subgraph to obtain a property certificate character set, wherein the property certificate character set comprises: the project name and the project content corresponding to the project name;
judging whether the house property certificate subgraph belongs to a preset text type or not based on the house property certificate character set;
if the property certificate belongs to the preset text type, performing character replacement processing on the property certificate character set to obtain a property certificate character set with replaced characters;
and extracting the item names and the item contents in the character set of the property certificate with the replaced characters to generate identification result data.
2. The method for identifying information of a property certificate as claimed in claim 1, wherein the step of identifying and extracting the property certificate sub-graph of the property certificate picture according to a preset identification and extraction algorithm to obtain the property certificate sub-graph comprises:
carrying out binarization processing on the house property certificate picture to obtain a binarization picture;
according to a preset kernel similarity algorithm, carrying out similarity calculation on the binary image to obtain a property certificate edge line, and according to a preset linear analysis algorithm, carrying out regression calculation on the property certificate edge line to obtain a slope of the property certificate edge line relative to the property certificate image;
based on the slope, carrying out correction rotation processing on the house property certificate picture to obtain a corrected house property certificate picture;
and cutting the corrected house property card picture according to the house property card edge line to obtain a house property card subgraph.
3. The method for identifying information of a property certificate as claimed in claim 1, wherein said character recognition processing of the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set comprises:
analyzing character coordinate data of the property certificate subgraph according to a preset Craft algorithm;
according to a preset labelme marking component and the character coordinate data, marking the character string of the property certificate subgraph to obtain an item name label and an item content label of the property certificate subgraph;
and according to a preset identification algorithm, identifying the property certificate subgraph to obtain a property certificate character set with the project name label and the project content label.
4. The method for identifying information of a property certificate as claimed in claim 3, wherein said identifying said property certificate sub-graph according to a preset identification algorithm to obtain a property certificate character set with said item name tag and said item content tag comprises:
according to a preset CNN algorithm, carrying out image extraction processing on the property certificate subgraph to obtain image character string characteristics;
according to a preset RNN algorithm, carrying out sequence recognition processing on the house property certificate subgraph to obtain character sequence characteristics;
and sequencing the character string features of the image according to the character sequence features to obtain a property certificate character set.
5. The method for identifying information of a property certificate as claimed in claim 1, wherein said determining whether the property certificate subgraph belongs to a preset text category based on the property certificate character set comprises:
setting key labels for the names of the items in the property certificate character set, and setting value labels for the contents of the items in the property certificate character set;
converting the property certificate character set into JSON data based on the key tag and the value tag;
and analyzing the JSON data, and judging whether the property certificate subgraph belongs to a preset text type.
6. The method for identifying information of a property certificate as claimed in claim 5, wherein the analyzing the JSON data and determining whether the property certificate subgraph belongs to a preset text category comprises:
reading a preset name hit analysis table, and matching data corresponding to the key tag in the JSON data with the name hit analysis table to obtain a matching name set;
calculating the matching rates of different preset text types according to the matching name set to obtain a matching rate set;
judging whether the matching rate set has a matching rate exceeding a preset matching threshold value;
if the matching rate exceeding a preset matching threshold exists, determining the text type corresponding to the highest matching rate in the matching rate set as the text type of the house property certificate subgraph;
and if the matching rate exceeding a preset matching threshold does not exist, determining the property certification subgraph as an undefined class.
7. The method for identifying information of a property certificate as claimed in claim 1, further comprising, after said determining whether said property certificate subgraph belongs to a preset text category based on said property certificate character set:
and if the data does not belong to the preset text type, extracting the item name and the item content in the property certificate character set to generate identification result data.
8. An information recognition apparatus of a property certificate, characterized in that the information recognition apparatus of a property certificate includes:
the acquisition module is used for acquiring a property certificate picture to be identified, wherein the property certificate picture comprises: a property certificate subgraph;
the image extraction module is used for identifying and extracting the property certificate subgraph of the property certificate image according to a preset identification extraction algorithm to obtain the property certificate subgraph;
the character recognition module is used for carrying out character recognition processing on the property certificate subgraph according to a preset character recognition algorithm to obtain a property certificate character set, wherein the property certificate character set comprises: the project name and the project content corresponding to the project name;
the judging module is used for judging whether the property certificate subgraph belongs to a preset text type or not based on the property certificate character set;
the character replacement module is used for performing character replacement processing on the property certificate character set if the property certificate character set belongs to the preset text type to obtain a property certificate character set with replaced characters;
and the identification extraction module is used for extracting the item names and the item contents in the property certificate character set replaced by the characters to generate identification result data.
9. An information recognition apparatus of a property certificate, characterized in that the information recognition apparatus of a property certificate includes: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;
the at least one processor invokes the instructions in the memory to cause the information identification device of the property certificate to perform the information identification method of the property certificate of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the information identification method of a property certificate according to any one of claims 1 to 7.
CN202210168516.1A 2022-02-24 2022-02-24 Information identification method, device, equipment and storage medium of property certificate Pending CN114241485A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210168516.1A CN114241485A (en) 2022-02-24 2022-02-24 Information identification method, device, equipment and storage medium of property certificate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210168516.1A CN114241485A (en) 2022-02-24 2022-02-24 Information identification method, device, equipment and storage medium of property certificate

Publications (1)

Publication Number Publication Date
CN114241485A true CN114241485A (en) 2022-03-25

Family

ID=80748013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210168516.1A Pending CN114241485A (en) 2022-02-24 2022-02-24 Information identification method, device, equipment and storage medium of property certificate

Country Status (1)

Country Link
CN (1) CN114241485A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663903A (en) * 2022-05-25 2022-06-24 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038504A (en) * 2017-12-11 2018-05-15 深圳房讯通信息技术有限公司 A kind of method for parsing property ownership certificate photo content
CN109871770A (en) * 2019-01-17 2019-06-11 平安城市建设科技(深圳)有限公司 Property ownership certificate recognition methods, device, equipment and storage medium
CN111464716A (en) * 2020-04-09 2020-07-28 腾讯科技(深圳)有限公司 Certificate scanning method, device, equipment and storage medium
CN113989806A (en) * 2021-10-11 2022-01-28 浙江康旭科技有限公司 Extensible CRNN bank card number identification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038504A (en) * 2017-12-11 2018-05-15 深圳房讯通信息技术有限公司 A kind of method for parsing property ownership certificate photo content
CN109871770A (en) * 2019-01-17 2019-06-11 平安城市建设科技(深圳)有限公司 Property ownership certificate recognition methods, device, equipment and storage medium
CN111464716A (en) * 2020-04-09 2020-07-28 腾讯科技(深圳)有限公司 Certificate scanning method, device, equipment and storage medium
CN113989806A (en) * 2021-10-11 2022-01-28 浙江康旭科技有限公司 Extensible CRNN bank card number identification method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663903A (en) * 2022-05-25 2022-06-24 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium
CN114663903B (en) * 2022-05-25 2022-08-19 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Huang et al. Icdar2019 competition on scanned receipt ocr and information extraction
CN110766014B (en) Bill information positioning method, system and computer readable storage medium
Cliche et al. Scatteract: Automated extraction of data from scatter plots
CN101297319B (en) Embedding hot spots in electronic documents
US9552516B2 (en) Document information extraction using geometric models
US8195659B2 (en) Integration and use of mixed media documents
JP5522408B2 (en) Pattern recognition device
Hosny et al. Copy-move forgery detection of duplicated objects using accurate PCET moments and morphological operators
CN107590491B (en) Image processing method and device
CN105023340A (en) Cloud intelligent invoice identification and examination system and method based on scanner
CN110298340A (en) Image processing apparatus, image processing method and computer readable storage medium
GB2519838A (en) Image identification system and method
JP2009506393A (en) Image collation method and system in mixed media environment
CN111310750B (en) Information processing method, device, computing equipment and medium
US9710769B2 (en) Methods and systems for crowdsourcing a task
CN112580108B (en) Signature and seal integrity verification method and computer equipment
CN112395995A (en) Method and system for automatically filling and checking bill according to mobile financial bill
CN113158895A (en) Bill identification method and device, electronic equipment and storage medium
Diem et al. Text classification and document layout analysis of paper fragments
Diem et al. Skew estimation of sparsely inscribed document fragments
CN114241485A (en) Information identification method, device, equipment and storage medium of property certificate
JP4897795B2 (en) Processing apparatus, index table creation method, and computer program
Obaidullah et al. Structural feature based approach for script identification from printed Indian document
JP2009506392A (en) Method, computer program and system for embedding hotspots in electronic documents
CN108090728B (en) Express information input method and system based on intelligent terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220325