WO2022126986A1 - 基于ocr识别房产证信息确定方法、装置、设备及介质 - Google Patents

基于ocr识别房产证信息确定方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022126986A1
WO2022126986A1 PCT/CN2021/091716 CN2021091716W WO2022126986A1 WO 2022126986 A1 WO2022126986 A1 WO 2022126986A1 CN 2021091716 W CN2021091716 W CN 2021091716W WO 2022126986 A1 WO2022126986 A1 WO 2022126986A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
sample
word
data
target
Prior art date
Application number
PCT/CN2021/091716
Other languages
English (en)
French (fr)
Inventor
舒俊杰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022126986A1 publication Critical patent/WO2022126986A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to a method, device, device and medium for determining information on real estate deeds based on OCR identification.
  • the purpose is to solve the technical problems that the OCR technology of the existing technology has low recognition accuracy for the real estate certificate and causes more complaints from customers.
  • the main purpose of this application is to provide a method, device, equipment and medium for identifying real estate certificate information based on OCR identification, which aims to solve the technical problems that the prior art OCR technology has low identification accuracy for real estate certificates and causes more complaints from customers .
  • the present application proposes a method for determining real estate certificate information based on OCR identification, the method comprising:
  • the preprocessed text data is replaced with wrong words by using the target abnormal relationship corresponding data set and the unsuccessfully matched word set to obtain target text data corresponding to the target real estate deed.
  • the present application also proposes a device for determining real estate certificate information based on OCR identification, the device comprising:
  • the OCR text recognition module is used to obtain the document image to be recognized of the target real estate certificate, and uses OCR technology to perform text recognition on the document image to be recognized to obtain the text data to be corrected;
  • a preprocessing module for preprocessing the text data to be corrected to obtain preprocessed text data
  • a word segmentation module which is used to perform word segmentation on the preprocessed text data to obtain a word set to be corrected
  • the unsuccessfully matched word search module is used to obtain a knowledge base dictionary, and each of the to-be-corrected words in the to-be-corrected word set is searched for unsuccessfully matched words in the knowledge base dictionary to obtain an unsuccessfully matched word set;
  • an abnormal relationship matching module configured to obtain a data set corresponding to the abnormal relationship, and respectively match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship to obtain a data set corresponding to the target abnormal relationship;
  • the wrong word replacement module is configured to use the data set corresponding to the target abnormal relationship and the unsuccessfully matched word set to perform wrong word replacement on the preprocessed text data to obtain target text data corresponding to the target real estate deed.
  • the present application also proposes a computer device, including a memory and a processor, the memory stores a computer program, and the processor implements the following method steps when executing the computer program:
  • the preprocessed text data is replaced with wrong words by using the target abnormal relationship corresponding data set and the unsuccessfully matched word set to obtain target text data corresponding to the target real estate deed.
  • the present application also proposes a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following method steps are implemented:
  • the preprocessed text data is replaced with wrong words by using the target abnormal relationship corresponding data set and the unsuccessfully matched word set to obtain target text data corresponding to the target real estate deed.
  • the method, device, equipment and medium for determining real estate certificate information based on OCR recognition of the present application by using OCR technology to perform text recognition on the certificate image to be recognized, to obtain text data to be corrected, and to preprocess the text data to be corrected to obtain the preprocessing
  • After the text data by segmenting the preprocessed text data, a set of words to be corrected is obtained, and each word to be corrected in the set of words to be corrected is searched for unsuccessfully matched words in the knowledge base dictionary, and the uncorrected words are obtained.
  • FIG. 1 is a schematic flowchart of a method for determining real estate certificate information based on OCR identification according to an embodiment of the application;
  • FIG. 2 is a schematic block diagram of the structure of an apparatus for determining real estate certificate information based on OCR identification according to an embodiment of the present application;
  • FIG. 3 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
  • the present application proposes a method for determining information on real estate certificates based on OCR identification, which is applied in the field of artificial intelligence technology .
  • the method for identifying real estate deed information based on OCR recognizes the image of the real estate deed first by using the OCR technology, and then uses the knowledge base dictionary and the data set corresponding to the abnormal relationship to correct, so as to realize the automatic correction of the OCR identification result and improve the performance of the real estate deed.
  • the accuracy of identifying the text data of the real estate certificate is improved, and the satisfaction of users is improved.
  • an embodiment of the present application provides a method for determining real estate certificate information based on OCR identification, and the method includes:
  • S1 Obtain the document image to be identified of the target real estate certificate, and use OCR technology to perform text recognition on the document image to be identified to obtain text data to be corrected;
  • S4 Obtaining a knowledge base dictionary, respectively searching for unsuccessfully matched words in the knowledge base dictionary for each to-be-corrected word in the set of words to be corrected, to obtain a set of unsuccessfully matched words;
  • S5 Obtain a data set corresponding to an abnormal relationship, and respectively perform matching on each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship to obtain a data set corresponding to the target abnormal relationship;
  • the preprocessed text data is segmented to obtain a set of words to be corrected, and each word to be corrected in the set of words to be corrected is searched for unsuccessfully matched words in the knowledge base dictionary, and an unsuccessfully matched word is obtained.
  • Word set match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship to obtain the corresponding data set of the target abnormal relationship, and use the data set corresponding to the target abnormal relationship and the set of unsuccessfully matched words to preprocess the
  • the text data is replaced by wrong words, and the target text data corresponding to the target real estate certificate is obtained, so as to realize the automatic correction of the OCR recognition result, improve the accuracy of identifying the text data of the real estate certificate, and improve the satisfaction of users.
  • the document image to be identified of the target real estate deed entered by the user can be obtained, or the document image to be identified obtained by directly scanning or photographing the target real estate deed with an electronic device (such as a scanner or a digital camera), or the first The certificate image to be identified of the target real estate certificate sent by the three-party application system.
  • an electronic device such as a scanner or a digital camera
  • the certificate image to be recognized is the digital image of the target real estate certificate that needs to be recognized by text.
  • the OCR technology is used to perform text recognition on each text area in the certificate image to be recognized, to obtain at least one item of text data, wherein each text area corresponds to one item of text data.
  • OCR technology refers to optical character recognition technology.
  • the target real estate certificate can be any real estate type, any age, and any area.
  • the text data to be corrected is the text data obtained by using the OCR technology to identify the document image to be recognized.
  • the special character processing model is a model obtained by training a neural network.
  • Special characters refer to the characters that cannot appear in the text information on the real estate certificate. Special characters refer to characters other than letters, numbers, Chinese, dashes, and spaces.
  • word segmentation is performed on each item of text data in the preprocessed text data, the words obtained from the segmentation are used as words to be corrected, and all words to be corrected are used as a set of words to be corrected. That is to say, each item of text data is segmented independently.
  • the knowledge base dictionary can be obtained from the database.
  • a word to be corrected is proposed from the set of words to be corrected as the target word to be corrected; the word to be corrected is searched in the knowledge base dictionary, and the word is determined when a word is found in the knowledge base dictionary
  • the target word to be corrected is a successfully matched word, otherwise it is determined that the target word to be corrected is an unsuccessful matched word; repeat the step of proposing a word to be corrected from the set of words to be corrected as the target word to be corrected , until it is determined that all the words to be corrected in the word set to be corrected are unsuccessfully matched words or successfully matched words; all the unsuccessfully matched words are regarded as the unsuccessfully matched word set.
  • the knowledge base dictionary includes, but is not limited to: a sub-dictionary of real estate type, a sub-dictionary of administrative area, a sub-dictionary of real estate, and a sub-dictionary of surname.
  • the knowledge base dictionary is constructed based on the common information of the real estate certificate, so that the knowledge base dictionary is suitable for error correction in the field of real estate certificate, which is beneficial to improve the accuracy of the correction of the information of the real estate certificate.
  • the property type sub-dictionary includes: property type name.
  • Real estate type names include but are not limited to: self-owned, commercial housing, residential, apartment.
  • Administrative region sub-dictionary includes but is not limited to: province name, city name, district name, street name.
  • the real estate sub-dictionary includes but is not limited to: real estate name.
  • Surname sub-dictionaries include, but are not limited to: Surname.
  • the unsuccessfully matched word set includes: unsuccessfully matched words and location data, each unsuccessfully matched word corresponds to one location data.
  • the multiple unsuccessfully matched words in the unsuccessfully matched word set may be the same or different, which is not specifically limited herein.
  • the data set corresponding to the abnormal relationship can be obtained from the database; the incorrect words in the abnormal relationship corresponding data in the abnormal relationship corresponding data set for each unsuccessfully matched word in the unsuccessfully matched word set are respectively matched, and the The error words matched in the error words of the abnormal relationship corresponding data in the abnormal relationship corresponding data set are used as target error words; the abnormal relationship corresponding data corresponding to all target error words are used as the target abnormal relationship corresponding data set.
  • the data set corresponding to the abnormal relationship includes: data corresponding to the abnormal relationship.
  • the data corresponding to the abnormal relationship includes: wrong words and correct words, and each wrong word corresponds to a correct word.
  • the corresponding data set of the target abnormal relationship is used to replace the words in the corresponding positions of the preprocessed text data of the unsuccessfully matched word set, and the preprocessed text data after the replacement is replaced. as the target text data corresponding to the target real estate certificate.
  • the above-mentioned steps of performing word segmentation on the preprocessed text data to obtain a word set to be corrected include:
  • S31 Perform word segmentation on each item of text data in the preprocessed text data, respectively, to obtain a plurality of words to be corrected and position data corresponding to each of the words to be corrected;
  • S32 Determine the set of words to be corrected according to the plurality of words to be corrected and the position data corresponding to each of the words to be corrected.
  • This embodiment implements word segmentation on the preprocessed text data, which provides a basis for subsequent correction using the knowledge base dictionary and the data set corresponding to the abnormal relationship.
  • a piece of text data is obtained from the preprocessed text data, and a target text data item is obtained; the target text data item is segmented, and the words obtained by segmenting are used as words to be corrected.
  • the position data of the wrong word in the preprocessed text data is used as the position data corresponding to the word to be corrected; repeating the process of obtaining an item of text data from the preprocessed text data to obtain the target text data Items are performed until a plurality of to-be-corrected words and respective position data corresponding to the plurality of to-be-corrected words are determined.
  • the position data refers to the position data of the word to be corrected in the preprocessed text data.
  • the plurality of words to be corrected and the position data corresponding to each of the plurality of words to be corrected are used as the set of words to be corrected.
  • the set of words to be corrected includes: words to be corrected and location data, and each word to be corrected corresponds to one piece of location data.
  • the above-mentioned steps of performing unsuccessfully matching word search in the knowledge base dictionary for each to-be-corrected word in the to-be-corrected word set, respectively, to obtain the unsuccessfully-matched word set include:
  • This embodiment realizes the correct elimination of words to be corrected according to the knowledge base dictionary, reduces the amount of data corrected by using the data set corresponding to the abnormal relationship, and avoids being erroneously corrected.
  • a word to be corrected is obtained from the set of words to be corrected as a target word to be corrected; the target word to be corrected is matched in the knowledge base dictionary, and when a word is found in the knowledge base dictionary When the matching result of the knowledge base corresponding to the target word to be corrected is determined to be successful, otherwise, the matching result of the knowledge base corresponding to the target word to be corrected is determined to be failure; repeatedly execute the described method of obtaining an error to be corrected from the set of words to be corrected The step of using the word as the target word to be corrected is until the knowledge base matching results corresponding to all the words to be corrected in the set of words to be corrected are determined.
  • the method before the above step of acquiring the data set corresponding to the abnormal relationship, the method further includes:
  • S51 Obtain a plurality of real estate license sample images, where the real estate license sample images carry image identifiers;
  • S52 Use the OCR technology to perform text recognition on each of the real estate certificate sample images, respectively, to obtain OCR identification sample data corresponding to each of the multiple real estate certificate sample images;
  • S53 Send the plurality of real estate certificate sample images and their corresponding OCR identification sample data to the labeling terminal for error correction;
  • This embodiment establishes a data set corresponding to an abnormal relationship based on manual annotation and OCR identification sample data, improves the accuracy of the corresponding relationship, and provides a basis for subsequent correction using the data set corresponding to the abnormal relationship; and determines the corresponding abnormal relationship based on the real estate certificate sample image. Therefore, the data set corresponding to the abnormal relationship is suitable for error correction in the field of real estate licenses, which is beneficial to improve the accuracy of the correction of real estate license information.
  • a plurality of real estate certificate sample images input by the user may be acquired, and may also be multiple real estate certificate sample images sent by a third-party application system.
  • the real estate certificate sample image refers to the digital image of the real estate certificate.
  • the image identification can be an identification that uniquely identifies a sample image of a real estate certificate, such as an image name, an image ID, etc.
  • the OCR identification sample data corresponding to the sample image of the real estate license is repeated; the steps of obtaining a sample image of the real estate license from multiple sample images of the real estate license to obtain the sample image of the target real estate license are repeated until the corresponding sample images of the multiple real estate licenses are determined.
  • OCR identifies sample data.
  • the annotator performs error correction according to the multiple real estate certificate sample images and the OCR identification sample data corresponding to the multiple real estate certificate sample images, and then sends the respective annotations corresponding to the multiple real estate certificate sample images through the labeling terminal. sample.
  • OCR identification sample data and the labeled sample data corresponding to the plurality of real estate certificate sample images the corresponding judgment of word segmentation and abnormal relationship is performed, and the corresponding data set of the abnormal relationship is obtained. steps, including:
  • S551 Perform word segmentation on each item of text data of each of the OCR identification sample data corresponding to the plurality of real estate certificate sample images, respectively, to obtain an OCR identification sample word set corresponding to each of the plurality of real estate certificate sample images;
  • S552 Perform word segmentation on each item of text data of each of the labeled sample data corresponding to the plurality of real estate license sample images, respectively, to obtain a set of labeled sample words corresponding to each of the plurality of real estate license sample images;
  • S553 Perform an abnormal relationship search on the OCR identification sample word set and the annotated sample word set corresponding to each of the plurality of real estate certificate sample images based on the image identifier, and obtain the waiting list corresponding to the plurality of real estate certificate sample images.
  • a statistical abnormal relationship data set, the abnormal relationship data set to be counted includes: the OCR identification sample words and the labeling sample words;
  • S554 Perform probability statistics of each of the labeled sample words corresponding to each of the OCR identification sample words on the abnormal relationship data set to be counted, respectively, to obtain all the OCR identifications of the abnormal relationship data set to be counted The probability value set of the labeled sample words corresponding to the sample words;
  • S555 Obtain the maximum probability value of the labeled sample word from the probability value set of the labeled sample words corresponding to each of the OCR-identified sample words in the abnormal relationship data set to be counted, and obtain the probability value of the abnormal relationship data set to be counted.
  • the respective corresponding target labeling sample words of all the OCR identification sample words correspond to the probability values;
  • S556 Determine each of the OCR-identified sample words in the abnormal relationship data set to be counted according to the probability values of the target labeled sample words corresponding to each of the OCR-identified sample words in the abnormal relationship data set to be counted. Corresponding target labeled sample words;
  • S557 Obtain the data set corresponding to the abnormal relationship according to all the OCR identification sample words and the respective corresponding target labeling sample words in the abnormal relationship data set to be counted.
  • This embodiment implements the corresponding judgment of word segmentation and abnormal relationship according to the image identification, OCR identification sample data and the labeling sample data corresponding to the multiple real estate certificate sample images, which provides a method for subsequent correction using the data set corresponding to the abnormal relationship. Base.
  • an abnormal relationship search is performed according to the OCR recognition sample word set and the labeled sample word set corresponding to the same image identification, and the abnormal relationship to be counted corresponding to the plurality of real estate deed sample images is determined according to the found abnormal relationship Relational datasets.
  • one of the OCR-identified sample words and one of the labeled sample words form an abnormal relationship.
  • the OCR recognition sample words of the OCR recognition sample word set of the image identification T1 corresponding to the location data W1 are: tax has property rights, and the labeling sample word set of the image identifier T1 corresponds to the labeling sample words of the location data W1 Yes: private property rights, tax-owned property rights and private property rights are different, it is determined that the location data W1 corresponding to the image identifier T1 has an abnormal relationship, and the tax-owned property rights (that is, the OCR recognition sample words) and the private property rights (that is, the labeling sample words)
  • the abnormal relationship data to be counted as the sample image of the real estate certificate corresponding to the image identifier T1 is given as an example but not specifically limited.
  • the OCR identifies the sample word P1, and there are corresponding labeled sample words B1, B2, B3, and B4 in the abnormal relationship data set to be counted, wherein P1-B1 appears 3 times and P1-B2 appears 4 times , P1-B3 appears 3 times, P1-B4 appears 5 times, the total number of correspondences of P1 is 15 times, the probability of P1-B1 occurrence is 3 divided by 15, the probability of P1-B2 occurrence is 4 divided by 15, P1-B3 The probability of occurrence is 3 divided by 15, the probability of occurrence of P1-B4 is 5 divided by 15, the probability of occurrence of P1-B1 (that is, the probability value of the labeled sample word B1), the probability of the occurrence of P1-B2 (that is, the labeled sample word B2 probability value), the probability of occurrence of P1-B3 (that is, the probability value of the labeled sample word B3), and the probability of the occurrence of P1
  • the probability value set of the labeled sample words find the largest probability value of the labeled sample words from the probability value set of the labeled sample words to be analyzed, and use the found probability value of the labeled sample words as the corresponding value of the probability value set of the labeled sample words to be analyzed.
  • OCR identifies the probability value of the target labeled sample word corresponding to the sample word; repeating the process of obtaining a labeled sample word probability from the corresponding labeled sample word probability value set of all the OCR-identified sample words in the abnormal relationship data set to be counted The step of obtaining the probability value set of the labeled sample words to be analyzed, until the probability values of the target labeled sample words corresponding to all the OCR-identified sample words of the abnormal relationship data set to be counted are determined.
  • a probability value of a target labeling sample word corresponding to the OCR-identified sample word is obtained from the target labeling sample word probability values corresponding to all the OCR-identified sample words in the to-be-statistical abnormal relationship data set as the target labeling sample word probability value to be analyzed.
  • the probability value of a target labeling sample word corresponding to the OCR recognition sample word is obtained from the target labeling sample word probability values corresponding to all the OCR recognition sample words in the to-be-statistical abnormal relationship data set as the target labeling sample word probability value to be analyzed.
  • the step of the probability value of the target labeling sample word, until the target labeling sample word corresponding to all the OCR-identified sample words of the abnormal relationship data set to be counted is determined.
  • All the OCR identification sample words of the abnormal relationship data set to be counted are regarded as the wrong words of the abnormal relationship corresponding data of the abnormal relationship corresponding data set, and all the OCR identification of the abnormal relationship data set to be counted are identified.
  • the target labeling sample words corresponding to each of the sample words are used as the correct words for the data corresponding to the abnormal relationship in the data set corresponding to the abnormal relationship.
  • the target labeling sample word corresponding to the OCR identification sample word P1 of the abnormal relationship data set to be counted is B4
  • the OCR identification sample word P1 is used as the wrong word of the abnormal relationship corresponding data of the abnormal relationship corresponding data set
  • the target labeling sample word is B4 as the correct word of the abnormal relationship corresponding data of the abnormal relationship corresponding data set, which is not specifically limited in this example.
  • the above-mentioned probability statistics of each of the labeled sample words corresponding to each of the OCR identification sample words are respectively performed on the abnormal relationship data set to be counted, to obtain the abnormal relationship data set to be counted. All the steps of the OCR identifying the respective corresponding labeling sample word probability value sets of the sample words include:
  • S5542 Perform deduplication processing on the OCR recognition sample words in the OCR recognition sample word set to be deduplicated, to obtain a plurality of target OCR recognition sample words;
  • S5543 Respectively calculate the number of times each of the target OCR recognition sample words appears in the to-be-statistical abnormal relationship data set, and obtain the total number of correspondences corresponding to each of the target OCR recognition sample words;
  • S5544 Calculate the respective occurrences of each of the labeled sample words corresponding to each of the target OCR recognition sample words in the abnormal relationship data set to be counted, and obtain each corresponding to each of the target OCR recognition sample words Describe the number of target occurrences corresponding to each of the labeled sample words;
  • S5545 Divide the target occurrence times and the total number of correspondences corresponding to each of the labeled sample words corresponding to the same target OCR identification sample word, to obtain all the The OCR identifies the probability value set of the labeled sample words corresponding to each of the sample words.
  • This embodiment implements the probability statistics of each of the labeled sample words corresponding to each of the OCR-identified sample words, and provides a data basis for determining the target labeled sample words corresponding to the OCR-identified sample words.
  • the OCR recognition sample words of each abnormal relationship in the abnormal relationship data set to be counted are obtained, and all the obtained OCR recognition sample words are used as the OCR recognition sample word set to be deduplicated.
  • the target OCR recognition sample words in the plurality of target OCR recognition sample words are unique.
  • the number of occurrences of the target OCR recognition sample word P1 in the number of occurrences of the abnormal relationship data set to be counted is 4, then the total number of correspondences corresponding to the target OCR recognition sample word P1 is 4, and This example is not specifically limited.
  • the target OCR recognition sample word P1 corresponds to B1, B2, B3, and B4 in the abnormal relationship data set to be counted, wherein P1-B1 appears 3 times, P1-B2 appears 4 times, and P1- B3 appears 3 times, P1-B4 appear 5 times, and P1 appears 15 times in total.
  • the target OCR identifies the target appearance of B1 corresponding to the sample word P1 is 3 times, and the target OCR identifies the target of B2 corresponding to the sample word P1.
  • the number of occurrences is 4 times
  • the target number of occurrences of B3 corresponding to the target OCR recognition sample word P1 is 3 times
  • the target number of occurrences of B4 corresponding to the target OCR recognition sample word P1 is 5 times, and the example is not specific here. limited.
  • the target OCR recognition sample word P1 corresponds to B1, B2, B3, and B4 in the abnormal relationship data set to be counted, wherein P1-B1 appears 3 times, P1-B2 appears 4 times, and P1- B3 appears 3 times, P1-B4 appears 5 times, and the total number of correspondences of P1 is 15 times.
  • the target OCR recognizes the target appearance of B1 corresponding to the sample word P1 is 3 times, and the target OCR recognizes the B2 corresponding to the sample word P1.
  • the target number of occurrences of the target OCR recognition sample word P1 is 4 times, the target number of occurrences of the B3 corresponding to the target OCR recognition sample word P1 is 3 times, and the target OCR recognition sample word P1 The target number of occurrences of B4 corresponding to the sample word P1 is 5 times.
  • the number of target occurrences of B1 corresponding to the OCR recognition sample word P1 is 3 times divided by the total number of P1 correspondences 15 times to obtain the marked sample word probability value of the marked sample word B1, and the target OCR identifies the target occurrence of B2 corresponding to the sample word P1
  • the number of times is 4 times divided by the total number of P1 correspondences 15 times to obtain the labeled sample word probability value of the labeled sample word B2
  • the target OCR recognition sample word P1 The target number of occurrences of B3 corresponding to the sample word P1 is 3 times divided by the total number of P1 corresponding relationships 15
  • the probability value of the labeled sample word B3 is obtained twice, and the target OCR recognition sample word B4 corresponding to the sample word P1 is divided into 5 times by the total number of P1 correspondences 15 times to obtain the labeled sample word probability of the labeled sample word B4. value, which is not specifically limited in this example.
  • the above-mentioned data set corresponding to the target abnormal relationship and the set of unsuccessfully matched words are used to perform wrong word replacement on the preprocessed text data to obtain the target text data corresponding to the target real estate deed. steps, including:
  • S62 Perform wrong word replacement on the preprocessed text data according to the target location data and the correct word in the wrong word replacement data set, to obtain the target text data corresponding to the target real estate deed.
  • This embodiment implements the replacement of wrong words on the preprocessed text data, thereby improving the accuracy of the obtained target text data corresponding to the target real estate deed and improving user satisfaction.
  • the step of obtaining the data corresponding to the abnormal relationship at the position to be determined is until the error word replacement data corresponding to all the abnormal relationship corresponding data in the target abnormal relationship corresponding data set is determined; all the error word replacement data are used as the error word replacement data set.
  • Each target position data in the wrong word replacement data corresponds to a correct word.
  • the present application also proposes a device for identifying real estate deed information based on OCR, and the device includes:
  • the OCR text recognition module 100 is used to obtain the certificate image to be recognized of the target real estate certificate, and uses OCR technology to perform text recognition on the to-be-recognized certificate image to obtain the text data to be corrected;
  • a preprocessing module 200 configured to preprocess the text data to be corrected to obtain preprocessed text data
  • a word segmentation module 300 configured to perform word segmentation on the preprocessed text data to obtain a word set to be corrected
  • the unsuccessfully matched word search module 400 is used to obtain a knowledge base dictionary, and each to-be-corrected word in the to-be-corrected word set is searched for unsuccessfully matched words in the knowledge base dictionary to obtain an unsuccessfully matched word set ;
  • the abnormal relationship matching module 500 is configured to obtain a data set corresponding to the abnormal relationship, and respectively match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship to obtain a data set corresponding to the target abnormal relationship;
  • the wrong word replacement module 600 is used to replace the preprocessed text data with wrong words by using the target abnormal relationship corresponding data set and the unsuccessful matching word set to obtain the target text data corresponding to the target real estate certificate .
  • the preprocessed text data is segmented to obtain a set of words to be corrected, and each word to be corrected in the set of words to be corrected is searched for unsuccessfully matched words in the knowledge base dictionary, and an unsuccessfully matched word is obtained.
  • Word set match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship, and obtain the corresponding data set of the target abnormal relationship.
  • the corresponding data set of the target abnormal relationship and the set of unsuccessfully matched words are used.
  • the text data is replaced by wrong words, and the target text data corresponding to the target real estate certificate is obtained, so as to realize the automatic correction of the OCR recognition result, improve the accuracy of identifying the text data of the real estate certificate, and improve the satisfaction of users.
  • an embodiment of the present application further provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer design is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium, an internal memory.
  • the nonvolatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used for storing data such as the method for determining the information of the real estate deed based on the OCR identification.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the method for determining real estate deed information based on OCR identification includes: acquiring a document image to be identified of a target real estate deed, using OCR technology to perform text recognition on the document image to be identified, and obtaining text data to be corrected;
  • the corrected text data is preprocessed to obtain preprocessed text data;
  • the preprocessed text data is segmented to obtain a set of words to be corrected;
  • a knowledge base dictionary is obtained, and the words to be corrected are separated into each Perform an unsuccessfully matched word search in the knowledge base dictionary for the words to be corrected, and obtain an unsuccessfully matched word set; obtain a data set corresponding to an abnormal relationship, and put each unsuccessfully matched word in the unsuccessfully matched word set in the corresponding word set.
  • the preprocessed text data is segmented to obtain a set of words to be corrected, and each word to be corrected in the set of words to be corrected is searched for unsuccessfully matched words in the knowledge base dictionary, and an unsuccessfully matched word is obtained.
  • Word set match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship, and obtain the corresponding data set of the target abnormal relationship.
  • the corresponding data set of the target abnormal relationship and the set of unsuccessfully matched words are used.
  • the text data is replaced by wrong words, and the target text data corresponding to the target real estate certificate is obtained, so as to realize the automatic correction of the OCR recognition result, improve the accuracy of identifying the text data of the real estate certificate, and improve the satisfaction of users.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a method for determining real estate certificate information based on OCR identification is implemented, which includes the steps of: obtaining a pending real estate certificate of the target.
  • For the recognized certificate image use OCR technology to perform text recognition on the to-be-recognized certificate image to obtain text data to be corrected; perform preprocessing on the to-be-corrected text data to obtain preprocessed text data;
  • the preprocessed text data is divided into words to obtain a set of words to be corrected; a knowledge base dictionary is obtained, and each word to be corrected in the set of words to be corrected is respectively searched for unsuccessfully matched words in the knowledge base dictionary, Obtain a set of unsuccessfully matched words; obtain a data set corresponding to an abnormal relationship, and match each unsuccessfully matched word in the set of unsuccessfully matched words in the data set corresponding to the abnormal relationship to obtain a data set corresponding to the target abnormal relationship;
  • the preprocessed text data is replaced with wrong words by the data set corresponding to the target abnormal relationship and the unsuccessfully matched word set, and the target text data corresponding to the target real estate deed is obtained.
  • the above-mentioned method for determining real estate deed information based on OCR recognition is performed by segmenting the preprocessed text data to obtain a set of words to be corrected, and each word to be corrected in the set of words to be corrected is analyzed in the knowledge base dictionary. Search for successfully matched words, and obtain a set of unsuccessfully matched words.
  • the computer storage medium can be non-volatile or volatile.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

本申请涉及人工智能技术领域,揭示了一种基于OCR识别房产证信息确定方法、装置、设备及介质,其中方法包括:采用OCR技术对待识别的证件图像进行文本识别得到待纠正的文本数据;根据待纠正的文本数据得到预处理后的文本数据;对预处理后的文本数据进行分词得到待纠错词语集;分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找得到未成功匹配词语集;分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配得到目标异常关系对应数据集;采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换得到目标房产证对应的目标文本数据。提高了识别房产证文本数据的准确性。

Description

基于OCR识别房产证信息确定方法、装置、设备及介质
本申请要求于2020年12月15日提交中国专利局、申请号为2020114826258,发明名称为“基于OCR识别房产证信息确定方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到人工智能技术领域,特别是涉及到一种基于OCR识别房产证信息确定方法、装置、设备及介质。
背景技术
房产证因地区、时间等存在差异,导致存在多种版本。目前,发明人意识到对图像进行文本识别一般采用OCR技术,但是OCR技术对多种版本的房产证的图像进行文本识别的识别准确度低,引起客户投诉较多。
技术问题
旨在解决现有技术的OCR技术对房产证的识别准确度低,引起客户投诉较多的技术问题。
技术解决方案
本申请的主要目的为提供一种基于OCR识别房产证信息确定方法、装置、设备及介质,旨在解决现有技术的OCR技术对房产证的识别准确度低,引起客户投诉较多的技术问题。
为了实现上述发明目的,本申请提出一种基于OCR识别房产证信息确定方法,所述方法包括:
获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
对所述预处理后的文本数据进行分词,得到待纠错词语集;
获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本申请还提出了一种基于OCR识别房产证信息确定装置,所述装置包括:
OCR文本识别模块,用于获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
预处理模块,用于对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
分词模块,用于对所述预处理后的文本数据进行分词,得到待纠错词语集;
未成功匹配词语查找模块,用于获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
异常关系匹配模块,用于获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
错误词替换模块,用于采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本申请还提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如下方法步骤:
获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
对所述预处理后的文本数据进行分词,得到待纠错词语集;
获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本申请还提出了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如下方法步骤:
获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
对所述预处理后的文本数据进行分词,得到待纠错词语集;
获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
有益效果
本申请的基于OCR识别房产证信息确定方法、装置、设备及介质,通过采用OCR技术对待识别的证件图像进行文本识别,得到待纠正的文本数据,对待纠正的文本数据进行预处理,得到预处理后的文本数据,通过对预处理后的文本数据进行分词,得到待纠错词语集,分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集,分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配,得到目标异常关系对应数据集,采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换,得到目标房产证对应的目标文本数据,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高了用户的满意度。
附图说明
图1为本申请一实施例的基于OCR识别房产证信息确定方法的流程示意图;
图2为本申请一实施例的基于OCR识别房产证信息确定装置的结构示意框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本申请目的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
为了解决现有技术的OCR技术对房产证的识别准确度低,引起客户投诉较多的技术问题,本申请提出了一种基于OCR识别房产证信息确定方法,所述方法应用于人工智能技术领域。所述基于OCR识别房产证信息确定方法通过先采用OCR技术对房产证的图像进行识别,然后采用知识库词典和异常关系对应数据集进行纠正,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高了用户的满意度。
参照图1,本申请实施例中提供一种基于OCR识别房产证信息确定方法,所述方法包括:
S1:获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
S2:对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
S3:对所述预处理后的文本数据进行分词,得到待纠错词语集;
S4:获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
S5:获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
S6:采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本实施例通过对预处理后的文本数据进行分词,得到待纠错词语集,分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集,分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配,得到目标异常关系对应数据集,采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换,得到目标房产证对应的目标文本数据,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高了用户的满意度。
对于S1,可以获取用户输入的目标房产证的待识别的证件图像,也可以获取电子设备(比如扫描仪或数码相机)直接扫描或拍摄目标房产证得到的待识别的证件图像,还可以是第三方应用系统发送的目标房产证的待识别的证件图像。
待识别的证件图像,是需要进行文字识别的目标房产证的数字图像。
采用OCR技术对所述待识别的证件图像中的每个文字区域进行文本识别,得到至少一项文本数据,其中,每个文字区域对应一项文本数据。
OCR技术,是指光学字符识别技术。
目标房产证,可以是任意房产类型、任意年代、任意地区的房产证。
待纠正的文本数据,是采用OCR技术识别待识别的证件图像得到的文本数 据。
对于S2,将所述待纠正的文本数据输入特殊字符处理模型进行特殊字符识别及删除处理,将删除特殊字符后的所述待纠正的文本数据作为预处理后的文本数据。
特殊字符处理模型是基于神经网络训练得到的模型。
特殊字符,是指房产证上的文本信息不可能出现的字符。特殊字符是指字母、数字、中文、横杠、空格以外的字符。
对于S3,分别对所述预处理后的文本数据中每项文本数据进行分词,将分词得到的词语作为待纠错词语,将所有待纠错词语作为待纠错词语集。也就是说,每项文本数据独立进行分词。
对于S4,可以从数据库中获取知识库词典。从所述待纠错词语集中提出出一个待纠错词语作为目标待纠错词语;将待纠错词语在所述知识库词典中进行查找,当在所述知识库词典中查找到词语时确定目标待纠错词语为成功匹配词语,否则确定目标待纠错词语为未成功匹配词语;重复执行所述从所述待纠错词语集中提出出一个待纠错词语作为目标待纠错词语的步骤,直至确定所述待纠错词语集中所有待纠错词语为未成功匹配词语或成功匹配词语;将所有所述未成功匹配词语作为未成功匹配词语集。
知识库词典包括但不限于:房产类型子词典、行政区域子词典、楼盘子词典、姓氏子词典。知识库词典基于房产证的常用信息构建,从而使知识库词典适用于房产证领域的错误纠正,有利于提高房产证信息纠正的准确性。
房产类型子词典包括:房产类型名称。房产类型名称包括但不限于:自有、商品房、住宅、公寓。
行政区域子词典包括但不限于:省名、市名、区名、街道名。
楼盘子词典包括但不限于:楼盘名称。
姓氏子词典包括但不限于:姓氏。
未成功匹配词语集包括:未成功匹配词语、位置数据,每个未成功匹配词语对应一个位置数据。
未成功匹配词语集中的多个未成功匹配词语之间可以相同,也可以不同,在此不做具体限定。
对于S5,可以从数据库中获取异常关系对应数据集;分别将所述未成功匹配词语集中的每个未成功匹配词语在所述异常关系对应数据集中的异常关系对应数据的错误词进行匹配,将在所述异常关系对应数据集中的异常关系对应数据的错误词中匹配到的错误词作为目标错误词;将所有目标错误词各自对应的异常关系对应数据作为目标异常关系对应数据集。
异常关系对应数据集包括:异常关系对应数据。异常关系对应数据包括:错误词、正确词,每个错误词对应一个正确词。
对于S6,采用所述目标异常关系对应数据集对所述未成功匹配词语集在所述预处理后的文本数据的对应位置的词语进行替换处理,将替换后的所述预处理后的文本数据作为所述目标房产证对应的目标文本数据。
在一个实施例中,上述对所述预处理后的文本数据进行分词,得到待纠错词语集的步骤,包括:
S31:分别对所述预处理后的文本数据中每项文本数据进行分词,得到多个待纠错词语和所述多个待纠错词语各自对应的位置数据;
S32:根据所述多个待纠错词语和所述多个待纠错词语各自对应的所述位置数据,确定所述待纠错词语集。
本实施例实现了对预处理后的文本数据进行分词,为后续采用知识库词典和异常关系对应数据集进行纠正提供了基础。
对于S31,从所述预处理后的文本数据中获取出一项文本数据,得到目标文本数据项;对所述目标文本数据项进行分词,将分词得到的词语作为待纠错词语,将待纠错词语在所述预处理后的文本数据中的位置数据作为待纠错词语对应的位置数据;重复执行所述从所述预处理后的文本数据中获取出一项文本数据,得到目标文本数据项的步骤,直至确定多个待纠错词语和所述多个待纠错词语各自对应的位置数据。
位置数据,是指待纠错词语在所述预处理后的文本数据中的位置数据。
对于S32,将所述多个待纠错词语和所述多个待纠错词语各自对应的所述位置数据作为确定所述待纠错词语集。
也就是说,所述待纠错词语集包括:待纠错词语、位置数据,每个待纠错词语对应一个位置数据。
在一个实施例中,上述分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集的步骤,包括:
S41:分别将所述待纠错词语集中每个所述待纠错词语在所述知识库词典中进行匹配,得到多个知识库匹配结果;
S42:当所述知识库匹配结果为失败时,根据失败的所述知识库匹配结果,确定所述未成功匹配词语集。
本实施例实现了根据所述知识库词典进行正确的待纠错词语剔除,减少了采用异常关系对应数据集进行纠正的数据量,也避免了被错误纠正。
对于S41,从所述待纠错词语集获取一个待纠错词语作为目标待纠错词语;将目标待纠错词语在所述知识库词典中进行匹配,当在所述知识库词典中找到词语时,确定目标待纠错词语对应的知识库匹配结果为成功,否则确定目标待纠错词语对应的知识库匹配结果为失败;重复执行所述从所述待纠错词语集获取一个待纠错词语作为目标待纠错词语的步骤,直至确定所述待纠错词语集中所有所述待纠错词语各自对应的知识库匹配结果时。
对于S42,当所述知识库匹配结果为失败时,意味着待纠错词语没有在知识库匹配结果中,此时可以将待纠错词语作为未成功匹配词语;将所有未成功匹配词语作为所述未成功匹配词语集。
在一个实施例中,上述获取异常关系对应数据集的步骤之前,还包括:
S51:获取多个房产证样本图像,所述房产证样本图像携带有图像标识;
S52:采用所述OCR技术分别对每个所述房产证样本图像进行文本识别,得到所述多个房产证样本图像各自对应的OCR识别样本数据;
S53:将所述多个房产证样本图像和各自对应的OCR识别样本数据发送给标注端进行错误纠正;
S54:获取所述标注端发送的所述多个房产证样本图像各自对应的标注样本数据;
S55:根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对 应数据集;
S56:将所述异常关系对应数据集存储在数据库中。
本实施例根据人工标注和OCR识别样本数据建立异常关系对应数据集,提高了对应关系的准确性,为后续采用异常关系对应数据集进行纠正提供了基础;而且基于房产证样本图像确定异常关系对应数据集,从而使异常关系对应数据集适用于房产证领域的错误纠正,有利于提高房产证信息纠正的准确性。
对于S51,可以获取用户输入的多个房产证样本图像,也可以是第三方应用系统发送的多个房产证样本图像。
房产证样本图像,是指房产证的数字图像。
图像标识,可以是图像名称、图像ID等唯一标识一个房产证样本图像的标识。
对于S52,从多个房产证样本图像中获取一个房产证样本图像,得到目标房产证样本图像;采用所述OCR技术对目标房产证样本图像进行文本识别,将文本识别得到的文本数据作为目标房产证样本图像对应的OCR识别样本数据;重复所述从多个房产证样本图像中获取一个房产证样本图像,得到目标房产证样本图像的步骤,直至确定所述多个房产证样本图像各自对应的OCR识别样本数据。
对于S53,将所述多个房产证样本图像和所述多个房产证样本图像各自对应的OCR识别样本数据发送给标注端进行错误纠正;
对于S54,标注人员根据所述多个房产证样本图像和所述多个房产证样本图像各自对应的OCR识别样本数据进行错误纠正后通过标注端发送所述多个房产证样本图像各自对应的标注样本数据。
对于S55,对所述多个房产证样本图像各自对应的OCR识别样本数据及所述标注样本数据进行分词,采用所述图像标识根据分词结果进行异常关系对应判断,得到所述异常关系对应数据集。
在一个实施例中,上述根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集的步骤,包括:
S551:分别对所述多个房产证样本图像对应的每个所述OCR识别样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的OCR识别样本词语集;
S552:分别对所述多个房产证样本图像对应的每个所述标注样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的标注样本词语集;
S553:基于所述图像标识对所述多个房产证样本图像各自对应的所述OCR识别样本词语集和所述标注样本词语集进行异常关系查找,得到所述多个房产证样本图像对应的待统计异常关系数据集,所述待统计异常关系数据集包括:所述OCR识别样本词语、所述标注样本词语;
S554:分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集;
S555:分别从所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述标注样本词语概率值集中获取最大的标注样本词语概率值,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值;
S556:分别根据所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述目标标注样本词语概率值,确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语;
S557:根据所述待统计异常关系数据集的所有所述OCR识别样本词语和各自对应的所述目标标注样本词语,得到所述异常关系对应数据集。
本实施例实现了根据多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,为后续采用异常关系对应数据集进行纠正提供了基础。
对于S551,从所述多个房产证样本图像各自对应的OCR识别样本数据中获取出一个OCR识别样本数据,得到待分词的OCR识别样本数据;分别对所述待分词的OCR识别样本数据的每项文本数据项进行分词,将分词得到的词语作为所述待分词的OCR识别样本数据对应的房产证样本图像对应的OCR识别样本词语集;重复执行所述从所述多个房产证样本图像各自对应的OCR识别样本数据中获取出一个OCR识别样本数据,得到待分词的OCR识别样本数据的步骤,直至确定所述多个房产证样本图像各自对应的OCR识别样本词语集。
对于S552,从所述多个房产证样本图像各自对应的标注样本数据中获取出一个标注样本数据,得到待分词的标注样本数据;分别对所述待分词的标注样本数据的每项文本数据项进行分词,将分词得到的词语作为所述待分词的标注样本数据对应的房产证样本图像对应的标注样本词语集;重复执行所述从所述多个房产证样本图像各自对应的标注样本数据中获取出一个标注样本数据,得到待分词的标注样本数据的步骤,直至确定所述多个房产证样本图像各自对应的标注样本词语集。
对于S553,根据同一所述图像标识对应的所述OCR识别样本词语集和所述标注样本词语集进行异常关系查找,根据查找到的异常关系确定所述多个房产证样本图像对应的待统计异常关系数据集。
在所述待统计异常关系数据集,一个所述OCR识别样本词语与一个所述标注样本词语组成一个异常关系。
比如,所述图像标识T1的所述OCR识别样本词语集对应位置数据W1的OCR识别样本词语是:税有产权,所述图像标识T1的所述标注样本词语集对应位置数据W1的标注样本词语是:私有产权,税有产权与私有产权不同则确定所述图像标识T1对应的位置数据W1存在异常关系,将税有产权(也就是OCR识别样本词语)和私有产权(也就是标注样本词语)作为所述图像标识T1对应的房产证样本图像的待统计异常关系数据,在此举例不过具体限定。
对于S554,比如,所述OCR识别样本词语P1,在所述待统计异常关系数据集中对应有标注样本词语B1、B2、B3、B4,其中,P1-B1出现3次,P1-B2出现4次,P1-B3出现3次,P1-B4出现5次,P1的对应关系总数15次,P1-B1出现的概率是3除以15,P1-B2出现的概率是4除以15,P1-B3出现的概率是3除以15,P1-B4出现的概率是5除以15,将P1-B1出现的概率(也就是标注样本词语B1概率值)、P1-B2出现的概率(也就是标注样本词语B2概率值)、P1-B3出现的概率(也就是标注样本词语B3概率值)、P1-B4出现的概率(也就是标注样本词语B4概率值)作为所述OCR识别样本词语P1对应的标注样本词语概率值集,在此举例不做具体限定。
对于S555,从所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集中获取出一个所述OCR识别样本词语对应的标 注样本词语概率值集,得到待分析的标注样本词语概率值集;从待分析的标注样本词语概率值集中找出最大的标注样本词语概率值,将找到的标注样本词语概率值作为待分析的标注样本词语概率值集对应的所述OCR识别样本词语对应的目标标注样本词语概率值;重复执行所述从所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集中获取出一个标注样本词语概率值集,得到待分析的标注样本词语概率值集的步骤,直至确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值。
对于S556,从所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值中获取出一个所述OCR识别样本词语对应的目标标注样本词语概率值作为待分析的目标标注样本词语概率值;将待分析的目标标注样本词语概率值对于的标注样本词语作为待分析的目标标注样本词语概率值对应的所述OCR识别样本词语对应的目标标注样本词语;重复执行所述从所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值中获取出一个所述OCR识别样本词语对应的目标标注样本词语概率值作为待分析的目标标注样本词语概率值的步骤,直至确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语。
对于S557,将根据所述待统计异常关系数据集的所有所述OCR识别样本词语和所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述目标标注样本词语,得到所述异常关系对应数据集。
将所述待统计异常关系数据集的所有所述OCR识别样本词语作为述异常关系对应数据集的所述异常关系对应数据的错误词,将所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述目标标注样本词语作为述异常关系对应数据集的所述异常关系对应数据的正确词。
比如,所述待统计异常关系数据集的所述OCR识别样本词语P1对应的目标标注样本词语为B4,则将所述OCR识别样本词语P1作为异常关系对应数据集的异常关系对应数据的错误词,将目标标注样本词语为B4作为异常关系对应数据集的异常关系对应数据的正确词,在此举例不做具体限定。
在一个实施例中,上述分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集的步骤,包括:
S5541:从所述待统计异常关系数据集中获取出所有所述OCR识别样本词语,得到待去重的OCR识别样本词语集;
S5542:对所述待去重的OCR识别样本词语集中的所述OCR识别样本词语进行去重处理,得到多个目标OCR识别样本词语;
S5543:分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中出现的次数,得到多个所述目标OCR识别样本词语各自对应的对应关系总数;
S5544:分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中对应的各个所述标注样本词语各自的出现次数,得到多个所述目标OCR识别样本词语各自对应的各个所述标注样本词语各自对应的目标出现次数;
S5545:将同一所述目标OCR识别样本词语对应的各个所述标注样本词语各 自对应的所述目标出现次数和所述对应关系总数进行相除,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述标注样本词语概率值集。
本实施例实现了对每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,为确定所述OCR识别样本词语对应的目标标注样本词语提供了数据基础。
对于S5541,获取出所述待统计异常关系数据集中每个异常关系的OCR识别样本词语,将获取得到的所有OCR识别样本词语作为待去重的OCR识别样本词语集。
对于S5542,所述多个目标OCR识别样本词语中的目标OCR识别样本词语具有唯一性。
对于S5543,比如,所述目标OCR识别样本词语P1在所述待统计异常关系数据集中出现的次数中出现的次数为4,则所述目标OCR识别样本词语P1对应的对应关系总数为4,在此举例不做具体限定。
对于S5544,比如,所述目标OCR识别样本词语P1在所述待统计异常关系数据集中对应有B1、B2、B3、B4,其中,P1-B1出现3次,P1-B2出现4次,P1-B3出现3次,P1-B4出现5次,P1总共出现15次,所述目标OCR识别样本词语P1对应的B1的目标出现次数是3次,所述目标OCR识别样本词语P1对应的B2的目标出现次数是4次,所述目标OCR识别样本词语P1对应的B3的目标出现次数是3次,所述目标OCR识别样本词语P1对应的B4的目标出现次数是5次,在此举例不做具体限定。
对于S5545,比如,所述目标OCR识别样本词语P1在所述待统计异常关系数据集中对应有B1、B2、B3、B4,其中,P1-B1出现3次,P1-B2出现4次,P1-B3出现3次,P1-B4出现5次,P1的对应关系总数15次,所述目标OCR识别样本词语P1对应的B1的目标出现次数是3次,所述目标OCR识别样本词语P1对应的B2的目标出现次数是4次,所述目标OCR识别样本词语P1对应的B3的目标出现次数是3次,所述目标OCR识别样本词语P1对应的B4的目标出现次数是5次,将所述目标OCR识别样本词语P1对应的B1的目标出现次数是3次除以P1对应关系总数15次得到标注样本词语B1的标注样本词语概率值,将所述目标OCR识别样本词语P1对应的B2的目标出现次数是4次除以P1对应关系总数15次得到标注样本词语B2的标注样本词语概率值,将所述目标OCR识别样本词语P1对应的B3的目标出现次数是3次除以P1对应关系总数15次得到标注样本词语B3的标注样本词语概率值,将所述目标OCR识别样本词语P1对应的B4的目标出现次数是5次除以P1对应关系总数15次得到标注样本词语B4的标注样本词语概率值,在此举例不做具体限定。
在一个实施例中,上述采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据的步骤,包括:
S61:根据所述目标异常关系对应数据集和所述未成功匹配词语集,确定错误词替换数据集,所述错误词替换数据集包括:目标位置数据、正确词;
S62:根据所述错误词替换数据集的所述目标位置数据和所述正确词对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的所述目标文本数据。
本实施例实现了对预处理后的文本数据进行错误词替换,从而提升了得到的目标房产证对应的目标文本数据的准确性,提高了用户的满意度。
对于S61,从所述目标异常关系对应数据集中提取出一个异常关系对应数据,得到待确定位置的异常关系对应数据;将所述待确定位置的异常关系对应数据的错误词在所述未成功匹配词语集中进行匹配,将在所述未成功匹配词语集中匹配到的未成功匹配对应的位置数据作为所述待确定位置的异常关系对应数据对应的错误词替换数据的目标位置数据;将所述待确定位置的异常关系对应数据的正确词作为待确定位置的异常关系对应数据对应的错误词替换数据的正确词;重复执行所述从所述目标异常关系对应数据集中提取出一个异常关系对应数据,得到待确定位置的异常关系对应数据的步骤,直至确定所述目标异常关系对应数据集中所有异常关系对应数据对应的错误词替换数据;将所有错误词替换数据作为错误词替换数据集。
错误词替换数据中每个目标位置数据对应一个正确词。
对于S62,从所述错误词替换数据集提取出一个目标位置数据,得到待替换的目标位置数据;将待替换的目标位置数据在所述预处理后的文本数据对应的词语,用所述待替换的目标位置数据对应的正确词进行替换;重复执行所述从所述错误词替换数据集提取出一个目标位置数据,得到待替换的目标位置数据的步骤,直至完成所述错误词替换数据集中所有目标位置数据的错误词替换,将替换结束的所述预处理后的文本数据作为所述目标房产证对应的所述目标文本数据。
参照图2,本申请还提出了一种基于OCR识别房产证信息确定装置,所述装置包括:
OCR文本识别模块100,用于获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
预处理模块200,用于对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
分词模块300,用于对所述预处理后的文本数据进行分词,得到待纠错词语集;
未成功匹配词语查找模块400,用于获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
异常关系匹配模块500,用于获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
错误词替换模块600,用于采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本实施例通过对预处理后的文本数据进行分词,得到待纠错词语集,分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集,分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配,得到目标异常关系对应数据集,采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换,得到目标房产证对应的目标文本数据,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高了用户的满意度。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于储存基于OCR识别房产证信息确定方法等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于OCR识别房产证信息确定方法。所述基于OCR识别房产证信息确定方法,包括:获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;对所述预处理后的文本数据进行分词,得到待纠错词语集;获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
本实施例通过对预处理后的文本数据进行分词,得到待纠错词语集,分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集,分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配,得到目标异常关系对应数据集,采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换,得到目标房产证对应的目标文本数据,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高了用户的满意度。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现一种基于OCR识别房产证信息确定方法,包括步骤:获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;对所述预处理后的文本数据进行分词,得到待纠错词语集;获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
上述执行的基于OCR识别房产证信息确定方法,通过对预处理后的文本数据进行分词,得到待纠错词语集,分别将待纠错词语集中每个待纠错词语在知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集,分别将未成功匹配词语集中每个未成功匹配词语在异常关系对应数据集中进行匹配,得到目标异常关系对应数据集,采用目标异常关系对应数据集和未成功匹配词语集对预处理后的文本数据进行错误词替换,得到目标房产证对应的目标文本数据,从而实现了对OCR的识别结果自动进行纠正,提高了识别房产证文本数据的准确性,提高 了用户的满意度。
所述计算机存储介质可以是非易失性,也可以是易失性。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于OCR识别房产证信息确定方法,其中,所述方法包括:
    获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
    对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
    对所述预处理后的文本数据进行分词,得到待纠错词语集;
    获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
    获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
    采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
  2. 根据权利要求1所述的基于OCR识别房产证信息确定方法,其中,所述对所述预处理后的文本数据进行分词,得到待纠错词语集的步骤,包括:
    分别对所述预处理后的文本数据中每项文本数据进行分词,得到多个待纠错词语和所述多个待纠错词语各自对应的位置数据;
    根据所述多个待纠错词语和所述多个待纠错词语各自对应的所述位置数据,确定所述待纠错词语集。
  3. 根据权利要求1所述的基于OCR识别房产证信息确定方法,其中,所述分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集的步骤,包括:
    分别将所述待纠错词语集中每个所述待纠错词语在所述知识库词典中进行匹配,得到多个知识库匹配结果;
    当所述知识库匹配结果为失败时,根据失败的所述知识库匹配结果,确定所述未成功匹配词语集。
  4. 根据权利要求1所述的基于OCR识别房产证信息确定方法,其中,所述获取异常关系对应数据集的步骤之前,还包括:
    获取多个房产证样本图像,所述房产证样本图像携带有图像标识;
    采用所述OCR技术分别对每个所述房产证样本图像进行文本识别,得到所述多个房产证样本图像各自对应的OCR识别样本数据;
    将所述多个房产证样本图像和各自对应的OCR识别样本数据发送给标注端进行错误纠正;
    获取所述标注端发送的所述多个房产证样本图像各自对应的标注样本数据;
    根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集;
    将所述异常关系对应数据集存储在数据库中。
  5. 根据权利要求4所述的基于OCR识别房产证信息确定方法,其中,所述根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集的步骤,包括:
    分别对所述多个房产证样本图像对应的每个所述OCR识别样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的OCR识别样本词 语集;
    分别对所述多个房产证样本图像对应的每个所述标注样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的标注样本词语集;
    基于所述图像标识对所述多个房产证样本图像各自对应的所述OCR识别样本词语集和所述标注样本词语集进行异常关系查找,得到所述多个房产证样本图像对应的待统计异常关系数据集,所述待统计异常关系数据集包括:所述OCR识别样本词语、所述标注样本词语;
    分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集;
    分别从所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述标注样本词语概率值集中获取最大的标注样本词语概率值,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值;
    分别根据所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述目标标注样本词语概率值,确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语;
    根据所述待统计异常关系数据集的所有所述OCR识别样本词语和各自对应的所述目标标注样本词语,得到所述异常关系对应数据集。
  6. 根据权利要求5所述的基于OCR识别房产证信息确定方法,其中,所述分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集的步骤,包括:
    从所述待统计异常关系数据集中获取出所有所述OCR识别样本词语,得到待去重的OCR识别样本词语集;
    对所述待去重的OCR识别样本词语集中的所述OCR识别样本词语进行去重处理,得到多个目标OCR识别样本词语;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中出现的次数,得到多个所述目标OCR识别样本词语各自对应的对应关系总数;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中对应的各个所述标注样本词语各自的出现次数,得到多个所述目标OCR识别样本词语各自对应的各个所述标注样本词语各自对应的目标出现次数;
    将同一所述目标OCR识别样本词语对应的各个所述标注样本词语各自对应的所述目标出现次数和所述对应关系总数进行相除,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述标注样本词语概率值集。
  7. 根据权利要求1所述的基于OCR识别房产证信息确定方法,其中,所述采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据的步骤,包括:
    根据所述目标异常关系对应数据集和所述未成功匹配词语集,确定错误词替换数据集,所述错误词替换数据集包括:目标位置数据、正确词;
    根据所述错误词替换数据集的所述目标位置数据和所述正确词对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的所述目标文本数据。
  8. 一种基于OCR识别房产证信息确定装置,其中,所述装置包括:
    OCR文本识别模块,用于获取目标房产证的待识别的证件图像,采用OCR 技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
    预处理模块,用于对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
    分词模块,用于对所述预处理后的文本数据进行分词,得到待纠错词语集;
    未成功匹配词语查找模块,用于获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
    异常关系匹配模块,用于获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
    错误词替换模块,用于采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现如下方法步骤:
    获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
    对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
    对所述预处理后的文本数据进行分词,得到待纠错词语集;
    获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
    获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
    采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述预处理后的文本数据进行分词,得到待纠错词语集的步骤,包括:
    分别对所述预处理后的文本数据中每项文本数据进行分词,得到多个待纠错词语和所述多个待纠错词语各自对应的位置数据;
    根据所述多个待纠错词语和所述多个待纠错词语各自对应的所述位置数据,确定所述待纠错词语集。
  11. 根据权利要求9所述的计算机设备,其中,所述分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集的步骤,包括:
    分别将所述待纠错词语集中每个所述待纠错词语在所述知识库词典中进行匹配,得到多个知识库匹配结果;
    当所述知识库匹配结果为失败时,根据失败的所述知识库匹配结果,确定所述未成功匹配词语集。
  12. 根据权利要求9所述的计算机设备,其中,所述获取异常关系对应数据集的步骤之前,还包括:
    获取多个房产证样本图像,所述房产证样本图像携带有图像标识;
    采用所述OCR技术分别对每个所述房产证样本图像进行文本识别,得到所述多个房产证样本图像各自对应的OCR识别样本数据;
    将所述多个房产证样本图像和各自对应的OCR识别样本数据发送给标注端 进行错误纠正;
    获取所述标注端发送的所述多个房产证样本图像各自对应的标注样本数据;
    根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集;
    将所述异常关系对应数据集存储在数据库中。
  13. 根据权利要求12所述的计算机设备,其中,所述根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集的步骤,包括:
    分别对所述多个房产证样本图像对应的每个所述OCR识别样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的OCR识别样本词语集;
    分别对所述多个房产证样本图像对应的每个所述标注样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的标注样本词语集;
    基于所述图像标识对所述多个房产证样本图像各自对应的所述OCR识别样本词语集和所述标注样本词语集进行异常关系查找,得到所述多个房产证样本图像对应的待统计异常关系数据集,所述待统计异常关系数据集包括:所述OCR识别样本词语、所述标注样本词语;
    分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集;
    分别从所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述标注样本词语概率值集中获取最大的标注样本词语概率值,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值;
    分别根据所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述目标标注样本词语概率值,确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语;
    根据所述待统计异常关系数据集的所有所述OCR识别样本词语和各自对应的所述目标标注样本词语,得到所述异常关系对应数据集。
  14. 根据权利要求13所述的计算机设备,其中,所述分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集的步骤,包括:
    从所述待统计异常关系数据集中获取出所有所述OCR识别样本词语,得到待去重的OCR识别样本词语集;
    对所述待去重的OCR识别样本词语集中的所述OCR识别样本词语进行去重处理,得到多个目标OCR识别样本词语;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中出现的次数,得到多个所述目标OCR识别样本词语各自对应的对应关系总数;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中对应的各个所述标注样本词语各自的出现次数,得到多个所述目标OCR识别样本词语各自对应的各个所述标注样本词语各自对应的目标出现次数;
    将同一所述目标OCR识别样本词语对应的各个所述标注样本词语各自对应 的所述目标出现次数和所述对应关系总数进行相除,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述标注样本词语概率值集。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下方法步骤:
    获取目标房产证的待识别的证件图像,采用OCR技术对所述待识别的证件图像进行文本识别,得到待纠正的文本数据;
    对所述待纠正的文本数据进行预处理,得到预处理后的文本数据;
    对所述预处理后的文本数据进行分词,得到待纠错词语集;
    获取知识库词典,分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集;
    获取异常关系对应数据集,分别将所述未成功匹配词语集中每个未成功匹配词语在所述异常关系对应数据集中进行匹配,得到目标异常关系对应数据集;
    采用所述目标异常关系对应数据集和所述未成功匹配词语集对所述预处理后的文本数据进行错误词替换,得到所述目标房产证对应的目标文本数据。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述预处理后的文本数据进行分词,得到待纠错词语集的步骤,包括:
    分别对所述预处理后的文本数据中每项文本数据进行分词,得到多个待纠错词语和所述多个待纠错词语各自对应的位置数据;
    根据所述多个待纠错词语和所述多个待纠错词语各自对应的所述位置数据,确定所述待纠错词语集。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述分别将所述待纠错词语集中每个待纠错词语在所述知识库词典中进行未成功匹配词语查找,得到未成功匹配词语集的步骤,包括:
    分别将所述待纠错词语集中每个所述待纠错词语在所述知识库词典中进行匹配,得到多个知识库匹配结果;
    当所述知识库匹配结果为失败时,根据失败的所述知识库匹配结果,确定所述未成功匹配词语集。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述获取异常关系对应数据集的步骤之前,还包括:
    获取多个房产证样本图像,所述房产证样本图像携带有图像标识;
    采用所述OCR技术分别对每个所述房产证样本图像进行文本识别,得到所述多个房产证样本图像各自对应的OCR识别样本数据;
    将所述多个房产证样本图像和各自对应的OCR识别样本数据发送给标注端进行错误纠正;
    获取所述标注端发送的所述多个房产证样本图像各自对应的标注样本数据;
    根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集;
    将所述异常关系对应数据集存储在数据库中。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述多个房产证样本图像各自对应的所述图像标识、OCR识别样本数据及所述标注样本数据进行分词和异常关系对应判断,得到所述异常关系对应数据集的步骤,包括:
    分别对所述多个房产证样本图像对应的每个所述OCR识别样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的OCR识别样本词 语集;
    分别对所述多个房产证样本图像对应的每个所述标注样本数据的每项文本数据进行分词,得到所述多个房产证样本图像各自对应的标注样本词语集;
    基于所述图像标识对所述多个房产证样本图像各自对应的所述OCR识别样本词语集和所述标注样本词语集进行异常关系查找,得到所述多个房产证样本图像对应的待统计异常关系数据集,所述待统计异常关系数据集包括:所述OCR识别样本词语、所述标注样本词语;
    分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集;
    分别从所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述标注样本词语概率值集中获取最大的标注样本词语概率值,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语概率值;
    分别根据所述待统计异常关系数据集的每个所述OCR识别样本词语对应的所述目标标注样本词语概率值,确定所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的目标标注样本词语;
    根据所述待统计异常关系数据集的所有所述OCR识别样本词语和各自对应的所述目标标注样本词语,得到所述异常关系对应数据集。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述分别对所述待统计异常关系数据集进行每个所述OCR识别样本词语各自对应的每个所述标注样本词语的概率统计,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的标注样本词语概率值集的步骤,包括:
    从所述待统计异常关系数据集中获取出所有所述OCR识别样本词语,得到待去重的OCR识别样本词语集;
    对所述待去重的OCR识别样本词语集中的所述OCR识别样本词语进行去重处理,得到多个目标OCR识别样本词语;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中出现的次数,得到多个所述目标OCR识别样本词语各自对应的对应关系总数;
    分别计算每个所述目标OCR识别样本词语在所述待统计异常关系数据集中对应的各个所述标注样本词语各自的出现次数,得到多个所述目标OCR识别样本词语各自对应的各个所述标注样本词语各自对应的目标出现次数;
    将同一所述目标OCR识别样本词语对应的各个所述标注样本词语各自对应的所述目标出现次数和所述对应关系总数进行相除,得到所述待统计异常关系数据集的所有所述OCR识别样本词语各自对应的所述标注样本词语概率值集。
PCT/CN2021/091716 2020-12-15 2021-04-30 基于ocr识别房产证信息确定方法、装置、设备及介质 WO2022126986A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011482625.8A CN112528882B (zh) 2020-12-15 2020-12-15 基于ocr识别房产证信息确定方法、装置、设备及介质
CN202011482625.8 2020-12-15

Publications (1)

Publication Number Publication Date
WO2022126986A1 true WO2022126986A1 (zh) 2022-06-23

Family

ID=75000367

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091716 WO2022126986A1 (zh) 2020-12-15 2021-04-30 基于ocr识别房产证信息确定方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN112528882B (zh)
WO (1) WO2022126986A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528882B (zh) * 2020-12-15 2024-05-10 平安科技(深圳)有限公司 基于ocr识别房产证信息确定方法、装置、设备及介质
CN113837118B (zh) * 2021-09-28 2024-04-26 支付宝(杭州)信息技术有限公司 文本变异关系的获取方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9292737B2 (en) * 2008-01-18 2016-03-22 Mitek Systems, Inc. Systems and methods for classifying payment documents during mobile image processing
CN108376129A (zh) * 2018-01-24 2018-08-07 北京奇艺世纪科技有限公司 一种纠错方法及装置
CN110909725A (zh) * 2019-10-18 2020-03-24 平安科技(深圳)有限公司 识别文本的方法、装置、设备及存储介质
CN112016304A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、装置、电子设备及存储介质
CN112528882A (zh) * 2020-12-15 2021-03-19 平安科技(深圳)有限公司 基于ocr识别房产证信息确定方法、装置、设备及介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107133622B (zh) * 2016-02-29 2022-08-26 阿里巴巴集团控股有限公司 一种单词的分割方法和装置
US10740602B2 (en) * 2018-04-18 2020-08-11 Google Llc System and methods for assigning word fragments to text lines in optical character recognition-extracted data
CN110765996B (zh) * 2019-10-21 2022-07-29 北京百度网讯科技有限公司 文本信息处理方法及装置
CN111753531B (zh) * 2020-06-28 2024-03-12 平安科技(深圳)有限公司 基于人工智能的文本纠错方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9292737B2 (en) * 2008-01-18 2016-03-22 Mitek Systems, Inc. Systems and methods for classifying payment documents during mobile image processing
CN108376129A (zh) * 2018-01-24 2018-08-07 北京奇艺世纪科技有限公司 一种纠错方法及装置
CN110909725A (zh) * 2019-10-18 2020-03-24 平安科技(深圳)有限公司 识别文本的方法、装置、设备及存储介质
CN112016304A (zh) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 文本纠错方法、装置、电子设备及存储介质
CN112528882A (zh) * 2020-12-15 2021-03-19 平安科技(深圳)有限公司 基于ocr识别房产证信息确定方法、装置、设备及介质

Also Published As

Publication number Publication date
CN112528882A (zh) 2021-03-19
CN112528882B (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
US8391614B2 (en) Determining near duplicate “noisy” data objects
US8468167B2 (en) Automatic data validation and correction
US8014604B2 (en) OCR of books by word recognition
US9152859B2 (en) Property record document data verification systems and methods
US10713306B2 (en) Content pattern based automatic document classification
WO2022126986A1 (zh) 基于ocr识别房产证信息确定方法、装置、设备及介质
US20070217715A1 (en) Property record document data validation systems and methods
WO2021143088A1 (zh) 多证件类型同步检测方法、装置、计算机设备及存储介质
US9836520B2 (en) System and method for automatically validating classified data objects
US20140369606A1 (en) Automated field position linking of indexed data to digital images
CN109325042B (zh) 处理模版获取方法、表格处理方法、装置、设备及介质
CN111783460A (zh) 一种企业简称提取方法、装置、计算机设备及存储介质
CN111783710B (zh) 医药影印件的信息提取方法和系统
CN114357174B (zh) 基于ocr和机器学习的代码分类系统及方法
CN112559526A (zh) 数据表导出方法、装置、计算机设备及存储介质
Riad et al. Classification and information extraction for complex and nested tabular structures in images
US20070217691A1 (en) Property record document title determination systems and methods
US11593417B2 (en) Assigning documents to entities of a database
WO2021250600A1 (en) Methods and systems for matching and optimizing technology solutions to requested enterprise products
US20240054587A1 (en) Systems and methods for electronic document notarization
US20240062572A1 (en) Text data structuring method and apparatus using line information
CN117034864B (zh) 可视化标注方法、装置、计算机设备以及存储介质
US20230140546A1 (en) Randomizing character corrections in a machine learning classification system
US20240176949A1 (en) Systems and methods for generating document templates from a mixed set of document types
Sarker et al. A programming based handwritten text identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904925

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904925

Country of ref document: EP

Kind code of ref document: A1