WO2021072876A1 - 证件图像分类方法、装置、计算机设备及可读存储介质 - Google Patents

证件图像分类方法、装置、计算机设备及可读存储介质 Download PDF

Info

Publication number
WO2021072876A1
WO2021072876A1 PCT/CN2019/118392 CN2019118392W WO2021072876A1 WO 2021072876 A1 WO2021072876 A1 WO 2021072876A1 CN 2019118392 W CN2019118392 W CN 2019118392W WO 2021072876 A1 WO2021072876 A1 WO 2021072876A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
credential
image
field
fields
Prior art date
Application number
PCT/CN2019/118392
Other languages
English (en)
French (fr)
Inventor
黄文韬
刘鹏
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021072876A1 publication Critical patent/WO2021072876A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the technical field of certificate image classification, and in particular to a method, device, computer equipment, and computer-readable storage medium for certificate image classification.
  • Text recognition based on OCR text recognition has been widely used to extract information.
  • the universal OCR (Optical Character Recognition in English) recognition model is to directly perform undifferentiated full text recognition on input text images.
  • OCR Optical Character Recognition in English
  • What people need is not the entire content of a certificate, but different customization requirements for different certificates, which need to be based on the target file corresponding to the certificate.
  • Medium-direction extracts some of the preset information.
  • the processing logic corresponding to the certificate type is called to process the certificate. Therefore, before extracting text based on the OCR model, it is necessary to classify the documents, and then decide which logic to call based on the classification results to meet the customized extraction requirements of the current type of documents. Especially for some document types whose features are not obvious and are difficult to distinguish from the appearance alone, such as form documents printed on A4 paper, many different documents will have similar appearance characteristics.
  • the general object recognition model is used to distinguish the document types. The task will be more difficult, and it is also difficult to distinguish the documents to be classified by training the general object recognition model to identify the accurate document type. At this time, it is difficult to accurately classify the documents only through the general object recognition model.
  • the embodiments of the present application provide a certificate image classification method, device, computer equipment, and computer readable storage medium, which can solve the problem of low classification accuracy in the traditional technology when the certificate image is classified by the general object recognition model.
  • an embodiment of the present application provides a method for classifying a credential image.
  • the method includes: obtaining a credential image to be classified; extracting all fields contained in the credential image based on an OCR model;
  • the vector of the credential image is generated in a preset manner; it is determined whether there is a vector that matches the vector of the credential image in the preset vector set, wherein the vector set includes a plurality of vectors that are obtained through the first preset manner.
  • Generated vector corresponding to the credential image of different credential types if there is a vector matching the credential image vector in the set of vectors, the vector that matches the credential image vector is used as the target vector, and according to The credential type corresponding to the target vector determines the credential type of the credential image.
  • an embodiment of the present application also provides a certificate image classification device.
  • the device includes: an acquisition unit for acquiring a certificate image to be classified; and an extraction unit for extracting the certificate image based on the OCR model.
  • the first generating unit is used to generate the vector of the credential image in a first preset manner according to the field; the first judging unit is used to determine whether the credential image exists in the preset vector set A vector that matches the vector of the vector, wherein the vector set includes a plurality of vectors that are generated by the first preset method and correspond to the credential images of different credential types;
  • the first classification unit is configured to: There are vectors matching the vector of the credential image in a centralized manner, the vector matching the vector of the credential image is taken as a target vector, and the credential type of the credential image is determined according to the credential type corresponding to the target vector.
  • an embodiment of the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor implements the certificate image classification method when the computer program is executed.
  • the embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the credential image Classification.
  • FIG. 1 is a schematic diagram of an application scenario of the certificate image classification method provided by an embodiment of the application
  • FIG. 2 is a schematic flowchart of a method for classifying a credential image provided by an embodiment of the application
  • FIG. 3 is a schematic diagram of a sub-process of the certificate image classification method provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of the process of generating a feature field set as a second field set through the recognition result of the OCR model in the certificate image classification method provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of another process of a method for classifying a credential image according to an embodiment of the application
  • Fig. 6 is a schematic block diagram of the credential image classification device provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic diagram of an application scenario of a method for classifying a credential image provided by an embodiment of this application.
  • the application scenarios include: (1) Terminal. The terminal is used to photograph the electronic version of the certificate to obtain the certificate image.
  • the terminal in Figure 1 is connected to the server.
  • (2) Server. The server receives the credential image sent by the terminal and executes the steps of the credential image classification method.
  • the server is connected to the terminal through a wired network or a wireless network.
  • the working process of each subject in Figure 1 is as follows: the terminal takes the image of the credential to obtain the credential image, and sends the credential image to the server; the server obtains the credential image, and extracts all the fields contained in the credential image based on the OCR model.
  • the vector of the credential image is generated by the first preset method, and it is judged whether there is a vector matching the vector of the credential image in the preset vector set, wherein the vector set includes a plurality of The vector generated by the preset method and corresponding to the credential image of different credential types, if there is a vector in the vector set that matches the vector of the credential image, the vector that matches the vector of the credential image is taken as the target Vector, and determine the credential type of the credential image according to the credential type corresponding to the target vector.
  • FIG. 1 only shows a mobile phone as a terminal.
  • the type of the terminal is not limited to that shown in FIG. 1.
  • the terminal may also be a smart watch, a notebook computer, or a tablet computer.
  • the application scenario of the above-mentioned certificate image classification method is only used to illustrate the technical solution of the present application, and is not used to limit the technical solution of the present application, and the foregoing connection relationship may also have other forms.
  • FIG. 2 is a schematic flowchart of a method for classifying a credential image provided by an embodiment of the application.
  • the certificate image classification method is applied to the server in FIG. 1 to complete all or part of the functions of the certificate image classification method.
  • FIG. 2 which is a schematic flowchart of a method for classifying a credential image provided by an embodiment of the present application. As shown in Figure 2, the method includes the following steps S201-S206:
  • the certificate type to which the certificate belongs when adapting to different certificate types in a service, it is necessary to classify the certificate type to which the certificate belongs before proceeding with the subsequent logic, so as to retrieve the processing logic for processing the certificate type according to the certificate type of the certificate.
  • documents such as ID cards, driving licenses, and resumes can be processed.
  • After obtaining the ID image of the ID it is first necessary to determine whether the ID image is an ID card, a driving license or a resume, in order to retrieve the ID card processing logic accordingly , Driver's license processing logic or resume processing logic to process the certificate image.
  • the electronic version image of the certificate can be first captured by the camera of the terminal to obtain the certificate image. After the terminal captures the certificate image of the certificate, the certificate image is sent to the server, and the server obtains the certificate image to be classified, and further classifies the certificate image .
  • the server extracts all the fields included in the credential image based on the OCR model, that is, the server detects and recognizes all the characters included in the credential image based on the OCR model to extract all the fields included in the credential image.
  • OCR in English, is Optical Character Recognition, and optical character recognition refers to the process of analyzing and recognizing image files of text data to obtain text and layout information. That is, the text in the image is recognized and returned in the form of text.
  • the text extraction performed by the OCR model includes the following steps: 1) The OCR model receives the credential image. 2) The OCR model preprocesses the document image. Specifically, image preprocessing is usually to correct image imaging problems. Common preprocessing processes mainly include binarization, noise removal and tilt correction, such as geometric transformation (including perspective, distortion, rotation, etc.), distortion correction, blur removal, image enhancement, and light correction. 3) The OCR model performs text detection on the document image. Specifically, text detection is to detect the location and range of the text and its layout. Usually also includes layout analysis and text line detection. Commonly used detection methods include Faster R-CNN and FCN RRPN (Rotation Region Proposal Networks) and other text detection models.
  • image preprocessing is usually to correct image imaging problems. Common preprocessing processes mainly include binarization, noise removal and tilt correction, such as geometric transformation (including perspective, distortion, rotation, etc.), distortion correction, blur removal, image enhancement, and light correction.
  • the OCR model performs text detection on the document image
  • the OCR model performs character recognition on the document image. Specifically, text recognition is to recognize text content based on text detection, and convert text information in image form contained in an image into text information in editable text form. Among them, the text recognition network structure includes the CRNN model and the introduction of attention mechanism. 5) The OCR model outputs the recognized text.
  • the vector of the credential image is generated by the first preset method, and the following process is carried out: 1) First, a total set of fields including all fixed fields of all credential types including non-repeated fields is generated, based on The OCR model extracts multiple document images of each document type to identify the common fixed fields in the multiple document images of each document type, and composes all fixed fields of all document types into a total of fields containing no repeated fields. set.
  • the generated vector of the credential image is ⁇ 2, 0, 5, 0, 0, 1, 6, 0 ⁇ .
  • S204 Determine whether there is a vector matching the vector of the certificate image in a preset vector set, where the vector set includes a plurality of certificates generated by the first preset method and corresponding to different certificate types.
  • Vector of images
  • the judgment of the similarity between the two images is converted into the judgment of the closeness of the two vectors corresponding to the two images.
  • the cosine similarity between the two vectors can be calculated to determine the degree of similarity between the two images.
  • the similarity between the realization of image classification The greater the cosine similarity of the two images, the closer the two vectors are, and the more similar the two images are.
  • the distance between the two points corresponding to the two vectors can also be calculated. The shorter the distance, the closer the two vectors are. The more similar the images, the more the document images are classified.
  • the proximity of the two vectors meets the requirements, it is determined that there is a vector in the vector set that meets the preset condition with the vector of the certificate image among the vectors contained in the vector set, and it is determined that there is a vector in the vector set and the certificate image
  • the vector matching the vector of the document image is used as the target vector, and the document type of the document image is determined according to the document type corresponding to the target vector, so as to realize the classification of the document image.
  • the credential image is classified into any credential type in the vector set, and the credential image fails to be classified. If the credential image fails to be classified, a prompt of classification failure can be given to the user for manual processing.
  • the classification of the image is converted to the conditions corresponding to the text content.
  • the judgment of the similarity between the document images is converted into the judgment of the similarity between the vectors corresponding to the text contained in the image, so as to judge the document type to which the document belongs, instead of directly using the document image to determine the document type Judging, classification based on text content has better results than direct classification with ordinary object recognition models, which can improve the accuracy of the identification of document types, especially for texts that are not obvious in image features and are easily confused with other file types.
  • classifying the credential image through the field recognition result of OCR will improve the accuracy of classifying credential images.
  • the process of document classification can be more automated, especially when processing a large number of unclassified text files, you can quickly obtain accurate classification results of the document type, which can improve the simplicity, efficiency and accuracy of document image classification It can reduce the labor cost and time cost caused by manual classification to a considerable extent, so that customized services corresponding to the classification of different documents can be configured more quickly according to the classification results, and the efficiency of document processing can be improved as a whole.
  • the method before the step of determining whether there is a vector matching the vector of the document image in the preset vector set, the method further includes: generating the vector set.
  • the embodiment of the present application implements the classification of the document images based on the common fields contained in multiple document images of the same document type as the basis for judging the document type, the common fields contained in the multiple documents are also fixed in the documents. Fields, the fields contained in the credential image need to be extracted through the OCR model.
  • OCR recognition results In order to use OCR recognition results to classify documents of document types, it is necessary to establish a category library, that is, a vector set, before categorizing the document images, which is used to inform the service of what conditions the OCR recognition results will be recognized as belonging to One category of document types.
  • the vector of the credential image corresponding to each credential type is generated by the first preset method, and the vector set is composed of the respective vectors of all credential types.
  • FIG. 3 is a schematic diagram of a sub-process of the certificate image classification method provided by the embodiment of this application.
  • S301 to S307 are to generate vectors corresponding to each certificate image of different certificate types through the first preset method, and then form a vector set from the respective vectors of all the certificates to generate the vector set , That is, the step of generating the vector set includes the following steps S301-S308:
  • the text recognition result is obtained through the OCR model to extract all the fields contained in each credential image, and the number of occurrences of each of the fields is counted to generate each of the credential images.
  • the first field set corresponding to the credential image For example, if a document image contains 5 fields of ABCDE, where A appears 2 times, B appears 4 times, C appears 6 times, D appears 1 time, and E appears 2 times, you can Compose the first field set ⁇ (A, 2), (B, 4), (C, 6), (D, 1), (E, 2) ⁇ .
  • For each of the acquired multiple of the document images repeat the above-mentioned process of generating the first field set corresponding to the document image for each of the document images, and then multiple pieces of the document image can be obtained.
  • the first set of fields of each credential image For each of the acquired multiple of the document images, repeat the above-mentioned process of generating the first field set corresponding to the document image for each of the document images, and then multiple pieces of the document image
  • the multiple document images have common fields, and the common fields have common attributes such as common appearance times.
  • the common fields in the set include four fields ABCD, where A appears 2 times, B appears 4 times, C appears 6 times, and D appears 1 time, which can form a set ⁇ (A, 2), (B, 4), (C, 6), (D, 1) ⁇ .
  • a preset number of partial common fields may be extracted from the shared fields in a second preset manner to form the second field set, or all the common fields may be used as the second field set, that is, to compare each sample to extract Extract the common fields in each sample to form a second field set, and store the second field set and the corresponding document type in the classification database for subsequent use in classifying the document image.
  • the second preset method includes extracting a preset number of common fields according to the frequency of appearance of each field, and extracting the preset number of common fields according to the frequency of appearance of each field from high to low, or according to the frequency of field appearance.
  • the basis of the preset number is that the certificate type can be identified by the number of common fields, and there can be no different certificate types. Extract the same number of the same fields and the same fields appear in each case The same number of times is used as the second field set. Further, it is possible to determine whether the preset number of common fields extracted for each two certificate types and the number of appearances of each field are the same. If the preset number of common fields extracted for the two certificate types and the number of appearances of each field are the same, it is required Re-extract the common fields of at least one of the certificate types to form a new second field set, so that the preset number of common fields extracted for every two certificate types in all the certificate types and the number of occurrences of each field are different.
  • Figure 4 is a schematic diagram of the process of generating a feature field set as the second field set from the recognition result of the OCR model in the document image classification method provided by the embodiment of the application.
  • the OCR model By uploading multiple samples of the same type to the OCR model, The OCR model extracts the text content contained in the sample image to obtain the field recognition result, compares the recognized field results, extracts the common fields in the field results and the number of occurrences of the common fields to form a feature field set, and combines the feature fields
  • the types of document types corresponding to the feature field set are stored in the classification database formed by the field set for classification, and all common fields are used as the second field set, which can improve the identification accuracy of the document type and avoid the extraction of some common fields as the second field set.
  • the second set of fields leads to the confusion of different document types into the same document type.
  • the identification fields of the document type are extracted, and after identifying multiple samples of the same type, among the multiple recognition results obtained, it is analyzed that some fields are in all samples. Will appear. When its occurrence is greater than a certain threshold, it is judged that this is a characteristic field. When the same number of occurrences of the same field is greater than a certain threshold, the number of occurrences of the field is recorded as the corresponding number. For example, in 90% of the samples, "name" appears more than three times, while only 10% of them have more than four times. The number of occurrences of the name is recorded as 3 times, and the feature field of this type is identified with a high probability. A small probability may be a special case.
  • each field set for judging the certificate type category is formed, and then all the fields appearing in all categories are formed into a total set of non-repeated fields, so as to obtain the total set of all fields.
  • All the fields appearing in the second field set constitute a total field set without repeated fields.
  • All fields appearing in the second field set constitute a total field set without repeated fields.
  • the total set of no repeated fields is ABCDEFG, and the union relationship is taken to form the seven dimensions of ABCDEFG.
  • the number of times that each of the fields included in the second field set appears in the credential image corresponding to the credential type is counted for the fields included in the field total set
  • the number of occurrences in the second field set, and the fields included in the total field set that do not appear in the second field set are recorded as 0, so as to obtain the digital sequence to which the certificate type corresponding to the second field set belongs.
  • For the field set of each category calculate the number of times the field of the total set appears in the field set of the individual category, and record 0 if it does not appear. For example, for Type 1 ABC, DEFG is recorded as 0, and for Type 2 ACD, BEFG is recorded as 0. At the same time, each field does not necessarily appear once.
  • Party A may appear 4 times, and Party B may appear twice.
  • the same field may appear differently in different categories. For example, A may appear 3 times in 1 category, and A may appear 5 times in 2 categories.
  • the number of occurrences of the field ABC in the total set in category 1 is 123
  • the number of occurrences of the field ACD in the total set in category 2 It is 356, and the ones that do not appear are recorded as 0.
  • S307 Sort the number sequence according to the preset order of the fields, so as to obtain the vector of the certificate type corresponding to the second field set.
  • the number sequence calculated in the previous step is arranged in a fixed order based on the total concentration field to form a vector for the category.
  • the order of the fields in the total set is consistent with the order of the fields in the individual categories, so that a corresponding comparable vector can be formed.
  • the order of the fields in the total set is ACDFE
  • the order of the individual categories is ACDFE.
  • the order of the fields in the ACDFE should also form a corresponding vector.
  • the vector formed by type 1 is 1230000
  • the vector formed by type 2 is 3057000.
  • each of the second field sets of the multiple certificate types repeat the above-mentioned process of generating the vector of the certificate type corresponding to the second field set for each of the second field sets, to obtain a plurality of the certificates
  • the vectors of the respective types are grouped into a set of the respective vectors of a plurality of the certificate types to generate a vector set.
  • the method further includes: according to the number of occurrences of each of the fields in the second field set, according to the third prediction Set the weight to a number corresponding to each field in the number sequence corresponding to the second field set.
  • the generated feature fields containing all categories are extracted, all fields are used to form a non-repetitive ordered field set, and for each category, the number of times all fields contained in the field set appear in the category is calculated , The number of fields that do not appear is recorded as 0, and the frequency of each field in different categories is calculated at the same time, the less the category occupied, the higher the weight is given to it, and the number of times and the weight are multiplied to form the category in the category
  • a value of a field, combining these values in the order of the total set, can generate a vector as a feature vector of this type.
  • Frequency also known as "number” which means that the variable value represents a certain feature The number of occurrences (flag value). For example, if the field "name” appears in 8 of the total 10 categories, the field “name” is not highly recognizable as the basis for classifying, and it is given a lower weight. If "name” This field appears in one of a total of 10 categories. The "name” field is highly recognizable as the basis for classifying, and can be used as a stronger classification basis, giving it a higher weight. For example, please refer to Table 1. If there are categories in Table 1:
  • Type 1 contains the characteristic field BC
  • Type 2 contains the characteristic field AC
  • Type 3 contains the characteristic field BCD.
  • the characteristic field contained in the target certificate is ACD, and the number of each field is shown in the table.
  • the respective weights of ABCD can be calculated according to the frequency of occurrence of ABCD in Class 1, Class 2, and Class 3 respectively. You can use n/m as the calculation method of ABCD weight, where n is ABCD in Class 1, Class 2, and Class 3 respectively.
  • m is the sum of the frequency of each field in each category. A appears in category 1, category 2, category 3, and B is in category 1.
  • the weights of ABCD are as follows: 1A appears in only category 2 in category 1, category 2 and category 3, the frequency of occurrence of A is 1, and the weight of A is 1/7; 2B is in category 2 It appears in both types 1 and 3, and the frequency of appearance is 2 times, and the weight of B is 2/7; 3C appears in both types 1 and 2 and 3 types, and appears 3 times, and the weight of C is 3/7 4D because D only appears in 3 categories, the weight of D is 1/7.
  • the sum of the weights of ABCD is 1.
  • the formula for calculating the value contained in the vector of each certificate is: the number of times the field appears * weight.
  • the vector of the class is (1/7, 0, 6/7, 0);
  • the vector is (2/7, 0, 3/7, 3/7).
  • the weight of the number configuration corresponding to the field is inversely proportional to the number of times each field appears.
  • the second field set corresponding to the feature field set of the credential type in order to realize that the fewer types of fields are occupied, that is, the less frequently the field appears in the credential type, a higher weight is given to the field, and the field The number of occurrences is multiplied by the weight to form a value of the category for the field. Since weights are used to describe the relative importance of factors or indicators, they tend to describe the contribution or importance of factors or indicators. In one weight expression method, the sum of the weights of ABCD is 1, and in another weight expression method The sum of the weights of ABCD may not be 1. For example, the weight may also be 1/m, where m is the sum of the frequency of the field in all categories.
  • the specific allocation method is as follows: 1A is in category 1, category 2, category 3 In only appeared in 2 categories, A appeared in 1 time, A weight was 1; 2B appeared in both 1 and 3 categories, appeared twice in frequency, B weight was 0.5; 3C in Type 1, Type 2, Type 3, and 3 types, appearing 3 times, and the weight of C is 0.3; 4 Since D only appears in Type 3, the weight of D is 1.
  • the frequency can also be called the number of times, which means that under the same conditions, f tests are carried out. In these f tests, the number of occurrences m of event A is called the frequency of occurrence of event A. The larger the weight value, the stronger the expressive ability of the feature item.
  • the smaller the weight the weaker the expressive ability, as long as the same standard is used to assign the weight.
  • the weights for each field in the second field set of each individual certificate type, it is also possible to generate a feature vector about the total set for each individual certificate sample input, and by assigning each certificate type
  • the feature fields are configured with weights to reflect the importance of different fields in the process of document type judgment, which can improve the accuracy and efficiency of the classification of document images.
  • the step of determining whether there is a vector matching the vector of the document image in the preset vector set includes: calculating the cosine of the vector of the document image and each vector contained in the vector set Similarity; judge whether there is a vector whose cosine similarity is not less than the preset cosine similarity threshold; if there is a vector whose cosine similarity is not less than the preset cosine similarity threshold, it is determined that there is a vector in the set of vectors that is similar to the document image Matching vector; if all the cosine similarities are less than the preset cosine similarity threshold, it is determined that there is no vector matching the vector of the document image in the vector set.
  • the cosine similarity which can also be called the cosine distance, uses the cosine value of the angle between two vectors in the vector space as a measure of the difference between two individuals. The closer the cosine value is to 1, the closer the angle is to 0 degrees, that is, the more similar the two vectors are, which is also called "cosine similarity".
  • the cosine similarity between the vector of the input certificate image and the vector of each type of certificate is calculated.
  • the calculation of cosine similarity can be done in the following ways:
  • the cosine value between two vectors can be found by using the Euclidean dot product formula:
  • a i and B i represent the components of the vectors A and B respectively.
  • the similarity given ranges from -1 to 1.
  • -1 means that the two vectors point in exactly opposite directions
  • 1 means that their directions are exactly the same
  • 0 usually means that they are independent of each other.
  • the value between represents the similarity or dissimilarity in the middle.
  • the attribute vectors A and B are usually word frequency vectors in the document. Cosine similarity can be seen as a way to normalize the file length in the comparison process.
  • the similarity between the input document image and the various types of document types in the classification library is judged.
  • the cosine similarity is the largest and greater than the expected
  • the cosine similarity threshold is set, the credential image can be considered as belonging to the credential type corresponding to the vector, which can make the process of classifying images more automated and improve the efficiency of credential classification.
  • the vector that matches the vector of the credential image is used as the target vector, and the vector corresponding to the target vector is used as the target vector.
  • the credential type determining the credential type of the credential image includes: if a vector matching the vector of the credential image exists in the vector set, using the vector that matches the vector of the credential image as the target vector; The number of the target vector is one, and the certificate type corresponding to the target vector is determined as the certificate type of the certificate image; if the number of the target vector is more than one, the plurality of target vectors are combined with the certificate The evidence type corresponding to the target vector closest to the vector of the image is determined as the document type of the document image.
  • the vector that matches the vector of the credential image is used as the target vector. Since a credential image can only correspond to one credential type, if all The number of the target vector is one, the document type corresponding to the target vector is determined as the document type of the document image, and if the number of the target vector is multiple, the multiple target vectors are combined with the document type The evidence type corresponding to the target vector closest to the vector of the image is determined as the document type of the document image, so as to realize the classification of the document image.
  • the method before the step of extracting all the fields contained in the credential image based on the OCR model, the method further includes: recognizing the credential image through a preset credential recognition model; and judging that the credential is recognized by the preset credential Whether the model can determine the credential type to which the credential image belongs; if it can determine the credential type to which the credential image belongs, classify the credential image to the credential type to which the preset credential recognition model belongs; if the preset is passed The credential recognition model fails to determine the credential type to which the credential image belongs, and the step of extracting all the fields contained in the credential image based on the OCR model is executed.
  • the preset document recognition model refers to an object recognition model corresponding to an existing mature recognition model such as an ID card, a marriage certificate, and a driver's license.
  • FIG. 5 is a schematic diagram of another process of a method for classifying a credential image according to an embodiment of the application.
  • the process of categorizing documents includes: inputting the document image that needs to be classified, and first classifying the document image through a preset document recognition model (ie, object recognition model).
  • the model successfully classifies the credential image and directly obtains the classification result of the credential image. If the specific credential type of the credential image cannot be identified through the object recognition model, the credential image fails to be classified, and the OCR model is entered to extract the credential image. Enter the text information in the image to obtain the text recognition result.
  • the field set vector of the input image is obtained, and the field set is obtained from the category library in advance, and the certificate type of each category in the category library is calculated through the field set
  • the vector set of each category composed of vectors calculates the similarity between the vector in the field set of the input image and the vector in the vector set, and classifies the credential image by the similarity between the vectors. If the credential image is classified successfully To obtain the classification result of the credential image, if the credential image is classified and recognized, it can be determined that the credential image belongs to another category, and manual processing can be performed by prompting the failure of the image classification.
  • FIG. 6 is a schematic block diagram of a credential image classification device provided by an embodiment of this application.
  • an embodiment of the present application also provides a certificate image classification device.
  • the certificate image classification device includes a unit for executing the above-mentioned certificate image classification method, and the device can be configured in a computer device such as a server.
  • the certificate image classification device 600 includes an acquisition unit 601, an extraction unit 602, a first generation unit 603, a first judgment unit 604, and a first classification unit 605.
  • the acquiring unit 601 is configured to acquire the credential image to be classified; the extracting unit 602 is configured to extract all the fields contained in the credential image based on the OCR model; the first generating unit 603 is configured to pass the The vector of the credential image is generated in a preset manner; the first determining unit 604 is configured to determine whether there is a vector matching the vector of the credential image in the preset vector set, wherein the vector set includes a plurality of passed The vector generated by the first preset method and corresponding to the certificate image of different certificate types; the first classification unit 605 is configured to match the vector with the vector of the certificate image if there is a vector in the vector set.
  • the vector matching the vector of the credential image is used as the target vector, and the credential type of the credential image is determined according to the credential type corresponding to the target vector.
  • the credential image classification device 600 further includes: a second generating unit, configured to generate the vector set; the second generating unit includes: an acquiring subunit, configured to acquire documents belonging to the same credential type Multiple credential images; the first extraction subunit, for each credential image, extract all fields contained in the credential image based on the OCR model, and count the number of occurrences of each of the fields to generate The first field set corresponding to each of the credential images; the comparison subunit, which is used to compare the fields contained in each of the first field sets, and filter out all the common fields in the first field set; the second extraction subunit , For extracting a preset number of shared fields from the shared fields in a second preset manner to form a second field set, and the second field set is used as a basis for identifying the certificate type; forming a subunit, It is used to compose all the fields that appear in the second field set into a field total set without repeated fields; the first obtained subunit is used for each of the second field sets, according to the second
  • the second extraction subunit is used to extract all the common fields to form a second field set.
  • the second generating unit further includes: a configuration subunit, configured to configure the weight to the second according to a third preset manner according to the number of occurrences of each of the fields in the second field set. A number corresponding to each field in the number sequence corresponding to the field set.
  • the weight of the number configuration corresponding to the field is inversely proportional to the number of times each field appears.
  • the first judging unit 604 includes: a calculation subunit for calculating the cosine similarity between the vector of the credential image and each vector contained in the vector set; the first judging subunit uses To determine whether there is a vector whose cosine similarity is not less than a preset cosine similarity threshold; the determination subunit is used to determine whether there is a vector with a cosine similarity not less than the preset cosine similarity threshold to determine whether the vector set exists with the document The vector of the image matches the vector.
  • the first classification unit 605 includes: a second judgment subunit, configured to match the vector of the credential image with the vector of the credential image if there is a vector in the vector set. The matched vector is used as the target vector; the first classification subunit is used to determine the certificate type corresponding to the target vector as the certificate type of the certificate image if the number of the target vector is one; the second classification subunit, If the number of the target vectors is multiple, determine the evidence type corresponding to the target vector closest to the vector of the document image among the multiple target vectors as the document type of the document image.
  • the credential image classification device 600 further includes: an identification unit, configured to identify the credential image through a preset credential identification model; and a second judging unit, configured to determine that the credential image is identified through the preset credential Whether the model can determine the credential type to which the credential image belongs; the extracting unit 602 is configured to perform the extraction based on the OCR model if the credential type to which the credential image belongs cannot be determined through the preset credential recognition model Describe the steps for all fields included in the credential image.
  • the above-mentioned credential image classification device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 7.
  • FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 700 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 700 includes a processor 702, a memory, and a network interface 705 connected through a system bus 701, where the memory may include a non-volatile storage medium 703 and an internal memory 704.
  • the non-volatile storage medium 703 can store an operating system 7031 and a computer program 7032.
  • the processor 702 can execute one of the above-mentioned certificate image classification methods.
  • the processor 702 is used to provide calculation and control capabilities to support the operation of the entire computer device 700.
  • the internal memory 704 provides an environment for the operation of the computer program 7032 in the non-volatile storage medium 703.
  • the processor 702 can execute the above-mentioned method for classifying a certificate image.
  • the network interface 705 is used for network communication with other devices.
  • the specific computer device 700 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 7 and will not be repeated here.
  • the processor 702 is configured to run a computer program 7032 stored in a memory to implement the credential image classification method in the foregoing embodiment of the present application.
  • the processor 702 may be a central processing unit (Central Processing Unit, CPU), and the processor 702 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium stores a computer program.
  • the processor executes the descriptions in the above embodiments. Steps of the method of document image classification.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Input (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种证件图像分类方法、装置、计算机设备及可读存储介质。本申请实施例属于证件图像分类技术领域,通过获取待分类的证件图像,基于OCR模型提取证件图像中包含的所有字段,根据字段,通过第一预设方式生成证件图像的向量,判断预设的向量集中是否存在与证件图像的向量相匹配的向量,其中,向量集包括多个通过第一预设方式所生成的、对应于不同证件类型的证件图像的向量;若向量集中存在与证件图像的向量相匹配的向量,将与证件图像的向量相匹配的向量作为目标向量,并根据目标向量对应的证件类型确定证件图像的证件类型。

Description

证件图像分类方法、装置、计算机设备及可读存储介质
本申请要求于2019年10月15日提交中国专利局、申请号为201910979547.3、申请名称为“证件图像分类方法、装置、计算机设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及证件图像分类技术领域,尤其涉及一种证件图像分类方法、装置、计算机设备及计算机可读存储介质。
背景技术
基于OCR文字识别的文本识别对信息的提取已经应用的相当广泛。通常来说,通用的OCR(英文为Optical Character Recognition,光学字符识别)识别模型是对输入文本图像直接进行无差别的全文识别。但是随着更加广泛的应用和更加精细化的需求,很多情况下,人们所需要的并不是一个证件的全部内容,而是对于不同证件会有不同的定制化需求,需要从证件对应的目标文件中定向提取预设的部分信息,这时如果仅仅依靠通用OCR模型的识别很难满足定向提取的需求,需要对不同类型证件选择性地调用对应的处理逻辑,这就涉及到对证件的分类,也即根据证件的类型调用该证件类型对应的处理逻辑处理该证件。因此,基于OCR模型提取文本之前,需要对证件进行分类处理,然后依据分类结果决定调用哪种逻辑来适应当前类型证件的定制化提取需求。尤其对于一些特征不明显、单从外形上难以区分的证件类型,例如A4纸打印的表格类证件等,有很多不同的证件都会具有相似的外观特点,用一般的物体识别模型来完成区分证件类型的任务会比较困难,也很难通过训练一般的物体识别模型对待分类的证件加以区分以识别到准确的证件类型,这时仅通过一般的物体识别模型很难实现对证件进行准确分类。
发明内容
本申请实施例提供了一种证件图像分类方法、装置、计算机设备及计算机可读存储介质,能够解决传统技术中通过通用物体识别模型对证件图像进行分类时存在的分类准确性较低的问题。
第一方面,本申请实施例提供了一种证件图像分类方法,所述方法包括:获取待分类的证件图像;基于OCR模型提取所述证件图像中包含的所有字段;根据所述字段,通过第一预设方式生成所述证件图像的向量;判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
第二方面,本申请实施例还提供了一种证件图像分类装置,所述装置包括:获取单元,用于获取待分类的证件图像;提取单元,用于基于OCR模型提取所述证件图像中包含的所有字段;第一生成单元,用于根据所述字段,通过第一预设方式生成所述证件图像的向量;第一判断单元,用于判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;第一分类单元,用于若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
第三方面,本申请实施例还提供了一种计算机设备,其包括存储器及处理器,所述存储器上存储有计算机程序,所述处理器执行所述计算机程序时实现所述证件图像分类方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器执行所述证件图像分类方法。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的证件图像分类方法的应用场景示意图;
图2为本申请实施例提供的证件图像分类方法的流程示意图;
图3为本申请实施例提供的证件图像分类方法的一个子流程示意图;
图4为本申请实施例提供的证件图像分类方法中通过OCR模型的识别结果生成特征字段集作为第二字段集的流程示意图;
图5为本申请实施例提供的证件图像分类方法的另一个流程示意图;
图6为本申请实施例提供的证件图像分类装置的一个示意性框图;以及
图7为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,图1为本申请实施例提供的证件图像分类方法的应用场景示意图。所述应用场景包括:(1)终端。终端用于拍摄证件的电子版图像以获得证件图像。图1中的终端与服务器连接。(2)服务器。服务器接收终端发送的证件图像,并执行证件图像分类方法的步骤。服务器通过有线网络或者无线网络与终端连接。图1中的各个主体工作过程如下:终端拍摄证件的图像以获得证件图像,并将证件图像发送至服务器;服务器获取证件图像,基于OCR模型提取所述证件图像中包含的所有字段,根据所述字段,通过第一预设方式生成所述证件图像的向量,判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量,若所述向量集中存在与所述证件图像的向 量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
需要说明的是,图1中仅仅示意出手机作为终端,在实际操作过程中,终端的类型不限于图1中所示,所述终端还可以为智能手表、笔记本电脑或者平板电脑等计算机设备,上述证件图像分类方法的应用场景仅仅用于说明本申请技术方案,并不用于限定本申请技术方案,上述连接关系还可以有其他形式。
图2为本申请实施例提供的证件图像分类方法的示意性流程图。该证件图像分类方法应用于图1中的服务器中,以完成证件图像分类方法的全部或者部分功能。请参阅图2,图2是本申请实施例提供的证件图像分类方法的流程示意图。如图2所示,该方法包括以下步骤S201-S206:
S201、获取待分类的证件图像。
具体地,要在一个服务中适应不同的证件类型时,需要在进行后续逻辑之前,对证件所属证件类型进行分类处理,以根据证件的证件类型调取处理该证件类型的处理逻辑。例如在一个服务中,可以对身份证、驾驶证和简历等证件进行处理,获取证件的证件图像后,首先需要判断该证件图像是身份证、驾驶证还是简历,以对应调取身份证处理逻辑、驾驶证处理逻辑或者简历的处理逻辑对证件图像进行处理。可以首先通过终端的摄像头拍摄证件的电子版图像以获得证件图像,终端拍摄证件的证件图像后,将所述证件图像发送至服务器,服务器获取待分类的证件图像,进一步对所述证件图像进行分类。
S202、基于OCR模型提取所述证件图像中包含的所有字段。
具体地,服务器基于OCR模型提取所述证件图像中包含的所有字段,也即服务器基于OCR模型对所述证件图像中包含的所有文字进行检测及识别以提取所述证件图像中包含的所有字段。其中,OCR,英文为Optical Character Recognition,光学字符识别,是指对文本资料的图像文件进行分析识别处理,获取文字及版面信息的过程。亦即将图像中的文字进行识别,并以文本的形式返回。
进一步地,OCR模型进行文字提取包括以下步骤:1)OCR模型接收所述证件图像。2)OCR模型对所述证件图像进行预处理。具体地,图像预处理通常 是针对图像的成像问题进行修正。常见的预处理过程主要包括二值化、噪声去除及倾斜较正等,比如进行几何变换(包括透视、扭曲、旋转等方式)、畸变校正、去除模糊、图像增强和光线校正等。3)OCR模型对所述证件图像进行文字检测。具体地,文字检测即检测文本的所在位置和范围及其布局。通常也包括版面分析和文字行检测等。常用的检测方法包括Faster R-CNN及FCN RRPN(Rotation Region Proposal Networks)等文本检测模型。4)OCR模型对所述证件图像进行文字识别。具体地,文本识别是在文本检测的基础上,对文本内容进行识别,将图像中包含的图像形式的文本信息转化为可编辑的文字形式的文本信息。其中,文本识别网络结构包括CRNN模型及引入注意力机制等。5)OCR模型将识别出来的文本输出。
S203、根据所述字段,通过第一预设方式生成所述证件图像的向量。
具体地,根据所述字段,通过第一预设方式生成所述证件图像的向量,经过以下过程:1)首先,生成所有证件类型的所有固定字段组成的包含无重复字段的字段总集,基于OCR模型对每种证件类型的多张证件图像进行提取,以识别出每种证件类型的多张证件图像中的共有的固定字段,将所有证件类型的所有固定字段组成包含无重复字段的字段总集。2)其次,根据所述证件图像中包含的所有字段,计算所述字段总集中包含的各个字段在所述证件图像中出现的次数,所述字段总集中包含的各个字段在所述证件图像中未出现的字段的次数记为0,从而得到与所述字段总集中的字段顺序相同的一个关于所述证件图像的数字序列,并且所述数字序列的顺序需要与生成各个证件类型的向量中描述的各个字段的顺序相同,从而生成所述证件图像的向量,所述证件图像的向量的生成方式与已生成的每个证件类型的向量的生成方式相同,并且所述每个证件类型的向量包含于由所有证件类型的向量组成的集合对应的向量集中。例如,基于OCR模型提取所述证件图像中包含的所有字段包括ACFG,其中,A出现了2次,C出现了5次,F出现了1次,G出现了6次,假设预先生成的字段总集包含的字段及其顺序为:ABCDEFGH,根据上述生成向量的生成方式,生成的所述证件图像的向量为{2,0,5,0,0,1,6,0}。
S204、判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量, 其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量。
具体地,通过将两个图像之间相似度的判断转换为两个图像各自所对应的两个向量接近程度的判断,例如可以通过计算两个向量之间的余弦相似度,从而根据两个图像之间的相似度实现将图像进行分类。两个图像的余弦相似度越大,表明两个向量越接近,两个图像越相似,也可以计算两个向量对应的两点之间的距离,距离越短,两个向量越接近,两个图像越相似,从而将所述证件图像进行归类。
S205、若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型;S206、若所述向量集中不存在与所述证件图像的向量相匹配的向量,对所述证件图像分类失败。
具体地,若判断两个向量接近程度满足要求,判断所述向量集包含的向量中存在与所述证件图像的向量满足所述预设条件的向量,判断所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型,从而实现对证件图像进行分类。若判断所述向量集包含的向量中不存在与所述证件图像的向量满足所述预设条件的向量,判断所述向量集中不存在与所述证件图像的向量相匹配的向量,不可以将所述证件图像归类至所述向量集中的任何一个证件类型,对所述证件图像分类失败,如果对所述证件图像分类失败,可以进行分类失败的提示以交给用户进行人工处理。
本申请实施例提供的证件图像的分类方法,根据对于文本类型的文件来说,由于图像中的文字内容比整图本身更具有代表性,通过将图像的归类转换为基于文字内容所对应条件的判断,将证件图像之间相似度的比对转换为图像中包含的文字所对应的向量之间相似度的判断,以对证件所属证件类型进行判断,相比直接通过证件图像对证件类型进行判断,根据文本内容进行分类比直接用普通的物体识别模型进行分类会有更好的效果,能够提高对证件类型判断的准确性,特别是对于图像特征不明显,容易和其他文件类型混淆的文本图像类型 来说,只要其有一个固定的格式,存在有代表性的字段组合,那么通过OCR的字段识别结果来对证件图像进行分类将能够提高对证件图像分类的准确性。同时,可以实现对证件分类的过程更加自动化,尤其对大批量未分类的文本文件进行处理时,可以快速的得到对证件类型的精准分类结果,能够提高证件图像分类的简便性、高效性和准确性,可以相当程度的减少人工分类所带来的人力成本和时间成本,从而可以更加快速地根据分类结果为不同文件配置与其分类相对应的定制化服务,从整体上提高对证件的处理效率。
在一个实施例中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤之前,还包括:生成所述向量集。
具体地,由于本申请实施例根据同一证件类型的多张证件图像中包含的共有字段作为判断证件类型的依据实现对证件图像的分类,多张证件中包含的共有字段也即证件中包含的固定字段,需要通过OCR模型提取证件图像中包含的字段。为了借助OCR的识别结果对证件类型的文件进行分类,需要在对证件图像进行分类之前,建立一个类别库,也即向量集,用来告知服务符合怎样条件的OCR识别结果将被识别为属于哪一类别的证件类型中。通过所述第一预设方式生成对应于每种证件类型的证件图像的向量,由所有证件各自的向量组成向量集。
进一步地,请参阅图3,图3为本申请实施例提供的证件图像分类方法的一个子流程示意图。如图3所示,其中,S301至S307为通过所述第一预设方式生成对应于不同证件类型的每种证件图像的向量,再由所有证件各自的向量组成向量集以生成所述向量集,也即所述生成所述向量集的步骤包括以下步骤S301-S308:
S301、获取属于同一个证件类型的多张证件图像。
具体地,在获取同一个证件类型中包含的共有的固定字段作为识别该证件类型的依据时,需要通过对同一个证件类型的多张证件图像各自包含的所有字段进行对比分析,以筛选出多张证件图像中共有的固定字段。比如,针对证件类型为身份证的多张证件图像ABCDE五张证件图像进行对比分析,可以获知五张身份证的证件图像中均包含字段“姓名”、“性别”、“民族”、“住址”、“公民身 份号码”、“签发机关”及“有效期限”等字段,而这些字段需要通过对多张身份证的证件图像中各自包含的所有字段进行筛选获得的,一般情况下,同一个证件类型的多张证件图像作为筛选该证件类型的共有的固定字段的样本,样本数量越大,对共有的固定字段的筛选越准确。
S302、针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集。
具体地,针对每一张所述证件图像,通过OCR模型获取文本识别结果,以提取每张所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集。例如,若一张所述证件图像中包含ABCDE共5个字段,其中,A出现了2次,B出现了4次,C出现了6次,D出现了1次,E出现了2次,可以组成第一字段集{(A,2),(B,4),(C,6),(D,1),(E,2)}。针对获取的多张所述证件图像中的每张证件图像,均各自重复上述针对每一张所述证件图像生成所述证件图像对应的所述第一字段集的过程,可以得到多张所述证件图像各自的所述第一字段集。
S303、对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段。
具体地,由于多张证件图像属于同一个证件类型,多张证件图像具有共同的格式或者模板,因此多张证件图像具有共同的字段,并且所述共同的字段具有共同的出现次数等共同属性。例如获得同属身份证的八张证件图像,可以得到8个第一字段集,对比这8个第一字段集,筛选出这八个第一字段集中的共有的字段,假设这八个第一字段集中的共有字段包括ABCD共四个字段,其中A出现了2次,B出现了4次,C出现了6次,D出现了1次,可以组成一个集合{(A,2),(B,4),(C,6),(D,1)}。
S304、从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据。
具体地,可以从所述共有的字段中按照第二预设方式提取预设数量的部分共有字段组成第二字段集,也可以将所有的共有字段作为第二字段集,即对比 各个样本提取到的文本信息,提取各样本中共有的字段组成第二字段集,并将第二字段集及对应的证件类型存入分类类别库中以供后续用于对证件图像进行分类使用。其中,第二预设方式包括按照各个字段出现的频率提取预设数量的共有字段,可以按照各个字段出现的频率由高到低提取预设数量的共有字段,也可以按照字段出现的频率由低到高提取预设数量的共有字段,其中预设数量的依据是以能以该数量的共有字段识别该证件类型为准,不能存在不同的证件类型提取相同数量的相同字段且相同字段各自出现的次数相同作为第二字段集。进一步地,可以通过判断每两个证件类型提取的预设数量的共有字段及各个字段出现的次数是否相同,若两个证件类型提取的预设数量的共有字段及各个字段出现的次数相同,需要至少重新提取其中一个证件类型的共有字段组成新的第二字段集,以使所有证件类型中的每两个证件类型提取的预设数量的共有字段及各个字段出现的次数均不相同。
以将所有的共有字段作为第二字段集为例。请参阅图4,图4为本申请实施例提供的证件图像分类方法中通过OCR模型的识别结果生成特征字段集作为第二字段集的流程示意图,通过上传多张同类样本图像至OCR模型,通过OCR模型提取所述样本图像中包含的文字内容以得到识别出字段结果,对比识别到的字段结果,提取字段结果中共有的字段及共有的字段各自的出现次数组成特征字段集合,将特征字段集合与特征字段集合对应的证件类型的种类存入分类用的字段集形成的分类库中,以所有的共有字段作为第二字段集,能够提高证件类型的辨识准确度,避免由于提取部分共有字段作为第二字段集导致将不同证件类型混淆为相同证件类型。比如,对一个种类的所有样本中,识别出来AX1BY1DZ1、AX2B Y2C Z2、AX3B Y3E Z3、AX4B Y4F Z4,取共有字符集AB作为判断该种类的依据,将共有字符集储存起来作为之后对新输入样本的分类依据,若新输入样本图像也包括AB,判断新输入样本图像为AB对应的分类,若新输入样本图像不满足包括AB,判断新输入样本不为AB对应的分类。
更进一步地,借助OCR模型的识别结果,将该证件类型的具备特征性的标识性字段提取,对多张同类型样本识别之后,得到的多个识别结果中,分析出有些字段在所有样本中都会出现的,当其出现大于一定阈值时,判断这是一个 特征字段,当同种字段的同种次数出现大于一定阈值时,将该字段的出现次数记为相应次数。比如90%样本中“姓名”出现了三次以上,而四次以上的只有10%,则记录姓名的出现次数为3次,采用大概率识别该种类的特征字段,小概率的可能是特殊情况,从而使通过特征字段判断该证件类型更准确。如上述举例,若90%样本中“姓名”出现了三次以上,而四次以上的只有10%,则记录姓名的出现次数为3次作为该种类的特征字段,姓名出现四次以上的只有10%,可能是其他情况导致的。通过这样的方式,可生成关于该种类证件类型的特征标识性字段集。对不同种类的证件类型重复上述过程生成多个证件类型各自的第二字段集作为类别字段集,进而通过各自的类别字段集作为判断对应证件类型的依据。
S305、将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集。
具体地,获取多个证件类型各自的第二字段集后,形成了判断证件类型类别的各个字段集合,再将所有类别中出现的所有字段组成一个无重复字段的总集合,以得到将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集。比如,假设有4类证件类型,包括:1类,ABC;2类,ACD;3类,BCDE;4类,CDEFG,将1类、2类、3类及4类中出现的所有字段组成一个无重复字段的总集合为ABCDEFG,取并集关系,形成ABCDEFG七个维度。
S306、针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列。
具体地,针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,所述字段总集中包含的字段在所述第二字段集中未出现的记为0,从而得到所述第二字段集对应的证件类型所属的数字序列。对每个类别的字段集,计算总集的字段在单独类别字段集中出现的次数,没出现的记为0。例如,针对1类ABC,DEFG分别记为0,针对2类 ACD,BEFG分别记为0。同时,每个字段不一定是出现一次的,比如合同中,甲方可能出现4次,乙方可能出现2次。同时,同一个字段在不同的类别中出现的次数可能也不同,比如,A在1类中可能出现3次,A在2类中可能出现五次。计算总集的字段在单独类别字段集中出现的次数,没出现的记为0,比如,总集中的字段ABC在1类中出现的次数为123,总集中的字段ACD在2类中出现的次数为356,没出现的记为0。
S307、将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集对应的所述证件类型的向量。
具体地,对每个类别,将上一步中计算得到的数字序列以总集中字段为依据排列为固定顺序组成针对该类别的一个向量。对总集中的字段没有顺序上的要求,只是要求总集中的字段顺序和单独类别中的字段顺序一致,就可以形成对应的具有可比性的向量,比如,总集中的字段顺序为ACDFE,单独类别中的字段顺序也应ACDFE的字段顺序形成对应的向量,比如,针对步骤S308中的举例,1类形成的向量为1230000,2类形成的向量为3057000。针对多个证件类型各自的第二字段集,重复上述针对每个所述第二字段集生成所述第二字段集对应的所述证件类型的向量的过程,得到多个所述证件类型各自的向量;
S308、将多个所述证件类型各自的向量组成集合以生成向量集。
具体地,针对多个证件类型各自的第二字段集,重复上述针对每个所述第二字段集生成所述第二字段集对应的所述证件类型的向量的过程,得到多个所述证件类型各自的向量,将多个所述证件类型各自的向量组成集合以生成向量集。
在一个实施例中,所述得到所述第二字段集对应的证件类型所属的数字序列的步骤之后,还包括:根据所述第二字段集中每个所述字段出现的次数,按照第三预设方式配置权重至所述第二字段集对应的所述数字序列中每个所述字段对应的数字。
具体地,将生成的包含所有类别的特征字段提取出来,用所有字段组成一个不重复的有序的字段总集,对每个类别计算出字段总集中包含的所有字段在该类别中出现的次数,对于未出现的字段次数记作0,同时计算每个字段在不同 种类中的出现频数,所占有的种类越少,赋予其更高的权重,将次数与权重相乘,组成该种类中该字段的一个值,按照总集的顺序将这些值组合起来,就可以生成一个向量作为这个种类的特征向量,其中,频数,英文为Frequency,又称“次数”,指变量值中代表某种特征的数(标志值)出现的次数。比如,若“姓名”这个字段在总共10个类别中的8个类中均出现了,“姓名”这个字段作为划分类别的依据的辨识度不高,赋予其较低的权重,若“姓名”这个字段在总共10个类别中的1个类中出现了,“姓名”这个字段作为划分类的依据的辨识度很高,可以作为更强的分类依据,赋予其较高的权重。比如,请参阅表格1,若有表格1中的类别:
表格1
Figure PCTCN2019118392-appb-000001
若存在1类、2类及3类证件的三种分类,其中,1类包含特征字段BC,2类包含特征字段AC,3类包含特征字段BCD。目标证件包含的特征字段为ACD,其中各个字段的次数如表格中所示。可以根据ABCD分别在1类2类及3类中出现的频数,计算出ABCD各自的权重,可以以n/m为ABCD权重的计算方式,其中,n为ABCD分别在1类2类及3类中出现的频数,也就是在1类2类及3类中是否出现,m为各个字段在各个类中出现的频数之和,A在1类2类3类中总共出现了1次,B在1类2类3类中总共出现了2次,C在1类2类3类中总共出现了3次,D在1类2类3类中总共出现了1次,上述各次相加m=1+2+3+1=7。比如,上述表格中的实施例,ABCD的权重分别为:①A在1类2类3类中只在2类中出现了,A的出现频数为1次,A的权重为1/7;②B在1类和3类中均出现了,出现的频数为2次,B的权重为2/7;③C在1类2类3类中均出现了,出现了3次,C的权重为3/7;④D由于只在3类中出现了, D的权重为1/7。在上述权重的表达方式中,ABCD的权重之和为1。
对各个证件的向量包含的值计算的公式为:字段出现的次数*权重。例如,对于上述1类的向量组成为ABCD,0BC0,其中,B=2*2/7=4/7,C=1*3/7=3/7,1类的向量为(0,4/7,3/7,4/7);对于上述2类的向量组成为ABCD,A0C0,其中,A=1*1/7=1/7,C=2*3/7=6/7,2类的向量为(1/7,0,6/7,0);对于上述3类的向量组成为ABCD,0BCD,其中,B=1*2/7=2/7,C=1*3/7=3/7,D=1*1/7=1/7,3类的向量为(0,2/7,3/7,1/7)。目标证件的向量为ABCD,A0CD,其中,A=2*1/7=2/7,C=1*3/7=3/7,D=3*1/7=3/7,3类的向量为(2/7,0,3/7,3/7)。然后通过计算目标证件的向量分别与1类、2类及3类向量的余弦值相似度,若向量余弦值相似度满足预设条件,将目标证件归类到1类、2类或者3类里,否则,归类失败。
在一个实施例中,所述字段对应的数字配置的权重与每个所述字段出现的次数成反比。
具体地,针对证件类型的特征字段集合对应的第二字段集中,为了实现字段所占有的种类越少,即字段在证件类型中出现的频数越少,赋予该字段更高的权重,将该字段出现的次数与权重相乘,组成该种类关于该字段的一个值。由于权重用于描述因素或指标的相对重要程度,倾向于描述因素或者指标的贡献度或重要性,在一种权重表达方式中,ABCD的权重之和为1,在另一种权重表达方式中,ABCD的权重之和也可以不为1,比如,权重也可以为1/m,m为该字段在所有类别中出现的频数之和,具体采取如下分配方式:①A在1类2类3类中只在2类中出现了,A的出现频数为1次,A的权重为1;②B在1类和3类中均出现了,出现的频数为2次,B的权重为0.5;③C在1类2类3类中均出现了,出现了3次,C的权重为0.3;④D由于只在3类中出现了,D的权重为1。其中,频数又可以称为次数,是指在相同的条件下,进行了f次试验,在这f次试验中,事件A发生的次数m称为事件A发生的频数。权重值越大说明该特征项的表示能力越强,反之权重越小表示能力越弱,只要采用同一种标准分配权重即可。采用上述同样的方式,通过为单独每个证件类型的第二字段集中各个字段配置权重,也可以为输入的每个单独类型的证件样本生成一个关 于总集的特征向量,通过为每个证件类型的特征字段配置权重,以体现不同字段在证件类型判断过程中的重要性,可以提高对证件图像的分类准确性和效率。
在一个实施例中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤包括:计算所述证件图像的向量与所述向量集中包含的每个向量的余弦相似度;判断是否存在余弦相似度不小于预设余弦相似度阈值的向量;若存在余弦相似度不小于预设余弦相似度阈值的向量,判定所述向量集中存在与所述证件图像的向量相匹配的向量;若所有所述余弦相似度均小于所述预设余弦相似度阈值,判定所述向量集中不存在与所述证件图像的向量相匹配的向量。其中,余弦相似度,也可以称为余弦距离,是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异大小的度量。余弦值越接近1,就表明夹角越接近0度,也就是两个向量越相似,也称为“余弦相似性”。
具体地,计算输入的证件图像的向量与各类别证件类型的向量的余弦相似度。其中,余弦相似度的计算可以通过如下方式:
两个向量间的余弦值可以通过使用欧几里得点积公式求出:
a·b=||a||||b||cosθ.        公式(1)
给定两个属性向量,A和B,其余弦相似性θ由点积和向量长度给出,如下所示:
Figure PCTCN2019118392-appb-000002
其中,A i,B i分别代表向量A和B的各分量。给出的相似性范围从-1到1,-1意味着两个向量指向的方向正好截然相反,1表示它们的指向是完全相同的,0通常表示它们之间是独立的,而在这之间的值则表示中间的相似性或相异性。对于文本匹配,属性向量A和B通常是文档中的词频向量。余弦相似性,可以被看作是在比较过程中把文件长度正规化的方法。
通过计算输入的证件图像的向量与向量集中包含的向量之间的余弦相似度来判断输入的证件图像和分类库中各种类别的证件类型之间的相似性,当余弦相似度最大且大于预设余弦相似度阈值时,可以认为该证件图像属于该向量对 应的证件类型,能够使对图像进行分类的过程更加自动化,提高对证件的分类效率。
在一个实施例中,所述若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型的步骤包括:若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量;若所述目标向量的数量为一个,将所述目标向量对应的证件类型确定为所述证件图像的证件类型;若所述目标向量的数量为多个,将多个所述目标向量中与所述证件图像的向量最接近的目标向量对应的证据类型确定为所述证件图像的证件类型。
具体地,若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,由于一个证件图像只能对应一个证件类型,若所述目标向量的数量为一个,将所述目标向量对应的证件类型确定为所述证件图像的证件类型,若所述目标向量的数量为多个,将多个所述目标向量中与所述证件图像的向量最接近的目标向量对应的证据类型确定为所述证件图像的证件类型,从而实现对证件图像进行分类。
在一个实施例中,所述基于OCR模型提取所述证件图像中包含的所有字段的步骤之前,还包括:通过预设证件识别模型对所述证件图像进行识别;判断通过所述预设证件识别模型是否能够确定所述证件图像所属的证件类型;若能够确定所述证件图像所属的证件类型,将所述证件图像分类至所述预设证件识别模型所属的证件类型;若通过所述预设证件识别模型未能确定所述证件图像所属的证件类型,执行所述基于OCR模型提取所述证件图像中包含的所有字段的步骤。
其中,预设证件识别模型是指身份证、结婚证及驾驶证等已有的成熟的识别模型对应的物体识别模型。
具体地,对于进行证件类型分类的文件来说,一些特征明显的证件,例如身份证、驾驶证等是可以通过训练物体识别模型来进行分类的,并且身份证、结婚证及驾驶证等已有成熟的物体识别模型可以识别这些证件类型对应的证件, 采用这些物体识别模型识别各自对应的证件,可以准确的识别出对应的证件图像。但是对于一些特征比较相似的证件,如一些A4纸打印出来的表格类证件等,通过上述的身份证、结婚证或者驾驶证等普通的物体识别模型就比较难以区分。为了提高分类的准确性,本申请实施例中采用多层的结构对证件图像进行分类。首先通过一个针对身份证、结婚证及驾驶证等特征证件进行过训练的物体识别模型,先通过物体识别模型对所述证件图像进行识别,若是能识别出来是身份证、结婚证或者驾驶证等证件,如果该物体识别模型得到分类结果且置信度大于预设阈值,则认为该输入样本为该种类证件,直接用该类证件的后续处理逻辑的模型进行处理,可以提高对证件图像的处理效率。若不是身份证、结婚证或者驾驶证等证件,通过物体识别模型识别不出来具体证件类型,再用本申请实施例上述的证件图像的分类方法的步骤以实现对证件类型的分类。其中,所谓置信度,也叫置信水平,是指特定个体对待特定命题真实性相信的程度,也就是概率是对个人信念合理性的量度。例如,请参阅图5,图5为本申请实施例提供的证件图像分类方法的另一个流程示意图。如图5所示,在该实施例中,对证件分类的过程包括:输入需要进行分类的证件图像,首先通过预设的证件识别模型(即物体识别模型)对证件图像进行分类,若物体识别模型对所述证件图像分类成功,直接获得所述证件图像的分类结果,如果通过物体识别模型无法识别出所述证件图像的具体证件类型,对所述证件图像分类失败,进入OCR模型提取所述输入图像中的文字信息,获得文本识别结果,根据文本识别结果得到输入图像的字段集向量,并预先通过类别库中获得字段总集,通过字段总集计算出类别库中各个类别的证件类型的向量组成的各类别向量集,计算输入图像的字段集的向量与向量集中的向量之间的相似度,通过向量之间的相似度对所述证件图像进行分类,若对所述证件图像分类成功,获得所述证件图像的分类结果,若对所述证件图像分类识别,可以判断所述证件图像属于其他类别,可以通过对所述图像分类失败进行提示以进行人工处理。
需要说明的是,上述各个实施例所述的证件图像分类方法,可以根据需要将不同实施例中包含的技术特征重新进行组合,以获取组合后的实施方案,但都在本申请要求的保护范围之内。
请参阅图6,图6为本申请实施例提供的证件图像分类装置的示意性框图。对应于上述证件图像分类方法,本申请实施例还提供一种证件图像分类装置。如图6所示,该证件图像分类装置包括用于执行上述证件图像分类方法的单元,该装置可以被配置于服务器等计算机设备中。具体地,请参阅图6,该证件图像分类装置600包括获取单元601、提取单元602、第一生成单元603、第一判断单元604及第一分类单元605。其中,获取单元601,用于获取待分类的证件图像;提取单元602,用于基于OCR模型提取所述证件图像中包含的所有字段;第一生成单元603,用于根据所述字段,通过第一预设方式生成所述证件图像的向量;第一判断单元604,用于判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;第一分类单元605,用于若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
在一个实施例中,所述证件图像分类装置600还包括:第二生成单元,用于生成所述向量集;所述第二生成单元包括:获取子单元,用于获取属于同一个证件类型的多张证件图像;第一提取子单元,用于针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集;对比子单元,用于对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段;第二提取子单元,用于从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据;组成子单元,用于将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集;第一得到子单元,用于针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列;排序子单元,用于将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集对应的 所述证件类型的向量;生成子单元,用于将多个所述证件类型各自的向量组成集合以生成向量集。
在一个实施例中,第二提取子单元,用于提取所有所述共有的字段组成第二字段集。
在一个实施例中,所述第二生成单元还包括:配置子单元,用于根据所述第二字段集中每个所述字段出现的次数,按照第三预设方式配置权重至所述第二字段集对应的所述数字序列中每个所述字段对应的数字。
在一个实施例中,所述字段对应的数字配置的权重与每个所述字段出现的次数成反比。
在一个实施例中,所述第一判断单元604包括:计算子单元,用于计算所述证件图像的向量与所述向量集中包含的每个向量的余弦相似度;第一判断子单元,用于判断是否存在余弦相似度不小于预设余弦相似度阈值的向量;判定子单元,用于若存在余弦相似度不小于预设余弦相似度阈值的向量,判定所述向量集中存在与所述证件图像的向量相匹配的向量。
在一个实施例中,所述第一分类单元605包括:第二判断子单元,用于若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量;第一分类子单元,用于若所述目标向量的数量为一个,将所述目标向量对应的证件类型确定为所述证件图像的证件类型;第二分类子单元,用于若所述目标向量的数量为多个,将多个所述目标向量中与所述证件图像的向量最接近的目标向量对应的证据类型确定为所述证件图像的证件类型。
在一个实施例中,所述证件图像分类装置600还包括:识别单元,用于通过预设证件识别模型对所述证件图像进行识别;第二判断单元,用于判断通过所述预设证件识别模型是否能够确定所述证件图像所属的证件类型;所述提取单元602,用于若通过所述预设证件识别模型未能确定所述证件图像所属的证件类型,执行所述基于OCR模型提取所述证件图像中包含的所有字段的步骤。
需要说明的是,所属领域的技术人员可以清楚地了解到,上述证件图像分类装置和各单元的具体实现过程,可以参考前述方法实施例中的相应描述,为 了描述的方便和简洁,在此不再赘述。同时,上述证件图像分类装置中各个单元的划分和连接方式仅用于举例说明,在其他实施例中,可将证件图像分类装置按照需要划分为不同的单元,也可将证件图像分类装置中各单元采取不同的连接顺序和方式,以完成上述证件图像分类装置的全部或部分功能。
上述证件图像分类装置可以实现为一种计算机程序的形式,该计算机程序可以在如图7所示的计算机设备上运行。请参阅图7,图7是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备700可以是台式机电脑或者服务器等计算机设备,也可以是其他设备中的组件或者部件。
参阅图7,该计算机设备700包括通过系统总线701连接的处理器702、存储器和网络接口705,其中,存储器可以包括非易失性存储介质703和内存储器704。该非易失性存储介质703可存储操作系统7031和计算机程序7032。该计算机程序7032被执行时,可使得处理器702执行一种上述证件图像分类方法。该处理器702用于提供计算和控制能力,以支撑整个计算机设备700的运行。该内存储器704为非易失性存储介质703中的计算机程序7032的运行提供环境,该计算机程序7032被处理器702执行时,可使得处理器702执行一种上述证件图像分类方法。该网络接口705用于与其它设备进行网络通信。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备700的限定,具体的计算机设备700可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图7所示实施例一致,在此不再赘述。
其中,所述处理器702用于运行存储在存储器中的计算机程序7032,以实现本申请上述实施例的证件图像分类方法。
应当理解,在本申请实施例中,处理器702可以是中央处理单元(Central Processing Unit,CPU),该处理器702还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或 者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以通过计算机程序来完成,该计算机程序可存储于一计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述证件图像分类方法的实施例的步骤。
因此,本申请实施例还提供一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时使处理器执行以上各实施例中所描述的证件图像分类方法的步骤。
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储计算机程序的实体存储介质。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述,仅为本申请的具体实施方式,但本申请明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种证件图像分类方法,包括:
    获取待分类的证件图像;
    基于OCR模型提取所述证件图像中包含的所有字段;
    根据所述字段,通过第一预设方式生成所述证件图像的向量;
    判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;
    若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
  2. 根据权利要求1所述证件图像分类方法,其中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤之前,还包括:生成所述向量集;其中,所述生成所述向量集的步骤包括:
    获取属于同一个证件类型的多张证件图像;
    针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集;
    对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段;
    从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据;
    将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集;
    针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列;
    将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集 对应的所述证件类型的向量;
    将多个所述证件类型各自的向量组成集合以生成向量集。
  3. 根据权利要求2所述证件图像分类方法,其中,所述从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集的步骤包括:
    提取所有所述共有的字段组成第二字段集。
  4. 根据权利要求2所述证件图像分类方法,其中,所述得到所述第二字段集对应的证件类型所属的数字序列的步骤之后,还包括:
    根据所述第二字段集中每个所述字段出现的次数,按照第三预设方式配置权重至所述第二字段集对应的所述数字序列中每个所述字段对应的数字。
  5. 根据权利要求4所述证件图像分类方法,其中,所述字段对应的数字配置的权重与每个所述字段出现的次数成反比。
  6. 根据权利要求1所述证件图像分类方法,其中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤包括:
    计算所述证件图像的向量与所述向量集中包含的每个向量的余弦相似度;
    判断是否存在余弦相似度不小于预设余弦相似度阈值的向量;
    若存在余弦相似度不小于预设余弦相似度阈值的向量,判定所述向量集中存在与所述证件图像的向量相匹配的向量。
  7. 根据权利要求1所述证件图像分类方法,其中,所述若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型的步骤包括:
    若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量;
    若所述目标向量的数量为一个,将所述目标向量对应的证件类型确定为所述证件图像的证件类型;
    若所述目标向量的数量为多个,将多个所述目标向量中与所述证件图像的向量最接近的目标向量对应的证据类型确定为所述证件图像的证件类型。
  8. 根据权利要求1所述证件图像分类方法,其中,所述基于OCR模型提取 所述证件图像中包含的所有字段的步骤之前,还包括:
    通过预设证件识别模型对所述证件图像进行识别;
    判断通过所述预设证件识别模型是否能够确定所述证件图像所属的证件类型;
    若通过所述预设证件识别模型未能确定所述证件图像所属的证件类型,执行所述基于OCR模型提取所述证件图像中包含的所有字段的步骤。
  9. 一种证件图像分类装置,包括:
    获取单元,用于获取待分类的证件图像;
    提取单元,用于基于OCR模型提取所述证件图像中包含的所有字段;
    第一生成单元,用于根据所述字段,通过第一预设方式生成所述证件图像的向量;
    第一判断单元,用于判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;
    第一分类单元,用于若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
  10. 根据权利要求9所述证件图像分类装置,其中,所述装置还包括:
    第二生成单元,用于生成所述向量集;其中,所述第二生成单元包括:
    获取子单元,用于获取属于同一个证件类型的多张证件图像;
    第一提取子单元,用于针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集;
    对比子单元,用于对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段;
    第二提取子单元,用于从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据;
    组成子单元,用于将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集;
    第一得到子单元,用于针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列;
    排序子单元,用于将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集对应的所述证件类型的向量;
    生成子单元,用于将多个所述证件类型各自的向量组成集合以生成向量集。
  11. 一种计算机设备,其中,所述计算机设备包括存储器以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:
    获取待分类的证件图像;
    基于OCR模型提取所述证件图像中包含的所有字段;
    根据所述字段,通过第一预设方式生成所述证件图像的向量;
    判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;
    若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
  12. 根据权利要求11所述计算机设备,其中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤之前,还包括:生成所述向量集;其中,所述生成所述向量集的步骤包括:
    获取属于同一个证件类型的多张证件图像;
    针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集;
    对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段;
    从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据;
    将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集;
    针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列;
    将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集对应的所述证件类型的向量;
    将多个所述证件类型各自的向量组成集合以生成向量集。
  13. 根据权利要求12所述计算机设备,其中,所述从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集的步骤包括:
    提取所有所述共有的字段组成第二字段集。
  14. 根据权利要求12所述计算机设备,其中,所述得到所述第二字段集对应的证件类型所属的数字序列的步骤之后,还包括:
    根据所述第二字段集中每个所述字段出现的次数,按照第三预设方式配置权重至所述第二字段集对应的所述数字序列中每个所述字段对应的数字。
  15. 根据权利要求14所述证件图像分类方法,其中,所述字段对应的数字配置的权重与每个所述字段出现的次数成反比。
  16. 根据权利要求11所述计算机设备,其中,所述判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量的步骤包括:
    计算所述证件图像的向量与所述向量集中包含的每个向量的余弦相似度;
    判断是否存在余弦相似度不小于预设余弦相似度阈值的向量;
    若存在余弦相似度不小于预设余弦相似度阈值的向量,判定所述向量集中存在与所述证件图像的向量相匹配的向量。
  17. 根据权利要求11所述计算机设备,其中,所述若所述向量集中存在与 所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型的步骤包括:
    若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量;
    若所述目标向量的数量为一个,将所述目标向量对应的证件类型确定为所述证件图像的证件类型;
    若所述目标向量的数量为多个,将多个所述目标向量中与所述证件图像的向量最接近的目标向量对应的证据类型确定为所述证件图像的证件类型。
  18. 根据权利要求11所述计算机设备,其中,所述基于OCR模型提取所述证件图像中包含的所有字段的步骤之前,还包括:
    通过预设证件识别模型对所述证件图像进行识别;
    判断通过所述预设证件识别模型是否能够确定所述证件图像所属的证件类型;
    若通过所述预设证件识别模型未能确定所述证件图像所属的证件类型,执行所述基于OCR模型提取所述证件图像中包含的所有字段的步骤。
  19. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时使所述处理器实现如下步骤:
    获取待分类的证件图像;
    基于OCR模型提取所述证件图像中包含的所有字段;
    根据所述字段,通过第一预设方式生成所述证件图像的向量;
    判断预设的向量集中是否存在与所述证件图像的向量相匹配的向量,其中,所述向量集包括多个通过所述第一预设方式所生成的、对应于不同证件类型的证件图像的向量;
    若所述向量集中存在与所述证件图像的向量相匹配的向量,将与所述证件图像的向量相匹配的向量作为目标向量,并根据所述目标向量对应的证件类型确定所述证件图像的证件类型。
  20. 根据权利要求19所述计算机可读存储介质,其中,所述判断预设的向 量集中是否存在与所述证件图像的向量相匹配的向量的步骤之前,还包括:生成所述向量集;其中,所述生成所述向量集的步骤包括:
    获取属于同一个证件类型的多张证件图像;
    针对每一张所述证件图像,基于所述OCR模型提取所述证件图像中包含的所有字段,并统计每个所述字段出现的次数以生成每张所述证件图像对应的第一字段集;
    对比每个所述第一字段集中包含的字段,筛选出所有所述第一字段集中共有的字段;
    从所述共有的字段中按照第二预设方式提取预设数量的共有字段组成第二字段集,所述第二字段集用于作为识别所述证件类型的依据;
    将所有所述第二字段集中出现的所有字段组成一个无重复字段的字段总集;
    针对每个所述第二字段集,根据所述第二字段集中包含的每个所述字段在对应所述证件类型的证件图像中出现的次数,统计所述字段总集中包含的字段在所述第二字段集中出现的次数,从而得到所述第二字段集对应的证件类型所属的数字序列;
    将所述数字序列按照字段的预设顺序进行排序,从而得到所述第二字段集对应的所述证件类型的向量;
    将多个所述证件类型各自的向量组成集合以生成向量集。
PCT/CN2019/118392 2019-10-15 2019-11-14 证件图像分类方法、装置、计算机设备及可读存储介质 WO2021072876A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910979547.3 2019-10-15
CN201910979547.3A CN111046879B (zh) 2019-10-15 2019-10-15 证件图像分类方法、装置、计算机设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021072876A1 true WO2021072876A1 (zh) 2021-04-22

Family

ID=70231789

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118392 WO2021072876A1 (zh) 2019-10-15 2019-11-14 证件图像分类方法、装置、计算机设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN111046879B (zh)
WO (1) WO2021072876A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516597A (zh) * 2021-05-19 2021-10-19 中国工商银行股份有限公司 图像校正方法、装置和服务器

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110688998A (zh) * 2019-09-27 2020-01-14 中国银行股份有限公司 票据识别方法及装置
CN111563501A (zh) * 2020-04-26 2020-08-21 北京立禾物联科技有限公司 一种合格证识别装置及合格证识别方法
CN111881943A (zh) * 2020-07-08 2020-11-03 泰康保险集团股份有限公司 图像分类的方法、装置、设备和计算机可读介质
CN111860657A (zh) * 2020-07-23 2020-10-30 中国建设银行股份有限公司 一种图像分类方法、装置、电子设备及存储介质
TWI845837B (zh) * 2021-04-21 2024-06-21 國立中央大學 手寫中文字辨識方法及手寫中文字辨識裝置
CN113627542A (zh) * 2021-08-13 2021-11-09 青岛海信网络科技股份有限公司 一种事件信息处理方法、服务器及存储介质
CN114005131A (zh) * 2021-11-02 2022-02-01 京东科技信息技术有限公司 一种证件文字识别方法及装置
CN114677701A (zh) * 2022-03-11 2022-06-28 联宝(合肥)电子科技有限公司 一种数据识别方法、装置、设备及存储介质
CN114780172B (zh) * 2022-04-15 2024-02-27 深圳优美创新科技有限公司 外接摄像头的识别方法、装置、智能显示屏以及存储介质
CN115379061B (zh) * 2022-08-10 2024-04-30 珠海金山办公软件有限公司 拍照扫描方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6370269B1 (en) * 1997-01-21 2002-04-09 International Business Machines Corporation Optical character recognition of handwritten or cursive text in multiple languages
US7421126B2 (en) * 2000-03-23 2008-09-02 Cardiff Software, Inc. Method and system for searching form features for form identification
CN102831405A (zh) * 2012-08-16 2012-12-19 北京理工大学 基于分布式和暴力匹配的室外大规模物体识别方法和系统
CN109919076A (zh) * 2019-03-04 2019-06-21 厦门商集网络科技有限责任公司 基于深度学习的确认ocr识别结果可靠性的方法及介质
CN110991456A (zh) * 2019-12-05 2020-04-10 北京百度网讯科技有限公司 票据识别方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5668897A (en) * 1994-03-15 1997-09-16 Stolfo; Salvatore J. Method and apparatus for imaging, image processing and data compression merge/purge techniques for document image databases
US10685223B2 (en) * 2008-01-18 2020-06-16 Mitek Systems, Inc. Systems and methods for mobile image capture and content processing of driver's licenses
US8995774B1 (en) * 2013-09-19 2015-03-31 IDChecker, Inc. Automated document recognition, identification, and data extraction
US9984471B2 (en) * 2016-07-26 2018-05-29 Intuit Inc. Label and field identification without optical character recognition (OCR)
CN109492643B (zh) * 2018-10-11 2023-12-19 平安科技(深圳)有限公司 基于ocr的证件识别方法、装置、计算机设备及存储介质
CN110287971B (zh) * 2019-05-22 2023-11-14 平安银行股份有限公司 数据验证方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6370269B1 (en) * 1997-01-21 2002-04-09 International Business Machines Corporation Optical character recognition of handwritten or cursive text in multiple languages
US7421126B2 (en) * 2000-03-23 2008-09-02 Cardiff Software, Inc. Method and system for searching form features for form identification
CN102831405A (zh) * 2012-08-16 2012-12-19 北京理工大学 基于分布式和暴力匹配的室外大规模物体识别方法和系统
CN109919076A (zh) * 2019-03-04 2019-06-21 厦门商集网络科技有限责任公司 基于深度学习的确认ocr识别结果可靠性的方法及介质
CN110991456A (zh) * 2019-12-05 2020-04-10 北京百度网讯科技有限公司 票据识别方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516597A (zh) * 2021-05-19 2021-10-19 中国工商银行股份有限公司 图像校正方法、装置和服务器
CN113516597B (zh) * 2021-05-19 2024-05-28 中国工商银行股份有限公司 图像校正方法、装置和服务器

Also Published As

Publication number Publication date
CN111046879A (zh) 2020-04-21
CN111046879B (zh) 2023-09-29

Similar Documents

Publication Publication Date Title
WO2021072876A1 (zh) 证件图像分类方法、装置、计算机设备及可读存储介质
US20220012487A1 (en) Systems and methods for classifying payment documents during mobile image processing
WO2021164232A1 (zh) 用户识别方法、装置、设备及存储介质
WO2021072885A1 (zh) 识别文本的方法、装置、设备及存储介质
US20160260014A1 (en) Learning method and recording medium
US9659213B2 (en) System and method for efficient recognition of handwritten characters in documents
WO2022166532A1 (zh) 人脸识别方法、装置、电子设备及存储介质
US20190347472A1 (en) Method and system for image identification
CN111242124A (zh) 一种证件分类方法、装置及设备
US10423817B2 (en) Latent fingerprint ridge flow map improvement
CN111209827A (zh) 一种基于特征检测的ocr识别票据问题的方法及系统
CN112418167A (zh) 图像的聚类方法、装置、设备和存储介质
JP5755046B2 (ja) 画像認識装置、画像認識方法及びプログラム
CN110647895A (zh) 一种基于登录框图像的钓鱼页面识别方法及相关设备
CN109635796B (zh) 调查问卷的识别方法、装置和设备
CN111444362A (zh) 恶意图片拦截方法、装置、设备和存储介质
CN117493645B (zh) 一种基于大数据的电子档案推荐系统
CN112288045B (zh) 一种印章真伪判别方法
Hung et al. Automatic vietnamese passport recognition on android phones
CN116311276A (zh) 文档图像矫正方法、装置、电子设备及可读介质
CN114513341B (zh) 恶意流量检测方法、装置、终端及计算机可读存储介质
US11482027B2 (en) Automated extraction of performance segments and metadata values associated with the performance segments from contract documents
KR20140112869A (ko) 문자 인식 장치 및 방법
CN113569839A (zh) 证件识别方法、系统、设备及介质
CN109933969B (zh) 验证码识别方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948964

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948964

Country of ref document: EP

Kind code of ref document: A1