CN112801099B - Image processing method, device, terminal equipment and medium - Google Patents
Image processing method, device, terminal equipment and medium Download PDFInfo
- Publication number
- CN112801099B CN112801099B CN202010490243.3A CN202010490243A CN112801099B CN 112801099 B CN112801099 B CN 112801099B CN 202010490243 A CN202010490243 A CN 202010490243A CN 112801099 B CN112801099 B CN 112801099B
- Authority
- CN
- China
- Prior art keywords
- key
- value
- field
- fields
- text sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 70
- 238000000034 method Methods 0.000 claims abstract description 66
- 238000013145 classification model Methods 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 description 21
- 238000000605 extraction Methods 0.000 description 20
- 238000012015 optical character recognition Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 17
- 238000013473 artificial intelligence Methods 0.000 description 9
- 230000015654 memory Effects 0.000 description 9
- 238000003058 natural language processing Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000012706 support-vector machine Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000012417 linear regression Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 238000007637 random forest analysis Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
The embodiment of the application discloses an image processing method, an image processing device, terminal equipment and a medium, wherein the method comprises the following steps: the method comprises the steps of converting an image to be processed into a text sequence, determining key fields and value fields included in the text sequence, combining the key fields and the value fields in pairs to obtain at least one key value text sequence, obtaining characteristic information of the key fields and the value fields in each key value text sequence, carrying out pairing processing on the key fields and the value fields in each key value text sequence according to the characteristic information, and outputting a structured text corresponding to the image to be processed based on a pairing result of the key fields and the value fields in each key value text sequence. By converting the image data into the structured data, more valuable reference data can be provided for the user, and the practicability and the intelligence of the image processing scheme are improved.
Description
Technical Field
The present application relates to the field of internet technologies, and in particular, to the field of computer technologies, and in particular, to an image processing method, an image processing apparatus, a terminal device, and a computer storage medium.
Background
With the rapid development of the mobile internet, the application of image and text recognition technology is increasingly widespread. The image character recognition technology can be, for example, an OCR (Optical Character Recognition ) technology, and the OCR technology mainly performs electronic scanning on an input image and extracts character information therefrom, so that the burden of inputting corresponding character information by a user is reduced, the user can conveniently store and edit the corresponding character information, and a large amount of human resources can be saved. But OCR technology recognizes that only a string of editable characters is less valuable to the user, and such image-text recognition technology is less practical.
Disclosure of Invention
The embodiment of the application provides an image processing method, an image processing device, terminal equipment and a medium, which can provide more valuable reference data for users by converting image data into structured data, thereby improving the practicability and the intelligence of an image processing scheme.
In one aspect, an embodiment of the present application provides an image processing method, including:
Converting the image to be processed into a text sequence;
Performing key value classification on the text sequence, and determining key fields and value fields included in the text sequence based on a key value classification result;
combining the key fields and the value fields in pairs to obtain at least one key value text sequence, wherein each key value text sequence comprises a key field and a value field;
Acquiring characteristic information of a key field and a value field in each key value text sequence;
Pairing the key field and the value field in each key value text sequence according to the characteristic information;
And outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.
In another aspect, an embodiment of the present application provides an image processing apparatus including:
the conversion unit is used for converting the image to be processed into a text sequence;
The processing unit is used for classifying the key values of the text sequences, determining key fields and value fields included in the text sequences based on the key value classification result, combining the key fields and the value fields in pairs to obtain at least one key value text sequence, acquiring characteristic information of the key fields and the value fields in each key value text sequence, and carrying out pairing processing on the key fields and the value fields in each key value text sequence according to the characteristic information;
And the output unit is used for outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.
Correspondingly, the embodiment of the application also provides terminal equipment, which comprises output equipment, a processor and a storage device; a storage device for storing program instructions; and the processor is used for calling the program instructions and executing the image processing method.
Accordingly, an embodiment of the present application also provides a computer storage medium having stored therein program instructions for implementing the above-mentioned image processing method when executed.
The embodiment of the application can convert the image to be processed into the text sequence, classify the text sequence by key value, and determine the key field and the value field included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined in pairs to obtain at least one key value text sequence, feature information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is carried out on the key fields and the value fields in each key value text sequence according to the feature information, and further structured texts corresponding to the images to be processed are output based on the pairing result of the key fields and the value fields in each key value text sequence. By converting the image data into the structured data, more valuable reference data can be provided for the user, and the practicability and the intelligence of the image processing scheme are improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an application scene diagram of an image processing method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of an image processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a named entity model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a scenario in which key value classification is performed by using a named entity model according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of key-value pairing according to an embodiment of the present application;
FIG. 6 is a flowchart of another image processing method according to an embodiment of the present application;
fig. 7a is an application scenario diagram of another image processing method according to an embodiment of the present application;
FIG. 7b is an application scenario diagram of yet another image processing method according to an embodiment of the present application;
FIG. 7c is an application scenario diagram of yet another image processing method according to an embodiment of the present application;
FIG. 8 is a schematic view of a feature extraction and pairing process according to an embodiment of the present application;
FIG. 9 is a schematic view of a scenario for determining the aspect ratio of key fields and value fields in an image to be processed according to an embodiment of the present application;
fig. 10 is a schematic structural view of an image processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
AI (ARTIFICIAL INTELLIGENCE ) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to perform machine vision such as recognition and measurement on a target, and further perform graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
NLP (Nature Language processing, natural language processing) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
OCR is an image character recognition technology in computer vision, is a process of electronically scanning an input image and extracting characters therefrom, can reduce the burden of inputting corresponding character information by a user, is convenient for the user to store and edit the corresponding character information, and is beneficial to saving a large amount of human resources. The result of OCR recognition is, however, simply a string of editable characters, not containing any structured information, which is less valuable, but rather, for the user, is actually the structured data. For example, in enterprise license recognition, what the user needs is the recognition result of important fields such as enterprise name, legal person, etc., rather than the simple text recognition result. Therefore, how to convert image data into structured data is an important research direction. Structured data may be understood as the result of such structuring by Key Value pairs.
For example, referring to fig. 1, assuming that the image to be processed is a business license image as shown in the left diagram of fig. 1, structured data corresponding to the business license image may be shown in the right diagram of fig. 1, where key value pairs included in the structured data are shown in table 1, respectively.
TABLE 1
Currently, image data may be converted to structured data by OCR structuring methods, which may generally include a template registration method based on image features and a custom field detection method based on text features.
The template registration method based on the image features can map the image to be structured to the template image according to anchor point features (such as fixed characters, field distribution and the like) in the template image, and extract the structuring result of the corresponding field according to the position information, so that the conversion from the image data to the structuring data is realized. This method has the following disadvantages:
1. the requirements on the image quality and the recognition result of the characters are high, and the problems of rotation, perspective, distortion and the like are difficult to deal with;
2. The extraction of the structured data cannot be performed on images with different formats from the template images, and the method is only suitable for OCR structured scenes of fixed-format images. For example, the template image related in the template configuration method is a resident identification card image, so that the template configuration method is indicated to be capable of carrying out corresponding processing on the image to be processed with the same format as the resident identification card image accurately. Other formats of identity document images, such as passports, drivers' licenses, etc., or other business type images, such as social security cards, business licenses, value added tax receipts, etc., cannot be detected or have very low detection accuracy.
The custom detection method based on text features can detect the positions of the required fields to be structured through a special text field detector, and then obtain field recognition results through a text recognizer, so that the conversion from image data to structured data is realized. This method has the following disadvantages: the method is only suitable for OCR structural scenes of fixed format images, and in OCR structural scenes of non-fixed format images, the positions of the fields to be structured in the images of different formats cannot be accurately detected through a special text field detector due to the fact that the positions of the fields to be structured in the images of different formats are different, and therefore accuracy of extraction of structural data is affected.
Therefore, the existing OCR structuring method cannot be applied to the structuring scene of the non-fixed format image, and the application range is limited. Based on this, the embodiment of the application provides an image processing method, which can be executed by a terminal device or a server, wherein the terminal device can access an image processing platform or run an application corresponding to the image processing platform, and the server can be a server corresponding to the image processing platform. The terminal device herein may be any of the following: smart phones, tablet computers, laptop computers, and other portable devices, as well as desktop computers, and the like. Accordingly, the server may refer to a server that provides a corresponding service for the image processing platform, where the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers.
In the embodiment of the application, a user can acquire an image to be processed (for example, a camera device is started to shoot the image) or upload the image to be processed through the image processing platform, and trigger text recognition of the image to be processed. In this case, the terminal device or the server may convert the image to be processed into a text sequence, perform key value classification on the text sequence, and determine key fields and value fields included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined in pairs to obtain at least one key value text sequence, feature information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is carried out on the key fields and the value fields in each key value text sequence according to the feature information, and further structured texts corresponding to the images to be processed are output based on pairing results of the key fields and the value fields in each key value text sequence, wherein each key value text sequence comprises one key field and one value field. By converting the image data into the structured data, more valuable reference data can be provided for the user, and the practicability and the intelligence of the image processing scheme are improved.
It can be seen that in the process of converting the image data into the structured data, the embodiment of the application does not depend on the template image or a special text field detector, the accuracy of the output result is not influenced by the format change of the image to be processed, the corresponding structured data can be extracted from the image with the non-fixed format more accurately, the method is suitable for the structured scenes of various images with the non-fixed format, and compared with the existing OCR structured method, the application range is wider.
In addition, in the pairing processing process of the key fields and the value fields in the text sequence, each key field in the text sequence is combined with all the value fields in pairs to obtain at least one key value text sequence, and the key value text sequence is taken as a processing unit in the follow-up process, so that each time of characteristic information acquisition and pairing processing can be realized, the aimed object is only the key fields and the value fields in the key value text sequence, interference of other field information does not exist, and the pairing result accuracy is improved, so that the follow-up accuracy of the structured text output based on the pairing processing result, namely, the accuracy of extracting corresponding structured data from the image to be processed is improved.
It can be understood that the fixed-format image can be understood as a single format, in particular to an image of one format; the above-described non-fixed format image may be understood as an image of multiple formats. For example, the image processing method provided by the embodiment of the application can accurately extract the corresponding structured text from the image data corresponding to the resident identity image, can accurately extract the corresponding structured text from the image data corresponding to the passport image, and can be suitable for OCR (optical character recognition) structured scenes of images in various formats. The image to be processed in the embodiment of the application can comprise any one of the following: the business license image, the value added invoice image, the identification card image or the social security card image are not particularly limited.
In one embodiment, the above process of converting the image to be processed into the text sequence may be implemented based on the OCR method, the above key value classification and pairing process may be implemented based on the NLP method, and based on this, the embodiment of the present application proposes another image processing method, which may be performed by the above-mentioned terminal device or server, please refer to fig. 2, and the image processing method may include the following steps S201-S204:
S201: image input, OCR recognition and typesetting processing. The user can collect the image to be processed or upload the image to be processed through the image processing platform and trigger text recognition of the image to be processed. In this case, the terminal device or the server may recognize the image to be processed input to the image processing platform through OCR, and perform typesetting processing on the recognition result to obtain the text sequence. The specific implementation mode of typesetting treatment can be as follows: and splicing discrete characters included in the recognition result to form paragraph text.
S202: and (5) classifying key values. In a specific implementation, a text sequence may be subjected to key value classification to obtain a key value classification result, where the key value classification result includes classification labels of each character in the text sequence, each classification label is used to indicate a character type of a corresponding character and a position of the character in a field to which the corresponding character belongs, and the position includes any one or more of the following: a start position, a middle position, and an end position, the character type including any one or more of: key characters, value characters, and other characters.
The key value classification can be performed by calling a named entity model or a position-based single word classification model. In a specific implementation, between calling the named entity model or the position-based single word classification model, training the named entity model or the position-based single word classification model through a large number of text sequences marked with classification labels in advance, and subsequently inputting the text sequence obtained after the step S201 into the trained named entity model or the position-based single word classification model, and outputting a key value classification result of the classification labels comprising each character in the text sequence by the named entity model or the position-based single word classification model.
The single word classification model based on the position can be a model formed by CUTIE (Convolutional Universal Text Information Extractor, a convolution general text information extractor) and a classifier, wherein CUTIE is used for extracting characteristic information of each character in a text sequence and inputting the characteristic information into the classifier; the classifier is used for classifying each field according to the characteristic information of each character and determining the classification label of each character. The named entity model can be a model combining Bi-LSTM (Bi-directional Long Short-Term Memory cyclic neural network) and CRF (conditional random field ), and the network structure diagram is shown in FIG. 3, and comprises a text input module, a feature extraction module, a semantic model and a key value classification module. The text input module is used for inputting a text sequence; the feature extraction module is used for extracting the vector features of each character in the text sequence and inputting the vector features of each character into the semantic model; the semantic model is used for outputting the probability [ p1, p2,..pi ] of each character's vector belonging to each classification label, and can also be understood as outputting the probability [ p1, p2,..pi ] of each character belonging to each classification label; and the key value classification is used for determining the classification label with the highest probability as the target classification label of each character based on the probability that each character output by the semantic model belongs to each classification label, wherein i is an integer greater than 0. For example, if the probability that the character "becomes" belonging to the classification tag 1 is highest and the probability that the character "stands" belonging to the classification tag 2 is highest in the text sequence, the classification tag 1 may be determined as the target classification tag of the character "becomes" and the classification tag 2 may be determined as the target classification tag of the character "stands". The class labels may be as shown in table 2. As can be seen, the named entity model provided by the embodiment of the application is different from the traditional named entity recognition method, on one hand, the label result output by the named entity model is only two major categories of Key and Value, and the convergence and algorithm effect are better; on the other hand, the method does not depend on layout information and image characteristics of the text to be recognized, and is more widely applicable.
TABLE 2
For example, referring to fig. 4, when the content of the text sequence is "date 2019 established", the named entity model shown in fig. 3 is called for key value classification, and the named entity model may output classification labels of 9 characters "adult", "upright", "date", "2", "0", "1", "9" and "year" in the text sequence, respectively: "B-Key", "I-Key", "E-Key", "B-Value", "I-Value" and "E-Value".
Further, after obtaining a key value classification result including classification labels of respective characters in the text sequence, a key field and a value field included in the text sequence may be determined based on the key value classification result. Specifically, according to the indication of the classification label of each character, the characters with the character type being key characters and belonging to the same field in the text sequence are integrated into a key field, and the characters with the character type being value characters and belonging to the same field in the text sequence are integrated into a value field. By way of example, in conjunction with fig. 4 and table 2, it can be seen that the characters "in", "on" and "on" can be integrated into the key field "on date" by indication of the category labels of the respective characters in the text sequence "on date 2019" which are key characters and all belong to the same field. Accordingly, the characters "2", "0", "1", "9" and "year" are all value characters and all belong to the same field, and the characters "2", "0", "1", "9" and "year" can be integrated into the value field "2019 year".
S203: feature extraction and key pairing. As a possible implementation manner, referring to fig. 5, after determining the key field and the value field in the text sequence, the terminal device or the server may extract feature information of each key field and the value field in the text sequence, pair each key field and each value field based on the feature information of each key field and the value field, and determine a relationship pair category to which each key field and each value field belong, where the relationship pair category includes a key value pair category or other categories. Further, based on the relation pair categories of the key fields and the value fields, a pairing result for the key fields and the value fields in the text sequence is output, wherein the pairing result indicates the relation pair categories of the key fields and the value fields in the text sequence.
The specific way of performing pairing processing on each key field and value field based on the characteristic information of each key field and value field may be: and calling a matching model to analyze the characteristic information of each key field and each value field, and determining the pairing result of each key field and each value field in the text sequence. The characteristic information herein may include any one or more of the following: semantic information, location information, and image information. The location information may be location information (for example, location coordinates or rank information) of each key field and value field in the image to be processed, and the image information may be image information of an image area where each key field and value field are located in the image to be processed, for example, an image RGB value, a gray value, a pixel value, and the like.
In a specific implementation, semantic information of each key field and each value field can be extracted through an NLP model (such as a semantic representation model Bert, transformer) and the like; determining the position information of each key field and each value field in the image to be processed through a position information extraction model; the image information of the respective key fields and value fields in the image to be processed may be determined by an image information extraction model. The position information extraction model and the image information extraction model can be CNN (Convolutional Neural Network ), and the CNN can be trained through different training samples, so that the position information extraction model and the image information extraction model are obtained. Specifically, the training samples corresponding to the position information extraction model comprise sample fields and sample images marked with the position information of the sample fields; the training sample corresponding to the image information extraction model comprises a sample field and a sample image marked with the image information of the image area where the sample field is located.
In one embodiment, after determining the key fields and the value fields included in the text sequence, each key field and the value field in the text sequence may be input into the NLP model, and semantic information of each key field and value field may be extracted through the NLP model. And inputting the position information extraction model and the image information extraction model obtained by training into each key field and each value field in the image to be processed and the text sequence, determining the position information of each key field and each value field in the image to be processed through the position information extraction model, and determining the image information of the image area where each key field and each value field are positioned in the image to be processed through the image information extraction model.
It is understood that when the above feature information includes semantic information, location information and image information, the extracting processes of the semantic information, the location information and the image information are three independent processes, and the execution has no sequence, and may be parallel, which is not particularly limited in the present application.
The matching model may be a classification model (such as random forest, linear regression, logistic regression, decision tree, SVM (Support Vector Machine, support vector machine), neural network, etc.), or a graph model (such as GCN (Graph Convolutional Network, graph convolution neural network), etc.).
Taking the classification model as an example, after the feature information of all the key fields and the value fields included in the text sequence is determined, the feature information of each key field and the feature information of all the value fields can be combined one by one and input into the classification model, the classification model can determine whether each key field and each value field are a relation pair or not, and a pairing result is output, wherein the pairing result indicates the category of the relation pair to which the key field and the value field belong.
For example, assuming that all the key fields and the value fields included in the text sequence are key field 1, key field 2, key field 3, value field 1, value field 2, and value field 3, respectively, in this case, after feature information of all the key fields and the value fields included in the text sequence is determined, the feature information of the key field 1 and the feature information of all the value fields may be first combined and input into a classification model, whether the feature information of the key field 1 and each of the value fields are a relationship pair is determined through the classification model, and if it is determined that the key field 1 and the value field 2 are a relationship pair with each other, a pairing result indicating that the relationship pair category to which the key field 1 and the value field 2 belong is a key value pair category may be output. And so on, the characteristic information of the key field 2 and the characteristic information of all the value fields can be sequentially combined and input into the classification model, the characteristic information of the key field 3 and the characteristic information of all the value fields are combined and input into the classification model, and the corresponding pairing result is output.
It can be understood that if the relationship between the key field and the value field in the text sequence is a one-to-one correspondence, in the process of pairing each key field and each value field through the classification model, if it is determined that the target key field and the target value field are in a relationship pair (i.e., the relationship pair category to which the target key field and the target value field belong is a key value pair category), then when the value field of the relationship pair with other key fields is subsequently determined, it is unnecessary to input the feature information of all the value fields into the classification model, and only the feature information of other value fields except the target value field is input, which is favorable for reducing the calculation amount of the classification model and improving the pairing efficiency of the key field and the value field. For example, all the key fields and the value fields included in the text sequence are the key field 1, the key field 2, the key field 3, the value field 1, the value field 2 and the value field 3, respectively, and before that, the key field 1 and the value field 2 have been determined to be the relationship pair by the classification model, then when the value field to be the relationship pair with the key field 2 is determined later, the feature information of the key field 2, the value field 1 and the value field 3 may be combined and input into the classification model without inputting the feature information of all the value fields.
Or when determining the value field which is in relation pair with the last key field in all the key fields, the target value field which is not paired with any key field in all the value fields is directly determined as the value field which is in relation pair with the last key field without being determined by a classification model. For example, all the key fields and the value fields included in the text sequence are key field 1, key field 2, key field 3, value field 1, value field 2 and value field 3, respectively, and before that, the key field 1 and the value field 2 are the relational pair with each other, and the key field 2 and the value field 3 are the relational pair with each other, which may be determined directly.
As another possible embodiment, after determining the key field and the value field in the text sequence, the terminal device or the server may combine the key field and the value field two by two to obtain at least one key-value text sequence. Further, feature information of the key field and the value field in each key text sequence can be obtained, pairing processing is carried out on the key field and the value field in each key text sequence according to the feature information, and a pairing result of the key field and the value field in each key text sequence is obtained, wherein each key text sequence comprises a key field and a value field, and the pairing result indicates the relation pair category of the key field and the value field in each key text sequence.
S204: and (5) structuring output. In a specific implementation, the terminal device or the server may determine, based on the indication of the pairing result, a target value field paired with each key field in the text sequence, and display each key field and a target value field corresponding to each key field in association on a page of the video processing platform. The association display may be, for example, displaying each key field and the target value field corresponding to each key field in the same line (for example, as shown in the right diagram of fig. 1).
Based on the above description, an embodiment of the present application proposes yet another image processing method, which may be performed by the above-mentioned terminal device or server, please refer to fig. 6, and may include the following steps S601-S605:
S601, converting the image to be processed into a text sequence. The image to be processed may be a plurality of service images, for example, may include any of the following: business license image, value added invoice image, identity card image or social security card image. In one embodiment, a user may capture an image to be processed (e.g., turn on a camera to capture an image) or upload the image to be processed through an image processing platform and trigger text recognition of the image to be processed. For example, referring to fig. 7a, the image to be processed is a business license image, a user starts a camera through an image processing platform to shoot the business license image, the image processing platform can display the business license image shot by the user in a page (as shown in the right diagram of fig. 7 a), and the user can trigger a "text recognition" function button in the page by clicking, pressing or voice, so as to trigger text recognition of the business license image.
Further, after detecting that the user triggers text recognition of the image to be processed, the terminal device or the server can call a text detection model to perform text recognition on the acquired image to be processed, and process text recognition results in parallel to obtain a text sequence corresponding to the image to be processed. The text detection model may be, for example, an OCR text detection model (e.g., EAST (AN EFFICIENT AND Accurate Scene Text Detector, an efficient and accurate scene text detector) or other neural network model for text recognition.
The specific implementation manner of obtaining the text sequence corresponding to the image to be processed by typesetting the text recognition result may be as follows: typesetting the text recognition result, and splicing discrete characters to form paragraph text (i.e. text sequence). The method is beneficial to quickly extracting effective information from the text sequence in the subsequent text sequence processing process.
S602, performing key value classification on the text sequence, and determining key fields and value fields included in the text sequence based on a key value classification result. Wherein the value field is a named entity in the text sequence, the key field is a text item corresponding to the named entity, and the named entity is a person name, an organization name, a place name and other entities identified by the name, and the more extensive entities can also comprise numbers, dates, currencies, addresses and the like. For example, assume that the image to be processed is a business license image shown in the right diagram of fig. 1, where "XX service limited company" in the business license image is a named entity, and "name" is a text item corresponding to "XX service limited company".
In one embodiment, a named entity model or a position-based single-word classification model can be trained in advance through a large number of text sequences marked with classification labels, after an image to be processed is converted into the text sequences, the trained named entity model or the trained position-based single-word classification model can be called to carry out key value classification on the text sequences, and a key value classification result of the classification labels comprising all characters in the text sequences is output. For example, the named entity model may be a model combining Bi-LSTM and CRF as shown in fig. 3, and the specific implementation manner of calling the named entity model to perform key value classification on the text sequence may refer to the description related to step S202 in the above embodiment, which is not repeated herein.
Wherein the text sequence may include a plurality of fields, each field including one or more characters, each class label (as shown in table 2 above) for indicating a character type of the character and a position of the character in the field to which the character belongs, the position including any one or more of: a start position, a middle position, and an end position, the character type including any one or more of: key characters, value characters, and other characters. In this case, after determining the key value classification result including the classification label of each character in the text sequence through the named entity model or the position-based single word classification model, the characters of the text sequence, of which the character types are key characters and belong to the same field, may be integrated into a key field, and the characters of the text sequence, of which the character types are value characters and belong to the same field, may be integrated into a value field according to the indication of the classification label of each character.
Illustratively, assuming that the content of the text sequence is "date 2019 established," the category labels of 9 characters "adult", "upright", "date", "2", "0", "1", "9", and "year" in the text sequence are: "B-Key", "I-Key", "E-Key", "B-Value", "I-Value" and "E-Value". In this case, by the indication of the classification label of each of the 9 characters, the characters "become", "stand", "day", and "period" are key characters and all belong to the same field, and the characters "become", "stand", "day", and "period" can be integrated into the key field "established date". Accordingly, the characters "2", "0", "1", "9" and "year" are all value characters and all belong to the same field, and the characters "2", "0", "1", "9" and "year" can be integrated into the value field "2019 year".
S603, combining the key fields and the value fields in pairs to obtain at least one key value text sequence, wherein each key value text sequence comprises a key field and a value field.
S604, obtaining the characteristic information of the key field and the value field in each key-value text sequence, and carrying out pairing processing on the key field and the value field in each key-value text sequence according to the characteristic information.
In one embodiment, after the feature information of the key field and the value field in each key-value text sequence is obtained, the feature information of the key field and the value field in each key-value text sequence may be input into a matching model, the feature information of the key field and the value field in each key-value text sequence is parsed by the matching model, and pairing processing is performed on the key field and the value field in each key-value text sequence, so as to obtain a pairing result of the key field and the value field in each key-value text sequence. The pairing result indicates a relation pair category to which a key field and a value field belong in each key value text sequence, wherein the relation pair category comprises a key value pair category or other categories; the matching model may be a classification model (e.g., random forest, linear regression, logistic regression, decision tree, SVM (Support Vector Machine, support vector machine), neural network, etc.) or a graph model (e.g., GCN (Graph Convolutional Network, graph convolution neural network), etc.).
Taking a classification model as an example, for any key value text sequence, after determining the characteristic information of a key field and a value field in the any key value text sequence, combining the characteristic information of the key field and the value field in the any key value text sequence and inputting the combined characteristic information into the classification model, wherein the classification model can judge whether the key field and the value field in the any key value text sequence are relation pairs or not based on the characteristic information of the key field and the value field in the any key value text sequence, if so, determining that the relation pair category to which the key field and the value field in the any key value text sequence belong is a key value pair category; if not, the relation pair category to which the key field and the value field belong in any key value text sequence can be determined to be other categories. Further, based on the determined relationship pair category to which the key field and the value field in the arbitrary key-value text sequence belong, a pairing result for the key field and the value field in the arbitrary key-value text sequence may be output, where the pairing result indicates the relationship pair category to which the key field and the value field in the arbitrary key-value text sequence belong.
Illustratively, assuming that all key fields and value fields included in the text sequence are key field 1, key field 2, value field 1 and value field 2, respectively, combining each key field and all value fields two by two results in 4 key-value text sequences, each key-value text sequence including key fields and value fields as shown in table 3. After determining the characteristic information of the key field and the value field in each key text sequence, the characteristic information of the key field 1 and the value field 1 in the key text sequence 1 may be combined and input into a classification model, whether the key field 1 and the value field 1 in the key text sequence 1 are a relation pair or not is determined through the classification model, if yes, the relation pair category to which the key field 1 and the value field 1 belong in the key text sequence 1 is determined to be the key value pair category, if not, the relation pair category to which the key field 1 and the value field 1 belong in the key text sequence 1 is determined to be other categories, and based on the determined relation pair category to which the key field 1 and the value field 1 belong, a pairing result for the key field 1 and the value field 1 in the key text sequence 1 is output, wherein the pairing result indicates the relation pair category to which the key field 1 and the value field 1 belong in the key text sequence 1. And so on, the characteristic information of the key field 1 and the value field 2 in the key value text sequence 2 can be sequentially merged and input into the classification model, the characteristic information of the key field 2 and the value field 1 in the key value text sequence 3 is merged and input into the classification model, and the characteristic information of the key field 2 and the value field 2 in the key value text sequence 4 is merged and input into the classification model, so that the pairing result of the key field 1 and the value field 2 in the key value text sequence 2, the key field 2 and the value field 1 in the key value text sequence 3 and the key field 2 and the value field 2 in the key value text sequence 4 is obtained.
TABLE 3 Table 3
Key value text sequence | Included key field and value field |
Key value text sequence 1 | Key field 1 and value field 1 |
Key value text sequence 2 | Key field 1 and value field 2 |
Key value text sequence 3 | Key field 2 and value field 1 |
Key value text sequence 4 | Key field 2 and value field 2 |
Wherein, the characteristic information may include any one or more of the following: semantic information, location information and attribute information for the key and value fields in each key-value text sequence, the attribute information being used to characterize a field type of the key and value fields in each key-value text sequence, the field type comprising a key field type or a value field type, the location information being used to characterize a relative location of the key and value fields in each key-value text sequence in the image to be processed, the location information comprising a location coordinate of the key and value fields in each key-value text sequence in the image to be processed or an aspect ratio relative to the image to be processed.
It will be appreciated that the occurrence of key and value fields typically has a strong positional correlation, which refers primarily to the display position of the key and value fields in the image to be processed, e.g. the display position of the key and value fields in chinese format in the image to be processed is typically: the key field is on the left, the value field is on the right (as shown in the left diagram of fig. 1), or the key field is on, the value field is off, etc. In the embodiment of the application, the key fields and the value fields in the text sequence can be paired by combining the semantic information, the position information and the attribute information of the key fields and the value fields in each key-value text sequence, so that the accuracy of the paired result is further improved.
In one embodiment, the specific implementation manner of obtaining the semantic information of the key field and the value field in each key-value text sequence by the terminal device or the server may be that: and carrying out segmentation processing on each key value text sequence according to the positions of the key fields and the value fields in each key value text sequence, and carrying out feature extraction on each segmented key value text sequence through a semantic representation model to obtain semantic information of the key fields and the value fields in each key value text sequence.
The splitting of each key text sequence according to the positions of the key field and the value field in each key text sequence may include: and adding an input start flag bit, an input end flag bit, a start flag bit of the key field, an end flag bit of the key field, a start flag bit of the value field and an end flag bit of the value field in each key value text sequence according to the positions of the key field and the value field in each key value text sequence. In the subsequent feature extraction process of the cut key value text sequence through the semantic representation model, the semantic mark model can pay more attention to semantic information in key fields and value fields in each key value text sequence, and is not influenced by other fields, so that the accuracy of the extracted semantic information is improved.
For example, referring to fig. 8, assuming that a certain key-value text sequence includes a key field 1 and a value field 2, the feature information includes semantic information, and the semantic representation model may be Bert, an input start flag bit, an input end flag bit, a start flag bit of the key field, an end flag bit of the key field, a start flag bit of the value field, and an end flag bit of the value field are shown in table 4, respectively. In this case, an input start flag bit "[ Beg ]", an input End flag bit "[ End ]", a start flag bit "[ E1]", an End flag bit "[/E1]", a start flag bit "[ E2]", an End flag bit "[/E2]", and a value field 2 "of the key field 1 and the value field 2 may be added to the key-value text sequence based on the positions of the key field 1 and the value field 2, and further, the key-value text sequence to which the respective flag bits are added may be input to the Bert, and semantic information of the key field 1 and semantic information of the value field 2 may be extracted by the Bert. Further, the semantic information of the key field 1 and the semantic information of the value field 2 may be input into a classification model, the semantic information of the key field 1 and the value field 2 is analyzed through the classification model, and pairing processing is performed on the key field 1 and the value field 2 to obtain a pairing result of the key field 1 and the value field 2, where the pairing result indicates a relationship pair category to which the key field 1 and the value field 2 belong, and the relationship pair category includes a key value pair category (i.e., a KV pair in fig. 8), and other categories (e.g., a KK pair and others in fig. 8).
TABLE 4 Table 4
Input of the start flag bit | [Beg] |
Start flag bit of key field | [E1] |
End flag bit of key field | [/E1]] |
Start flag bit of value field | [E2] |
End flag bit of value field | [/E2] |
Input end flag bit | [End] |
In one embodiment, the image to be processed may be placed in a plane rectangular coordinate system for analysis, where the characteristic information includes position information, where the position information is used to characterize the relative positions of the key field and the value field in each key-value text sequence in the image to be processed, and the position information includes the position coordinates of the key field and the value field in each key-value text sequence in the image to be processed, where the position coordinates include an abscissa (i.e., the coordinates of the x-axis in the plane rectangular coordinate system) and an ordinate (i.e., the coordinates of the y-axis in the plane rectangular coordinate system). In this case, the terminal device or the server invokes the text detection model to perform text recognition on the acquired image to be processed, and the obtained text recognition result not only includes the character string extracted from the image to be processed, but also includes the position coordinate of each character in the character string in the image to be processed. Further, after determining all the key fields and the value fields included in the text sequence, the position coordinates of each character in each key field, the position coordinates of each character in each value field, and the position coordinates of each key field in the image to be processed are obtained from the text recognition result, and the position coordinates of each value field in the image to be processed are determined according to the position coordinates of each character in each key field.
Wherein the specific embodiment of determining the position coordinates of each key field in the image to be processed according to the position coordinates of each character in each key field may include any one or more of the following: the position coordinates of the first character in each key field are determined as the position coordinates of each key field in the image to be processed, the position coordinates of the last character in each key field are determined as the position coordinates of each key field in the image to be processed, and the position coordinates of the center point of each key field are determined as the position coordinates of each key field in the image to be processed. For example, a certain key field is "established date", including 4 characters, and the position coordinates of each character are: (m 1, n), (m 2, n), (m 3, n) and (m 4, n), then the position coordinates of the center point of the key field "established date" may be ((m1+m2+m3+m4)/2, n). Similarly, the position coordinates of the value field in the image to be processed may also be determined in a similar manner to the key field, and will not be described here again.
Further, after determining the position coordinates of all key fields and value fields included in the text sequence, the position coordinates of all key fields and value fields included in the text sequence may be stored to a designated storage area (e.g., a local storage area of a terminal device or a server, a blockchain or cloud storage area, etc.). The subsequent terminal device or the server may acquire, from the specified storage area, position coordinates of the key field and the value field in each key-value text sequence in the image to be processed, respectively, as position information of the key field and the value field in each key-value text sequence.
Or the above location information further includes an aspect ratio of a key field and a value field in each key-value text sequence with respect to the image to be processed, in another embodiment, the terminal device or the server obtains, from the text recognition result, a location coordinate of each character in the key field, and after the location coordinate of each character in the value field, may further obtain a width w (w > 0) and a height h (h > 0) of the image to be processed, determine a width x0 and a height y0 of the key field according to the location coordinates of the characters in the key field, and determine a width x1 and a height y1 of the value field according to the location coordinates of the characters in the value field, so as to calculate an aspect ratio of the key field with respect to the image to be processed: x0/w and y0/h; the aspect ratio of the value field with respect to the image to be processed is: x1/w, y1/h. Further, the aspect ratio of all key fields and value fields included in the determined text sequence with respect to the image to be processed may be stored in the above-described designated storage area. The subsequent terminal device or the server may acquire the aspect ratio of the key field and the value field in each key-value text sequence with respect to the image to be processed from the specified storage area as the position information of the key field and the value field in each key-value text sequence.
Illustratively, referring to fig. 9, assuming that the width of the image to be processed is w and the height is h, the K of "Kxy" in fig. 9 characterizes the key field, and the subscript "xy" characterizes the width and the height of the corresponding key field in the image to be processed, respectively; in fig. 9, "Vxy" represents a V-representation value field, and subscript "xy" represents the width and height of the corresponding value field in the image to be processed, respectively. It can be seen that the terminal device or the server can determine the aspect ratio of all key fields and value fields in the text sequence with respect to the image to be processed based on the width and height of all key fields and value fields included in the text sequence in the image to be processed, and the width w and height h of the image to be processed.
The specific implementation of determining the width x0 and the height y0 of the key field according to the position coordinates of the characters in the key field may be: the difference in the abscissa between the last character in the key field and the first field is determined as the width x0 of the key field, and the ordinate of any character in the key field is determined as the height y0 of the key field. Accordingly, the width x1 and the height y1 of the value field may be similar to the key field according to the position coordinates of the characters in the value field, and will not be described herein.
For example, assuming that the units of width and height are cm, the key field is "name", the position coordinates of the first character "last name" are (4, 2), and the position coordinates of the second character "first name" are (6, 2), then the width of the key field may be determined to be 2 cm, and the height may be determined to be 2 cm.
S605, outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key-value text sequence. The structured text may refer to a key field displayed based on a certain display rule or display manner and a target value field paired with the key field.
In one embodiment, the pairing result indicates a relationship pair category to which the key field and the value field belong in each key-value text sequence, and the relationship pair category includes a key-value pair category or other categories. The terminal device or the server may determine, according to the indication of the pairing result of the key field and the value field in each key-value text sequence, a target value field paired with each key field in the text sequence, where the target value field is a value field in the text sequence, and the category of the relationship pair to which the corresponding key field belongs is a key-value pair category. Further, each key field and a target value field paired with each key field may be displayed according to a display rule.
For example, assuming that all key fields and value fields included in the text sequence are key field 1, key field 2, value field 1 and value field 2, each key field and all value fields are combined in pairs to obtain 4 key-value text sequences, the key fields and value fields included in each key-value text sequence are shown in table 3, and the pairing results of the key fields and value fields in the 4 key-value text sequences indicate: the relation pair category to which the key field 1 and the value field 1 belong is a key value pair category; the relationship pairs to which the key field 1 and the value field 2 belong are other categories; the relationship pair category to which the key field 2 and the value field 1 belong is another category, and the relationship pair category to which the key field 2 and the value field 2 belong is a key value pair category. In this case, the terminal device or the server may determine that the target value field paired with the key field 1 in the text sequence is the value field 1, and the target value field paired with the key field 2 is the value field 2.
The target key field is any one of the key fields, and the target value field is a value field paired with the target key field. The above display rule may include displaying the target key field and the target value field in the same line, and, for example, assuming that the pairing situation of each key field and the value field in the text sequence is shown in table 5, the effect of displaying each key field and the target value field paired with each key field according to the display rule may be shown in the right diagram in fig. 1.
TABLE 5
Key field | Paired value fields |
Unified social credit code | 91440300MA3EL54E2H |
Legal representative | Plum X |
Name of the name | XX service Co Ltd |
Residence | Shenzhen Futian district XXX |
Principal type | Limited liability company (Natural alone) |
Date of establishment | 2017, 06, 26 Days |
Alternatively, the display rule may include displaying the target key field and the target value field using adjacent lines, and the display line of the target key field is located before the display line of the target value field, and, for example, assuming that the pairing situation of each key field and the value field in the text sequence is shown in table 5, the effect of displaying each key field and the target value field paired with each key field according to the display rule may be shown in fig. 7 b.
In another embodiment, after determining the target value field paired with each key field in the text sequence based on the indication of the pairing result, a display mode of each key field and each value field in the image to be processed may be determined based on the position information of each key field and each value field in the image to be processed, and each key field and the target value field paired with each key field may be displayed according to the display mode. For example, referring to fig. 7c, the image to be processed is a business license image, and each key field and a target value field paired with each key field may be displayed in a page of the image processing platform based on the display manner of each key field and value field in the image to be processed, and the display effect is shown in the right graph of fig. 7 c. It can be seen that the respective key fields and value fields are displayed in the page of the image processing platform in a manner consistent with the manner of display in the business license image. By adopting the mode, the user can conveniently and quickly locate the target information required by the user from the output structured text, and the acquisition efficiency of the target information is improved.
In the embodiment of the application, the image to be processed can be converted into the text sequence, the text sequence is classified by key values, and the key fields and the value fields included in the text sequence are determined based on the key value classification result. Further, the key fields and the value fields can be combined in pairs to obtain at least one key value text sequence, feature information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is carried out on the key fields and the value fields in each key value text sequence according to the feature information, and further structured texts corresponding to the images to be processed are output based on the pairing result of the key fields and the value fields in each key value text sequence, so that conversion from image data to structured data is achieved. On the one hand, the method does not depend on template images or special text field detectors, the accuracy of output results is not influenced by the format change of the images to be processed, corresponding structured data can be extracted from the images with non-fixed formats more accurately, the method is suitable for structured scenes of various images with non-fixed formats, and the method is beneficial to expanding the application range. On the other hand, each time of characteristic information acquisition and pairing processing, the aimed object is a key field and a value field in each key value text sequence, interference of other field information does not exist, and accuracy of a pairing result between each key field and each value field is improved, so that accuracy of extracting corresponding structured data from an image to be processed is further improved.
The embodiment of the present application also provides a computer storage medium having stored therein program instructions for implementing the corresponding method described in the above embodiment when executed.
Referring to fig. 10 again, a schematic structural diagram of an image processing apparatus according to an embodiment of the present application may be provided in the terminal device or may be a computer program (including program code) running in the terminal device.
In one implementation manner of the apparatus of the embodiment of the present application, the apparatus includes the following structure.
A conversion unit 80 for converting an image to be processed into a text sequence;
The processing unit 81 is configured to perform key value classification on the text sequences, determine key fields and value fields included in the text sequences based on the key value classification result, combine the key fields and the value fields two by two to obtain at least one key value text sequence, obtain feature information of the key fields and the value fields in each key value text sequence, and perform pairing processing on the key fields and the value fields in each key value text sequence according to the feature information;
and an output unit 82, configured to output a structured document corresponding to the image to be processed based on the pairing result of the key field and the value field in each key-value text sequence.
In one embodiment, the characteristic information includes any one or more of the following: semantic information, location information and attribute information of key fields and value fields in each key-value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key-value text sequence, the field types comprise key field types or value field types, the location information is used for representing relative positions of the key fields and the value fields in each key-value text sequence in an image to be processed, and the location information comprises position coordinates of the key fields and the value fields in each key-value text sequence in the image to be processed or aspect ratio relative to the image to be processed.
In one embodiment, the feature information includes semantic information, and the processing unit 81 is specifically configured to perform segmentation processing on each key-value text sequence according to the positions of the key field and the value field in each key-value text sequence; and extracting features of each segmented key value text sequence through a semantic representation model to obtain semantic information of key fields and value fields in each key value text sequence.
In one embodiment, the processing unit 81 is further specifically configured to add an input start flag bit, an input end flag bit, a start flag bit of the key field, an end flag bit of the key field, a start flag bit of the value field, and an end flag bit of the value field in each key text sequence according to the positions of the key field and the value field in each key text sequence.
In one embodiment, the pairing process is performed by calling a matching model, the pairing result indicates a relationship pair category to which the key field and the value field in each key-value text sequence belong, the relationship pair category includes a key-value pair category or other categories, and the output unit 82 is specifically configured to determine, according to the indication of the pairing result of the key field and the value field in each key-value text sequence, a target value field paired with each key field in the text sequence, where the target value field is a value field of the key-value pair category in the text sequence to which the relationship pair category to which the corresponding key field belongs; and displaying each key field and target value fields paired with each key field according to the display rule.
In one embodiment, the text sequence contains a plurality of fields, each field comprising one or more characters; the key value classification result comprises classification labels of all characters in the text sequence, wherein the classification labels are used for indicating the character types of the characters and the positions of the characters in the belonging fields; the location includes any one or more of the following: a start position, a middle position, and an end position; the character type includes any one or more of the following: key characters, value characters, and other characters.
In one embodiment, the processing unit 81 is specifically configured to integrate, as indicated by the classification label of each character, characters in the text sequence whose character type is a key character and which belong to the same field into a key field, and integrate characters in the text sequence whose character type is a value character and which belong to the same field into a value field.
In one embodiment, the key value classification is performed by calling a named entity model or a position-based single word classification model, the value field is a named entity in the text sequence, and the key field is a text item corresponding to the named entity.
In one embodiment, the converting unit 80 is specifically configured to invoke the text detection model to perform text recognition on the acquired image to be processed, and typeset the text recognition result to obtain a text sequence corresponding to the image to be processed.
In one embodiment, the image to be processed includes any one of the following: business license image, value added invoice image, identity card image or social security card image.
In the embodiments of the present application, the specific implementation of each unit may refer to the description of the related content in the embodiments corresponding to the foregoing drawings.
The image processing device in the embodiment of the application can convert the image to be processed into a text sequence, classify the text sequence by key values, and determine key fields and value fields included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined in pairs to obtain at least one key value text sequence, feature information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is carried out on the key fields and the value fields in each key value text sequence according to the feature information, and further structured texts corresponding to the images to be processed are output based on the pairing result of the key fields and the value fields in each key value text sequence, so that conversion from image data to structured data is achieved. The method is not dependent on template images and special text field detectors, the accuracy of output results is not influenced by the format change of the images to be processed, corresponding structured data can be extracted from the images with non-fixed formats more accurately, the method is suitable for structured scenes of various images with non-fixed formats, and the application range is enlarged.
Referring to fig. 11 again, a schematic structural diagram of a terminal device according to an embodiment of the present application includes a power supply module and other structures, and includes a processor 90, a storage device 91, an input device 92 and an output device 93. Data may be interacted among the processor 90, the storage 91, the input device 92 and the output device 93, and corresponding image processing functions are implemented by the processor 90.
The storage 91 may include volatile memory (RAM), such as random-access memory (RAM); the storage device 91 may also include a nonvolatile memory (non-volatile memory), such as a flash memory (flash memory), a solid state disk (solid-state disk) (STATE DRIVE, SSD), etc.; the storage means 91 may also comprise a combination of memories of the kind described above.
The processor 90 may be a central processing unit 90 (central processing unit, CPU). In one embodiment, the processor 90 may also be a graphics processor 90 (Graphics Processing Unit, GPU). Processor 90 may also be a combination of a CPU and a GPU. In the terminal device, a plurality of CPUs and GPUs may be included as needed to perform corresponding image processing.
The input device 92 may include a touch pad, fingerprint sensor, microphone, etc., and the output device 93 may include a display (LCD, etc.), speaker, etc.
In one embodiment, the storage 91 is used to store program instructions. Processor 90 may invoke program instructions to implement the various methods as referred to above in embodiments of the present application.
In a first possible implementation manner, the processor 90 of the terminal device invokes the program instructions stored in the storage device 91, and is configured to convert the image to be processed into a text sequence, perform key value classification on the text sequence, determine a key field and a value field included in the text sequence based on the key value classification result, combine the key field and the value field two by two to obtain at least one key value text sequence, obtain feature information of the key field and the value field in each key value text sequence, perform pairing processing on the key field and the value field in each key value text sequence according to the feature information, and output a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence.
In one embodiment, the characteristic information includes any one or more of the following: semantic information, location information and attribute information of key fields and value fields in each key-value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key-value text sequence, the field types comprise key field types or value field types, the location information is used for representing relative positions of the key fields and the value fields in each key-value text sequence in an image to be processed, and the location information comprises position coordinates of the key fields and the value fields in each key-value text sequence in the image to be processed or aspect ratio relative to the image to be processed.
In one embodiment, the feature information includes semantic information, and the processor 90 is specifically configured to perform segmentation processing on each key-value text sequence according to the positions of the key field and the value field in each key-value text sequence; and extracting features of each segmented key value text sequence through a semantic representation model to obtain semantic information of key fields and value fields in each key value text sequence.
In one embodiment, the processor 90 is further specifically configured to add an input start flag bit, an input end flag bit, a start flag bit of a key field, an end flag bit of a key field, a start flag bit of a value field, and an end flag bit of a value field in each key text sequence according to the positions of the key field and the value field in each key text sequence.
In one embodiment, the pairing process is performed by calling a matching model, the pairing result indicates a relationship pair category to which the key field and the value field in each key-value text sequence belong, the relationship pair category includes a key-value pair category or other categories, and the processor 90 is further specifically configured to determine, according to the indication of the pairing result of the key field and the value field in each key-value text sequence, a target value field paired with each key field in the text sequence, where the target value field is a value field of the key-value pair category in the text sequence to which the relationship pair category to which the corresponding key field belongs; the respective key fields and target value fields paired with the respective key fields are displayed in accordance with the display rule by the output device 93.
In one embodiment, the text sequence contains a plurality of fields, each field comprising one or more characters; the key value classification result comprises classification labels of all characters in the text sequence, wherein the classification labels are used for indicating the character types of the characters and the positions of the characters in the belonging fields; the location includes any one or more of the following: a start position, a middle position, and an end position; the character type includes any one or more of the following: key characters, value characters, and other characters.
In one embodiment, the processor 90 is specifically configured to integrate the characters in the text sequence, which are of the character type as key characters and belong to the same field, as key fields, and integrate the characters in the text sequence, which are of the character type as value characters and belong to the same field, as value fields, as indicated by the class labels of the respective characters.
In one embodiment, the key value classification is performed by calling a named entity model or a position-based single word classification model, the value field is a named entity in the text sequence, and the key field is a text item corresponding to the named entity.
In one embodiment, the processor 90 is further specifically configured to invoke the text detection model to perform text recognition on the obtained image to be processed, and typeset the text recognition result to obtain a text sequence corresponding to the image to be processed.
In one embodiment, the image to be processed includes any one of the following: business license image, value added invoice image, identity card image or social security card image.
In the embodiments of the present application, the specific implementation of the processor 90 may refer to the descriptions of the related content in the embodiments corresponding to the foregoing drawings.
The terminal equipment in the embodiment of the application can convert the image to be processed into the text sequence, classify the text sequence by key value, and determine the key field and the value field included in the text sequence based on the key value classification result. Further, the key fields and the value fields can be combined in pairs to obtain at least one key value text sequence, feature information of the key fields and the value fields in each key value text sequence is obtained, pairing processing is carried out on the key fields and the value fields in each key value text sequence according to the feature information, and further structured texts corresponding to the images to be processed are output based on the pairing result of the key fields and the value fields in each key value text sequence, so that conversion from image data to structured data is achieved. The method is not dependent on template images and special text field detectors, the accuracy of output results is not influenced by the format change of the images to be processed, corresponding structured data can be extracted from the images with non-fixed formats more accurately, the method is suitable for structured scenes of various images with non-fixed formats, and the application range is enlarged.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program to instruct related hardware, and the described program may be stored in a computer readable storage medium, which when executed may include the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
The above disclosure is only a few examples of the present application, and it is not intended to limit the scope of the present application, but it is understood by those skilled in the art that all or a part of the above embodiments may be implemented and equivalents thereof may be modified according to the scope of the present application.
Claims (11)
1. An image processing method, the method comprising:
Converting an image to be processed into a text sequence, the text sequence comprising a plurality of fields, each field comprising one or more characters;
Performing key value classification on the text sequence, and determining a key field and a value field included in the text sequence based on a key value classification result, wherein the key value classification result comprises classification labels of all characters in the text sequence, and the classification labels are used for indicating character types of the characters and positions of the characters in the belonging fields; the location includes any one or more of the following: a start position, a middle position, and an end position; the character type includes any one or more of the following: key characters, value characters, and other characters;
combining the key fields and the value fields in pairs to obtain at least one key value text sequence, wherein each key value text sequence comprises a key field and a value field;
acquiring characteristic information of key fields and value fields in each key value text sequence;
Pairing the key field and the value field in each key value text sequence according to the characteristic information;
outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence; and determining the display modes of the key fields and the value fields in the image to be processed based on the position information of the key fields and the value fields in the image to be processed, and displaying the key fields and the target value fields paired with the key fields according to the display modes.
2. The method of claim 1, wherein the characteristic information includes any one or more of: the method comprises the steps of determining semantic information, position information and attribute information of key fields and value fields in each key-value text sequence, wherein the attribute information is used for representing field types of the key fields and the value fields in each key-value text sequence, the field types comprise key field types or value field types, the position information is used for representing relative positions of the key fields and the value fields in each key-value text sequence in the image to be processed, and the position information comprises position coordinates of the key fields and the value fields in each key-value text sequence in the image to be processed or aspect ratios relative to the image to be processed.
3. The method of claim 1, wherein the feature information includes semantic information, and the obtaining feature information of the key field and the value field in each key-value text sequence includes:
Performing segmentation processing on each key value text sequence according to the positions of key fields and value fields in each key value text sequence;
and extracting features of each segmented key value text sequence through a semantic representation model to obtain semantic information of key fields and value fields in each key value text sequence.
4. The method of claim 3, wherein said performing a segmentation process on each of said key-value text sequences according to the locations of the key fields and the value fields in said each of said key-value text sequences comprises:
And adding an input start flag bit, an input end flag bit, a start flag bit of the key field, an end flag bit of the key field, a start flag bit of the value field and an end flag bit of the value field into each key value text sequence according to the positions of the key field and the value field in each key value text sequence.
5. The method of claim 1, wherein the pairing process is performed by calling a matching model, the pairing result indicates a relationship pair category to which the key field and the value field in the each key-value text sequence belong, the relationship pair category includes a key-value pair category or other categories, and the outputting the structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in the each key-value text sequence includes:
Determining target value fields paired with the key fields in the text sequence according to the indication of the pairing result of the key fields and the value fields in each key value text sequence, wherein the target value fields are the value fields of which the relation pair categories with the corresponding key fields in the text sequence are key value pair categories;
and displaying the key fields and target value fields paired with the key fields according to a display rule.
6. The method of claim 2, wherein the determining the key field and the value field included in the text sequence based on the key-value classification result comprises:
and integrating the characters which are in the text sequence and belong to the same field into key fields according to the indication of the classification labels of the characters, and integrating the characters which are in the text sequence and belong to the same field into value fields.
7. The method of claim 1, wherein the key-value classification is performed by invoking a named entity model or a location-based single word classification model, the value field is a named entity in the text sequence, and the key field is a text item corresponding to the named entity.
8. The method of claim 1, wherein the converting the image to be processed into a text sequence comprises:
Calling a text detection model to carry out text recognition on the acquired image to be processed;
Typesetting a text recognition result to obtain a text sequence corresponding to the image to be processed.
9. An image processing apparatus, characterized in that the apparatus comprises:
A conversion unit for converting an image to be processed into a text sequence, the text sequence comprising a plurality of fields, each field comprising one or more characters;
The processing unit is used for carrying out key value classification on the text sequence and determining a key field and a value field included in the text sequence based on a key value classification result, wherein the key value classification result comprises classification labels of all characters in the text sequence, and the classification labels are used for indicating the character types of the characters and the positions of the characters in the belonging fields; the location includes any one or more of the following: a start position, a middle position, and an end position; the character type includes any one or more of the following: key characters, value characters, and other characters; combining the key fields and the value fields in pairs to obtain at least one key value text sequence, obtaining characteristic information of the key fields and the value fields in each key value text sequence, and carrying out pairing treatment on the key fields and the value fields in each key value text sequence according to the characteristic information;
The output unit is used for outputting a structured text corresponding to the image to be processed based on the pairing result of the key field and the value field in each key value text sequence; and determining the display modes of the key fields and the value fields in the image to be processed based on the position information of the key fields and the value fields in the image to be processed, and displaying the key fields and the target value fields paired with the key fields according to the display modes.
10. A terminal device, characterized in that the terminal device comprises a processor and a storage means, which are connected to each other, wherein the storage means are adapted to store a computer program, which computer program comprises program instructions, which processor is configured to invoke the program instructions to perform the method according to any of claims 1-8.
11. A computer storage medium having stored therein program instructions which, when executed, are adapted to carry out the method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010490243.3A CN112801099B (en) | 2020-06-02 | 2020-06-02 | Image processing method, device, terminal equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010490243.3A CN112801099B (en) | 2020-06-02 | 2020-06-02 | Image processing method, device, terminal equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112801099A CN112801099A (en) | 2021-05-14 |
CN112801099B true CN112801099B (en) | 2024-05-24 |
Family
ID=75806463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010490243.3A Active CN112801099B (en) | 2020-06-02 | 2020-06-02 | Image processing method, device, terminal equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112801099B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591657B (en) * | 2021-07-23 | 2024-04-09 | 京东科技控股股份有限公司 | OCR layout recognition method and device, electronic equipment and medium |
CN113936283A (en) * | 2021-09-29 | 2022-01-14 | 科大讯飞股份有限公司 | Image element extraction method, device, electronic equipment and storage medium |
CN114328679A (en) * | 2021-10-22 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Image processing method, apparatus, computer equipment, and storage medium |
CN114187603A (en) * | 2021-11-09 | 2022-03-15 | 北京百度网讯科技有限公司 | Image processing method and device, electronic equipment and storage medium |
CN114359535A (en) * | 2021-11-23 | 2022-04-15 | 中科曙光南京研究院有限公司 | OCR recognition result structuring method, system and storage medium |
CN114511864B (en) * | 2022-04-19 | 2023-01-13 | 腾讯科技(深圳)有限公司 | Text information extraction method, target model acquisition method, device and equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334346A (en) * | 2019-06-26 | 2019-10-15 | 京东数字科技控股有限公司 | A kind of information extraction method and device of pdf document |
CN110569846A (en) * | 2019-09-16 | 2019-12-13 | 北京百度网讯科技有限公司 | Image character recognition method, device, equipment and storage medium |
CN111177387A (en) * | 2019-12-25 | 2020-05-19 | 深圳壹账通智能科技有限公司 | User list information processing method, electronic device and computer-readable storage medium |
CN111191715A (en) * | 2019-12-27 | 2020-05-22 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
-
2020
- 2020-06-02 CN CN202010490243.3A patent/CN112801099B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334346A (en) * | 2019-06-26 | 2019-10-15 | 京东数字科技控股有限公司 | A kind of information extraction method and device of pdf document |
CN110569846A (en) * | 2019-09-16 | 2019-12-13 | 北京百度网讯科技有限公司 | Image character recognition method, device, equipment and storage medium |
CN111177387A (en) * | 2019-12-25 | 2020-05-19 | 深圳壹账通智能科技有限公司 | User list information processing method, electronic device and computer-readable storage medium |
CN111191715A (en) * | 2019-12-27 | 2020-05-22 | 深圳市商汤科技有限公司 | Image processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112801099A (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112801099B (en) | Image processing method, device, terminal equipment and medium | |
CN113283551B (en) | Training method and training device of multi-mode pre-training model and electronic equipment | |
KR102266529B1 (en) | Method, apparatus, device and readable storage medium for image-based data processing | |
CN109543690B (en) | Method and device for extracting information | |
CN113673432B (en) | Handwriting recognition method, touch display device, computer device and storage medium | |
CN112149663A (en) | RPA and AI combined image character extraction method and device and electronic equipment | |
CN107679070B (en) | Intelligent reading recommendation method and device and electronic equipment | |
CN112330331A (en) | Identity verification method, device and equipment based on face recognition and storage medium | |
CN112464927B (en) | Information extraction method, device and system | |
CN115130613B (en) | False news identification model construction method, false news identification method and device | |
US20220392243A1 (en) | Method for training text classification model, electronic device and storage medium | |
CN113711232A (en) | Object detection and segmentation for inking applications | |
Mathew et al. | Asking questions on handwritten document collections | |
CN115937887A (en) | Method and device for extracting document structured information, electronic equipment and storage medium | |
CN112395450B (en) | Picture character detection method and device, computer equipment and storage medium | |
CN118314594B (en) | Resume information extraction method, device, equipment and storage medium | |
CN111881900B (en) | Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium | |
Ou et al. | ERCS: An efficient and robust card recognition system for camera-based image | |
Rai et al. | MyOcrTool: visualization system for generating associative images of Chinese characters in smart devices | |
CN116976344A (en) | Entity identification method, entity identification device, computer equipment and medium | |
CN114692715A (en) | Sample labeling method and device | |
CN114359928A (en) | Electronic invoice identification method and device, computer equipment and storage medium | |
CN114332599A (en) | Image recognition method, device, computer equipment, storage medium and product | |
CN114898385B (en) | Data processing method, device, equipment, readable storage medium and program product | |
CN113887441B (en) | Table data processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40044244 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |