WO2021051553A1

WO2021051553A1 - Certificate information classification and positioning method and apparatus

Info

Publication number: WO2021051553A1
Application number: PCT/CN2019/117550
Authority: WO
Inventors: 黄泽浩
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-18
Filing date: 2019-11-12
Publication date: 2021-03-25
Also published as: CN110738238A; CN110738238B

Abstract

A certificate information classification and positioning method and apparatus. The method comprises: a server detecting A pieces of feature information in a first target image by means of a classification and positioning model based on a YOLO network, extracting A detection boxes, and acquiring first border information of the A detection boxes and first classification labels of the A detection boxes, wherein the first target image includes a first certificate, and A is a positive integer greater than 0 (S201); and the server adjusting, according to structured information features of the first certificate, the border information of the A detection boxes and the classification labels of the A detection boxes, and generating second border information of the A detection boxes and second classification labels of the A detection boxes (S202). The application range of the method can be enlarged, and the detection speed is improved.

Description

Method and device for classifying and positioning certificate information

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 18, 2019, the application number is 201910880737X, and the application name is "a method and device for classification and positioning of document information", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of computer technology, and in particular to a method and device for classifying and positioning credential information.

Background technique

The classification and positioning of the card surface information of ID cards, bank cards, etc. usually use fixed position extraction of text lines or general text detection methods. The former has a limited scope of application and overly relies on the contour extraction and image correction of the certificate. The latter has a slow detection speed. At the same time, the extracted text needs to be classified according to the content, which further reduces the accuracy.

In summary, the existing classification and positioning methods for document information have limited application scope and slow detection speed in actual application scenarios.

Summary of the invention

The embodiments of the present application provide a classification and positioning method and device for credential information, which can expand the scope of application and increase the detection speed.

The embodiment of the application provides a classification and positioning method for credential information, which includes the following steps: a server uses a classification and positioning model based on the YOLO network to detect A feature information in a first target image, and extract A detection frames, And obtain the first frame information of the above A detection frames and the first classification label of the above A detection frames. The first target image contains the first document, and A is a positive integer greater than 0; the server is structured according to the first document The information feature adjusts the frame information of the A detection frames and the classification labels of the A detection frames to generate the second frame information of the A detection frames and the second classification labels of the A detection frames.

The embodiment of the present application also provides a device for classification and positioning of credential information, which can realize the beneficial effects of the above-mentioned classification and positioning method for credential information. Among them, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions.

Optionally, the device includes a first extraction unit and an adjustment unit.

The first extraction unit is used to detect A feature information in the first target image using a classification and positioning model based on the YOLO network, extract A detection frames, and obtain the first frame information of the A detection frames and the A The first classification label of a detection frame, the first target image contains the first document, and A is a positive integer greater than 0;

The adjustment unit is configured to adjust the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first document, and generate the second frame information of the A detection frames and the A detection frames The second classification label.

The embodiment of the present application also provides a server, which can realize the beneficial effects of the above-mentioned classification and positioning method for credential information. Among them, the function of the server can be realized by hardware, and can also be realized by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions. The server includes a memory, a processor, and a transceiver. The memory is used to store a computer program that supports the server to execute the above method. The computer program includes program instructions. The processor is used to control and manage the actions of the server according to the program instructions. The transceiver is used to support The server communicates with other communication devices.

The embodiment of the present application also provides a computer-readable storage medium with instructions stored on the readable storage medium, which when run on a processor, cause the processor to execute the above-mentioned classification and positioning method of credential information.

In the embodiment of this application, the server uses the classification and positioning model based on the YOLO network to detect A feature information in the first target image, extracts A detection frames, and obtains the first frame information of the A detection frames and the A The first classification label of each detection frame, the first target image contains the first document, and A is a positive integer greater than 0; the server adjusts the border information of the above A detection frames and the above A according to the structured information characteristics of the first document The classification label of the detection frame, the second frame information of the above A detection frames and the second classification label of the above A detection frames are generated. The solution proposed in the embodiment of this application does not rely on the contour extraction and image correction of the certificate, and can expand the scope of application. The embodiment of this application adopts the classification and positioning model based on the YOLO network, which effectively improves the detection speed of the classification and positioning of the certificate information.

The additional aspects and advantages of the present application will be partly given in the following description, which will become obvious from the following description, or be understood through the practice of the present application.

Description of the drawings

The above and/or additional aspects and advantages of the present application will become obvious and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic structural diagram of a server provided by an embodiment of the present application;

2 is a schematic flowchart of a method for classifying and positioning credential information provided by an embodiment of the present application;

Fig. 3 is a schematic structural diagram of an apparatus for classifying and positioning credential information provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application. It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof. In addition, the terms "first", "second", "third", etc. are used to distinguish different objects, but not to describe a specific sequence.

It should be noted that the terms used in the embodiments of the present application are only for the purpose of describing specific embodiments, and are not intended to limit the present application. The singular forms of "a", "said" and "the" used in the embodiments of the present application and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.

It should be noted that the server in the embodiment of the present application can be a conventional server that can undertake services and guarantee service capabilities, or it can be a terminal device that has a processor, hard disk, memory, and system bus structure that can undertake services and guarantee service capabilities. . The embodiments of this application do not make specific limitations.

The YOLO network is a deep residual network. The advantage of the deep residual network over the general deep network is to use a high-speed network to solve the problem of gradient disappearance in a deep network with a higher number of layers. In a deep neural network, if the number of layers is high, some of its deeper layers may need to simulate an identity map, and this identity map is more difficult to learn for a certain layer. Therefore, the deep residual network uses shortcut connections to design the original identity mapping F(x)=x as F(x)=g(x)+x, that is, g(x)=F(x)-x, as long as By learning to make the residual g(x)=0, an identity mapping can be learned, which reduces the difficulty of learning identity mapping. Using the deep residual network can effectively solve the problem of gradient disappearance when the number of layers of the deep network is large, so that the error of the deep network will not increase when the number of layers is large, and the training efficiency can be improved.

Please refer to FIG. 1, which is a schematic diagram of the hardware structure of a server 100 according to an embodiment of the application. The server 100 includes a memory 101, a transceiver 102, and a processor 103 coupled to the memory 101 and the transceiver 102. The memory 101 is configured to store a computer program, the computer program includes program instructions, the processor 103 is configured to execute the program instructions stored in the memory 101, and the transceiver 102 is configured to communicate with other devices under the control of the processor 103. When the processor 103 is executing instructions, it can execute the classification and positioning method of the credential information according to the program instructions.

Among them, the processor 103 may be a central processing unit (English: central processing unit, abbreviated as: CPU), a general-purpose processor, a digital signal processor (English: digital signal processor, abbreviated as: DSP), an application specific integrated circuit (English: application- Specific integrated circuit, abbreviation: ASIC), field programmable gate array (English: field programmable gate array, abbreviation: FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute various exemplary logical blocks, modules, and circuits described in conjunction with the disclosure of the embodiments of the present application. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and so on. The transceiver 102 may be a communication interface, a transceiver circuit, etc., where the communication interface is a general term and may include one or more interfaces, such as an interface between a server and a terminal.

Optionally, the server 100 may further include a bus 104. Among them, the memory 101, the transceiver 102, and the processor 103 may be connected to each other through a bus 104; the bus 104 may be a peripheral component interconnection standard (English: peripheral component interconnect, abbreviated as: PCI) bus or an extended industry standard structure (English: extended industry standard architecture, referred to as EISA) bus, etc. The bus 104 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 1, but it does not mean that there is only one bus or one type of bus.

In addition to the memory 101, the transceiver 102, the processor 103, and the aforementioned bus 104 shown in FIG. 1, the server 100 in the embodiment may also include other hardware generally according to the actual function of the server, which will not be repeated here.

In the foregoing operating environment, the embodiment of the present application provides a classification and positioning method for credential information as shown in FIG. 2. Please refer to Figure 2. The classification and positioning method of the certificate information includes:

S201. The server uses the classification and positioning model based on the YOLO network to detect A feature information in the first target image, extracts A detection frames, and obtains the first frame information of the A detection frame and the first frame information of the A detection frame. Once the label is classified, the first target image contains the first certificate, and A is a positive integer greater than 0.

Optionally, before the server uses the classification and positioning model based on the YOLO network to detect the A feature information in the first target image, the method further includes: binarizing the second target image to obtain the second target image The binarized image is the first target image.

Optionally, the above A detection frames include N text line detection frames and M non-text line detection frames, and the server uses a classification and positioning model based on the YOLO network to detect the feature information in the first target image, and extract A detection frames The frame includes: the server uses the classification and positioning model based on the YOLO network to detect the feature information in the first target image, and extracts N text line detection frames; the server uses the classification and positioning model based on the YOLO network to detect the features in the first target image Information is detected, and M non-text line detection boxes are extracted.

For example, the front of the ID card includes 8 characteristic information, namely name, gender, ethnicity, address, date of birth, address, ID number, and ID photo. The 8 feature information on the front of the ID card includes one non-text line information and 7 text line information. The personal data page on the inside page of the passport includes 12 characteristic information, namely Type/Type, Country Code/Country Code, Passport No./Passport No, Last Name/Surname, First Name/Given Names, Gender/Sex, Place of Birth/Place of Birth , Date of Birth/Date of Birth, Place of Issue/Place of Issue, Date of Issue/Date of Issue, Issuing Authority/Authority, and passport photo. The 8 feature information on the inside page of the passport contains one non-text line information and 11 text line information.

In the embodiment of the present application, the above-mentioned text line may be consecutive p symbols that do not include sentence-breaking punctuation marks, and the above-mentioned sentence-breaking punctuation marks include commas, periods, and exclamation points. The distance between any two characters in the text line does not exceed the first distance threshold, and the first distance threshold is determined by actual application conditions, which is not specifically limited in the embodiment of the present application. The symbols in the above text line may include Chinese characters, English letters, numbers, and non-breaking punctuation marks, etc. The above non-breaking punctuation marks include plus signs, minus signs, and semicolons. P is a positive integer greater than or equal to 0.

Optionally, the above-mentioned server uses a classification and positioning model based on the YOLO network to detect feature information in the first target image, and extracts N text line detection frames, including:

S1. The server uses the classification and positioning model to extract n text head detection boxes and n text tail detection boxes from the first target image. The first text head detection box of the above n text head detection boxes includes the first target image The first B characters of the first text line, the length of the first B characters of the first text line is L1, the text header detection box also includes a non-text image area with a length of t*L1 before the B characters, the above The first text end detection box in the n text end detection boxes includes the last C characters of the first text line, the last C characters of the first text line are L2, and the text end detection box also includes the above C characters after A non-text image area with a length of t*L2. B and C are positive integers, and t is greater than zero and less than or equal to 1.

S2. The server matches the n text header detection boxes with the n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity, to obtain the initial detection boxes of the n text lines.

S3. The server corrects the initial detection frame of the above n text lines, removes the non-text image area in the text line detection frame, and obtains n prediction frames.

S4. The server uses the K-means clustering algorithm to obtain the confidence that the n prediction boxes contain the text line feature information and the confidence of the category to which the text line feature information in the n prediction boxes belongs.

S5. The server uses a non-maximum value suppression algorithm to filter the above n prediction boxes to obtain the above N text line detection boxes, the target check scores of the above N text line detection boxes, and the first of the above N text line detection boxes. Sub-category label.

It should be noted that the text line information in the certificate mostly meets the consistency of the text line slope, that is, the connection slopes of any two characters in a text line are the same, and/or the slopes of any two text lines are the same. For example, ID cards, bank cards and social security cards.

Optionally, the server matches the aforementioned n text header detection boxes with the aforementioned n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity, to obtain the initial detection frame of the aforementioned n text lines, including: With reference to the horizontal line, the server calculates the slopes of the n text head detection boxes, the slopes of the n text tail detection boxes, and the i-th text head detection box in the n text head detection boxes and the first text tail detection box in the n text tail detection boxes. The connection slope of j text end detection boxes. Then, when the slope consistency condition is satisfied, the n text head detection boxes and the n text tail detection boxes are matched one by one based on the principle of proximity and sequence consistency. Sequential consistency means that the text head detection box of all text lines in the first target image is on the left (right) of the text end detection box of the text line.

Optionally, the connection slope between the i-th text head detection box and the j-th text tail detection box refers to: the center coordinates of the i-th text head detection box and the center of the j-th text tail detection box The slope of the line of coordinates.

Optionally, the connection slope of the i-th text head detection box in the n text head detection boxes and the g-th text tail detection box in the n text tail detection boxes is the second slope, and the above i-th text head detection If the box and the g-th text tail detection box meet the slope consistency condition, it means that the difference between the slope of the g-th text tail detection box and the first slope is less than the first preset threshold, and the second slope and the first slope The difference of a slope is smaller than the second preset threshold. The first slope may be the slope of the i-th text head detection frame, or the average value of the slopes of the n text head detection frames and the n text tail detection frames.

It should be noted that the settings of the first preset threshold and the second preset threshold are related to the above-mentioned average value of the slope, and are determined according to actual conditions, and the embodiment of the present application does not specifically limit it.

It is understandable that the text header of the initial detection box of each text line contains a non-text image area of length t*L1, and the text end of the initial detection box of each text line contains a non-text image area of length t*L2, so The server corrects the initial detection frame of the text line to remove the non-text image area in the text line detection frame to obtain the above-mentioned N text line detection frames.

Optionally, the foregoing server uses a classification and positioning model to detect feature information in the first target image, and extracts M non-text line detection frames, including:

S4. The server uses the classification and positioning model to perform feature extraction on the first target image to obtain m feature maps with a size of a*a, where the feature maps are images containing non-text line feature information;

S5. The server divides each feature map in the m feature maps into a*a network cells, and predicts the center coordinates of the non-text line feature information in the m feature maps, and uses K-means clustering based on the center coordinates. The class algorithm obtains the length and width of m prediction boxes, the confidence that the m prediction boxes contain non-text line feature information, and the confidence of the category of the non-text line feature information in the m prediction frames;

S6. The server uses a non-maximum value suppression algorithm to filter the above m prediction boxes to obtain the above M non-text line detection boxes, the target detection scores of the above M non-text line detection boxes, and the above M non-text line detection boxes The first classification label.

Optionally, the sigmoid function is used to predict the center coordinates of the non-text line feature information.

Optionally, the server uses a non-maximum value suppression algorithm to filter the m prediction frames to obtain the M non-text line detection frames, including: using a non-maximum value suppression algorithm to generate target detections for the m prediction frames Score, sort the scores of the above m prediction boxes, and select the highest score and its corresponding prediction box. Traverse the rest of the prediction frames, and if the overlap area between the prediction frame and the current highest score prediction frame is greater than the third threshold, delete the prediction frame. Continue to select one with the highest score from the unprocessed prediction box, and repeat the above process until M prediction boxes are selected as M non-text line detection boxes.

It can be understood that the non-maximum suppression algorithm generates a detection frame based on the target detection score, the prediction frame with the highest score is selected, and other prediction frames that have obvious overlap with the selected prediction frame are suppressed. This process is continuously recursively applied to the remaining prediction boxes.

In the embodiment of the present application, the frame information of the frame information of the A detection frames includes the center coordinates of the detection frame, the length of the detection frame, and the width of the detection frame.

S202. The server adjusts the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generates the second frame information of the A detection frames and the first frame information of the A detection frames. Secondary classification label.

In the embodiment of the present application, the above-mentioned structured information feature of the first certificate refers to the relative positional relationship and relative ratio of any two feature information in the A pieces of feature information of the first certificate.

Optionally, the server adjusts the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generates the second frame information of the A detection frames and the A detection frames. The second classification label includes steps S7 to S14. It is not limited to the above steps, and other steps may also be included in the embodiments of the present application.

S7, i=0, select the first detection frame with the highest target detection score from A-i detection frames, and the first classification label of the first detection frame is the first feature information.

S8. Using the first detection frame as a reference, according to the relative position relationship and relative ratio between the first feature information and the remaining A-1 feature information, obtain the reference prediction frame corresponding to the remaining A-1 feature information, and the reference prediction frame corresponding to the reference prediction frame. Border information.

S9, j=1, select the detection frame with the largest overlap area with the j-th reference prediction frame of the A-1 reference prediction frames from the remaining A-1 detection frames. If the overlap area between the detection frame and the j-th reference prediction frame is greater than the third preset threshold, and the first classification label corresponding to the detection frame is the same as the feature information corresponding to the j-th reference prediction frame, then the detection The target detection score corresponding to the frame is increased by Δt; if the first classification label corresponding to the detection frame is different from the feature information corresponding to the j-th reference prediction frame, the target detection score corresponding to the detection frame is decreased by Δt.

S9, j=j+1, and j is less than or equal to A-1.

Repeat steps S9 and S10.

S10, i=i+1, and i is less than or equal to A-1.

Repeat steps S7 to S10 until the above A detection frames are traversed.

S11. Select the third detection frame with the highest target detection score from the traversed A detection frames, and the first classification label of the third detection frame is the third feature information.

S12. Take the third detection frame as a reference, and obtain the reference prediction frame corresponding to the remaining A-1 feature information and the reference prediction frame corresponding to the remaining A-1 feature information according to the relative positional relationship and relative ratio between the remaining third feature information and the A-1 feature information Border information.

S13, j=1, select the detection frame with the largest overlap area with the j-th reference prediction frame of the A-1 reference prediction frames from the remaining A-1 detection frames. If the overlap area between the detection frame and the j-th reference prediction frame is greater than the fourth preset threshold, the second classification label corresponding to the detection frame is made the same as the feature information corresponding to the j-th reference prediction frame. And according to the frame information of the j-th reference prediction frame, the first frame information of the detection frame is adjusted to the second frame information.

S14, j=j+1, and j is less than or equal to A-1.

Repeat steps S13 and S14. Until the second frame information of the A detection frames and the second classification label of the A detection frames are generated.

Optionally, adjusting the first frame information of the detection frame to the second frame information according to the frame information of the j-th reference prediction frame includes:

The center coordinates of the detection frame are (x1, y1), and the difference between the center coordinates of the detection frame and the j-th reference prediction frame is (x2, y2), then the center coordinates of the detection frame are adjusted to (x1+a* x2, y1+a*y2). The length of the detection frame is L1, and the difference between the length of the detection frame and the aforementioned j-th reference prediction frame is L2, and the length of the detection frame is adjusted to L1+b*L2. The width of the detection frame is K1, and the difference between the width of the detection frame and the aforementioned j-th reference prediction frame is K2, and the width of the detection frame is adjusted to K1+c*K2. a, b, and c are all greater than or equal to zero and less than or equal to 1. For example, the values of a, b, and c are all 0.5.

Optionally, the above-mentioned server uses a classification and positioning model based on the YOLO network to detect the feature information in the first target image. Before extracting A detection frames, the method further includes: pre-training the YOLO network; the above-mentioned pre-training the YOLO network Training includes: establishing a sample database, which contains image samples used to train the YOLO network; initializing the training parameters of the YOLO network; randomly selecting image samples from the sample database as training samples; inputting the training samples as input vectors into the YOLO network; obtaining The output vector of the YOLO network is the feature map of the training sample; the training parameters are optimized according to the output vector, and the residual network between the image sample and the feature map of the image sample is established.

Optionally, a migration learning strategy is adopted, and the network parameters trained on the ImageNet data set are used as the training parameters of the YOLO network.

In the embodiment of the present application, the A feature information in the first target image is detected by using the classification and positioning model based on the YOLO network, A detection frames are extracted, and the first frame information of the A detection frames and the A The first classification label of each detection frame, the first target image contains the first document, and A is a positive integer greater than 0; the server adjusts the border information of the above A detection frames and the above A according to the structured information characteristics of the first document The classification label of the detection frame, the second frame information of the above A detection frames and the second classification label of the above A detection frames are generated. The solution proposed in the embodiment of this application does not rely on the contour extraction and image correction of the certificate, and can expand the scope of application. The embodiment of this application adopts the classification and positioning model based on the YOLO network and uses the consistency of the slope of the text line to effectively improve the certificate. Detection speed of information classification and positioning.

The embodiment of the present application also provides a device for classification and positioning of credential information, which can achieve the beneficial effects of the above-mentioned classification and positioning method for credential information. Among them, the function of the device can be realized by hardware, or by hardware executing corresponding software. The hardware or software includes at least one module corresponding to the above-mentioned functions.

Please refer to FIG. 3. FIG. 3 is a structural block diagram of an apparatus 300 for classifying and positioning credential information according to an embodiment of the present application. The apparatus includes a first extraction unit 301 and an adjustment unit 302.

The first extraction unit 301 is configured to detect A feature information in the first target image using a classification and positioning model based on the YOLO network, extract A detection frames, and obtain the first frame information of the A detection frames and the foregoing The first classification label of A detection frames, the first target image contains the first certificate, and A is a positive integer greater than 0;

The adjustment unit 302 is configured to adjust the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generate the second frame information of the A detection frames and the A detections. The second classification label of the box.

Optionally, the above A detection frames include N text line detection frames and M non-text line detection frames; the first extraction unit 301 includes: a text extraction unit, configured to use a classification and positioning model based on the YOLO network to target the first target The feature information in the image is detected, and N text line detection frames are extracted; the non-text extraction unit is used to detect the feature information in the first target image using the classification and positioning model based on the YOLO network, and extract M non-text line detection frame.

Optionally, the aforementioned text extraction unit includes: a detection frame extraction unit, a matching unit, a correction unit, and a first filtering unit.

The detection frame extraction unit is used for extracting n text head detection frames and n text tail detection frames from the first target image by using the classification and positioning model, and the first text head detection frame of the n text head detection frames includes The first B characters of the first text line in the first target image, the length of the first B characters of the first text line is L1, and the text header detection box also includes non-text with a length of t*L1 before the above B characters In the image area, the first text end detection box in the n text end detection boxes includes the last C characters of the first text line, the length of the last C characters of the first text line is L2, and the text end detection box also includes the above The length of the non-text image area after C characters is t*L2, B and C are positive integers, and t is greater than zero and less than or equal to 1.

The matching unit is configured to match the n text head detection boxes with the n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity to obtain the n text line detection boxes.

The correction unit is used to correct the above-mentioned n text line detection frames, remove the non-text image area in the text line detection frame, and obtain n prediction frames.

The first filtering unit uses a non-maximum value suppression algorithm to filter the above n prediction boxes to obtain the above N text line detection boxes, the target detection scores of the above N text line detection boxes, and the above N text line detection boxes. Sort tags for the first time.

Optionally, the aforementioned non-text extraction unit includes: a first acquisition unit, configured to perform feature extraction on the first target image using the classification and positioning model to obtain m feature maps with a size of a*a, the feature maps containing non-text The second acquisition unit is used to predict the center coordinates of the non-text line information in the m feature maps, and use the K-means clustering algorithm to acquire the length and width of the m prediction boxes based on the center coordinates. The m prediction boxes contain the confidence of the non-text line feature information and the confidence of the category of the non-text line feature information in the m prediction boxes; the second filtering unit is used to use the non-maximum value suppression algorithm to predict the m The frames are filtered to obtain the M non-text line detection frames, the target detection scores of the M non-text line detection frames, and the first classification labels of the M non-text line detection frames.

The aforementioned extraction unit uses a classification and positioning model based on the YOLO network to detect the feature information in the first target image. Before extracting A detection frames, the aforementioned device further includes: a pre-training unit. The pre-training unit is used to pre-train the YOLO network.

The above-mentioned pre-training unit includes: a establishing unit for establishing a sample database, the sample database containing image samples for training the YOLO network; an initialization unit for initializing the training parameters of the YOLO network; a selection unit for randomly selecting from the sample database Select the image sample as the training sample; the input unit is used to input the training sample as the input vector into the YOLO network; the third acquisition unit is used to obtain the output vector of the YOLO network, that is, the feature map of the training sample; the processing unit is used to according to the output vector Optimize the training parameters and establish a residual network between the image sample and the feature map of the image sample.

The steps of the method or algorithm described in combination with the disclosure of the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. Software instructions can be composed of corresponding software modules, which can be stored in random access memory (English: random access memory, referred to as RAM), flash memory, read-only memory (English: read only memory, referred to as ROM), Erasable programmable read-only memory (English: erasable programmable rom, abbreviation: EPROM), electrically erasable programmable read-only memory (English: electrically eprom, abbreviation: EEPROM), register, hard disk, mobile hard disk, CD-ROM (CD -ROM) or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and can write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC can be located in a network device. Of course, the processor and the storage medium may also exist as discrete components in the network device.

Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in the embodiments of the present application may be implemented by hardware, software, firmware, or any combination thereof. When implemented by software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or codes on the computer-readable medium. The computer-readable medium includes a computer storage medium and a communication medium, where the communication medium includes any medium that facilitates the transfer of a computer program from one place to another. The storage medium may be any available medium that can be accessed by a general-purpose or special-purpose computer.

The computer-readable medium described in this application may be a non-volatile computer-readable medium.

The specific implementations described above further describe the purpose, technical solutions, and beneficial effects of the embodiments of the application in detail. It should be understood that the foregoing descriptions are only specific implementations of the embodiments of the application, and are not used for To limit the protection scope of the embodiments of the application, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the application shall be included in the protection scope of the embodiments of the application.

Claims

A training method for classification and positioning of credential information, characterized in that the method includes:

The server uses the classification and positioning model based on the YOLO network to detect A feature information in the first target image, extracts A detection frames, and obtains the first frame information of the A detection frames and the A detections The first classification label of the frame, the first target image contains the first certificate, and A is a positive integer greater than 0;

The server adjusts the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generates the second frame information of the A detection frames and the The second classification label of A detection frame.
The method according to claim 1, wherein the A detection frames include N text line detection frames and M non-text line detection frames, and the server uses a classification and positioning model based on the YOLO network to analyze the first target The feature information in the image is detected, and A detection frames are extracted, including:

The server detects the feature information in the first target image by using the classification and positioning model based on the YOLO network, and extracts N text line detection frames;

The server detects the feature information in the first target image by using the classification and positioning model based on the YOLO network, and extracts M non-text line detection frames.
The method according to claim 2, wherein the server uses the classification and positioning model based on the YOLO network to detect feature information in the first target image, and extracts N text line detection frames, comprising:

The server uses the classification and positioning model to extract n text head detection boxes and n text tail detection boxes from the first target image, and the first text head detection box of the n text head detection boxes includes The first B characters of the first text line in the first target image, the length of the first B characters of the first text line is L1, and the text header detection box also includes the length before the B characters Is a non-text image area of t*L1, the first text end detection box in the n text end detection boxes includes the last C characters of the first text line, and the last C characters of the first text line The length of is L2, the text end detection box also includes a non-text image area of length t*L2 after the C characters, B and C are positive integers, and t is greater than zero and less than or equal to 1;

The server matches the n text head detection boxes and the n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity to obtain the n text line detection boxes;

The server corrects the n text line detection boxes, removes the non-text image area in the text line detection box, and obtains n prediction boxes;

The server uses a non-maximum value suppression algorithm to filter the n prediction boxes to obtain the N text line detection boxes, the target detection scores of the N text line detection boxes, and the N text line detections The first classification label of the box.
The method according to claim 2, wherein the server uses the classification and positioning model to detect feature information in the first target image, and extracts M non-text line detection frames, comprising:

The server uses the classification and positioning model to perform feature extraction on the first target image to obtain m feature maps with a size of a*a, where the feature maps are images containing non-text line information;

The server performs center coordinate prediction on the non-text line information in the m feature maps, and uses the K-means clustering algorithm to obtain the length and width of m prediction boxes based on the center coordinates, and the m prediction boxes include The confidence of the non-text line feature information and the confidence of the category to which the non-text line feature information in the m prediction boxes belongs;

The server uses a non-maximum value suppression algorithm to filter the m prediction frames to obtain the M non-text line detection frames, the target detection scores of the M non-text line detection frames, and the M non-text line detection frames. The first classification label of the text line detection box.
The method according to claim 3, wherein the connection between the i-th text head detection box in the n text head detection boxes and the g-th text tail detection box in the n text tail detection boxes The slope is the second slope, and the condition that the i-th text head detection box and the g-th text tail detection box meet the slope consistency is: the difference between the slope of the g-th text tail detection box and the first slope The value is less than a first preset threshold, and the difference between the second slope and the first slope is less than a second preset threshold; the first slope is the slope of the i-th text header detection frame, or, The first slope is an average value of the slopes of the n text head detection boxes and the n text tail detection boxes.
The method according to any one of claims 1 to 5, wherein the server uses a classification and positioning model based on the YOLO network to detect A feature information in the first target image, and before extracting A detection frames, Also includes:

Binarize a second target image to obtain a binarized image of the second target image, the binarized image of the second target image is the first target image, and the second target image includes The first certificate.
The method according to any one of claims 1 to 6, wherein the server uses a classification and positioning model based on the YOLO network to detect the feature information in the first target image, and before extracting A detection frames, the The method further includes: pre-training the YOLO network;

The pre-training of the YOLO network includes:

Establishing a sample database, the sample database containing image samples used to train the YOLO network;

Initialize the training parameters of the YOLO network;

Randomly selecting image samples from the sample database as training samples;

Input the training sample as an input vector into the YOLO network;

Acquiring the YOLO network output vector, that is, the feature map of the training sample;

The training parameters are optimized according to the output vector, and a residual network between the image sample and the feature map of the image sample is established.
A device for classification and positioning training of credential information, characterized in that the device comprises:

The first extraction unit is used to detect A feature information in the first target image by using a classification and positioning model based on the YOLO network, extract A detection frames, and obtain the first frame information and all the A detection frames. The first classification labels of the A detection frames, the first target image contains the first certificate, and A is a positive integer greater than 0;

The adjustment unit is configured to adjust the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generate the second frame information of the A detection frames and The second classification label of the A detection frames.
The device according to claim 8, wherein the A detection frames comprise N text line detection frames and M non-text line detection frames; and the extraction unit comprises:

A text extraction unit, configured to use the classification and positioning model based on the YOLO network to detect feature information in the first target image, and extract N text line detection frames;

The non-text extraction unit is configured to use the classification and positioning model based on the YOLO network to detect the feature information in the first target image, and extract M non-text line detection frames.
The device according to claim 9, wherein the text extraction unit comprises:

The detection frame extraction unit is configured to extract n text head detection frames and n text tail detection frames from the first target image by using the classification and positioning model, and the first text head in the n text head detection frames The detection frame includes the first B characters of the first text line in the first target image, the length of the first B characters of the first text line is L1, and the text header detection frame also includes the B A non-text image area with a length of t*L1 before a character, the first text end detection box in the n text end detection boxes includes the last C characters of the first text line, and the first text line The length of the last C characters is L2, the text end detection box also includes a non-text image area with a length of t*L2 after the C characters, B and C are positive integers, and t is greater than zero and less than or equal to 1;

A matching unit, configured to match the n text head detection boxes with the n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity to obtain the n text line detection boxes;

The correction unit is configured to correct the n text line detection frames, remove the non-text image area in the text line detection frame, and obtain n prediction frames;

The first filtering unit is configured to filter the n prediction boxes by using a non-maximum value suppression algorithm to obtain the N text line detection boxes, the target detection scores of the N text line detection boxes, and the N The first classification label of a text line detection box.
The device according to claim 9, wherein the non-text extraction unit comprises:

The first acquiring unit is configured to perform feature extraction on the first target image by using the classification and positioning model to obtain m feature maps with a size of a*a, where the feature maps are images containing non-text line information;

The second acquisition unit is configured to predict the center coordinates of the non-text line information in the m feature maps, and use the K-means clustering algorithm to acquire the length and width of the m prediction frames based on the center coordinates, and the m Each prediction box contains the confidence level of the non-text line feature information and the confidence level of the category to which the non-text line feature information in the m prediction boxes belongs;

The second filtering unit is further configured to filter the m prediction frames using a non-maximum value suppression algorithm to obtain the M non-text line detection frames, the target detection scores of the M non-text line detection frames, and The first classification label of the M non-text line detection boxes.
The device according to claim 10, wherein the connection between the i-th text header detection box in the n text header detection boxes and the g-th text tail detection box in the n text tail detection boxes The slope is the second slope, and the condition that the i-th text head detection box and the g-th text tail detection box meet the slope consistency is: the difference between the slope of the g-th text tail detection box and the first slope The value is less than a first preset threshold, and the difference between the second slope and the first slope is less than a second preset threshold; the first slope is the slope of the i-th text header detection frame, or, The first slope is an average value of the slopes of the n text head detection boxes and the n text tail detection boxes.
The device according to any one of claims 8 to 12, wherein the server uses a classification and positioning model based on the YOLO network to detect A feature information in the first target image, and before extracting A detection frames, Also includes:

The binarization unit is configured to perform binarization processing on a second target image to obtain a binarized image of the second target image, where the binarized image of the second target image is the first target image, The second target image includes the first certificate.
The method according to any one of claims 8 to 13, wherein the first extraction unit uses a classification and positioning model based on the YOLO network to detect the feature information in the first target image, and before extracting A detection frames , The device further includes: a pre-training unit for pre-training the YOLO network;

The pre-training unit includes:

A establishing unit for establishing a sample database, the sample database containing image samples used for training the YOLO network;

The initialization unit is used to initialize the training parameters of the YOLO network;

The selection unit is configured to randomly select an image sample from the sample database as a training sample;

An input unit, configured to input the training sample as an input vector into the YOLO network;

The third acquiring unit is configured to acquire the output vector of the YOLO network, that is, the feature map of the training sample;

The processing unit is configured to optimize the training parameters according to the output vector, and establish a residual network between the image sample and the feature map of the image sample.
A server, characterized in that it comprises:

One or more processors;

Memory

One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to execute The following steps:

A classification and positioning model based on the YOLO network is used to detect A feature information in the first target image, extract A detection frame, and obtain the first frame information of the A detection frame and the first frame information of the A detection frame One-time classification label, the first target image contains the first certificate, and A is a positive integer greater than 0;

Adjust the frame information of the A detection frames and the classification labels of the A detection frames according to the structured information characteristics of the first certificate, and generate the second frame information of the A detection frames and the A detections The second classification label of the box.
The server according to claim 15, wherein the A detection frames include N text line detection frames and M non-text line detection frames, and the classification and positioning model based on the YOLO network is used to analyze the first target image When A detection frames are extracted, the one or more application programs are configured to perform the following steps:

Using the classification and positioning model based on the YOLO network to detect the feature information in the first target image, and extract N text line detection frames;

The feature information in the first target image is detected by using the classification and positioning model based on the YOLO network, and M non-text line detection frames are extracted.
The server according to claim 16, wherein the feature information in the first target image is detected using the classification and positioning model based on the YOLO network, and when N text line detection frames are extracted, the one or Multiple applications are configured to perform the following steps:

The classification and positioning model is used to extract n text head detection boxes and n text tail detection boxes from the first target image, and the first text head detection box of the n text head detection boxes includes the first text head detection box. The first B characters of the first text line in a target image, the length of the first B characters of the first text line is L1, and the text header detection box also includes the length before the B characters is t* In the non-text image area of L1, the first text end detection box in the n text end detection boxes includes the last C characters of the first text line, and the length of the last C characters of the first text line is L2, the text end detection frame further includes a non-text image area with a length of t*L2 after the C characters, B and C are positive integers, and t is greater than zero and less than or equal to 1;

Matching the n text head detection boxes and the n text tail detection boxes based on the consistency of the slope of the text line and the principle of proximity to obtain the n text line detection boxes;

Correcting the n text line detection boxes, removing non-text image areas in the text line detection boxes, and obtaining n prediction boxes;

Use a non-maximum value suppression algorithm to filter the n prediction boxes to obtain the N text line detection boxes, the target detection scores of the N text line detection boxes, and the number of the N text line detection boxes Sort tags once.
The server according to claim 15, wherein when the feature information in the first target image is detected using the classification and positioning model, and M non-text line detection frames are extracted, the one or more applications The program is also configured to perform the following steps:

Performing feature extraction on the first target image by using the classification and positioning model to obtain m feature maps with a size of a*a, where the feature maps are images containing non-text line information;

Perform center coordinate prediction on the non-text line information in the m feature maps, and use the K-means clustering algorithm to obtain the length and width of m prediction boxes based on the center coordinates, and the m prediction boxes contain non-text lines The confidence of the feature information and the confidence of the category to which the feature information of the non-text lines in the m prediction boxes belongs;

Use a non-maximum suppression algorithm to filter the m prediction boxes to obtain the M non-text line detection boxes, the target detection scores of the M non-text line detection boxes, and the M non-text line detections The first classification label of the box.
The server according to any one of claims 15 to 18, wherein the i-th text head detection box in the n text head detection boxes and the g-th text in the n text tail detection boxes The connection slope of the tail detection box is the second slope, and the condition for the i-th text head detection box and the g-th text tail detection box to meet the slope consistency is: the slope of the g-th text tail detection box is The difference between the first slope is less than a first preset threshold, and the difference between the second slope and the first slope is less than a second preset threshold; the first slope is the i-th text header detection frame Or, the first slope is an average value of the slopes of the n text head detection boxes and the n text tail detection boxes.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method according to any one of claims 1 to 7.