Disclosure of Invention
In order to solve the technical problems, the invention provides a method and a device for collecting marking data, and provides an on-line automatic marking data collecting mode, which can save labor cost and time cost and promote the optimization of a character recognition system.
In addition, the invention also provides a certificate identification system, which can realize identification of the certificate and automatic collection of marking data.
The invention provides a marking data collecting method in a first aspect, which comprises the following steps:
recognizing text information on the picture by using an image character recognition system;
verifying whether the identified text information is consistent with information of a trusted data source;
if the unit pictures are consistent, cutting the pictures to obtain at least one unit picture containing a single character;
for the unit picture, marking the unit picture by using text information corresponding to characters contained in the unit picture to obtain marking data of the unit picture, wherein the marking data comprises: the unit picture and the markup text information of the unit picture.
Optionally, the method further includes:
if not, calculating the similarity between the text information and the information of the credible data source, and judging whether the similarity falls into a preset confidence interval;
if yes, the method proceeds to execute the steps: and cutting the picture to obtain at least one unit picture containing a single character.
Optionally, the method further includes:
and saving the marking data, wherein the marking data is used as a training sample of a machine recognition model of the image character recognition system.
Optionally, before saving the marking data, the method further includes:
and carrying out desensitization treatment on the marking data.
Optionally, the cutting the picture to obtain at least one unit picture containing a single character includes:
adopting a text line positioning algorithm to position a text region in the picture;
and cutting the text region into at least one unit picture containing single characters by adopting a word cutting algorithm.
A second aspect of the present invention provides a marking data collecting apparatus, comprising:
the identification unit is used for identifying the text information on the picture;
the verification unit is used for verifying whether the identified text information is consistent with the information of the credible data source; if the two are consistent, triggering a segmentation unit;
the segmentation unit is used for segmenting the picture to obtain at least one unit picture containing a single character;
the marking unit is used for marking the unit picture by using text information corresponding to characters contained in the unit picture to obtain marking data of the unit picture, and the marking data comprises: the unit picture and the markup text information of the unit picture.
Optionally, the apparatus further comprises:
a calculation unit; the verification unit triggers the calculation unit when the verification results are inconsistent;
the calculation unit is used for calculating the similarity between the text information and the information of the credible data source and judging whether the similarity falls into a preset confidence interval or not; if so, triggering the segmentation unit.
Optionally, the apparatus further comprises:
and the storage unit is used for storing the marking data, and the marking data is used as a training sample of a machine identification model of the identification unit.
Optionally, the apparatus further comprises:
the desensitizing unit is used for desensitizing the marking data;
the storage unit is specifically configured to store the marking data after the desensitization processing of the desensitization unit.
Optionally, the dividing unit includes:
the positioning subunit is used for positioning the text area in the picture by adopting a text line positioning algorithm;
and the cutting subunit is used for cutting the text area into at least one unit picture containing a single character by adopting a word cutting algorithm.
A third aspect of the invention provides a document identification system, the system comprising:
the image character recognition unit is used for recognizing text information in the picture to be recognized;
the information verification unit is used for verifying whether the text information identified by the image identification unit is consistent with the information in the certificate database, and if so, verifying that the picture to be identified is a real picture;
the segmentation unit is used for cutting the real picture verified by the information verification unit to obtain at least one unit picture containing a single character;
and the marking unit is used for marking the unit picture by using the text information corresponding to the characters contained in the unit picture to obtain marking data, wherein the marking data comprises the unit picture and the marking text information of the unit picture, and the marking data is used as a training sample of a machine recognition model of the image character recognition unit.
Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:
according to the technical scheme provided by the invention, the text information on the picture is identified by using an image character identification system; verifying whether the identified text information is consistent with information of a trusted data source; if the two pictures are consistent, the text information on the pictures is true, the pictures are cut, and at least one unit picture containing a single character is obtained; in this way, the image character recognition system can provide a reliable data base for the collection of subsequent marking data in real time and without interruption. Then, for the unit picture, marking is carried out by using text information corresponding to characters contained in the unit picture, and marking data of the unit picture are obtained, wherein the marking data comprise: the unit picture and the markup text information of the unit picture. Therefore, the technical scheme provided by the invention verifies the authenticity of the pictures based on the image character recognition system and the trusted data source, and the marking data is obtained by cutting and marking the real pictures, so that the whole process does not need manual participation, the labor cost and the time cost can be saved, and the improvement and optimization of the system performance can be promoted.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below.
Method embodiment
Referring to fig. 1, fig. 1 is a flowchart of an embodiment 1 of a marking data collection method according to the present invention, where the method may be executed by a user machine, such as a personal PC, or may be executed by various types of Web servers, such as a Web server or an APP server. The method as shown in fig. 1 comprises:
step 101: and recognizing the text information on the picture by using an image character recognition system.
Step 102: verifying whether the identified text information is consistent with information of a trusted data source; if so, step 103 and 105 are performed.
In the embodiment of the present invention, the image character recognition system refers to a system for recognizing text information on an image, such as an identification card character recognition system, a license character recognition system, a passport character recognition system, and the like.
In the embodiment of the present invention, the trusted data source refers to a database storing real data information or verified data information, such as official data of a public security network. Such as a database storing user identification card information within a public security system network, or a database storing user passport information, or a database storing enterprise license information.
For convenience of explanation, the following description only exemplifies the embodiment of the present invention by using the identification card character recognition system as an example.
For example, the following steps are carried out: the identification card character recognition system can recognize different text information in the user identification card picture according to different business requirements, such as the text information of name, identification card number, birth, ethnicity, gender, address, issuing authority, valid period and the like on the identification card picture.
Generally, an identity card character recognition system receives a picture uploaded by a user, and verifies the picture type, such as whether the picture is a designated identity card type, and such as whether the picture is a face of the identity card; after the verification is passed, the identification card character recognition system extracts text information on the picture based on a character recognition model adopted by a character recognition algorithm. After the text information is obtained, verifying whether the identified text information is consistent with the information of the credible data source.
For example: verifying whether the name and the identification card number on the identity card picture are consistent with the information of the public security network, which specifically comprises the following steps: verifying whether the name and the identity card number on the identity card picture are consistent with the character number of the information of the public security network and whether the multiple numerical characters are consistent, if so, determining that the text information on the identity card picture is consistent with the information of the public security network, and if so, determining that the verification result is consistent, which indicates that the identity card picture is a real picture and the text information on the picture is real and reliable; if not, the text information on the identity card picture is determined to be inconsistent with the information of the public security network, the verification result is inconsistent, the identity card picture is a false picture, and the text information on the picture is not true and not credible.
Whether the text information on the picture is real and reliable can be verified through the image character recognition system and the credible data source, so that the data base is printed through the subsequent collection of marking data, and then the real picture is processed.
Step 103: the picture is cut to obtain at least one unit picture containing a single character.
In a specific implementation, step 103 may include:
adopting a text line positioning algorithm to position a text region in the picture;
and cutting the text region into at least one unit picture containing single characters by adopting a word cutting algorithm.
The following describes a specific implementation process of step 103, taking the identification of the name and the identification number in the identification card image by the identification card character recognition system as an example.
Firstly, adopting a text line positioning algorithm to position text areas of names and ID card numbers in the ID card picture; and then cutting the text regions into at least one unit picture containing single characters by adopting a word cutting algorithm.
For example, if the name in the identification card picture uploaded by the user is "zhang san", the name text region is located first, and then the "zhang san" text region is cut into two unit pictures containing characters, i.e., "zhang san" and "san".
For another example, if the identification number in the identification card picture uploaded by the user is "110123201510100334", the identification card text region is first located, and then the "110123201510100334" text region is cut into 18 unit pictures containing single numbers.
Step 104: for the unit picture, marking the unit picture by using text information corresponding to characters contained in the unit picture to obtain marking data of the unit picture, wherein the marking data comprises: the unit picture and the markup text information of the unit picture.
The step 104 is exemplified by the unit pictures "one" and "three" in the above example.
Referring to fig. 2, fig. 2 shows a unit picture 1 with "one" in gray background and a unit picture 2 with "three" in gray background; the characters on the right side of the unit picture 1 and the characters on the right side of the unit picture 2 are information in the trusted data source; and respectively marking the unit picture 1 and the unit picture 2 by using characters of 'sheet' and 'third' to obtain marking data. As can be seen from the method embodiment 1, the authenticity of the text information on the picture is identified through the image character identification system and the credible data source; if true, cutting the picture to obtain at least one unit picture containing a single character; for the unit picture, marking the unit picture by using text information corresponding to characters contained in the unit picture to obtain marking data of the unit picture, wherein the marking data comprises: the unit picture and the markup text information of the unit picture. Cutting the picture to obtain a small picture containing characters; therefore, the technical scheme provided by the invention verifies the authenticity of the pictures based on the image character recognition system and the trusted data source, and the marking data is obtained by cutting and marking the real pictures, so that the whole process does not need manual participation, the labor cost and the time cost can be saved, and the improvement and optimization of the system performance can be promoted.
In order to further improve the collection efficiency of the marking data, the invention also provides another collection method. The method is explained below with reference to fig. 3.
Referring to fig. 3, fig. 3 is a flowchart of an embodiment 2 of a method for collecting marking data according to the present invention, the method includes:
step 301: recognizing text information on the picture by using an image character recognition system;
step 302: verifying whether the identified text information is consistent with information of a trusted data source; if so, performing steps 303 and 304; if not, step 305 is performed.
Step 303: cutting the picture to obtain at least one unit picture containing a single character;
step 304: for the unit picture, marking the unit picture by using text information corresponding to characters contained in the unit picture to obtain marking data of the unit picture, wherein the marking data comprises: the unit picture and the markup text information of the unit picture.
Step 305: calculating the similarity between the text information and the information of the credible data source, and judging whether the similarity falls into a preset confidence interval; if so, steps 303 and 304 are performed.
Wherein, the steps 301-304 are the same as the steps 101-104 in the above embodiment, and reference may be made to the above description, which is not repeated herein.
In step 301, the image character recognition system may cause that the recognized text information is inconsistent with the information of the trusted data source due to reasons that the picture format uploaded by the user cannot be recognized, the picture is unclear, and the like, but the pictures are likely to be true and reliable. Based on this, the present invention further utilizes step 305 to further measure the authenticity and credibility of the picture.
In step 305, measuring the reliability of the picture through the similarity between the text information and the information of the credible data source; if the similarity falls into a preset confidence interval, the picture is credible. The picture can be used as a data base for subsequent marking data and the marking data is collected through steps 303 and 304.
Compared with the method embodiment 1, the method embodiment 2 of the invention increases the reliability of the picture with inconsistent verification result through similarity further based on the method embodiment 1, and the picture with the reliability meeting the requirement is also used as the data base of the marking data, thereby expanding the data source of the marking data and further improving the collection efficiency and quality of the marking data. In addition, on the basis of the method embodiment 1 or the method embodiment 2, the following steps can be added:
and saving the marking data, wherein the marking data is used as a training sample of a machine recognition model of the image character recognition unit.
In addition, it is also considered that the picture identified by the image character identification system can relate to user private information, for example, the user identification card picture carries private information such as a user name, an identity card number and the like; in order to ensure the safety of the private information in the marking data collection process, the private information is prevented from being leaked. Therefore, before the marking data is stored, desensitization treatment can be carried out on the marking data, and the marking data after desensitization treatment is stored. Desensitization may be achieved by randomly naming the marking data.
Two realizable approaches are given below for the "desensitization process on the marked data" step.
One implementation is to randomly sort a plurality of the unit pictures and the markup text information of the unit pictures in the marking data.
Generally, marking data collected for a picture is ordered in order. For example, the marking data collected for the identification card picture is unit pictures of each character in the name and the identification card number and marking text information of the unit pictures. The marking data such as the names "sheet" and "three" are arranged in order. Therefore, after the illegal member steals the marking data, the private information of the specific user, such as the name of Zhang III of the user, can be directly recovered. To prevent private information from being revealed during the marking data collection process. The marking data is randomly ordered according to the unit pictures and the marking text information of the unit pictures, and especially under the condition that the marking data amount is large, the private information of a specific user is difficult to recover from the marking data, so that the safety of the private information of the user is ensured.
Another way to achieve this is to encrypt the tagged data.
By adopting the method, desensitization processing is carried out on the marking data, and the ciphertext of the marking data is finally stored, so that the illegal copies are prevented from directly stealing private information of users from the database, and the cracking difficulty is increased. Of course, the desensitization treatment in the present invention is not limited to the above two methods, and other desensitization methods may be used.
Device embodiment
Corresponding to the method for collecting the marking data, the embodiment of the application also provides a device for collecting the marking data.
Referring to fig. 4, fig. 4 is a structural diagram of a marking data collecting device according to embodiment 1 of the present invention. The internal structure and connection relationship of the device will be further described below in conjunction with the working principle of the device. The device includes:
an identifying unit 401 for identifying text information on a picture;
wherein the recognition unit may be to recognize the text information on the picture by using an image character recognition system.
A verification unit 402, configured to verify whether the identified text information is consistent with information of a trusted data source; if the two are consistent, the segmentation unit 403 is triggered;
the segmentation unit 403 is configured to segment the picture to obtain at least one unit picture including a single character;
a marking unit 404, configured to mark, for the unit picture, text information corresponding to characters included in the unit picture, to obtain marking data of the unit picture, where the marking data includes: the unit picture and the markup text information of the unit picture.
In addition, the present invention also provides another marking data collecting device, specifically please refer to the structure diagram of embodiment 2 of the marking data collecting device shown in fig. 5. The internal structure and connection relationship of the device will be further described below in conjunction with the working principle of the device. The device includes:
an identifying unit 501, configured to identify text information on a picture;
a verification unit 502, configured to verify whether the identified text information is consistent with information of a trusted data source; if the two are consistent, the segmentation unit 503 is triggered; if not, triggering the calculation unit 505;
the segmentation unit 503 is configured to segment the picture to obtain at least one unit picture including a single character;
a marking unit 504, configured to mark, for the unit picture, text information corresponding to characters included in the unit picture, to obtain marking data of the unit picture, where the marking data includes: the unit picture and the markup text information of the unit picture.
A calculating unit 505, configured to calculate a similarity between the text information and information of the trusted data source, and determine whether the similarity falls into a preset confidence interval; if so, the segmentation unit 503 and the marking unit 504 are triggered.
On the basis of the device shown in fig. 4 or fig. 5, the device may further include:
and the storage unit is used for storing the marking data, and the marking data is used as a training sample of the machine identification model of the identification unit 501.
Further, on the basis of the apparatus shown in fig. 4 or fig. 5, the apparatus may further include:
the desensitizing unit is used for desensitizing the marking data;
the storage unit is specifically configured to store the marking data after the desensitization processing.
Optionally, the desensitization unit is specifically configured to: and randomly sequencing a plurality of unit pictures in the marking data and the marking text information of the unit pictures.
In the apparatus shown in fig. 4 or fig. 5, the dividing unit may include:
the positioning subunit is used for positioning the text area in the picture by adopting a text line positioning algorithm;
and the cutting subunit is used for cutting the text area into at least one unit picture containing a single character by adopting a word cutting algorithm.
The device provided by the invention verifies the authenticity of the pictures based on the image character recognition system and the trusted data source, and the marking data is obtained by cutting and marking the real pictures, so that the whole process does not need manual participation, the labor cost and the time cost can be saved, and the improvement and optimization of the system performance can be promoted.
The invention also provides a certificate identification system, which is explained below with reference to fig. 6.
Referring to fig. 6, fig. 6 is a structural diagram of a certificate recognition system according to the present invention, and as shown in fig. 6, the system may include:
an image character recognition unit 601, configured to recognize text information in a picture to be recognized;
an information verification unit 602, configured to verify whether the text information identified by the image identification unit is consistent with information in a certificate database, and if so, verify that the picture to be identified is a real picture;
a dividing unit 603, configured to divide the real picture verified by the information verification unit to obtain at least one unit picture including a single character;
a marking unit 604, configured to mark the unit picture by using text information corresponding to characters included in the unit picture to obtain marking data, where the marking data includes the unit picture and the marking text information of the unit picture, and the marking data is used as a training sample of a machine recognition model of the image character recognition unit.
On the basis of the system shown in fig. 6, the method may further include:
the calculation unit is used for calculating the similarity between the text information and the information of the credible data source and judging whether the similarity falls into a preset confidence interval or not; if yes, triggering the segmentation unit and the marking unit. Therefore, some pictures with high reliability can be further used as the basis of the marking data, and the collection efficiency of the marking data is improved.
According to the certificate identification system provided by the invention, on one hand, an image character identification unit and an information verification unit are utilized to verify an image; on the other hand, the verified real picture is cut and marked by the segmentation unit and the marking unit to obtain marking data, and the marking data can be used as a training sample of a machine recognition model of the image character recognition unit to further optimize the image character recognition unit. Therefore, the evidence identification system can verify the authenticity of the picture, can automatically collect marking data and lays a good foundation for self-optimization of the system.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when the actual implementation is performed, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may be or may be physically separate, and parts displayed as units may be or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can be realized in a form of a software functional unit.
It should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The marking data collection method, the marking data collection device and the certificate identification system provided by the application are described in detail above, and specific embodiments are applied in the description to explain the principles and implementations of the application, and the description of the embodiments is only used to help understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.