WO2020186785A1 - 样本集构建方法、装置、计算机设备和存储介质 - Google Patents

样本集构建方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020186785A1
WO2020186785A1 PCT/CN2019/117857 CN2019117857W WO2020186785A1 WO 2020186785 A1 WO2020186785 A1 WO 2020186785A1 CN 2019117857 W CN2019117857 W CN 2019117857W WO 2020186785 A1 WO2020186785 A1 WO 2020186785A1
Authority
WO
WIPO (PCT)
Prior art keywords
certificate
virtual
map
image
credential
Prior art date
Application number
PCT/CN2019/117857
Other languages
English (en)
French (fr)
Inventor
高梁梁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020186785A1 publication Critical patent/WO2020186785A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Definitions

  • This application relates to a sample set construction method, device, computer equipment and storage medium.
  • a sample set construction method, device, computer equipment, and storage medium are provided.
  • a sample set construction method includes:
  • Image acquisition is performed on the physical certificate corresponding to the electronic certificate map to obtain a certificate collection map
  • a picture sample set is constructed according to the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • a sample set construction device includes:
  • the credential template image acquisition module is used to obtain the credential template image that is generated based on the credential image and does not include the credential information;
  • a virtual credential information generating module configured to generate multiple sets of virtual credential information according to the various types of credential information patterns in the credential map;
  • An electronic credential map generation module configured to write the virtual credential information into the credential template map according to the positions of various credential information in the credential map to generate an electronic credential map
  • the collection module is used for image collection of the physical certificate corresponding to the electronic certificate diagram, and the obtained certificate collection diagram;
  • the constructing module is used to construct a picture sample set according to the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • a picture sample set is constructed according to the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • a picture sample set is constructed according to the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • Fig. 1 is an application scenario diagram of a method for constructing a sample set according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for constructing a sample set according to one or more embodiments.
  • Fig. 3 is a block diagram of an apparatus for constructing a sample set according to one or more embodiments.
  • Figure 4 is a block diagram of a computer device according to one or more embodiments.
  • the sample set construction method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network through the network.
  • the server 104 can obtain the credential template diagram generated based on the credential map without the credential information through the network; generate multiple sets of virtual credential information according to various types of credential information in the credential diagram; and convert the virtual credential information according to the various credentials in the credential diagram.
  • the location of the information is written into the credential template map to generate an electronic credential map; the terminal 102 performs image capture on the physical credential corresponding to the electronic credential map to obtain the credential collection map, and the server 104 can obtain the credential collection map sent by the terminal 102; according to the electronic credential map Build a picture sample set with the certificate collection map, and the picture sample set is used to train the character recognition model.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for constructing a sample set is provided. Taking the method applied to the server 104 in FIG. 1 as an example for description, the method includes the following steps:
  • Step S202 Obtain a credential template map that does not include credential information generated based on the credential map.
  • the credential template image is a credential image with blank credential information.
  • the certificate can be a resident ID card, passport, driving license or graduation certificate, etc.
  • the certificate information includes at least the certificate number and name of the certificate holder, and may also include the date of birth, photo, residential address, certificate validity period, etc.
  • the certificate number may be the ID number of the certificate holder, and the name may include at least one of a Chinese name and an English name.
  • the server can obtain a complete certificate image from the web page of the browser, and perform image processing on the obtained certificate image to obtain a certificate template image.
  • the certificate template image obtained after processing does not include the certificate information.
  • the certificate information in the obtained certificate map can be erased as required. For example, if the character recognition model only needs to recognize the certificate number, only the certificate number in the certificate map can be erased. If the name is also required For identification, you also need to erase the name in the ID picture.
  • Step S204 generating multiple sets of virtual credential information according to the patterns of various credential information in the credential map.
  • the format of the credential information refers to the presentation format of credential information in the credential map.
  • the credential number is represented by 9-digit Arabic numerals, and the name is composed of 2 to 4 simplified Chinese characters.
  • Virtual credential information refers to credential information that has no authenticity and validity and is fabricated to generate an electronic credential diagram used as a sample.
  • the server can generate constraint conditions for representing the styles of various types of credential information in the credential map, generate each type of virtual credential information according to the generated constraint conditions, and randomly combine different types of virtual credential information to obtain multiple sets of virtual credential information .
  • the constraint conditions may be, for example, generating a 9-digit virtual certificate number, a name in 2 to 4 Chinese characters, a virtual birth date generated in the format of YYYY-MM-DD (year-month-day), etc.
  • generating multiple sets of virtual document information according to the pattern of various document information in the document map includes: obtaining multiple numbers according to the length of the digit string of the document number to generate a virtual document number; De-duplication processing to obtain a preset number of virtual certificate numbers; repeated execution of obtaining unmarked Chinese characters from the Chinese character library to generate virtual names; according to all virtual names that have been generated, statistics of each Chinese character included in the currently generated virtual name Number of uses; when the number of uses reaches the preset upper limit, the steps of marking the corresponding Chinese characters until a preset number of virtual names are obtained; the obtained virtual certificate numbers and virtual names are randomly combined to obtain multiple sets of virtual certificate information .
  • the virtual ID number is a fake ID number, the numbers in the virtual ID number have no specific meaning, and the virtual name is a fake name.
  • the credential information in the credential map includes credential number and name
  • a corresponding number of digits can be randomly obtained according to the digit string length of credential number to generate a virtual credential number, for example, if the credential number includes 9 digits, each time a virtual certificate number is generated, 9 digits are randomly obtained, arranged in order to generate the certificate number, and then all the obtained virtual certificate numbers are processed to remove duplicate virtual certificate numbers.
  • the generated virtual name can cover most of the Chinese characters in the Chinese character library as much as possible, and the repetition degree of the virtual name generated in this way is not too high, which can improve the acquisition The diversity of electronic document images including virtual document information.
  • the server can also randomly combine all the generated virtual certificate numbers and virtual names to obtain multiple virtual certificate information, and de-duplicate all the obtained virtual certificate information, and remove duplicate virtual certificates from the multiple sets of virtual certificate information information.
  • step S206 the virtual certificate information is written into the certificate template map according to the positions of the various certificate information in the certificate map to generate an electronic certificate map.
  • the server after the server obtains the credential template map and multiple sets of virtual credential information, it can write the generated sets of virtual credential information into corresponding positions in the credential template map according to the positions of the various credential information in the credential map, and obtain a large number of E-document map.
  • writing the virtual certificate information into the certificate template map according to the corresponding position to generate the electronic certificate map includes: obtaining the corresponding character format of the virtual certificate number and virtual name in the virtual certificate information; determining various types of certificates in the certificate map Location of information: According to the location of various types of certificate information, each group of virtual certificate information is written into the certificate template map according to the character format, and the electronic certificate map is obtained.
  • Character format refers to the character style of various types of certificate information in the certificate map, including the printing font used in the character, character size, character color, character spacing, simplified and traditional format, etc.
  • the server may determine the relative positions of various credential information in the credential map according to the layout of the characters in the credential information in the credential map.
  • the server may determine various types of document information included in the document map, and determine the position of the first character of the various document information in the document map, and use this position as the position of the corresponding document information in the document map.
  • the server can determine the corresponding character format of the credential number and name according to the credential map, respectively, as the character format of the virtual credential number and virtual name to be generated, and write each group of virtual credential information obtained into the credential template map according to the respective character format Corresponding location.
  • the corresponding position can be the position determined by the ID number and the first character of the name in the ID picture mentioned above.
  • step S208 image collection is performed on the physical certificate corresponding to the electronic certificate map to obtain a certificate collection map.
  • the physical certificate is a physical certificate made according to the generated electronic certificate map.
  • a part of the generated electronic certificate map can be used to make a physical certificate.
  • the certificate collection map refers to the picture obtained by image collection of the produced physical certificate, which can be collected by the terminal If it is sent to the server, the server can obtain the certificate collection map obtained by image collection of these physical certificates. Different image acquisition parameters and different equipment can be used for image acquisition of the physical certificate to obtain a certificate picture closer to the real scene, so that the pictures in the picture sample set to be constructed are more suitable for the certificate picture obtained in the real scene.
  • image acquisition is performed on the physical document corresponding to the electronic document map, and the obtained document acquisition map includes: determining image acquisition parameters; the image acquisition parameters include at least one of light intensity, focal length, acquisition angle, and acquisition background ; When each image acquisition parameter corresponds to a different parameter value, image acquisition is performed on the physical document corresponding to the electronic document map to obtain a preset number of document acquisition maps.
  • image acquisition can be performed on the physical documents corresponding to the electronic document map to obtain a large number of document acquisition maps.
  • the resulting document acquisition maps are more random and more in line with the sample. It should be balanced.
  • the credential acquisition image can include standard images and interference images.
  • performing image acquisition on the physical document corresponding to the electronic document map to obtain a preset number of document acquisition maps includes: determining the standard parameter corresponding to each image acquisition parameter Value; According to the standard parameter value, perform image capture on the physical certificate corresponding to the electronic document image to obtain the standard image; add the standard image and electronic document image to the image path used to store the positive sample; when the interference instruction is received, each The parameter values corresponding to the image acquisition parameters are adjusted from the standard parameter values to different interference values.
  • the interference image is obtained by image acquisition of the physical document according to the adjusted interference value, and the interference image is added to the image path for storing negative samples.
  • the image acquisition parameters include at least one of light intensity, focal length, acquisition angle, and acquisition background.
  • the standard parameter value corresponding to each image acquisition parameter can be set in advance, and the image acquisition of the physical certificate is obtained under the standard parameter value.
  • the image is a standard image, and the standard image and the generated electronic document image can be used as a positive sample of the image sample set to be constructed.
  • the interference instruction is an instruction used to add random interference factors to the process of collecting images, so that the obtained pictures are more suitable for the actual document photo collection process. It can receive interference instructions triggered by the user.
  • the interference instructions After receiving the interference instructions, adjust the parameter value corresponding to at least one image acquisition parameter from the standard parameter value to different interference values, so that under the combination of different parameter values, the same physical certificate Image collection obtains multiple interference images, and image collection can also be performed on different physical documents under different parameter values, and the obtained interference image is used as a negative sample of the picture sample set to be constructed. In this way, not only can the number of sample pictures be greatly increased, but also the collection conditions of the sample pictures can be enriched, and the diversity of the sample pictures can be improved.
  • Step 210 Construct a picture sample set based on the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • the server can not only obtain an electronic document map that is very similar to the background of the document map, but also obtain various document collection maps obtained by image collection of the physical documents corresponding to the electronic document map.
  • the image sample set constructed from the certificate collection map can improve the accuracy of character recognition by the character recognition model when it is used to train the character recognition model.
  • constructing a picture sample set based on the electronic document map and the document collection map includes: respectively determining the number of positive samples and negative samples; when the difference between the number of positive samples and the number of negative samples is greater than a preset threshold , Then determine the top-ranked sample in the similarity of the virtual document information in the more samples; remove the top-ranked samples from the more samples to make the difference between the number of samples remaining after the removal and the number of the fewer samples If it is less than the preset threshold, a picture sample set with a balanced number of positive and negative samples is obtained.
  • the server separately determines the number of positive samples and negative samples after obtaining all the sample pictures.
  • the difference between the number of positive samples and the number of negative samples is greater than a preset threshold, the number of samples needs to be adjusted.
  • the difference between the adjusted positive and negative samples is made smaller than the preset threshold, and the number of positive and negative samples is balanced, which can avoid overfitting or underfitting of the character recognition model obtained by training. For example, if the ratio between the number of positive samples and the number of negative samples is greater than a preset threshold, the virtual document information corresponding to each positive sample can be determined, and samples with high virtual document information repetition can be eliminated from the positive samples to reduce The number of positive samples.
  • the constructed picture sample set includes not only a large number of electronic document maps corresponding to different virtual document information, but also document collection maps obtained by simulating the image acquisition process of real documents.
  • the pictures in the picture sample set are not only rich in information, but also sourced It is more real, improves the diversity of samples, and the samples are more balanced.
  • the character recognition model is trained using the picture sample set, a character recognition model with higher recognition accuracy can be obtained.
  • the above-mentioned sample set construction method further includes: acquiring a picture processing operation; the picture processing operation includes at least one of a flip operation, a stretching operation, a rotation operation, a noise adding operation, and a blur operation; according to the picture processing operation , Process the electronic document map and the document collection map to obtain a variety of different derivative images; use the derivative image as the negative sample in the picture sample set to be constructed; use the electronic document map and the document collection map as the positive sample in the picture sample set.
  • image processing can be further performed on the electronic document image and the document acquisition image, including flip processing, stretching processing, rotation processing, noise adding processing, and blur processing, and a variety of different derivative images can be obtained after processing.
  • At least one of the above-mentioned image processing operations can be sampled to process the electronic certificate image and the certificate collection image.
  • the document collection image here can include standard images and interference images, that is, standard images can be processed to obtain derivative images as negative samples, and interference images can also be processed to obtain derivative images as negative samples.
  • the positive samples for constructing the picture sample set include: electronic document images and standard images, and negative samples include: interference images, derivative images obtained by processing electronic document images, derivative images obtained by processing standard images, and interference images Derived image.
  • the electronic document image and the document collection image are processed through a variety of image processing operations, which can increase the number of samples and improve the balance of samples used for training the character recognition model.
  • the method for constructing a sample set specifically includes the following steps:
  • the image acquisition parameters include at least one of light intensity, focal length, acquisition angle, and acquisition background;
  • the standard parameter values perform image collection on the physical certificate corresponding to the electronic certificate image to obtain a standard image; add the standard image and electronic certificate image to the image path used to store the positive sample;
  • the parameter value corresponding to each image acquisition parameter is adjusted from the standard parameter value to a different interference value, and the interference image is obtained by image acquisition on the physical certificate according to the adjusted interference value, and the interference image is added to the user
  • the picture processing operation includes at least one of a flip operation, a stretch operation, a rotation operation, a noise adding operation, and a blur operation;
  • the preset threshold determines the sample with the highest similarity of the virtual certificate information among the more samples; remove the highest-ranking sample from the more samples Samples, so that the difference between the number of samples remaining after removal and the number of fewer samples is less than a preset threshold, and a picture sample set with a balanced number of positive and negative samples is obtained.
  • the picture sample set is used to train a character recognition model.
  • steps in the flowchart of FIG. 2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a sample set construction device 300 which includes: a credential template image acquisition module 302, a virtual credential information generation module 304, an electronic credential image generation module 306, a collection module 308, and a construction Module 310, where:
  • the credential template image acquisition module 302 is configured to obtain a credential template image that does not include credential information generated according to the credential image.
  • the virtual credential information generating module 304 is used to generate multiple sets of virtual credential information according to various types of credential information in the credential map.
  • the electronic certificate map generation module 306 is configured to write the virtual certificate information into the certificate template map according to the positions of various kinds of certificate information in the certificate map to generate an electronic certificate map.
  • the collection module 308 is used for image collection of the physical certificate corresponding to the electronic certificate map to obtain the certificate collection map.
  • the constructing module 310 is used to construct a picture sample set based on the electronic certificate map and the certificate collection map, and the picture sample set is used to train a character recognition model.
  • the virtual certificate information generating module 304 is further configured to obtain multiple digits according to the length of the digit string of the certificate number to generate a virtual certificate number; perform deduplication processing on all generated virtual certificate numbers to obtain a preset number Virtual ID number; repeated execution of obtaining unmarked Chinese characters from the Chinese character library to generate a virtual name; according to all virtual names that have been generated, count the number of uses of each Chinese character included in the currently generated virtual name; when the number of uses reaches the preset When the upper limit is set, the step of marking the corresponding Chinese characters until the preset number of virtual names is obtained; the obtained virtual certificate numbers and virtual names are randomly combined to obtain multiple sets of virtual certificate information.
  • the electronic document map generation module 306 is also used to obtain the corresponding character format of the virtual document number and virtual name in the virtual document information; determine the position of various document information in the document map; The location of the information, each group of virtual certificate information is written into the certificate template map according to the character format, and the electronic certificate map is obtained.
  • the acquisition module 308 is also used to determine image acquisition parameters; the image acquisition parameters include at least one of light intensity, focal length, acquisition angle, and acquisition background; when each image acquisition parameter corresponds to different parameter values, Perform image collection on the physical certificate corresponding to the electronic certificate map to obtain a preset number of certificate collection maps.
  • the acquisition module 308 is also used to determine the standard parameter value corresponding to each image acquisition parameter; according to the standard parameter value, perform image acquisition on the physical document corresponding to the electronic document image to obtain the standard image; combine the standard image and the electronic document The image is added to the image path used to store the positive sample; when the interference instruction is received, the parameter value corresponding to each image acquisition parameter is adjusted from the standard parameter value to a different interference value, and the physical certificate is performed according to the adjusted interference value The interference image is obtained by image acquisition, and the interference image is added to the image path for storing negative samples.
  • the sample set construction device 300 further includes a picture processing module, which is used to obtain picture processing operations; the picture processing operations include flipping operations, stretching operations, rotating operations, noise adding operations, and blurring operations. At least one; according to the image processing operation, process the electronic document image and the document acquisition image to obtain a variety of different derivative images; use the derivative image as a negative sample in the set of image samples to be constructed; combine the electronic document image and the document acquisition image As a positive sample in the picture sample set.
  • a picture processing module which is used to obtain picture processing operations; the picture processing operations include flipping operations, stretching operations, rotating operations, noise adding operations, and blurring operations. At least one; according to the image processing operation, process the electronic document image and the document acquisition image to obtain a variety of different derivative images; use the derivative image as a negative sample in the set of image samples to be constructed; combine the electronic document image and the document acquisition image As a positive sample in the picture sample set.
  • the construction module 310 is also used to determine the number of positive samples and negative samples respectively; when the difference between the number of positive samples and the number of negative samples is greater than a preset threshold, it is determined that the number of virtual samples The sample with the highest ranking for the similarity of the credential information; the highest ranking sample is removed from the more samples, so that the difference between the number of samples remaining after the elimination and the number of the few samples is less than the preset threshold, and the result is positive and negative A picture sample set with a balanced sample size.
  • the above-mentioned sample set construction device 300 can generate multiple sets of virtual document information according to the styles of various types of document information in the document map after acquiring the document template map that is generated according to the document map and does not include the document information. For the location of similar certificate information, write the generated virtual certificate information into the certificate template map, and obtain a large number of electronic certificate maps carrying various virtual certificate information. Further, after obtaining the physical certificate of the electronic certificate map, image collection can be performed on the physical certificate to obtain a certificate collection map, and a picture sample set can be constructed based on the generated electronic certificate map and the certificate collection map obtained by image collection.
  • the constructed picture sample set includes not only a large number of electronic document maps corresponding to different virtual document information, but also document collection maps obtained by simulating the image acquisition process of real documents.
  • the pictures in the picture sample set are not only rich in information, but also sourced It is more real, improves the diversity of the samples, and the samples are more balanced.
  • the character recognition model is trained using the picture sample set, a character recognition model with higher recognition accuracy can be obtained.
  • Each module in the apparatus for constructing a sample set can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
  • the computer equipment includes a processor, a memory, and a network interface connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer readable instructions.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a method for constructing a sample set.
  • FIG. 4 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the sample set construction apparatus 300 provided in the present application may be implemented in a form of computer-readable instructions, and the computer-readable instructions may run on the computer device as shown in FIG. 4.
  • the memory of the computer device can store various program modules that make up the sample set construction device 300, for example, the certificate template image acquisition module 302, the virtual certificate information generation module 304, the electronic certificate image generation module 306, and the acquisition module 308 shown in FIG. And building block 310.
  • the computer-readable instructions formed by each program module cause the processor to execute the steps in the sample set construction method of each embodiment of the application described in this specification.
  • the computer device shown in FIG. 4 may execute step S202 through the acquisition module 302 in the sample set construction apparatus 300 shown in FIG. 3.
  • the computer device may execute step S204 through the determining module 304.
  • the computer device can execute step S206 through the virtual certificate information generating module 306.
  • the computer device can execute step S208 through the electronic document map generation module 308.
  • the computer device may execute step S210 through the collection module 310.
  • a computer device including a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors execute the sample set of the various embodiments of the present application. Steps of the construction method.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute the samples of the various embodiments of the present application Set the steps of the construction method.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

一种样本集构建方法,包括:获取根据证件图生成的不包括证件信息的证件模板图;按照证件图中各类证件信息的样式生成多组虚拟证件信息;将虚拟证件信息按照证件图中各类证件信息的位置写入证件模板图,生成电子证件图;对电子证件图对应的实体证件进行图像采集,得到的证件采集图;根据电子证件图和证件采集图构建图片样本集,图片样本集用于训练字符识别模型。

Description

样本集构建方法、装置、计算机设备和存储介质
本申请要求于2019年03月19日提交中国专利局,申请号为201910208401.9,申请名称为“样本集构建方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种样本集构建方法、装置、计算机设备和存储介质。
背景技术
在证件信息自动识别的技术领域中,需要大量的证件图片对字符识别模型进行训练,可以提高字符识别模型对证件信息识别的准确性。但是在训练字符识别模型时所需的图片数量非常多,通常也无法获取到大量的真实的证件图片。
然而,发明人意识到,目前,在训练字符识别模型时所使用的证件图片大多数是通过模板批量生成电子证件图得到的,这样得到的电子证件图虽然较为清晰,但随机性不强,会导致证件图片作为样本图片存在分布不均衡的问题。当直接利用大量不均衡的证件图片对字符识别模型进行训练,得到的字符识别模型的模型参数就会不够准确,使用训练后的字符识别模型对证件信息进行识别时,得到的识别结果也就不太准确。
发明内容
根据本申请公开的各种实施例,提供一种样本集构建方法、装置、计算机设备和存储介质。
一种样本集构建方法包括:
获取根据证件图生成的不包括证件信息的证件模板图;
按照所述证件图中各类所述证件信息的样式生成多组虚拟证件信息;
将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;
根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字 符识别模型。
一种样本集构建装置包括:
证件模板图获取模块,用于获取根据证件图生成的不包括证件信息的证件模板图;
虚拟证件信息生成模块,用于按照所述证件图中各类所述证件信息的样式生成多组虚拟证件信息;
电子证件图生成模块,用于将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
采集模块,用于对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
构建模块,用于根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
获取根据证件图生成的不包括证件信息的证件模板图;
按照所述证件图中各类证件信息的样式生成多组虚拟证件信息;
将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
获取根据证件图生成的不包括证件信息的证件模板图;
按照所述证件图中各类证件信息的样式生成多组虚拟证件信息;
将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字 符识别模型。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中样本集构建方法的应用场景图。
图2为根据一个或多个实施例中样本集构建方法的流程示意图。
图3为根据一个或多个实施例中样本集构建装置的框图。
图4为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的样本集构建方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104通过网络进行通信。服务器104可以通过网络获取根据证件图生成的不包括证件信息的证件模板图;按照证件图中各类证件信息的样式生成多组虚拟证件信息;将虚拟证件信息按照所述证件图中各类证件信息的位置写入证件模板图,生成电子证件图;终端102对电子证件图对应的实体证件进行图像采集,得到的证件采集图,服务器104可以获取终端102发送的证件采集图;根据电子证件图和证件采集图构建图片样本集,图片样本集用于训练字符识别模型。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一个实施例中,如图2所示,提供了一种样本集构建方法,以该方法应用于图1中的服务器104为例进行说明,包括以下步骤:
步骤S202,获取根据证件图生成的不包括证件信息的证件模板图。
证件模板图是证件信息为空白的证件图。证件可以是居民身份证、护照、驾驶证或毕业证等。证件信息至少包括证件持有人的证件号码、姓名,还可以包括出生日期、照片、居住地址、证件有效期等。证件号码可以是证件持有人的身份证号码,姓名可以包括中文姓名和英文姓名中的至少一种。服务器可以从浏览器的网页获取一张完整的证件图,对获取的证件图进行图像处理,得到证件模板图,被处理后得到的证件模板图不包括证件信息。
需要说明的是,可以按照需求抹除获取的证件图中的证件信息,比如,若字符识别模型仅需要对证件号码进行识别,则可以仅抹除证件图中的证件号码,若还需要对姓名进行识别,则还需要抹除证件图中的姓名。
步骤S204,按照证件图中各类证件信息的样式生成多组虚拟证件信息。
证件信息的样式是指证件图中证件信息的呈现格式,比如,证件号码以9位数的阿拉伯数字表示,而姓名由2~4个简体汉字组成等等。虚拟证件信息是指为生成用于作为样本的电子证件图而虚造的不具有真实性和有效性的证件信息。
具体地,服务器可以生成用于表示证件图中各类证件信息的样式的约束条件,根据生成的约束条件生成每类虚拟证件信息,将不同类别的虚拟证件信息随机组合,得到多组虚拟证件信息。约束条件比如可以是生成9位数的虚拟证件号码、2~4个汉字的姓名、以YYYY-MM-DD(年-月-日)格式生成的虚拟出生日期等。
在其中一个实施例中,按照证件图中各类证件信息的样式生成多组虚拟证件信息包括:按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的虚拟姓名所包括的各个汉字的使用次数;当使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;将得到的虚拟证件号码、虚拟姓名随机组合,得到多组虚拟证件信息。
虚拟证件号码是虚造的证件号码,虚拟证件号码中的数字没有特定的含义,虚拟姓名是虚造的姓名。具体地,若证件图中的证件信息包括证件号码和姓名,则在生成虚拟证件信息时,可按照证件号码的数字串长度随机获取相应数量的数字,生成虚拟证件号码,比如,若证件号码包括9位数字,则每生成一个虚拟证件号码时就随机获取9个数字,按顺序排列生成证件号码,然后对得到的所有虚拟证件号码进行去重处理,去除重复的虚拟证件号码。
在生成虚拟姓名时,从汉字库中随机获取2~4个未被标记的汉字,生成虚拟姓名,然后对统计当前生成的虚拟姓名所包括的汉字的使用次数,也就是在已经生成的虚拟姓名中,当前生成的虚拟姓名所包括汉字的使用频率,若使用次数达到上限值,比如5次,则将达到上限值的汉字进行标记,则在后续生成新的虚拟姓名时,从汉字库中获取未被标记的汉字生成新的虚拟姓名。这样,可以保证汉字库中的汉字都尽可能地被使用到,生成的虚拟姓名能够尽量覆盖汉字库中的大多数汉字,并且这样生成的虚拟姓名的重复度也不会太高,能够提高得到的包括虚拟证件信息的电子证件图的多样性。
进一步地,服务器还可以将生成的所有虚拟证件号码、虚拟姓名随机组合,得到多虚拟证件信息,并对得到的所有虚拟证件信息进行去重处理,从多组虚拟证件信息中剔除重复的虚拟证件信息。
步骤S206,将虚拟证件信息按照证件图中各类证件信息的位置写入证件模板图,生成电子证件图。
具体地,服务器在得到证件模板图和多组虚拟证件信息后,可按照证件图中各类证件信息的位置,将生成的各组虚拟证件信息写入证件模板图中的相应位置处,得到大量的电子证件图。
在其中一个实施例中,将虚拟证件信息按照相应位置写入证件模板图,生成电子证件图包括:获取虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;确定证件图中各类证件信息的位置;根据各类证件信息的位置,按照字符格式分别将每组虚拟证件信息写入证件模板图,得到电子证件图。
字符格式是指证件图中各类证件信息的字符样式,包括字符所采用的印刷字体、字符大小、字符颜色、字符间距、简繁格式等。具体地,服务器在获取到证件图后,可以根据证件图中证件信息的字符的布局,确定各类证件信息在证件图中的相对位置。在其中一个实施例中,服务器可以确定证件图所包括的各类证件信息,并确定各类证件信息的首字符在证件图中的位置,将该位置作为相应证件信息在证件图中的位置。服务器可以根据证件图确定证件号码、姓名各自对应的字符格式,分别作为待生成的虚拟证件号码、虚拟姓名的字符格式,按照各自的字符格式将得到的每组虚拟证件信息写入证件模板图的相应位置。该相应位置可以是前文中提及的由证件图中证件号码、姓名的首字符所确定的位置。
步骤S208,对电子证件图对应的实体证件进行图像采集,得到的证件采集图。
实体证件是根据生成的电子证件图制作的实体证件,可对生成的一部分电子证件图制 作实体证件,证件采集图是指对制作的实体证件进行图像采集得到的图片,可以是由终端采集得到后发送至服务器的,服务器可获取对这些实体证件进行图像采集的得到的证件采集图。可以不同的图像采集参数、不同的设备对实体证件进行图像采集,得到更贴近于真实场景下得到的证件图片,使得待构建的图片样本集中的图片更贴合真实的场景下得到的证件图片。
在其中一个实施例中,对电子证件图对应的实体证件进行图像采集,得到的证件采集图包括:确定图像采集参数;图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;在各图像采集参数对应不同的参数值时,对电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图。
具体地,可在各种图像采集参数取不同的参数值时,对电子证件图对应的实体证件进行图像采集,得到大量的证件采集图,这样得到的证件采集图随机性更强,更加符合样本应当具有的均衡性。证件采集图可以包括标准图像和干扰图像。
在其中一个实施例中,在各图像采集参数对应不同的参数值时,对电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图包括:确定各图像采集参数对应的标准参数值;根据标准参数值,对电子证件图对应的实体证件进行图像采集得到标准图像;将标准图像、电子证件图添加至用于存放正样本的图片路径;当接收到干扰指令时,则将各图像采集参数对应的参数值从标准参数值调整至不同的干扰值,根据调整后的干扰值对实体证件进行图像采集得到干扰图像,将干扰图像添加至用于存放负样本的图片路径。
具体地,图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种,可预先设置各个图像采集参数对应的标准参数值,在该标准参数值下对实体证件进行图像采集得到的图像为标准图像,标准图像和生成的电子证件图可以作为待构建的图片样本集的正样本。干扰指令是用于对采集图像的过程添加随机性较强的干扰因素的指令,这样得到的图片也更贴合实际的证件照片采集过程。可接收用户触发的干扰指令,在接收到干扰指令后,将至少一个图像采集参数对应的参数值从标准参数值调整至不同的干扰值,这样不同参数值的组合下,对同一个实体证件进行图像采集得到多个干扰图像,也可以对不同的实体证件在不同的参数值下进行图像采集,得到的干扰图像作为待构建的图片样本集的负样本。这样,不仅可以大大增加样本图片的数量,还可以丰富样本图片的采集条件,能够提升样本图片的多样性。
步骤210,根据电子证件图和证件采集图构建图片样本集,图片样本集用于训练字符 识别模型。
具体地,经过上述步骤,服务器不仅可以得到与证件图的背景非常相似的电子证件图,还可以得到对电子证件图对应的实体证件进行图像采集得到的各种证件采集图,根据电子证件图和证件采集图构建得到的图片样本集,用于对字符识别模型进行训练时,可以提高字符识别模型对字符进行识别的准确率。
在其中一个实施例中,根据电子证件图和证件采集图构建图片样本集包括:分别确定正样本和负样本的数量;当正样本的数量与负样本的数量之间的差异大于预设阈值时,则确定多的样本中虚拟证件信息的相似度排名靠前的样本;从多的样本中剔除排名靠前的样本,使得剔除后剩下的样本的数量与少的样本的数量之间的差异小于预设阈值,得到正负样本数量均衡的图片样本集。
具体地,服务器在得到所有的样本图片后,分别确定正样本和负样本的数量,当正样本的数量与负样本的数量之间的差异大于预设阈值时,需要对样本的数量进行调整,使得调整后的正负样本之间的差异小于预设阈值,正负样本数量均衡,可以避免训练得到的字符识别模型过拟合或欠拟合的情况。比如,正样本的数量与负样本的数量之间的比例大于预设阈值,则可以确定各正样本对应的虚拟证件信息,将虚拟证件信息重复度较高的样本从正样本中剔除,以减少正样本的数量。
上述样本集构建方法,在获取到根据证件图生成的不包括证件信息的证件模板图之后,就可以按照证件图中各类证件信息的样式生成多组虚拟证件信息,并按照证件图中各类证件信息的位置,将生成的虚拟证件信息写入证件模板图,得到大量携带各种不同虚拟证件信息的电子证件图。进一步地,在得到电子证件图的实体证件之后,可以对实体证件进行图像采集,得到证件采集图,根据生成的电子证件图和图像采集得到的证件采集图构建图片样本集。构建的图片样本集不仅包括大量对应不同的虚拟证件信息的电子证件图,还包括模拟真实证件的图像采集过程得到的证件采集图,也就是说,图片样本集中的图片,不仅信息丰富,来源也更真实,提高了样本的多样性,样本更加均衡,在利用该图片样本集对字符识别模型进行训练时,能够得到识别准确率较高的字符识别模型。
在其中一个实施例中,上述样本集构建方法还包括:获取图片处理操作;图片处理操作包括翻转操作、拉伸操作、旋转操作、加噪操作和模糊操作中的至少一种;根据图片处理操作,对电子证件图和证件采集图进行处理,得到多种不同的衍生图像;将衍生图像作为待构建的图片样本集中的负样本;将电子证件图和证件采集图作为图片样本集中的正样 本。
具体地,还可进一步对电子证件图和证件采集图进行图片处理,包括翻转处理、拉伸处理、旋转处理、添加噪声处理和模糊处理,处理后得到多种不同衍生图像。可以采样上述图片处理操作中的至少一种对电子证件图和证件采集图进行处理。这里的证件采集图可以包括标准图像和干扰图像,也就是,可以对标准图像进行处理,得到作为负样本的衍生图像,还可以对干扰图像进行处理得到作为负样本的衍生图像,这样,用于构建图片样本集的正样本包括:电子证件图和标准图像,负样本包括:干扰图像、对电子证件图进行处理得到的衍生图像、对标准图像进行处理得到的衍生图像以及对干扰图像进行处理得到的衍生图像。
在本实施例中,通过多种图片处理操作,对电子证件图和证件采集图进行处理,能够增加样本数量,提高用于训练字符识别模型的样本的均衡性。
在一个具体的实施例中,样本集构建方法具体包括以下步骤:
获取根据证件图生成的不包括证件信息的证件模板图;
按照证件图中证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;
重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的虚拟姓名所包括的各个汉字的使用次数;当使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;
将得到的虚拟证件号码、虚拟姓名随机组合,得到多组虚拟证件信息;
获取虚拟证件号码、虚拟姓名各自对应的字符格式;
确定证件图中各类证件信息的位置;
根据各类证件信息的位置,按照字符格式分别将每组虚拟证件信息写入证件模板图,得到电子证件图;
确定图像采集参数;图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;
确定各图像采集参数对应的标准参数值;
根据标准参数值,对电子证件图对应的实体证件进行图像采集得到标准图像;将标准图像、电子证件图添加至用于存放正样本的图片路径;
当接收到干扰指令时,则将各图像采集参数对应的参数值从标准参数值调整至不同的 干扰值,根据调整后的干扰值对实体证件进行图像采集得到干扰图像,将干扰图像添加至用于存放负样本的图片路径;
获取图片处理操作;图片处理操作包括翻转操作、拉伸操作、旋转操作、加噪操作和模糊操作中的至少一种;
根据图片处理操作,对电子证件图和证件采集图进行处理,得到多种不同的衍生图像;
将衍生图像作为待构建的图片样本集中的负样本;将电子证件图和证件采集图作为图片样本集中的正样本;
分别确定正样本和负样本的数量;
当正样本的数量与负样本的数量之间的差异大于预设阈值时,则确定多的样本中虚拟证件信息的相似度排名靠前的样本;从多的样本中剔除相似度排名靠前的样本,使得剔除后剩下的样本的数量与少的样本的数量之间的差异小于预设阈值,得到正负样本数量均衡的图片样本集,图片样本集用于训练字符识别模型。
应该理解的是,虽然图2的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图3所示,提供了一种样本集构建装置300,包括:证件模板图获取模块302、虚拟证件信息生成模块304、电子证件图生成模块306、采集模块308和构建模块310,其中:
证件模板图获取模块302,用于获取根据证件图生成的不包括证件信息的证件模板图。
虚拟证件信息生成模块304,用于按照证件图中各类证件信息的样式生成多组虚拟证件信息。
电子证件图生成模块306,用于将虚拟证件信息按照所述证件图中各类证件信息的位置写入证件模板图,生成电子证件图。
采集模块308,用于对电子证件图对应的实体证件进行图像采集,得到的证件采集图。
构建模块310,用于根据电子证件图和证件采集图构建图片样本集,图片样本集用于训练字符识别模型。
在其中一个实施例中,虚拟证件信息生成模块304还用于按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的虚拟姓名所包括的各个汉字的使用次数;当使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;将得到的虚拟证件号码、虚拟姓名随机组合,得到多组虚拟证件信息。
在其中一个实施例中,电子证件图生成模块306还用于获取虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;确定所述证件图中各类证件信息的位置;根据各类证件信息的位置,按照字符格式分别将每组虚拟证件信息写入证件模板图,得到电子证件图。
在其中一个实施例中,采集模块308还用于确定图像采集参数;图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;在各图像采集参数对应不同的参数值时,对电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图。
在其中一个实施例中,采集模块308还用于确定各图像采集参数对应的标准参数值;根据标准参数值,对电子证件图对应的实体证件进行图像采集得到标准图像;将标准图像、电子证件图添加至用于存放正样本的图片路径;当接收到干扰指令时,则将各图像采集参数对应的参数值从标准参数值调整至不同的干扰值,根据调整后的干扰值对实体证件进行图像采集得到干扰图像,将干扰图像添加至用于存放负样本的图片路径。
在其中一个实施例中,样本集构建装置300还包括图片处理模块,图片处理模块用于获取图片处理操作;图片处理操作包括翻转操作、拉伸操作、旋转操作、加噪操作和模糊操作中的至少一种;根据图片处理操作,对电子证件图和证件采集图进行处理,得到多种不同的衍生图像;将衍生图像作为待构建的图片样本集中的负样本;将电子证件图和证件采集图作为图片样本集中的正样本。
在其中一个实施例中,构建模块310还用于分别确定正样本和负样本的数量;当正样本的数量与负样本的数量之间的差异大于预设阈值时,则确定多的样本中虚拟证件信息的相似度排名靠前的样本;从多的样本中剔除排名靠前的样本,使得剔除后剩下的样本的数量与少的样本的数量之间的差异小于预设阈值,得到正负样本数量均衡的图片样本集。
上述样本集构建装置300,在获取到根据证件图生成的不包括证件信息的证件模板图之后,就可以按照证件图中各类证件信息的样式生成多组虚拟证件信息,并按照证件图中各类证件信息的位置,将生成的虚拟证件信息写入证件模板图,得到大量携带各种不同虚拟证件信息的电子证件图。进一步地,在得到电子证件图的实体证件之后,可以对实体证件进行图像采集,得到证件采集图,根据生成的电子证件图和图像采集得到的证件采集图构建图片样本集。构建的图片样本集不仅包括大量对应不同的虚拟证件信息的电子证件图,还包括模拟真实证件的图像采集过程得到的证件采集图,也就是说,图片样本集中的图片,不仅信息丰富,来源也更真实,提高了样本的多样性,样本更加均衡,在利用该图片样本集对字符识别模型进行训练时,能够得到识别准确率较高的字符识别模型。
关于样本集构建装置300的具体限定可以参见上文中对于样本集构建方法的限定,在此不再赘述。上述样本集构建装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种样本集构建方法。
本领域技术人员可以理解,图4中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的样本集构建装置300可以实现为一种计算机可读指令的形式,计算机可读指令可在如图4所示的计算机设备上运行。计算机设备的存储器中可存储组成该样本集构建装置300的各个程序模块,比如,图3所示的证件模板图获取模块302、虚拟证件信息生成模块304、电子证件图生成模块306、采集模块308和构建模块310。各个程序模块构成的计算机可读指令使得处理器执行本说明书中描述的本申请各个 实施例的样本集构建方法中的步骤。
例如,图4所示的计算机设备可以通过如图3所示的样本集构建装置300中的获取模块302执行步骤S202。计算机设备可通过确定模块304执行步骤S204。计算机设备可通过虚拟证件信息生成模块306执行步骤S206。计算机设备可通过电子证件图生成模块308执行步骤S208。计算机设备可通过采集模块310执行步骤S210。
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器执行本申请各个实施例的样本集构建方法的步骤。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行本申请各个实施例的样本集构建方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种样本集构建方法,包括:
    获取根据证件图生成的不包括证件信息的证件模板图;
    按照所述证件图中各类证件信息的样式生成多组虚拟证件信息;
    将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
    对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
    根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
  2. 根据权利要求1所述的方法,其特征在于,所述按照所述证件图中各类证件信息的样式生成多组虚拟证件信息,包括:
    按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;
    重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的所述虚拟姓名所包括的各个汉字的使用次数;当所述使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;及
    将得到的所述虚拟证件号码、所述虚拟姓名随机组合,得到多组虚拟证件信息。
  3. 根据权利要求1所述的方法,其特征在于,所述将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图,包括:
    获取所述虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;
    确定所述证件图中各类证件信息的位置;及
    根据各类证件信息的位置,按照所述字符格式分别将每组所述虚拟证件信息写入所述证件模板图,得到电子证件图。
  4. 根据权利要求1所述的方法,其特征在于,所述对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图包括:
    确定图像采集参数;所述图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;及
    在各所述图像采集参数对应不同的参数值时,对所述电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图。
  5. 根据权利要求4所述的方法,其特征在于,所述在各所述图像采集参数对应不同的参数值时,对所述电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图包括:
    确定各所述图像采集参数对应的标准参数值;
    根据所述标准参数值,对所述电子证件图对应的实体证件进行图像采集得到标准图像;将所述标准图像、所述电子证件图添加至用于存放正样本的图片路径;及
    当接收到干扰指令时,则
    将各所述图像采集参数对应的参数值从所述标准参数值调整至不同的干扰值,根据调整后的所述干扰值对所述实体证件进行图像采集得到干扰图像,将所述干扰图像添加至用于存放负样本的图片路径。
  6. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取图片处理操作;所述图片处理操作包括翻转操作、拉伸操作、旋转操作、加噪操作和模糊操作中的至少一种;
    根据所述图片处理操作,对所述电子证件图和所述证件采集图进行处理,得到多种不同的衍生图像;
    将所述衍生图像作为待构建的图片样本集中的负样本;及
    将所述电子证件图和所述证件采集图作为所述图片样本集中的正样本。
  7. 根据权利要求5或6任一项所述的方法,其特征在于,所述根据所述电子证件图和所述证件采集图构建图片样本集包括:
    分别确定所述正样本和所述负样本的数量;
    当所述正样本的数量与所述负样本的数量之间的差异大于预设阈值时,则
    确定多的样本中所述虚拟证件信息的相似度排名靠前的样本;及
    从多的样本中剔除所述相似度排名靠前的样本,使得剔除后剩下的样本的数量与少的样本的数量之间的差异小于预设阈值,得到正负样本数量均衡的图片样本集。
  8. 一种样本集构建装置,所述装置包括:
    证件模板图获取模块,用于获取根据证件图生成的不包括证件信息的证件模板图;
    虚拟证件信息生成模块,用于按照所述证件图中各类所述证件信息的样式生成多组虚拟证件信息;
    电子证件图生成模块,用于将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
    采集模块,用于对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;
    构建模块,用于根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
  9. 根据权利要求8所述的装置,其特征在于,所述虚拟证件信息生成模块,还用于按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;重复从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的所述虚拟姓名所包括的各个汉字的使用次数;当所述使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;及将得到的所述虚拟证件号码、所述虚拟姓名随机组合,得到多组虚拟证件信息。
  10. 根据权利要求8所述的装置,其特征在于,电子证件图生成模块还用于获取所述虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;确定所述证件图中各类证件信息的位置;及根据各类证件信息的位置,按照所述字符格式分别将每组所述虚拟证件信息写入所述证件模板图,得到电子证件图。
  11. 根据权利要求8所述的装置,其特征在于,采集模块还用于确定图像采集参数;所述图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;及在各所述图像采集参数对应不同的参数值时,对所述电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图。
  12. 根据权利要求8至11任一项所述的装置,其特征在于,所述装置还包括图片处理模块;
    所述图片处理模块用于获取图片处理操作;所述图片处理操作包括翻转操作、拉伸操作、旋转操作、加噪操作和模糊操作中的至少一种;根据所述图片处理操作,对所述电子证件图和所述证件采集图进行处理,得到多种不同的衍生图像;将所述衍生图像作为待构建的图片样本集中的负样本;及将所述电子证件图和所述证件采集图作为所述图片样本集中的正样本。
  13. 一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机 可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取根据证件图生成的不包括证件信息的证件模板图;
    按照所述证件图中各类证件信息的样式生成多组虚拟证件信息;
    将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
    对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
    根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行计算机可读指令时还执行以下步骤:
    按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;
    重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的所述虚拟姓名所包括的各个汉字的使用次数;当所述使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;及
    将得到的所述虚拟证件号码、所述虚拟姓名随机组合,得到多组虚拟证件信息。
  15. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行计算机可读指令时还执行以下步骤:
    获取所述虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;
    确定所述证件图中各类证件信息的位置;及
    根据各类证件信息的位置,按照所述字符格式分别将每组所述虚拟证件信息写入所述证件模板图,得到电子证件图。
  16. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行计算机可读指令时还执行以下步骤:
    确定图像采集参数;所述图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;及
    在各所述图像采集参数对应不同的参数值时,对所述电子证件图对应的实体证件进行图像采集,得到预设数量的证件采集图。
  17. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    获取根据证件图生成的不包括证件信息的证件模板图;
    按照所述证件图中各类证件信息的样式生成多组虚拟证件信息;
    将所述虚拟证件信息按照所述证件图中各类证件信息的位置写入所述证件模板图,生成电子证件图;
    对所述电子证件图对应的实体证件进行图像采集,得到的证件采集图;及
    根据所述电子证件图和所述证件采集图构建图片样本集,所述图片样本集用于训练字符识别模型。
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    按照证件号码的数字串长度获取多个数字,生成虚拟证件号码;对生成的所有虚拟证件号码进行去重处理,得到预设数量的虚拟证件号码;
    重复执行从汉字库中获取未被标记的汉字,生成虚拟姓名;根据已生成的全部虚拟姓名,统计当前生成的所述虚拟姓名所包括的各个汉字的使用次数;当所述使用次数达到预设上限值时,对相应的汉字进行标记的步骤,直至得到预设数量的虚拟姓名;及
    将得到的所述虚拟证件号码、所述虚拟姓名随机组合,得到多组虚拟证件信息。
  19. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    获取所述虚拟证件信息中虚拟证件号码、虚拟姓名各自对应的字符格式;
    确定所述证件图中各类证件信息的位置;及
    根据各类证件信息的位置,按照所述字符格式分别将每组所述虚拟证件信息写入所述证件模板图,得到电子证件图。
  20. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    确定图像采集参数;所述图像采集参数包括光线强度、焦距、采集角度和采集背景中的至少一种;及
    在各所述图像采集参数对应不同的参数值时,对所述电子证件图对应的实体证件进行 图像采集,得到预设数量的证件采集图。
PCT/CN2019/117857 2019-03-19 2019-11-13 样本集构建方法、装置、计算机设备和存储介质 WO2020186785A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910208401.9 2019-03-19
CN201910208401.9A CN110059689B (zh) 2019-03-19 2019-03-19 样本集构建方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020186785A1 true WO2020186785A1 (zh) 2020-09-24

Family

ID=67317215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117857 WO2020186785A1 (zh) 2019-03-19 2019-11-13 样本集构建方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110059689B (zh)
WO (1) WO2020186785A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114862863A (zh) * 2022-07-11 2022-08-05 四川大学 一种样本可均衡的曲轴表面缺陷检测方法及检测系统

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059689B (zh) * 2019-03-19 2024-05-03 平安科技(深圳)有限公司 样本集构建方法、装置、计算机设备和存储介质
CN110689063B (zh) * 2019-09-18 2023-07-25 平安科技(深圳)有限公司 一种基于神经网络的证件识别的训练方法及装置
CN113313120A (zh) * 2020-02-27 2021-08-27 顺丰科技有限公司 智能卡图像识别模型的建立方法以及装置
CN112669412A (zh) * 2020-12-23 2021-04-16 深圳壹账通智能科技有限公司 证件图片生成方法、装置、设备及存储介质
CN112528998B (zh) * 2021-02-18 2021-06-01 成都新希望金融信息有限公司 证件图像处理方法、装置、电子设备及可读存储介质
CN113239339A (zh) * 2021-02-26 2021-08-10 平安普惠企业管理有限公司 证件拍摄方法、装置、计算机设备及存储介质
CN113313114B (zh) * 2021-06-11 2023-06-30 北京百度网讯科技有限公司 证件信息获取方法、装置、设备以及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078191A1 (en) * 2009-09-28 2011-03-31 Xerox Corporation Handwritten document categorizer and method of training
CN106682629A (zh) * 2016-12-30 2017-05-17 佳都新太科技股份有限公司 一种复杂背景下身份证号识别算法
CN108154148A (zh) * 2018-01-22 2018-06-12 厦门美亚商鼎信息科技有限公司 训练样本的人工合成方法及基于该样本的验证码识别方法
CN108460414A (zh) * 2018-02-27 2018-08-28 北京三快在线科技有限公司 训练样本图像的生成方法、装置及电子设备
CN110059689A (zh) * 2019-03-19 2019-07-26 平安科技(深圳)有限公司 样本集构建方法、装置、计算机设备和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292823A (zh) * 2017-08-20 2017-10-24 平安科技(深圳)有限公司 电子装置、发票分类的方法及计算机可读存储介质
CN108549881A (zh) * 2018-05-02 2018-09-18 杭州创匠信息科技有限公司 证件文字的识别方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110078191A1 (en) * 2009-09-28 2011-03-31 Xerox Corporation Handwritten document categorizer and method of training
CN106682629A (zh) * 2016-12-30 2017-05-17 佳都新太科技股份有限公司 一种复杂背景下身份证号识别算法
CN108154148A (zh) * 2018-01-22 2018-06-12 厦门美亚商鼎信息科技有限公司 训练样本的人工合成方法及基于该样本的验证码识别方法
CN108460414A (zh) * 2018-02-27 2018-08-28 北京三快在线科技有限公司 训练样本图像的生成方法、装置及电子设备
CN110059689A (zh) * 2019-03-19 2019-07-26 平安科技(深圳)有限公司 样本集构建方法、装置、计算机设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114862863A (zh) * 2022-07-11 2022-08-05 四川大学 一种样本可均衡的曲轴表面缺陷检测方法及检测系统
CN114862863B (zh) * 2022-07-11 2022-09-20 四川大学 一种样本可均衡的曲轴表面缺陷检测方法及检测系统

Also Published As

Publication number Publication date
CN110059689A (zh) 2019-07-26
CN110059689B (zh) 2024-05-03

Similar Documents

Publication Publication Date Title
WO2020186785A1 (zh) 样本集构建方法、装置、计算机设备和存储介质
WO2021135499A1 (zh) 损伤检测模型训练、车损检测方法、装置、设备及介质
US9373030B2 (en) Automated document recognition, identification, and data extraction
CN109034159A (zh) 图像信息提取方法和装置
US10929597B2 (en) Techniques and systems for storing and protecting signatures and images in electronic documents
WO2021012382A1 (zh) 配置聊天机器人的方法、装置、计算机设备和存储介质
DE112019000334T5 (de) Validieren der identität eines fernen benutzers durch vergleichen auf der grundlage von schwellenwerten
CN110147787A (zh) 基于深度学习的银行卡号自动识别方法及系统
CN113111880B (zh) 证件图像校正方法、装置、电子设备及存储介质
US9081801B2 (en) Metadata supersets for matching images
WO2017143973A1 (zh) 文本识别模型建立方法和装置
CN111881904A (zh) 板书记录方法和系统
WO2022126978A1 (zh) 发票信息抽取方法、装置、计算机设备及存储介质
CN114332883A (zh) 发票信息识别方法、装置、计算机设备及存储介质
CN112396047B (zh) 训练样本生成方法、装置、计算机设备和存储介质
CN110689063B (zh) 一种基于神经网络的证件识别的训练方法及装置
CN111930976A (zh) 演示文稿生成方法、装置、设备及存储介质
CN112348008A (zh) 证件信息的识别方法、装置、终端设备及存储介质
CN110909733A (zh) 基于ocr图片识别的模版定位方法、装置和计算机设备
CN111950542B (zh) 基于ocr识别算法的学习扫描笔
CN112395834B (zh) 基于图片输入的脑图生成方法、装置、设备及存储介质
CN115223183A (zh) 一种信息提取方法、装置及电子设备
WO2022252641A1 (zh) 基于多图片差异性的鉴伪方法、装置、设备及存储介质
CN113936187A (zh) 文本图像合成方法、装置、存储介质及电子设备
CN111753108A (zh) 演示文稿生成方法、装置、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19920113

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19920113

Country of ref document: EP

Kind code of ref document: A1