WO2023015922A1 - 图像识别模型的训练方法、装置、设备及存储介质 - Google Patents

图像识别模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023015922A1
WO2023015922A1 PCT/CN2022/085915 CN2022085915W WO2023015922A1 WO 2023015922 A1 WO2023015922 A1 WO 2023015922A1 CN 2022085915 W CN2022085915 W CN 2022085915W WO 2023015922 A1 WO2023015922 A1 WO 2023015922A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
recognition model
image
target
text content
Prior art date
Application number
PCT/CN2022/085915
Other languages
English (en)
French (fr)
Inventor
乔美娜
刘珊珊
钦夏孟
章成全
姚锟
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to US17/905,965 priority Critical patent/US20230401828A1/en
Publication of WO2023015922A1 publication Critical patent/WO2023015922A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1444Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields

Definitions

  • the present disclosure relates to the field of computer technology, specifically to the field of artificial intelligence technology such as computer vision and deep learning, and in particular to a training method, device, equipment, storage medium and computer program product for an image recognition model.
  • OCR Optical Character Recognition
  • the present disclosure provides an image recognition model training method, device, equipment, storage medium and computer program product.
  • a method for training an image recognition model including:
  • the training data set includes the first text image of each vertical class in the non-target scene and the second text image of each vertical class in the target scene, and the text content contained in the first text image
  • the type is the same as the type of the text content contained in the second text image
  • Correction training is performed on the basic recognition model by using the second text image, so as to obtain an image recognition model corresponding to the target scene.
  • a training device for an image recognition model including:
  • the first acquisition module is used to acquire a training data set, wherein the training data set includes the first text image of each vertical category in the non-target scene and the second text image of each vertical category in the target scene, and the first The type of text content contained in the text image is the same as the type of text content contained in the second text image;
  • a second acquisition module configured to use the first text image to train an initial recognition model to obtain a basic recognition model
  • the third acquisition module is configured to use the second text image to correct and train the basic recognition model, so as to obtain an image recognition model corresponding to the target scene.
  • the embodiment of the third aspect of the present disclosure provides an electronic device, including:
  • At least one processor and a memory communicatively coupled to the at least one processor;
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can execute the method as proposed in the embodiment of the first aspect of the present disclosure. Methods.
  • the embodiment of the fourth aspect of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute the method as provided in the embodiment of the first aspect of the present disclosure.
  • the embodiment of the fifth aspect of the present disclosure provides a computer program product, including a computer program.
  • the computer program is executed by a processor, the method provided in the embodiment of the first aspect of the present disclosure is implemented.
  • the image recognition model training method, device, equipment, storage medium and computer program product provided by the present disclosure have at least the following beneficial effects:
  • the generated image recognition model can have higher recognition accuracy and stronger applicability.
  • FIG. 1 is a schematic flowchart of a method for training an image recognition model according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flowchart of a method for training an image recognition model according to another embodiment of the present disclosure
  • Fig. 3 is a schematic structural diagram of a training device for an image recognition model provided according to an embodiment of the present disclosure
  • Fig. 4 is a schematic structural diagram of a training device for an image recognition model according to another embodiment of the present disclosure.
  • FIG. 5 is a block diagram of an electronic device for implementing the method for training an image recognition model according to an embodiment of the present disclosure.
  • Artificial intelligence is a discipline that studies the use of computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level.
  • Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, machine learning, depth Learning, big data processing technology, knowledge map technology and other major directions.
  • Deep learning is to learn the internal laws and representation levels of sample data. The information obtained during the learning process is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to be able to recognize data such as text, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition that far exceed previous related techniques.
  • Computer vision is an interdisciplinary scientific field that studies how to enable computers to obtain high-level understanding from digital images or videos. From an engineering standpoint, it seeks to automate tasks that the human visual system can accomplish. Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and methods for extracting high-dimensional data from the real world to produce numerical or symbolic information, for example, in the form of decisions.
  • the present disclosure provides a training method for an image recognition model, which can be executed by an image recognition model training device provided by the present disclosure, and can also be executed by the electronic device provided by the present disclosure, wherein the electronic device can include but not limited to a mobile phone , desktop computers, tablet computers and other terminal devices, which can also be servers, the image recognition model training device provided by the present disclosure is used to execute the training method of an image recognition model provided by the present disclosure, and it is not a limitation of the present disclosure.
  • device hereinafter referred to simply as "device”.
  • Fig. 1 is a schematic flowchart of a method for training an image recognition model according to an embodiment of the present disclosure.
  • the training method of this image recognition model can comprise the following steps:
  • Step S101 obtain a training data set, wherein the training data set includes the first text image of each vertical category in the non-target scene and the second text image of each vertical category in the target scene, and the text content contained in the first text image
  • the type is the same as the type of the text content contained in the second text image.
  • the target scene may be any specified scene. It can be understood that the target scene may have certain attributes or characteristics, and each text image to be recognized in the target scene may be called a vertical class.
  • the target scene may be a traffic scene
  • the text images of each vertical category in this scene may be a text image of a driving license, a text image of a driver's license, a text image of a vehicle certificate, etc., which are not limited here.
  • the target scene can be a financial scene
  • the text images of each category in this scene can be value-added tax invoice text images, machine-printed invoice text images, itinerary text images, bank check text images, bank receipt text images, etc. It is not limited here.
  • the non-target scene may be a scene that is similar to the target scene, or has a certain intrinsic relationship with the target scene.
  • the text image of each vertical category in the target scene and the text image of each vertical category in the non-target scene contain the same type of text content.
  • the non-target scene may be a document scene.
  • the text images to be recognized are usually ID cards, passports, and the like.
  • Text images such as ID cards, passports, and text images such as driver's licenses, driving licenses, and vehicle certificates all contain text types such as text, date, and ID number, so the text images in the ID scene can be used as the first text image, that is, the same as
  • the text image corresponding to the non-target scene is not limited here.
  • first text image and the second text image included in the training data set may be images acquired by an image sensor, such as a camera or a camera, which may be color images or gray images, which are not limited here.
  • data synthesis and data enhancement can also be performed on the text data in the training data set, so as to enhance the diversity of the training data, which is not limited here.
  • Step S102 using the first text image to train the initial recognition model to obtain a basic recognition model.
  • the initial recognition model may be an initial deep learning network model without any training
  • the basic recognition model may be a network model generated during the process of training the initial recognition model using the first text image, that is, training data.
  • the first text image that is, the training data
  • the initial recognition model can be input into the initial recognition model in batches, and then the text data in the text image extracted by the initial recognition model can be determined according to the error function of the initial recognition model The error between the real text data corresponding to the text image, and then based on the error, the initial recognition model is back-propagated to obtain the basic recognition model.
  • the number of first text images used for training the initial recognition model may be 8,000 or 10,000, which is not limited here.
  • the initial recognition model may be a network model such as a convolutional recurrent neural network (Convolutional Recurrent Neural Network, CRNN), an attention mechanism (Attention Mechanism), and is not limited here.
  • a network model such as a convolutional recurrent neural network (Convolutional Recurrent Neural Network, CRNN), an attention mechanism (Attention Mechanism), and is not limited here.
  • Step S103 using the second text image to correct and train the basic recognition model to obtain an image recognition model corresponding to the target scene.
  • the second text image corresponding to the target scene can be used as training data to correct and train the basic recognition model, so as to obtain an image recognition model corresponding to the target scene.
  • the second text image that is, the training data
  • the basic recognition model can be input into the basic recognition model in batches, and then the text data in the text image extracted by the basic recognition model can be determined according to the error function of the basic recognition model The error between the real text data corresponding to the text image, and then based on the error, backpropagation training is performed on the basic recognition model to obtain the image recognition model corresponding to the target scene.
  • the training data set may also include text images in any scene, such as text images that may include documents, books, scanned copies, etc., which is not limited here.
  • text images that may include documents, books, scanned copies, etc.
  • the basic recognition model is obtained through training
  • the text image and the first text image in any scene can be jointly used as training data.
  • the image recognition model corresponding to the target scene is obtained through training
  • the text image and the second text image in any scene can be used together as training data.
  • first obtain the training data set wherein the training data set includes the first text image of each vertical category in the non-target scene, and the second text image of each vertical category in the target scene, and the first text image contains The type of text content is the same as the type of text content contained in the second text image, and then use the first text image to train the initial recognition model to obtain the basic recognition model, and then use the second text image to correct the basic recognition model Training to obtain the image recognition model corresponding to the target scene. Therefore, when training the image recognition model in the target scene, by using text images of different categories in scenes similar to the target scene, and text images of different categories in the target scene, a model that can be applied to the target scene is obtained through training. Different types of recognition models improve the recognition accuracy and versatility of the model, reduce the memory occupied by the model, and save manpower and material resources.
  • Fig. 2 is a schematic flowchart of a method for training an image recognition model according to another embodiment of the present disclosure.
  • the training method of this image recognition model can comprise the following steps:
  • Step S201 obtain a training data set, wherein the training data set includes the first text image of each vertical category in the non-target scene and the second text image of each vertical category in the target scene, and the text content contained in the first text image
  • the type is the same as the type of the text content contained in the second text image.
  • step S201 may refer to the foregoing embodiments, and details are not described here.
  • the training data set may include the first annotation text content corresponding to the first text image, the position information of the first text box, and the first annotation type label corresponding to the first annotation text content.
  • each text content can be annotated first, and the position information of each text box can be determined at the same time, and the corresponding type label can be determined for the first annotated text content, and then the first text can be The images are added to the training dataset.
  • the first marked text content can be each text contained in the first text image.
  • the corresponding first label text content can be text information such as the buyer's name, taxpayer identification number, invoice date, tax amount, etc. of the value-added tax invoice .
  • the first text box may be a text box determined by each first marked text content.
  • the first marking type label can be the type marked on each first text box, for example, "date" can be marked on the billing date, "number” can be marked on the taxpayer identification number, and "" can be marked on the tax amount. Amount", etc., are not limited here.
  • the location of the text box can be determined, and then the location information of the first text box can be determined.
  • the coordinate information of the first text box may be used as the position information of the first text box, which is not limited here.
  • Step S202 according to the location information of the first text box, the first target image to be recognized is acquired from the first text image.
  • the position of the first target image to be recognized can be determined according to the position information of the first text box, and then the image of the region to be recognized, that is, the first target image, can be determined from the first text image according to the position.
  • Step S203 inputting the first target image into the initial recognition model to obtain the predicted text content output by the initial recognition model.
  • the first target image may be input into the initial recognition model, so as to obtain the predicted text content and the predicted type label output by the initial recognition model.
  • target images can also be continuously added for training.
  • Step S204 Correct the initial recognition model according to the difference between the predicted text content and the first labeled text content, so as to obtain a basic recognition model.
  • the distance between each pixel point in the predicted text content and the corresponding pixel point in the first marked text content can be determined first, and then the predicted text content and the first marked text can be represented according to the distance between each corresponding pixel point difference between content.
  • the Euclidean distance formula can be used to determine the distance between the corresponding pixels in the predicted text content and the first marked text content, or the Manhattan distance formula can be used to calculate the corresponding pixels between the predicted text content and the first marked text content The distance between the points, and then determine the correction gradient and correct the initial recognition model, which is not limited here.
  • the initial recognition model may be corrected according to the difference between the predicted text content and the first labeled text content and the difference between the predicted type label and the first labeled type label, so as to obtain the basic recognition model.
  • the initial recognition model may be corrected first according to the difference between the predicted text content and the first labeled text content, and then the initial recognition model may be corrected according to the difference between the predicted type label and the first labeled type label.
  • the initial recognition model can be corrected first according to the difference between the predicted type label and the first labeled type label, and then the initial recognition model can be corrected according to the difference between the predicted text content and the first labeled text content.
  • the initial recognition model may be corrected according to the difference between the predicted text content and the first labeled text content and the difference between the predicted type label and the first labeled type label, so as to obtain the basic recognition model.
  • the recognition model by training the recognition model to output the predicted text content and the predicted type label at the same time, the recognition model can automatically mark the information type of the recognized text when it is used, thereby providing convenience for further processing of information.
  • the training data set may further include the second annotation text content corresponding to the second text image, the position information of the second text box, and the second annotation type label corresponding to the second annotation text content.
  • Step S205 according to the location information of the second text box, acquire the second target image to be recognized from the second text image.
  • the position of the second target image to be recognized can be determined according to the position information of the second text box, and then the image of the region to be recognized, that is, the second target image, can be determined from the second text image according to the position.
  • Step S206 inputting the second target image into the basic recognition model to obtain the predicted text content and predicted type label output by the basic recognition model.
  • Step S207 Correct the basic recognition model according to the difference between the predicted text content and the second labeled text content, and the difference between the predicted type label and the second labeled type label, so as to obtain an image recognition model corresponding to the target scene.
  • steps S205, S206, and S207 reference may be made to the above steps S202, S203, and S204, which will not be repeated here.
  • Step S208 acquiring the target text image to be recognized.
  • target text image that is, the specified image to be recognized
  • target text image can be any text image, such as a certificate, a bill, etc., and is not limited here.
  • the target text image may be an image acquired by any image sensor, such as a camera or a camera, and it may be a color image or a gray image, which is not limited here.
  • Step S209 analyzing the target text image to determine the scene to which the target text image belongs.
  • the obtained target text image can be analyzed, and then the scene corresponding to the target text image can be determined. For example, if the current target text image is a driver's license text image, it can be determined that the current target text image belongs to a traffic scene; if the current target text image is a value-added tax invoice image, it can be determined that the target text image belongs to a financial scene. Not limited.
  • Step S210 input the target text image into the image recognition model corresponding to the scene to obtain the text content contained in the target text image.
  • an image recognition model corresponding to the scene can be determined. Furthermore, the target text image can be input into the image recognition model corresponding to the scene, so that the text content corresponding to the target text image can be output.
  • the target text image is a driver's license, it can be input into the image recognition model of the traffic scene.
  • the target text image is a VAT invoice, which can be fed into an image recognition model for financial scenarios.
  • the reliability and accuracy of image recognition are improved.
  • first obtain the training data set wherein the training data set includes the first text image of each vertical category in the non-target scene, and the second text image of each vertical category in the target scene, and the first text image contains The type of text content in is the same as the type of text content contained in the second text image; then use the first text image to train the initial recognition model to obtain the basic recognition model, and then use the second text image to correct the basic recognition model Training to obtain the image recognition model corresponding to the target scene; then obtain the target text image to be recognized, and analyze the target text image to determine the scene to which the target text image belongs, and finally input the target text image into the corresponding Image recognition model to obtain the text content contained in the target text image.
  • the initial recognition model is corrected according to the difference between the predicted text content and the first marked text content; when training the image recognition model in the target scene, according to the predicted text content and the second marked text.
  • the difference in text content, and the difference between the predicted type label and the second label type label correct the basic recognition model, so that the generated image recognition model has higher accuracy and stronger applicability, so that it can be accurately based on the target text
  • the image generates corresponding text content.
  • the present disclosure also provides a training device for an image recognition model.
  • Fig. 3 is a schematic structural diagram of a training device for an image recognition model according to an embodiment of the present disclosure.
  • the image recognition model training device 300 may include: a first acquisition module 310 , a second acquisition module 320 and a third acquisition module 330 .
  • the first obtaining module 310 is used to obtain the training data set, wherein the training data set includes the first text image of each vertical class in the non-target scene and the second text image of each vertical class in the target scene, the first text
  • the type of text content contained in the image is the same as the type of text content contained in the second text image.
  • the second obtaining module 320 is configured to use the first text image to train the initial recognition model to obtain a basic recognition model.
  • the third acquisition module 330 is configured to use the second text image to correct and train the basic recognition model, so as to obtain an image recognition model corresponding to the target scene.
  • the training data set further includes text images in any scene.
  • first obtain the training data set wherein the training data set includes the first text image of each vertical category in the non-target scene, and the second text image of each vertical category in the target scene, and the first text image contains The type of text content is the same as the type of text content contained in the second text image, and then use the first text image to train the initial recognition model to obtain the basic recognition model, and then use the second text image to correct the basic recognition model Training to obtain the image recognition model corresponding to the target scene. Therefore, when training the image recognition model in the target scene, by using text images of different categories in scenes similar to the target scene, and text images of different categories in the target scene, a model that can be applied to the target scene is obtained through training. Different types of recognition models improve the recognition accuracy and versatility of the model, reduce the memory occupied by the model, and save manpower and material resources.
  • Fig. 4 is a schematic structural diagram of an image recognition model training device according to another embodiment of the present disclosure.
  • the image recognition model training device 400 may include: a first acquisition module 410 , a second acquisition module 420 and a third acquisition module 430 .
  • the training data set further includes the first marked text content corresponding to the first text image and the position information of the first text box.
  • the second obtaining module 420 may include:
  • the first obtaining unit 421 is configured to obtain the target image to be recognized from the first text image according to the position information of the first text box.
  • the second acquisition unit 422 is configured to input the target image into the initial recognition model, so as to obtain the predicted text content output by the initial recognition model.
  • the third obtaining unit 423 is configured to correct the initial recognition model according to the difference between the predicted text content and the first marked text content, so as to obtain the basic recognition model.
  • the training data set further includes a first annotation type label corresponding to the first annotation text content.
  • the second acquisition unit 422 is specifically configured to: input the target image into the initial recognition model, so as to obtain the predicted text content and the predicted type label output by the initial recognition model;
  • the third obtaining unit 423 is specifically configured to: modify the initial recognition model according to the difference between the predicted text content and the first labeled text content, and the difference between the predicted type label and the first labeled type label, so as to obtain the basic recognition model.
  • the training data set further includes the second annotation text content corresponding to the second text image, the position information of the second text box, and the second annotation type label corresponding to the second annotation text content .
  • the third acquisition module 430 may include:
  • the fourth obtaining unit 431 is configured to obtain the second target image to be recognized from the second text image according to the position information of the second text box.
  • the fifth obtaining unit 432 is configured to input the second target image into the basic recognition model, so as to obtain the predicted text content and the predicted type label output by the basic recognition model.
  • the sixth acquisition unit 433 is configured to correct the basic recognition model according to the difference between the predicted text content and the second labeled text content, and the difference between the predicted type label and the second labeled type label, so as to obtain the image recognition model corresponding to the target scene .
  • the training device may further include a fourth obtaining module 440 , a first determining module 450 and a fifth obtaining module 460 .
  • the fourth acquiring module 440 is configured to acquire the target text image to be recognized.
  • the first determination module 450 is configured to analyze the target text image to determine the scene to which the target text image belongs.
  • the fifth acquiring module 460 is configured to input the target text image into the image recognition model corresponding to the scene to acquire the text content contained in the target text image.
  • the image recognition model training device 400 in Fig. 4 of the embodiment of the present disclosure is the same as the image recognition model training device 300 in the above embodiment
  • the first acquisition module 410 is the same as the first acquisition module in the above embodiment 310
  • the third obtaining module 430 may have the same function and structure as the third obtaining module 330 in the above embodiment.
  • first obtain the training data set wherein the training data set includes the first text image of each vertical category in the non-target scene, and the second text image of each vertical category in the target scene, and the first text image contains The type of text content in is the same as the type of text content contained in the second text image; then use the first text image to train the initial recognition model to obtain the basic recognition model, and then use the second text image to correct the basic recognition model Training to obtain the image recognition model corresponding to the target scene; then obtain the target text image to be recognized, and analyze the target text image to determine the scene to which the target text image belongs, and finally input the target text image into the corresponding Image recognition model to obtain the text content contained in the target text image.
  • the initial recognition model is corrected according to the difference between the predicted text content and the first marked text content; when training the image recognition model in the target scene, according to the predicted text content and the second marked text.
  • the difference in text content, and the difference between the predicted type label and the second label type label correct the basic recognition model, so that the generated basic recognition model and image recognition model have higher accuracy and stronger applicability, so that they can be accurately
  • the corresponding text content is generated according to the target text image.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 5 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 500 includes a computing unit 501 that can execute according to a computer program stored in a read-only memory (ROM) 502 or loaded from a storage unit 508 into a random-access memory (RAM) 503. Various appropriate actions and treatments. In the RAM 503, various programs and data necessary for the operation of the device 500 can also be stored.
  • the computing unit 501, ROM 502, and RAM 503 are connected to each other through a bus 504.
  • An input/output (I/O) interface 505 is also connected to the bus 504 .
  • the I/O interface 505 includes: an input unit 506, such as a keyboard, a mouse, etc.; an output unit 507, such as various types of displays, speakers, etc.; a storage unit 508, such as a magnetic disk, an optical disk, etc. ; and a communication unit 509, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 509 allows the device 500 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 501 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 501 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 501 executes various methods and processes described above, such as a training method of an image recognition model.
  • the method for training an image recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508 .
  • part or all of the computer program may be loaded and/or installed on the device 500 via the ROM 502 and/or the communication unit 509.
  • the computer program When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the training method of the image recognition model described above can be performed.
  • the computing unit 501 may be configured in any other appropriate way (for example, by means of firmware) to execute the method for training an image recognition model.
  • the computer program in the product implements the image recognition model training method in the above embodiment when executed by a processor.
  • the methods described above are performed when instructions in a computer program product are executed by a processor.
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: local area networks (LANs), wide area networks (WANs), the Internet, and blockchain networks.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and virtual private server (Virtual Private Server, or "VPS") There are defects such as difficult management and weak business expansion.
  • the server can also be a server of a distributed system, or a server combined with a blockchain.
  • first obtain the training data set wherein the training data set includes the first text image of each vertical category in the non-target scene, and the second text image of each vertical category in the target scene, and the first text image contains The type of text content is the same as the type of text content contained in the second text image, and then use the first text image to train the initial recognition model to obtain the basic recognition model, and then use the second text image to correct the basic recognition model Training to obtain the image recognition model corresponding to the target scene. Therefore, when training the image recognition model in the target scene, by using text images of different categories in scenes similar to the target scene, and text images of different categories in the target scene, a model that can be applied to the target scene is obtained through training. Different types of recognition models improve the recognition accuracy and versatility of the model, reduce the memory occupied by the model, and save manpower and material resources.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel or sequentially or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本公开公开了一种图像识别模型的训练方法、装置、设备、存储介质和计算机程序产品,涉及计算机技术领域,具体涉及深度学习、计算机视觉等人工智能技术领域,具体实现方案为:获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,所述第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同;利用第一文本图像对初始识别模型进行训练,以获取基础识别模型;利用第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型。

Description

图像识别模型的训练方法、装置、设备及存储介质 技术领域
本公开涉及计算机技术领域,具体涉及计算机视觉、深度学习等人工智能技术领域,尤其涉及一种图像识别模型的训练方法、装置、设备、存储介质和计算机程序产品。
背景技术
随着人工智能技术地不断发展和完善,其已经在与人类日常生活相关的各个领域扮演着极其重要的作用。例如,光学字符识别(Optical Character Recognition,OCR)技术可以用于提取文档、书籍、扫描件等多种场景的文本信息,为信息的收集和处理提供了极大方便。然而,对于证件类、票据类等特定场景下的具体垂类,由于能够获取的训练数据数量有限,导致训练得到的OCR模型的识别精度不高。因此,如何提高针对特定场景下的不同垂类的识别精度具有重要意义。
发明内容
本公开提供了一种图像识别模型的训练方法、装置、设备、存储介质和计算机程序产品。
根据本公开的第一方面,提供了一种图像识别模型的训练方法,包括:
获取训练数据集,其中,所述训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,所述第一文本图像中包含的文本内容的类型与所述第二文本图像中包含的文本内容的类型相同;
利用所述第一文本图像对初始识别模型进行训练,以获取基础识别模型;
利用所述第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型。
根据本公开的第二方面,提供了一种图像识别模型的训练装置,包括:
第一获取模块,用于获取训练数据集,其中,所述训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,所述第一文本图像中包含的文本内容的类型与所述第二文本图像中包含的文本内容的类型相同;
第二获取模块,用于利用所述第一文本图像对初始识别模型进行训练,以获取基础识别模型;
第三获取模块,用于利用所述第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型。
本公开第三方面实施例提出了一种电子设备,包括:
至少一个处理器;以及与所述至少一个处理器通信连接的存储器;
其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如本公开第一方面实施例提出的方法。
本公开第四方面实施例提出了一种非临时性计算机可读存储介质,存储有计算机指令,所述计算机指令用于使所述计算机执行如本公开第一方面实施例提出的方法。
本公开第五方面实施例提出了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现本公开第一方面实施例提出的方法。
本公开提供的图像识别模型的训练方法、装置、设备、存储介质和计算机程序产品至少存在以下有益效果:
首先获取训练数据集,然后利用训练数据集中非目标场景下各个垂类的第一文本图像,对初始识别模型进行训练,以获取基础识别模型,之后利用训练数据集中目标场景下各个垂类的第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。由此,可以使得生成的图像识别模型的识别精度更高,适用性更强。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1是根据本公开一实施例提供的一种图像识别模型的训练方法的流程示意图;
图2是根据本公开另一实施例提供的一种图像识别模型的训练方法的流程示意图;
图3是根据本公开一实施例提供的一种图像识别模型的训练装置的结构示意图;
图4是根据本公开另一实施例提供的一种图像识别模型的训练装置的结构示意图;
图5是用来实现本公开实施例的图像识别模型的训练方法的电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
为了方便对本公开的理解,下面首先对本公开涉及的技术领域进行简单解释说明书。
人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术;人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习、深度学习、大数据处理技术、知识图谱技术等几大方向。
深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。深度学习是一个复杂的机器学习算法,在语音和图像识别方面取得的效果,远远超过先前相关技术。
计算机视觉是一个跨学科的科学领域,研究如何让计算机从数字图像或视频中获得高水平的理解。从工程学的角度来看,它寻求人类视觉系统能够完成的自动化任务。计算机视觉任务包括获取、处理、分析和理解数字图像的方法,以及从现实世界中提取高维数据以便例如以决策的形式产生数字或符号信息的方法。
本公开提供了一种图像识别模型的训练方法,该方法可以由本公开提供的一种图像识别模型的训练装置执行,也可以由本公开提供的电子设备执行,其中,电子设备可以包括但不限于手机、台式电脑、平板电脑等终端设备,也可以是服务器,下面以由本公开提供的图像识别模型的训练装置来执行本公开提供的一种图像识别模型的训练方法,而不作为对本公开的限定,以下简称为“装置”。
下面结合参考附图对本公开提供的图像识别模型的训练方法、装置、电子设备、存储介质和计算机程序产品进行详细描述。
图1是根据本公开一实施例的一种图像识别模型的训练方法的流程示意图。
如图1所示,该图像识别模型的训练方法可以包括以下步骤:
步骤S101,获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同。
其中,目标场景可以为任意指定的场景。可以理解的是,目标场景可以具有一定的属性或者特征,在目标场景下需要识别的每种文本图像可以称为一种垂类。
比如,目标场景可以为交通场景,则该场景下各个垂类的文本图像可以为行驶证文本图像、驾驶证文本图像、车辆合格证文本图像等,在此不进行限定。
或者,目标场景可以为金融场景,则该场景下各个垂类的文本图像可以为增值税发票文本图像、机打发票文本图像、行程单文本图像、银行支票文本图像、银行回单文本图像等,在此不进行限定。
其中,非目标场景可以为与目标场景类似,或者与目标场景存在一定的内在关联的场景。比如,目标场景下各个垂类的文本图像和非目标场景下各个垂类的文本图像包含的文本内容的类型相同。
举例来说,若当前的目标场景为交通场景,则非目标场景可以为证件场景。需要说明的是,在证件场景中,需要识别的文本图像通常为身份证、护照等。身份证、护照等文本图像与驾驶证、行驶证、车辆合格证等文本图像均包含文字、日期、证件号码等文本类型,因而可以将证件场景下的文本图像作为第一 文本图像,也即与非目标场景对应的文本图像,在此不做限定。
需要说明的是,训练数据集中包含的第一文本图像以及第二文本图像可以是通过图像传感器,比如摄像头、照相机获取的图像,其可以为彩色图像或者灰色图像,在此不进行限定。另外,还可以对训练数据集中的文本数据进行数据合成、数据增强,从而增强训练数据的多样性,在此不进行限定。
步骤S102,利用第一文本图像对初始识别模型进行训练,以获取基础识别模型。
其中,初始识别模型可以为未经任何训练的初始深度学习网络模型,基础识别模型可以为在使用第一文本图像,也即训练数据,对初始识别模型进行训练的过程中生成的网络模型。
在一些示例中,可以按照预设参数,分批次将第一文本图像,也即训练数据输入初始识别模型,然后根据初始识别模型的误差函数,确定初始识别模型提取的文本图像中的文本数据与文本图像对应的真实文本数据之间的误差,然后基于误差,对初始识别模型进行反向传播训练,以得到基础识别模型。
需要说明的是,用于对初始识别模型进行训练的第一文本图像的数量可以为8千张或者一万张,在此不进行限定。
可选的,在一些实施方式中,初始识别模型可以为卷积循环神经网络(Convolutional Recurrent Neural Network,CRNN)、注意力机制(Attention Mechanism)等网络模型,在此不进行限定。
步骤S103,利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。
需要说明的是,在确定了基础识别模型之后,可以将目标场景对应的第二文本图像作为训练数据,对基础识别模型进行修正训练,从而获取与目标场景对应的图像识别模型。
在一些示例中,可以按照预设参数,分批次将第二文本图像,也即训练数据输入基础识别模型,然后根据基础识别模型的误差函数,确定基础识别模型提取的文本图像中的文本数据与文本图像对应的真实文本数据之间的误差,然后基于误差,对基础识别模型进行反向传播训练,以得到目标场景对应的图像识别模型。
可选的,训练数据集中还可以包括任意场景下的文本图像,比如可以包含文档、书籍、扫描件的文本图像等,在此不进行限定。当训练得到基础识别模 型时,可以将任意场景下的文本图像及第一文本图像共同作为训练数据。相应的,当训练得到目标场景对应的图像识别模型时,可以将任意场景下的文本图像及第二文本图像共同作为训练数据。
可以理解的是,由于特定场景下的文本图像通常具有私密性,因此难以收集到足够数量的数据用于训练。任意场景下的文本图像包含大量的文本信息,能够弥补目标场景及非目标场景下不同垂类的文本图像数量不足的缺陷。因此,在训练数据集中加入任意场景下的文本图像,可以增加训练数据的数据量,提高图像识别模型的基础识别能力。
本公开实施例中,首先获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同,然后利用第一文本图像对初始识别模型进行训练,以获取基础识别模型,之后利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。由此,在训练得到目标场景下的图像识别模型时,通过采用与目标场景类似的场景的不同垂类的文本图像,以及目标场景下不同垂类的文本图像,训练得到一个能够适用于目标场景不同垂类的识别模型,提高了模型的识别精度和通用性,减少了模型占用的内存,节省了人力和物力开销。
图2是根据本公开另一实施例的一种图像识别模型的训练方法的流程示意图。
如图2所示,该图像识别模型的训练方法可以包括以下步骤:
步骤S201,获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同。
需要说明的是,步骤S201的具体实现过程可以参照上述实施例,在此不进行赘述。
可选的,训练数据集中可以包括第一文本图像对应的第一标注文本内容,第一文本框的位置信息,以及与第一标注文本内容对应的第一标注类型标签。
需要说明的是,对于采集得到的第一文本图像,可以先对各个文本内容进行标注,同时确定各个文本框的位置信息,并为第一标注文本内容确定对应的类型标签,之后将第一文本图像加入训练数据集。其中,第一标注文本内容可 以为第一文本图像中包含的各个文本。
举例来说,若当前的第一文本图像为增值税发票文本图像,则与其对应的第一标注文本内容可以为该增值税发票的购买方名称、纳税人识别号、开票日期、税额等文本信息。其中,第一文本框可以为各个第一标注文本内容所确定的文本框。其中,第一标注类型标签可以为在各个第一文本框上标注的类型,比如,可以在开票日期上标注“日期”,可以在纳税人识别号上标注“号码”,可以在税额上标注“金额”等,在此不进行限定。
具体的,在确定第一文本框之后,可以确定该文本框的所在定位,进而确定第一文本框的位置信息。比如,可以将第一文本框的坐标信息作为第一文本框的位置信息,在此不进行限定。
步骤S202,根据第一文本框的位置信息,从第一文本图像中获取待识别的第一目标图像。
需要说明的是,可以根据第一文本框的位置信息确定待识别的第一目标图像的定位,进而可以根据该定位从第一文本图像中确定待识别区域的图像,也即第一目标图像。
本公开实施例中,通过确定文本框的位置信息,进而根据位置信息从文本图像中确定待识别的目标图像,可以避免识别空白区域,提高识别模型的训练效率。
步骤S203,将第一目标图像输入初始识别模型,以获取初始识别模型输出的预测文本内容。
可选的,可以将第一目标图像输入初始识别模型,以获取初始识别模型输出的预测文本内容及预测类型标签。在训练的过程中,还可以不断加入目标图像用于训练。
步骤S204,根据预测文本内容与第一标注文本内容的差异,对初始识别模型进行修正,以获取基础识别模型。
其中,可以先确定出预测文本内容中每个像素点与第一标注文本内容中对应的像素点间的距离,之后再根据各对应像素点间的距离即可表征预测文本内容与第一标注文本内容间的差异。
举例来说,可以使用欧氏距离公式确定预测文本内容与第一标注文本内容中各对应像素点间的距离,或者,也可以使用曼哈顿距离公式计算预测文本内容与第一标注文本内容各对应像素点间的距离,进而确定出修正梯度并 以此对初始识别模型进行修正,在此不进行限定。
可选的,还可以根据预测文本内容与第一标注文本内容的差异以及预测类型标签与第一标注类型标签的差异,对初始识别模型进行修正,以获取基础识别模型。
比如,可以先根据预测文本内容与第一标注文本内容的差异对初始识别模型进行修正,然后根据预测类型标签与第一标注类型标签的差异对初始识别模型进行修正。
或者,可以先根据预测类型标签与第一标注类型标签的差异对初始识别模型进行修正,然后根据预测文本内容与第一标注文本内容的差异对初始识别模型进行修正。
或者,可以同时根据预测文本内容与第一标注文本内容的差异以及预测类型标签与第一标注类型标签的差异,对初始识别模型进行修正,以获取基础识别模型。
本公开实施例中,通过训练识别模型同时输出预测本文内容和预测类型标签,可以使得识别模型在使用时,能够自动标注识别的文本的信息类型,从而为信息的进一步处理提供便利。
可选的,训练数据集中还可以包括第二文本图像对应的第二标注文本内容、第二文本框的位置信息及第二标注文本内容对应的第二标注类型标签。
需要说明的是,第二标注文本内容、第二文本框的位置信息以及第二标注类型标签的具体示例可以参照上述第一标注文本内容,第一文本框的位置信息,以及与第一标注文本内容对应的第一标注类型标签,在此不进行赘述。
步骤S205,根据第二文本框的位置信息,从第二文本图像中获取待识别的第二目标图像。
需要说明的是,可以根据第二文本框的位置信息确定待识别的第二目标图像的定位,进而可以根据该定位从第二文本图像中确定待识别区域的图像,也即第二目标图像。
步骤S206,将第二目标图像输入基础识别模型,以获取基础识别模型输出的预测文本内容及预测类型标签。
步骤S207,根据预测文本内容与第二标注文本内容的差异、及预测类型标签与第二标注类型标签的差异,对基础识别模型进行修正,以获取目标场景对应的图像识别模型。
需要说明的是,步骤S205、S206、S207的具体实现过程可以参照上述步骤S202、S203、S204,在此不进行赘述。
步骤S208,获取待识别的目标文本图像。
需要说明的是,目标文本图像,也即指定的待识别的图像,可以为任意文本图像,比如证件、票据等,在此不进行限定。
需要说明的是,目标文本图像可以是通过任意图像传感器,比如摄像头、照相机获取的图像,其可以为彩色图像或者灰色图像,在此不进行限定。
步骤S209,对目标文本图像进行解析,以确定目标文本图像所属的场景。
本公开实施中,可以对获得的目标文本图像进行解析,进而确定目标文本图像对应的场景。比如,若当前的目标文本图像为驾驶证文本图像,则可以确定当前的目标文本图像属于交通场景;若当前的目标文本图像为增值税发票图像,则可以确定目标文本图像属于金融场景,在此不进行限定。
步骤S210,将目标文本图像输入与所属的场景对应的图像识别模型,以获取目标文本图像中包含的文本内容。
在确定了目标文本图像所属的场景之后,可以确定与该场景对应的图像识别模型。进而,可以将目标文本图像输入至与该场景对应的图像识别模型中,从而可以输出与目标文本图像对应的文本内容。
比如,目标文本图像为驾驶证,则可以将其输入交通场景的图像识别模型中。
或者,目标文本图像为增值税发票,则可以将其输入金融场景的图像识别模型中。
本公开实施例中,通过确定目标文本图像所属的场景,进而使用与所属场景相应的图像识别模型,对目标文本图像进行识别,提高了图像识别的可靠性和精确度。
本公开实施例中,首先获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同;然后利用第一文本图像对初始识别模型进行训练,以获取基础识别模型,之后利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型;之后获取待识别的目标文本图像,并对目标文 本图像进行解析,以确定目标文本图像所属的场景,最后将目标文本图像输入与所属的场景对应的图像识别模型,以获取目标文本图像中包含的文本内容。其中,在训练得到基础识别模型时,根据预测文本内容与第一标注文本内容的差异,对初始识别模型进行修正;在训练得到目标场景下的图像识别模型时,根据预测文本内容与第二标注文本内容的差异、及预测类型标签与第二标注类型标签的差异,对基础识别模型进行修正,从而使得生成的图像识别模型的准确性更高,适用性更强,从而能够准确的根据目标文本图像生成对应的文本内容。
根据本公开的实施例,本公开还提供了一种图像识别模型的训练装置。
图3是根据本公开一实施例的图像识别模型的训练装置的结构示意图。如图3所示,该图像识别模型的训练装置300可以包括:第一获取模块310、第二获取模块320以及第三获取模块330。
其中,第一获取模块310,用于获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同。
第二获取模块320,用于利用第一文本图像对初始识别模型进行训练,以获取基础识别模型。
第三获取模块330,用于利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。
在本公开实施例一种可能的实现方式中,训练数据集中还包括任意场景下的文本图像。
本公开实施例中,首先获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同,然后利用第一文本图像对初始识别模型进行训练,以获取基础识别模型,之后利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。由此,在训练得到目标场景下的图像识别模型时,通过采用与目标场景类似的场景的不同垂类的文本图像,以及目标场景下不同垂类的文本图像,训练得到一个能够适用于目标场景不同垂类的识别模型,提高了模型的识别精度和通用性,减少了模型占用的内存,节省了人 力和物力开销。
图4是根据本公开另一实施例的图像识别模型的训练装置的结构示意图。如图4所示,该图像识别模型的训练装置400可以包括:第一获取模块410、第二获取模块420以及第三获取模块430。
在本公开实施例一种可能的实现方式中,训练数据集中还包括第一文本图像对应的第一标注文本内容、第一文本框的位置信息。
其中,第二获取模块420可以包括:
第一获取单元421,用于根据第一文本框的位置信息,从第一文本图像中获取待识别的目标图像。
第二获取单元422,用于将目标图像输入初始识别模型,以获取初始识别模型输出的预测文本内容。
第三获取单元423,用于根据预测文本内容与第一标注文本内容的差异,对初始识别模型进行修正,以获取基础识别模型。
在本公开实施例一种可能的实现方式中,训练数据集中还包括第一标注文本内容对应的第一标注类型标签。
其中,第二获取单元422具体用于:将目标图像输入初始识别模型,以获取初始识别模型输出的预测文本内容及预测类型标签;
第三获取单元423具体用于:根据预测文本内容与第一标注文本内容的差异、及预测类型标签与第一标注类型标签的差异,对初始识别模型进行修正,以获取基础识别模型。
在本公开实施例一种可能的实现方式中,训练数据集中还包括第二文本图像对应的第二标注文本内容、第二文本框的位置信息及第二标注文本内容对应的第二标注类型标签。
其中,第三获取模块430可以包括:
第四获取单元431,用于根据第二文本框的位置信息,从第二文本图像中获取待识别的第二目标图像。
第五获取单元432,用于将第二目标图像输入基础识别模型,以获取基础识别模型输出的预测文本内容及预测类型标签。
第六获取单元433,用于根据预测文本内容与第二标注文本内容的差异、及预测类型标签与第二标注类型标签的差异,对基础识别模型进行修正,以获取目标场景对应的图像识别模型。
在本公开实施例一种可能的实现方式中,该训练装置还可以包括第四获取模块440、第一确定模块450及第五获取模块460。
其中,第四获取模块440,用于获取待识别的目标文本图像。
第一确定模块450,用于对目标文本图像进行解析,以确定目标文本图像所属的场景。
第五获取模块460,用于将目标文本图像输入与所属的场景对应的图像识别模型,以获取目标文本图像中包含的文本内容。
可以理解的是,本公开实施例附图4中的图像识别模型的训练装置400与上述实施例中的图像识别模型的训练装置300,第一获取模块410与上述实施例中的第一获取模块310,第二获取模块420与上述实施例中的第二获取模块320,第三获取模块430与上述实施例中的第三获取模块330,可以具有相同的功能和结构。
需要说明的是,前述对图像识别模型的训练方法的实施例的解释说明,也适用于该实施例的图像识别模型的训练装置,其实现原理类似,此处不再赘述。
本公开实施例中,首先获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同;然后利用第一文本图像对初始识别模型进行训练,以获取基础识别模型,之后利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型;之后获取待识别的目标文本图像,并对目标文本图像进行解析,以确定目标文本图像所属的场景,最后将目标文本图像输入与所属的场景对应的图像识别模型,以获取目标文本图像中包含的文本内容。其中,在训练得到基础识别模型时,根据预测文本内容与第一标注文本内容的差异,对初始识别模型进行修正;在训练得到目标场景下的图像识别模型时,根据预测文本内容与第二标注文本内容的差异、及预测类型标签与第二标注类型标签的差异,对基础识别模型进行修正,从而使得生成的基础识别模型和图像识别模型的准确性更高,适用性更强,从而能够准确的根据目标文本图像生成对应的文本内容。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图5示出了可以用来实施本公开的实施例的示例电子设备500的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图5所示,设备500包括计算单元501,其可以根据存储在只读存储器(ROM)502中的计算机程序或者从存储单元508加载到随机访问存储器(RAM)503中的计算机程序,来执行各种适当的动作和处理。在RAM 503中,还可存储设备500操作所需的各种程序和数据。计算单元501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(I/O)接口505也连接至总线504。
设备500中的多个部件连接至I/O接口505,包括:输入单元506,例如键盘、鼠标等;输出单元507,例如各种类型的显示器、扬声器等;存储单元508,例如磁盘、光盘等;以及通信单元509,例如网卡、调制解调器、无线通信收发机等。通信单元509允许设备500通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元501可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元501的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元501执行上文所描述的各个方法和处理,例如图像识别模型的训练方法。例如,在一些实施例中,图像识别模型的训练方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元508。在一些实施例中,计算机程序的部分或者全部可以经由ROM 502和/或通信单元509而被载入和/或安装到设备500上。当计算机程序加载到RAM 503并由计算单元501执行时,可以执行上文描述的图像识别模型的训练方法的一个或多个步骤。备选地,在其他实施例中,计算单元501可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行图像识别模型的训练方法。
本公开实施例中的计算机程序产品,该产品中的计算机程序在被处理器执行时实现上述实施例中的图像识别模型的训练方法。在一些实施例中,当计算机程序产品中的指令被处理器执行时,执行上述方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其 它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、互联网和区块链网络。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与虚拟专用服务器(Virtual Private Server,或简称"VPS")中存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。
本公开实施例中,首先获取训练数据集,其中,训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,第一文本图像中包含的文本内容的类型与第二文本图像中包含的文本内容的类型相同,然后利用第一文本图像对初始识别模型进行训练,以获取基础识别模型,之后利用第二文本图像对基础识别模型进行修正训练,以获取目标场景对应的图像识别模型。由此,在训练得到目标场景下的图像识别模型时,通过采用与目标场景类似的场景的不同垂类的文本图像,以及目标场景下不同垂类的文本图像,训练得到一个能够适用于目标场景不同垂类的识别模型,提高了模型的识别精度和通用性,减少了模型占用的内存,节省了人力和物力开销。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行 也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (15)

  1. 一种图像识别模型的训练方法,包括:
    获取训练数据集,其中,所述训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,所述第一文本图像中包含的文本内容的类型与所述第二文本图像中包含的文本内容的类型相同;
    利用所述第一文本图像对初始识别模型进行训练,以获取基础识别模型;
    利用所述第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型。
  2. 如权利要求1所述的方法,其中,所述训练数据集中还包括任意场景下的文本图像。
  3. 如权利要求1或2所述的方法,其中,所述训练数据集中还包括所述第一文本图像对应的第一标注文本内容、第一文本框的位置信息,所述利用所述第一文本图像对初始识别模型进行训练,以获取基础识别模型,包括:
    根据所述第一文本框的位置信息,从所述第一文本图像中获取待识别的第一目标图像;
    将所述第一目标图像输入所述初始识别模型,以获取所述初始识别模型输出的预测文本内容;
    根据所述预测文本内容与所述第一标注文本内容的差异,对所述初始识别模型进行修正,以获取所述基础识别模型。
  4. 如权利要求3所述的方法,其中,所述训练数据集中还包括所述第一标注文本内容对应的第一标注类型标签,所述将所述第一目标图像输入所述初始识别模型,以获取所述初始识别模型输出的预测文本内容,包括:
    将所述第一目标图像输入所述初始识别模型,以获取所述初始识别模型输出的预测文本内容及预测类型标签;
    所述根据所述预测文本内容与所述第一标注文本内容的差异,对所述初始识别模型进行修正,以获取所述基础识别模型,包括:
    根据所述预测文本内容与所述第一标注文本内容的差异、及所述预测类型 标签与所述第一标注类型标签的差异,对所述初始识别模型进行修正,以获取所述基础识别模型。
  5. 如权利要求1-4任一所述的方法,其中,所述训练数据集中还包括所述第二文本图像对应的第二标注文本内容、第二文本框的位置信息及所述第二标注文本内容对应的第二标注类型标签,所述利用所述第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型,包括:
    根据所述第二文本框的位置信息,从所述第二文本图像中获取待识别的第二目标图像;
    将所述第二目标图像输入所述基础识别模型,以获取所述基础识别模型输出的预测文本内容及预测类型标签;
    根据所述预测文本内容与所述第二标注文本内容的差异、及所述预测类型标签与所述第二标注类型标签的差异,对所述基础识别模型进行修正,以获取所述目标场景对应的图像识别模型。
  6. 如权利要求5所述的方法,还包括:
    获取待识别的目标文本图像;
    对所述目标文本图像进行解析,以确定所述目标文本图像所属的场景;
    将所述目标文本图像输入与所属的场景对应的图像识别模型,以获取所述目标文本图像中包含的文本内容。
  7. 一种图像识别模型的训练装置,包括:
    第一获取模块,用于获取训练数据集,其中,所述训练数据集包括非目标场景下各个垂类的第一文本图像、及目标场景下各个垂类的第二文本图像,所述第一文本图像中包含的文本内容的类型与所述第二文本图像中包含的文本内容的类型相同;
    第二获取模块,用于利用所述第一文本图像对初始识别模型进行训练,以获取基础识别模型;
    第三获取模块,用于利用所述第二文本图像对所述基础识别模型进行修正训练,以获取所述目标场景对应的图像识别模型。
  8. 如权利要求7所述的装置,其中,所述训练数据集中还包括任意场景下的文本图像。
  9. 如权利要求7或8所述的装置,其中,所述训练数据集中还包括所述第一文本图像对应的第一标注文本内容、第一文本框的位置信息,所述第二获取模块包括:
    第一获取单元,用于根据所述第一文本框的位置信息,从所述第一文本图像中获取待识别的目标图像;
    第二获取单元,用于将所述目标图像输入所述初始识别模型,以获取所述初始识别模型输出的预测文本内容;
    第三获取单元,用于根据所述预测文本内容与所述第一标注文本内容的差异,对所述初始识别模型进行修正,以获取所述基础识别模型。
  10. 如权利要求9所述的装置,其中,所述训练数据集中还包括所述第一标注文本内容对应的第一标注类型标签,
    所述第二获取单元用于:将所述目标图像输入所述初始识别模型,以获取所述初始识别模型输出的预测文本内容及预测类型标签;
    所述第三获取单元用于:根据所述预测文本内容与所述第一标注文本内容的差异、及所述预测类型标签与所述第一标注类型标签的差异,对所述初始识别模型进行修正,以获取所述基础识别模型。
  11. 如权利要求7-10任一所述的装置,其中,所述训练数据集中还包括所述第二文本图像对应的第二标注文本内容、第二文本框的位置信息及所述第二标注文本内容对应的第二标注类型标签,所述第三获取模块包括:
    第四获取单元,用于根据所述第二文本框的位置信息,从所述第二文本图像中获取待识别的第二目标图像;
    第五获取单元,用于将所述第二目标图像输入所述基础识别模型,以获取所述基础识别模型输出的预测文本内容及预测类型标签;
    第六获取单元,用于根据所述预测文本内容与所述第二标注文本内容的差异、及所述预测类型标签与所述第二标注类型标签的差异,对所述基础识别模型进行修正,以获取所述目标场景对应的图像识别模型。
  12. 如权利要求11所述的装置,还包括:
    第四获取模块,用于获取待识别的目标文本图像;
    第一确定模块,用于对所述目标文本图像进行解析,以确定所述目标文本图像所属的场景;
    第五获取模块,用于将所述目标文本图像输入与所属的场景对应的图像识别模型,以获取所述目标文本图像中包含的文本内容。
  13. 一种电子设备,包括:
    至少一个处理器;以及与所述至少一个处理器通信连接的存储器;
    其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-6中任一项所述的方法。
  14. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1-6中任一项所述的方法。
  15. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-6中任一项所述的方法。
PCT/CN2022/085915 2021-08-13 2022-04-08 图像识别模型的训练方法、装置、设备及存储介质 WO2023015922A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/905,965 US20230401828A1 (en) 2021-08-13 2022-04-08 Method for training image recognition model, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110934322.3A CN113705554A (zh) 2021-08-13 2021-08-13 图像识别模型的训练方法、装置、设备及存储介质
CN202110934322.3 2021-08-13

Publications (1)

Publication Number Publication Date
WO2023015922A1 true WO2023015922A1 (zh) 2023-02-16

Family

ID=78652707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/085915 WO2023015922A1 (zh) 2021-08-13 2022-04-08 图像识别模型的训练方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20230401828A1 (zh)
CN (1) CN113705554A (zh)
WO (1) WO2023015922A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705554A (zh) * 2021-08-13 2021-11-26 北京百度网讯科技有限公司 图像识别模型的训练方法、装置、设备及存储介质
CN114359903B (zh) * 2022-01-06 2023-04-07 北京百度网讯科技有限公司 一种文本识别方法、装置、设备及存储介质
CN114428677B (zh) * 2022-01-28 2023-09-12 北京百度网讯科技有限公司 任务处理方法、处理装置、电子设备及存储介质
CN114677691B (zh) * 2022-04-06 2023-10-03 北京百度网讯科技有限公司 文本识别方法、装置、电子设备及存储介质
CN114550143A (zh) * 2022-04-28 2022-05-27 新石器慧通(北京)科技有限公司 无人车行驶中的场景识别方法及装置
CN114973279B (zh) * 2022-06-17 2023-02-17 北京百度网讯科技有限公司 手写文本图像生成模型的训练方法、装置和存储介质
CN115035510B (zh) * 2022-08-11 2022-11-15 深圳前海环融联易信息科技服务有限公司 文本识别模型训练方法、文本识别方法、设备及介质
CN116070711B (zh) * 2022-10-25 2023-11-10 北京百度网讯科技有限公司 数据处理方法、装置、电子设备和存储介质
CN115658903B (zh) * 2022-11-01 2023-09-05 百度在线网络技术(北京)有限公司 文本分类方法、模型训练方法、相关装置及电子设备
CN117132790B (zh) * 2023-10-23 2024-02-02 南方医科大学南方医院 基于人工智能的消化道肿瘤诊断辅助系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472296A (zh) * 2018-10-17 2019-03-15 阿里巴巴集团控股有限公司 一种基于梯度提升决策树的模型训练方法及装置
US20200342339A1 (en) * 2019-04-24 2020-10-29 International Business Machines Corporation Cognitive Data Preparation for Deep Learning Model Training
CN112183307A (zh) * 2020-09-25 2021-01-05 上海眼控科技股份有限公司 文本识别方法、计算机设备和存储介质
CN113159212A (zh) * 2021-04-30 2021-07-23 上海云从企业发展有限公司 Ocr识别模型训练方法、装置以及计算机可读存储介质
CN113705554A (zh) * 2021-08-13 2021-11-26 北京百度网讯科技有限公司 图像识别模型的训练方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275038A (zh) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 图像文本识别方法、装置、计算机设备及计算机存储介质
CN111652232B (zh) * 2020-05-29 2023-08-22 泰康保险集团股份有限公司 票据识别方法及装置、电子设备和计算机可读存储介质
CN112784751A (zh) * 2021-01-22 2021-05-11 北京百度网讯科技有限公司 图像识别模型的训练方法、装置、设备以及介质
CN113239967A (zh) * 2021-04-14 2021-08-10 北京达佳互联信息技术有限公司 文字识别模型训练方法、识别方法、相关设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472296A (zh) * 2018-10-17 2019-03-15 阿里巴巴集团控股有限公司 一种基于梯度提升决策树的模型训练方法及装置
US20200342339A1 (en) * 2019-04-24 2020-10-29 International Business Machines Corporation Cognitive Data Preparation for Deep Learning Model Training
CN112183307A (zh) * 2020-09-25 2021-01-05 上海眼控科技股份有限公司 文本识别方法、计算机设备和存储介质
CN113159212A (zh) * 2021-04-30 2021-07-23 上海云从企业发展有限公司 Ocr识别模型训练方法、装置以及计算机可读存储介质
CN113705554A (zh) * 2021-08-13 2021-11-26 北京百度网讯科技有限公司 图像识别模型的训练方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN113705554A (zh) 2021-11-26
US20230401828A1 (en) 2023-12-14

Similar Documents

Publication Publication Date Title
WO2023015922A1 (zh) 图像识别模型的训练方法、装置、设备及存储介质
US11816165B2 (en) Identification of fields in documents with neural networks without templates
WO2019238063A1 (zh) 文本检测分析方法、装置及设备
US11816710B2 (en) Identifying key-value pairs in documents
US20190294921A1 (en) Field identification in an image using artificial intelligence
CN110874618B (zh) 基于小样本的ocr模板学习方法、装置、电子设备及介质
EP3944145B1 (en) Method and device for training image recognition model, equipment and medium
CN112541332B (zh) 表单信息抽取方法、装置、电子设备及存储介质
JP7390445B2 (ja) 文字位置決めモデルのトレーニング方法及び文字位置決め方法
US20230237763A1 (en) Image processing method and system
US11881044B2 (en) Method and apparatus for processing image, device and storage medium
CN113313114B (zh) 证件信息获取方法、装置、设备以及存储介质
CN114418124A (zh) 生成图神经网络模型的方法、装置、设备及存储介质
CN112839185B (zh) 用于处理图像的方法、装置、设备和介质
CN114140649A (zh) 票据分类方法、票据分类装置、电子设备和存储介质
US20230048495A1 (en) Method and platform of generating document, electronic device and storage medium
US20220392243A1 (en) Method for training text classification model, electronic device and storage medium
US20220148324A1 (en) Method and apparatus for extracting information about a negotiable instrument, electronic device and storage medium
CN115359468A (zh) 一种目标网站识别方法、装置、设备及介质
US11699297B2 (en) Image analysis based document processing for inference of key-value pairs in non-fixed digital documents
CN111144409A (zh) 一种跟单托收审单处理方法及系统
CN112861841B (zh) 票据置信值模型的训练方法、装置、电子设备及存储介质
CN115497112B (zh) 表单识别方法、装置、设备以及存储介质
CN116884023A (zh) 图像识别方法、装置、电子设备及存储介质
CN115620859A (zh) 一种报告结构化处理的方法、装置、设备及存储介质

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 17905965

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE