WO2023016163A1 - 文字识别模型的训练方法、识别文字的方法和装置 - Google Patents

文字识别模型的训练方法、识别文字的方法和装置 Download PDF

Info

Publication number
WO2023016163A1
WO2023016163A1 PCT/CN2022/104891 CN2022104891W WO2023016163A1 WO 2023016163 A1 WO2023016163 A1 WO 2023016163A1 CN 2022104891 W CN2022104891 W CN 2022104891W WO 2023016163 A1 WO2023016163 A1 WO 2023016163A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
content
sample
picture
label
Prior art date
Application number
PCT/CN2022/104891
Other languages
English (en)
French (fr)
Inventor
王晓燕
吕鹏原
范森
章成全
姚锟
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023016163A1 publication Critical patent/WO2023016163A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, specifically to the technical fields of computer vision and deep learning, and can be applied to scenarios such as OCR optical character recognition.
  • documents, pictures, videos and other files that contain multiple languages.
  • documents, pictures, videos and other files include not only Chinese, but also English, Spanish, Portuguese, Russian, Polish and other languages. Recognizing the content of multi-language text in the file and outputting the corresponding language category are the prerequisites for extracting and translating text information in each language. This identification process is of great significance to information review, cultural transmission, business communication, etc.
  • the disclosure provides a training method for a character recognition model, a method for recognizing characters, a device, equipment, a storage medium and a program product.
  • a method for training a text recognition model including: determining a plurality of first sample pictures and content labels and language labels of the plurality of first sample pictures according to a plurality of monolingual corpora; According to a plurality of mixed language corpora, determine a plurality of second sample pictures and content labels and language labels of a plurality of second sample pictures; and determine a plurality of first sample pictures, content labels and languages of a plurality of first sample pictures tags, multiple second sample pictures, and the content labels and language labels of the multiple second sample pictures to train the text recognition model.
  • a method for recognizing text including: acquiring a picture to be recognized that contains text information; inputting the picture to be recognized into a text recognition model to obtain a content recognition result and a language recognition result of the picture to be recognized, wherein, the content recognition result is used to represent the text information contained in the picture to be recognized, and the language recognition result is used to represent the language corresponding to the text information, wherein the text recognition model is trained according to the method of the embodiment of the present disclosure.
  • a text recognition model training device including: a first determination module, configured to determine a plurality of first sample pictures and a plurality of first samples according to a plurality of monolingual corpora The content label and language label of the picture; the second determination module is used to determine multiple second sample pictures and the content label and language label of multiple second sample pictures according to multiple mixed language corpus; and the training module is used to determine according to The multiple first sample pictures, the content labels and language labels of the multiple first sample pictures, the multiple second sample pictures, and the content labels and language labels of the multiple second sample pictures train the text recognition model.
  • a device for recognizing text including: an acquisition module, configured to acquire a picture to be recognized containing text information; an input module, used to input the picture to be recognized into the text recognition model to obtain The content recognition result and language recognition result of the picture, wherein the content recognition result is used to represent the text information contained in the picture to be recognized, and the language recognition result is used to represent the language corresponding to the text information, wherein the text recognition model is implemented according to the present disclosure Example of device training.
  • Another aspect of the present disclosure provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by at least one processor, and the instructions are executed by at least one processor. Execution by a processor, so that at least one processor can execute the method shown in the embodiments of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method shown in the embodiments of the present disclosure.
  • a computer program product a computer program
  • the computer program implements the method shown in the embodiments of the present disclosure when executed by a processor.
  • FIG. 1 is a flowchart of a method for training a text recognition model according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a character recognition model according to an embodiment of the disclosure.
  • FIG. 3 is a flowchart of a method for training a text recognition model according to an embodiment of the disclosure
  • FIG. 4 schematically shows a schematic diagram of a method for training a character recognition model according to an embodiment of the present disclosure
  • FIG. 5 schematically shows a flowchart of a method for recognizing characters according to an embodiment of the present disclosure
  • Fig. 6 schematically shows a schematic diagram of a method for recognizing characters according to an embodiment of the present disclosure
  • Fig. 7 schematically shows a block diagram of a training device for a character recognition model according to an embodiment of the present disclosure
  • Fig. 8 schematically shows a block diagram of a device for recognizing characters according to an embodiment of the present disclosure.
  • Fig. 9 schematically shows a block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.
  • FIG. 1 is a flowchart of a method for training a character recognition model according to an embodiment of the disclosure.
  • the method 100 includes, at operation S110 , determining a plurality of first sample pictures and content tags and language tags of the plurality of first sample pictures according to a plurality of monolingual corpora.
  • a plurality of second sample images and content tags and language tags of the plurality of second sample images are determined according to the plurality of mixed-language corpora.
  • the character recognition model may be used to determine the content recognition result and the language recognition result of the input image.
  • the content recognition result may be used to represent the text information contained in the input picture
  • the language recognition result may be used to represent the language corresponding to the text information.
  • the trained character recognition model can automatically output the language corresponding to the characters while recognizing the characters contained in the picture.
  • text corpora in various languages can also be collected, and a large number of pictures with text can be synthesized based on these corpus for model training.
  • unwanted languages in the text may be filtered according to a predetermined language character set (also called a dictionary).
  • a predetermined language character set also called a dictionary.
  • Each line of the filtered text is then taken as a corpus.
  • a picture containing the monolingual corpus may be generated as the first sample picture. Then, according to the text content of the monolingual corpus, the content label of the first sample picture is determined. Determine the language label of the first sample image according to the language of the monolingual corpus.
  • the original corpus of multiple languages can be mixed and concatenated, and the corpus of multiple languages can be spliced into one text to obtain multiple mixed corpora. Then, for each of the multiple mixed-language corpora, a picture containing the mixed-language corpus is generated as a second sample picture. According to the text content of the mixed language corpus, the content label of the second sample picture is determined. Determine the language label of the second sample picture according to the language of the mixed-language corpus.
  • the language label of the mixed-language corpus may be the language with the largest number of words in the mixed-language corpus. In the case where multiple languages have the largest number of words in the mixed language corpus, any one of the multiple languages can be determined as the language label of the mixed language corpus.
  • the size of the pictures input into the text recognition model may be different, thus affecting the recognition accuracy.
  • the size of the picture can be adjusted to a preset range.
  • the vertical height of the picture may be adjusted to be between 32 pixels and 48 pixels, and correspondingly, the horizontal width of the picture may be proportionally scaled according to the original proportion of the picture.
  • FIG. 2 is a schematic diagram of a character recognition model according to an embodiment of the disclosure.
  • the character recognition model may include a first convolutional neural network (CNN) 210, a recurrent neural network (Recurrent Neural Network, RNN) 220, a connection time series classification network (Connectionist Temporal Classification, CTC) 230 and a second volume Productive Neural Networks 240 .
  • CNN convolutional neural network
  • RNN recurrent neural network
  • CTC Connection Time series classification network
  • the first convolutional neural network 210 may be used to perform feature extraction on the picture 21 input to the character recognition model to obtain the feature vector 22 of the picture.
  • the features in this feature vector 22 are ordered by time step.
  • the cyclic neural network 220 can be used to further extract sequence features according to the feature vector 22 extracted by the first convolutional neural network 210 .
  • the connected temporal classification network 230 can be used to determine the content recognition result 23 for the picture according to the sequence features extracted by the cyclic neural network.
  • a multivariate feature vector (N-gram) 24 can be determined according to the feature vector 22
  • the second convolutional neural network 240 can be used to determine a language recognition result 25 according to the multivariate feature vector 24 .
  • the number of models in the character recognition model according to the embodiments of the present disclosure is small, thereby reducing computing resources and simplifying the system process.
  • Fig. 3 is a flowchart of a method for training a character recognition model according to an embodiment of the disclosure.
  • the method 330 includes, in operation S331 , acquiring a sample picture among a plurality of first sample pictures and a plurality of second sample pictures.
  • a text recognition model is used to determine a content recognition result and a language recognition result of the sample picture.
  • a first loss is determined according to the content recognition result and the content label of the sample picture, and a second loss is determined according to the language recognition result and the language label of the sample picture.
  • a loss (loss) between the content recognition result and the content label of the sample picture may be determined according to the first loss function, that is, the first loss.
  • the loss between the language recognition result and the language label of the sample picture may be determined according to the second loss function, that is, the second loss. It should be noted that the first loss function and the second loss function may be the same or different.
  • a total loss is determined based on the first loss and the second loss.
  • the first loss and the second loss may be weighted and added to obtain the total loss.
  • the weights of the first loss and the second loss may be determined according to actual needs.
  • the weight of the second loss may be lower than the weight of the first loss.
  • operation S336 another sample picture among the plurality of first sample pictures and the plurality of second sample pictures is acquired, and skips to perform operation S332, so as to use the text recognition model to determine the content recognition result and Language recognition results.
  • Fig. 4 schematically shows a schematic diagram of a method for training a character recognition model according to an embodiment of the present disclosure.
  • the first convolutional neural network 410 can be used to determine the feature vector 42 of the sample picture 41 . Then, based on the feature vector 42, character recognition and language classification are respectively performed in two branches. In the branch corresponding to text recognition, the recurrent neural network 420 can be used to determine the sequence features according to the feature vector 42 , and the connection time series classification network 430 can be used to determine the content recognition result 43 according to the sequence features. On the other hand, in the branch corresponding to the language classification, the N-gram feature vector 44 can be determined according to the feature vector 42 , and the language recognition result 45 can be determined according to the N-gram feature vector using the second convolutional neural network 440 .
  • the first loss 46 can be determined according to the content recognition result 43 and the content label of the sample picture 41
  • the second loss 47 can be determined according to the language recognition result 45 and the language label of the sample picture 41 .
  • a total loss 48 is determined from the first loss 46 and the second loss 47 . According to the total loss 48, the parameters of the text recognition model are adjusted, that is, the error return is realized.
  • the two branches of multilingual character recognition and language classification share the underlying feature vector, forward calculation and error backpropagation are performed simultaneously.
  • the two complementary learning can improve the generalization effect.
  • the language category helps to distinguish similar characters and improve the recognition accuracy of language characters, such as the English character n and the Russian character ⁇ ; conversely, the unique characters in the language text also help to classify the language category, such as ⁇ appears in Russian, Ukraine language and other languages.
  • the text recognition model according to the embodiment of the present disclosure utilizes the semantic correlation between adjacent characters by extracting the n-gram feature vector of the picture convolution feature vector, which can further improve the language classification accuracy.
  • Fig. 5 schematically shows a flowchart of a method for recognizing characters according to an embodiment of the present disclosure.
  • the method includes, in operation S510, acquiring a picture to be recognized including text information.
  • the character recognition model can be obtained by training, for example, according to the training method of the character recognition model shown above.
  • the output of the character recognition model may include content recognition results and language recognition results.
  • the content recognition result may be used to represent the text information contained in the picture to be recognized, and the language recognition result may be used to represent the language corresponding to the text information.
  • Fig. 6 schematically shows a schematic diagram of a method for recognizing characters according to an embodiment of the present disclosure.
  • the character recognition model may include a first convolutional neural network CNN, a recurrent neural network, a connection temporal classification network and a second convolutional neural network.
  • the first convolutional neural network 610 can be used to determine the feature vector 62 of the picture to be recognized 61 .
  • the cyclic neural network 620 can be used to determine the sequence features according to the feature vector 62
  • the connected temporal classification network 630 can be used to determine the content recognition result 63 for the picture to be recognized 61 according to the sequence features.
  • the N-gram feature vector 64 can be determined according to the feature vector 62
  • the language recognition result 65 for the picture to be recognized 61 can be determined according to the N-gram feature vector 64 using the second convolutional neural network 640 .
  • Fig. 7 schematically shows a block diagram of a training device for a character recognition model according to an embodiment of the present disclosure.
  • the text recognition model training device 700 may include a first determination module 710 , a second determination module 720 and a training module 730 .
  • the first determination module 710 may be configured to determine a plurality of first sample pictures and content tags and language tags of the plurality of first sample pictures according to a plurality of monolingual corpora.
  • the second determination module 720 may be configured to determine a plurality of second sample pictures and content tags and language tags of the plurality of second sample pictures according to a plurality of mixed-language corpora.
  • the training module 730 may be configured to, according to the multiple first sample pictures, the content labels and language labels of the multiple first sample pictures, the multiple second sample pictures, and the content labels and language labels of the multiple second sample pictures, Train the text recognition model.
  • the first determining module may include a first generating submodule, a first content label determining submodule, and a first language label determining submodule.
  • the first generation sub-module may be used for generating a picture containing a monolingual corpus as a first sample picture for each monolingual corpus in a plurality of monolingual corpora.
  • the first content label determining submodule can be used to determine the content label of the first sample picture according to the text content of the monolingual corpus.
  • the first language label determining submodule can be used to determine the language label of the first sample picture according to the language of the monolingual corpus.
  • the above-mentioned device may further include a splicing module, which may be used to mix and splice the original corpus of multiple languages to obtain multiple mixed corpus.
  • a splicing module which may be used to mix and splice the original corpus of multiple languages to obtain multiple mixed corpus.
  • the second determining module may include a second generating submodule, a second content label determining submodule, and a second language label determining submodule.
  • the second generation sub-module may be used for generating a picture containing the mixed-language corpus as a second sample picture for each mixed-language corpus among the plurality of mixed-language corpora.
  • the second content label determining submodule can be used to determine the content label of the second sample picture according to the text content of the mixed-language corpus.
  • the second language label determining submodule can be used to determine the language label of the second sample picture according to the language of the mixed language corpus.
  • the training module may include an identification submodule, a first loss determination submodule, a second loss determination submodule, and an adjustment submodule.
  • the recognition sub-module can be used to determine the content recognition result and the language recognition result of one sample picture among the plurality of first sample pictures and the plurality of second sample pictures by using the text recognition model.
  • the first loss determining sub-module may be configured to determine the first loss according to the content recognition result and the content label of the sample picture, and determine the second loss according to the language recognition result and the language label of the sample picture.
  • the second loss determining submodule can be used to determine the total loss according to the first loss and the second loss.
  • the adjustment sub-module can be used to adjust the parameters of the text recognition model according to the total loss, and return to use the text recognition model to determine the content recognition result for another sample picture among the plurality of first sample pictures and the plurality of second sample pictures and language recognition results.
  • the character recognition model may include a first convolutional neural network, a recurrent neural network, a connection temporal classification network and a second convolutional neural network.
  • the recognition submodule includes a feature vector determination unit, a content recognition unit and a language recognition unit.
  • the eigenvector determining unit may be configured to determine the eigenvector of the sample picture by using the first convolutional neural network.
  • the content identification unit can be used to determine the sequence feature according to the feature vector by using the cyclic neural network, and determine the content identification result according to the sequence feature by using the connection time series classification network.
  • the language identification unit can be used to determine the multivariate feature vector according to the feature vector, and use the second convolutional neural network to determine the language recognition result according to the multivariate feature vector.
  • Fig. 8 schematically shows a block diagram of an apparatus for recognizing characters according to an embodiment of the present disclosure.
  • the text recognition device 800 may include an acquisition module 810 and an input module 820 .
  • Obtaining module 810 can be used for obtaining the picture to be recognized that contains text information
  • the input module 820 can be used to input the picture to be recognized into the text recognition model to obtain the content recognition result and the language recognition result of the picture to be recognized, wherein the content recognition result is used to represent the text information contained in the picture to be recognized, and the language recognition result is used for Indicates the language corresponding to the text information.
  • the character recognition model is trained by the above-mentioned character recognition model training device.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 9 schematically shows a block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 900 includes a computing unit 901 that can execute according to a computer program stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a random-access memory (RAM) 903. Various appropriate actions and treatments. In the RAM 903, various programs and data necessary for the operation of the device 900 can also be stored.
  • the computing unit 901, ROM 902, and RAM 903 are connected to each other through a bus 904.
  • An input/output (I/O) interface 905 is also connected to the bus 904 .
  • the I/O interface 905 includes: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a magnetic disk, an optical disk, etc. ; and a communication unit 909, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 909 allows the device 900 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 901 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 901 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the computing unit 901 executes various methods and processes described above, such as a method for training a character recognition model and a method for recognizing characters.
  • the method for training a character recognition model and the method for recognizing characters can be implemented as computer software programs, which are tangibly contained in a machine-readable medium, such as the storage unit 908 .
  • part or all of the computer program may be loaded and/or installed on the device 900 via the ROM 902 and/or the communication unit 909.
  • the computer program When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the method for training the character recognition model and the method for recognizing characters described above can be performed.
  • the computing unit 901 may be configured in any other appropriate way (for example, by means of firmware) to execute the method for training a character recognition model and the method for recognizing characters.
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

本公开提供了一种文字识别模型的训练方法、识别文字的方法、装置、设备、存储介质以及程序产品,涉及人工智能技术领域,具体涉及计算机视觉和深度学习技术领域,可应用于OCR光学字符识别等场景。具体实现方案为:根据多个单语种语料,确定多个第一样本图片以及所述多个第一样本图片的内容标签和语种标签;根据多个混合语种语料,确定多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签;以及根据所述多个第一样本图片、所述多个第一样本图片的内容标签和语种标签、所述多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。

Description

文字识别模型的训练方法、识别文字的方法和装置
本申请要求于2021年8月13日提交的、申请号为202110934328.0的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及人工智能技术领域,具体涉及计算机视觉和深度学习技术领域,可应用于OCR光学字符识别等场景。
背景技术
在生活中有很多文档、图片、视频等文件中包含多个语种的语言。例如文档、图片、视频等文件中除了汉语外,还包括英语、西班牙语、葡萄牙语、俄语、波兰语等多国语言文字。识别出文件中的多国语言文字的内容,输出相应的语种类别,是抽取各个语种文字信息和翻译的前提。该识别过程对信息审核、文化传递、商务交流等都具有重要意义。
发明内容
本公开提供了一种文字识别模型的训练方法、识别文字的方法、装置、设备、存储介质以及程序产品。
根据本公开的一方面,提供了一种文字识别模型的训练方法,包括:根据多个单语种语料,确定多个第一样本图片以及多个第一样本图片的内容标签和语种标签;根据多个混合语种语料,确定多个第二样本图片以及多个第二样本图片的内容标签和语种标签;以及根据多个第一样本图片、多个第一样本图片的内容标签和语种标签、多个第二样本图片以及多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
根据本公开的另一方面,提供了一种识别文字的方法,包括:获取包含文字信息的待识别图片;将待识别图片输入文字识别模型,得到待识别图片的内容识别结果和语种识别结果,其中,内容识别结果用于表示待识别图片中包含的文字信息, 语种识别结果用于表示文字信息所对应的语种,其中,文字识别模型是根据本公开实施例的方法训练的。
根据本公开的另一方面,提供了一种文字识别模型的训练装置,包括:第一确定模块,用于根据多个单语种语料,确定多个第一样本图片以及多个第一样本图片的内容标签和语种标签;第二确定模块,用于根据多个混合语种语料,确定多个第二样本图片以及多个第二样本图片的内容标签和语种标签;以及训练模块,用于根据多个第一样本图片、多个第一样本图片的内容标签和语种标签、多个第二样本图片以及多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
根据本公开的另一方面,提供了一种识别文字的装置,包括:获取模块,用于获取包含文字信息的待识别图片;输入模块,用于将待识别图片输入文字识别模型,得到待识别图片的内容识别结果和语种识别结果,其中,内容识别结果用于表示待识别图片中包含的文字信息,语种识别结果用于表示文字信息所对应的语种,其中,文字识别模型是根据本公开实施例的装置训练的。
本公开的另一个方面提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行本公开实施例所示的方法。
根据本公开实施例的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行本公开实施例所示的方法。
根据本公开实施例的另一方面,提供了一种计算机程序产品,计算机程序,计算机程序在被处理器执行时实现本公开实施例所示的方法。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1是根据本公开实施例的文字识别模型的训练方法的流程图;
图2是根据本公开实施例的文字识别模型的示意图;
图3是根据本公开实施例的对文字识别模型进行训练的方法的流程图;
图4示意性示出了根据本公开的实施例的对文字识别模型进行训练的方法的示意图;
图5示意性示出了根据本公开的实施例的识别文字的方法的流程图;
图6示意性示出了根据本公开的实施例的识别文字的方法的示意图;
图7示意性示出了根据本公开实施例的文字识别模型的训练装置的框图;
图8示意性示出了根据本公开实施例的识别文字的装置的框图;以及
图9示意性示出了可以用来实施本公开的实施例的示例电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
图1是根据本公开实施例的文字识别模型的训练方法的流程图。
如图1所示,该方法100包括在操作S110,根据多个单语种语料,确定多个第一样本图片以及多个第一样本图片的内容标签和语种标签。
然后,在操作S120,根据多个混合语种语料,确定多个第二样本图片以及多个第二样本图片的内容标签和语种标签。
在操作S130,根据多个第一样本图片、多个第一样本图片的内容标签和语种标签、多个第二样本图片以及多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
根据本公开的实施例,文字识别模型可以用于确定输入图片的内容识别结果和语种识别结果。其中,内容识别结果可以用于表示输入图片中包含的文字信息,语种识别结果可以用于表示文字信息所对应的语种。
根据本公开的实施例,经训练的文字识别模型可以在识别图片中包含的文字的同时自动输出这些文字所对应的语种。
相关技术从现实场景中可以搜集到的不同语种的图片,并进行标注,作为样本图片用于模型训练。但是从现实场景中可以搜集到的不同语种的图片数量有限,且标注难度高。
根据本公开的实施例,除现实场景中搜集到的不同语种的图片外,还可以收集各个语种的文本语料,根据这些语料合成大量带文字的图片用于模型训练。
根据本公开的实施例,例如可以针对包含混合语种的文本,根据预定语种的字符集(也称字典)过滤该文本中不需要的语种。然后将过滤后的文本的每一行作为一个语料。
基于此,根据本公开的实施例,可以针对多个单语种语料中的每个单语种语料,生成包含单语种语料的图片,作为第一样本图片。然后根据单语种语料的文本内容,确定第一样本图片的内容标签。根据单语种语料的语种,确定第一样本图片的语种标签。
根据本公开的实施例,可以对多个语种的原始语料进行混合拼接处理,将多个语种的语料拼接成的一个文本,得到多个混合语料。然后针对多个混合语种语料中的每个混合语种语料,生成包含混合语种语料的图片,作为第二样本图片。根据混合语种语料的文本内容,确定第二样本图片的内容标签。根据混合语种语料的语种,确定第二样本图片的语种标签。示例性地,混合语种语料的语种标签可以为混合语种语料中字数最多的语种。在混合语种语料中有多个语种的字数并列最多的情况下,可以确定该多个语种中任意一个作为混合语种语料的语种标签。
根据本公开的另一些实施例,输入文字识别模型的图片可能大小不一,从而影响识别精度。为此,可以在图片输入文字模型前,将图片的大小调整到预设范围之间。示例性地,本实施例中,可以将图片竖直方向的高调整到32像素至48像素之间,相应地,将图片水平方向的宽按照图片原有比例等比缩放。另外,还可以限定图片的宽最长不超过512像素。
下面参考图2,结合具体实施例对上文所示的文字识别模型做进一步说明。本领域技术人员可以理解,以下示例实施例仅用于理解本公开,本公开并不局限于此。
图2是根据本公开实施例的文字识别模型的示意图。
如图2所示,文字识别模型可以包括第一卷积神经网络(CNN)210、循环神经网络(Recurrent Neural Network,RNN)220、联结时序分类网络(Connectionist Temporal Classification,CTC)230和第二卷积神经网络240。
根据本公开的实施例,第一卷积神经网络210可以用于对输入文字识别模型的图片21进行特征提取,得到该图片的特征向量22。该特征向量22中的特征以时间步(time step)排序。循环神经网络220可以用于根据第一卷积神经网络210提取的特征向量22进一步提取序列特征。联结时序分类网络230可以用于根据循环神经网络提取的序列特征,确定针对该图片的内容识别结果23。另外,可以根据特征向量22确定多元特征向量(N-gram)24,第二卷积神经网络240可以用于根据该多元特征向量24确定语种识别结果25。
根据本公开实施例的文字识别模型中模型的个数较少,从而减少了计算资源,简化了系统流程。
下面参考图3,结合具体实施例对上文所示的对文字识别模型进行训练的方法做进一步说明。本领域技术人员可以理解,以下示例实施例仅用于理解本公开,本公开并不局限于此。
图3是根据本公开实施例的对文字识别模型进行训练的方法的流程图。
如图3所示,该方法330包括在操作S331,获取多个第一样本图片和多个第二样本图片中的一个样本图片。
在操作S332,使用文字识别模型来确定样本图片的内容识别结果和语种识别结果。
在操作S333,根据内容识别结果和样本图片的内容标签,确定第一损失,并根据语种识别结果和样本图片的语种标签,确定第二损失。
根据本公开的实施例,例如可以根据第一损失函数确定内容识别结果和样本图片的内容标签之间的损失(loss),即第一损失。可以根据第二损失函数确定语种识别结果和样本图片的语种标签之间的损失,即第二损失。需要说明的是,第一损失函数和第二损失函数可以相同也可以不同。
在操作S334,根据第一损失和第二损失,确定总损失。
根据本公开的实施,可以将第一损失和第二损失加权相加,得到总损失。其中,第一损失和第二损失的权重可以按照实际需要来确定。示例性地,本实施例中,第二损失的权重可以低于第一损失的权重。
在操作S335,根据总损失,调整文字识别模型的参数。
在操作S336,获取多个第一样本图片和多个第二样本图片中的另一个样本图片,并跳转执行操作S332,以使用文字识别模型来确定该另一个样本图片的内容识别结果和语种识别结果。
下面参考图4,结合具体实施例对上文所示的对文字识别模型进行训练的方法做进一步说明。本领域技术人员可以理解,以下示例实施例仅用于理解本公开,本公开并不局限于此。
图4示意性示出了根据本公开的实施例的对文字识别模型进行训练的方法的示意图。
在图4中示出了,在对文字识别模型进行训练的过程中,可以使用第一卷积神经网络410确定样本图片41的特征向量42。然后基于该特征向量42,以两个分支分别进行文字识别和语种分类。在文字识别所对应的分支,可以使用循环神经网络420,根据特征向量42来确定序列特征,并使用联结时序分类网络430,根据序列特征来确定内容识别结果43。另一方面,在语种分类所对应的分支,可以根据特征向量42,确定N-gram特征向量44,并使用第二卷积神经网络440,根据N-gram特征向量来确定语种识别结果45。
接下来,可以根据内容识别结果43和样本图片41的内容标签,确定第一损失46,并根据语种识别结果45和样本图片41的语种标签,确定第二损失47。然后根据第一损失46和第二损失47,确定总损失48。根据总损失48,调整文字识别模型的参数,即实现误差返传。
根据本公开的实施例,通过使多语种的文字识别和语种分类两个分支共享底层的特征向量,同时进行前向计算和误差反传。两者互补学习,可以提升泛化效果。
另外,语种类别有助于区分形近字符,提高语种字符识别精度,比如英文字符n和俄语字符й;反之,语种文字中的特有字符,也有助于分类语种类别,比如й出现在俄语、乌克兰语等语种中。根据本公开实施例的文字识别模型通过提取图片 卷积特征向量的n-gram特征向量,利用了相邻字符之间的语义相关性,可以进一步提高语种分类精度。
图5示意性示出了根据本公开的实施例的识别文字的方法的流程图。
如图5所示,该方法包括在操作S510,获取包含文字信息的待识别图片。
然后,在操作S520,将待识别图片输入文字识别模型,得到待识别图片的内容识别结果和语种识别结果。
根据本公开的实施例,文字识别模型例如可以根据上文所示的文字识别模型的训练方法进行训练得到。文字识别模型的输出可以包括内容识别结果和语种识别结果。其中,内容识别结果可以用于表示待识别图片中包含的文字信息,语种识别结果可以用于表示文字信息所对应的语种。
下面参考图6,结合具体实施例对上文所示的识别文字的方法做进一步说明。本领域技术人员可以理解,以下示例实施例仅用于理解本公开,本公开并不局限于此。
图6示意性示出了根据本公开的实施例的识别文字的方法的示意图。
在图6中示出了,根据本公开的实施例,文字识别模型可以包括第一卷积神经网络CNN、循环神经网络、联结时序分类网络和第二卷积神经网络。基于此,可以使用第一卷积神经网络610确定待识别图片61的特征向量62。然后可以使用循环神经网络620,根据特征向量62来确定序列特征,并使用联结时序分类网络630,根据序列特征来确定针对待识别图片61的内容识别结果63。另一方面,可以根据特征向量62,确定N-gram特征向量64,并使用第二卷积神经网络640,根据N-gram特征向量64来确定针对待识别图片61的语种识别结果65。
图7示意性示出了根据本公开实施例的文字识别模型的训练装置的框图。
如图7所示,该文字识别模型的训练装置700可以包括第一确定模块710、第二确定模块720和训练模块730。
第一确定模块710,可以用于根据多个单语种语料,确定多个第一样本图片以及多个第一样本图片的内容标签和语种标签。
第二确定模块720,可以用于根据多个混合语种语料,确定多个第二样本图片以及多个第二样本图片的内容标签和语种标签。
训练模块730,可以用于根据多个第一样本图片、多个第一样本图片的内容标签和语种标签、多个第二样本图片以及多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
根据本公开的实施例,第一确定模块可以包括第一生成子模块、第一内容标签确定子模块和第一语种标签确定子模块。其中,第一生成子模块,可以用于针对多个单语种语料中的每个单语种语料,生成包含单语种语料的图片,作为第一样本图片。第一内容标签确定子模块,可以用于根据单语种语料的文本内容,确定第一样本图片的内容标签。第一语种标签确定子模块,可以用于根据单语种语料的语种,确定第一样本图片的语种标签。
根据本公开的实施例,上述装置还可以包括拼接模块,可以用于对多个语种的原始语料进行混合拼接处理,得到多个混合语料。
根据本公开的实施例,第二确定模块可以包括第二生成子模块、第二内容标签确定子模块和第二语种标签确定子模块。其中,第二生成子模块,可以用于针对多个混合语种语料中的每个混合语种语料,生成包含混合语种语料的图片,作为第二样本图片。第二内容标签确定子模块,可以用于根据混合语种语料的文本内容,确定第二样本图片的内容标签。第二语种标签确定子模块,可以用于根据混合语种语料的语种,确定第二样本图片的语种标签。
根据本公开的实施例,训练模块可以包括识别子模块、第一损失确定子模块、第二损失确定子模块和调整子模块。其中,识别子模块,可以用于使用文字识别模型来确定多个第一样本图片和多个第二样本图片中的一个样本图片的内容识别结果和语种识别结果。第一损失确定子模块,可以用于根据内容识别结果和样本图片的内容标签,确定第一损失,并根据语种识别结果和样本图片的语种标签,确定第二损失。第二损失确定子模块,可以用于根据第一损失和第二损失,确定总损失。调整子模块,可以用于根据总损失,调整文字识别模型的参数,并针对多个第一样本图片和多个第二样本图片中的另一个样本图片返回使用文字识别模型来确定内容识别结果和语种识别结果的步骤。
根据本公开的实施例,文字识别模型可以包括第一卷积神经网络、循环神经网 络、联结时序分类网络和第二卷积神经网络。
根据本公开的实施例,识别子模块,包括特征向量确定单元、内容识别单元和语种识别单元。其中,特征向量确定单元,可以用于使用第一卷积神经网络确定样本图片的特征向量。内容识别单元,可以用于使用循环神经网络,根据特征向量来确定序列特征,并使用联结时序分类网络,根据序列特征来确定内容识别结果。语种识别单元,可以用于根据特征向量,确定多元特征向量,并使用第二卷积神经网络,根据多元特征向量来确定语种识别结果。
图8示意性示出了根据本公开实施例的识别文字的装置的框图。
如图8所示,该识别文字的装置800可以包括获取模块810和输入模块820。
获取模块810,可以用于获取包含文字信息的待识别图片;
输入模块820,可以用于将待识别图片输入文字识别模型,得到待识别图片的内容识别结果和语种识别结果,其中,内容识别结果用于表示待识别图片中包含的文字信息,语种识别结果用于表示文字信息所对应的语种。
根据本公开的实施例,文字识别模型是上文的文字识别模型的训练装置训练的。
需要说明的是,本公开的技术方案中,所涉及的用户个人信息的获取、存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图9示意性示出了可以用来实施本公开的实施例的示例电子设备900的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图9所示,设备900包括计算单元901,其可以根据存储在只读存储器(ROM)902中的计算机程序或者从存储单元908加载到随机访问存储器(RAM)903中的计算机程序,来执行各种适当的动作和处理。在RAM 903中,还可存储设备900 操作所需的各种程序和数据。计算单元901、ROM 902以及RAM 903通过总线904彼此相连。输入/输出(I/O)接口905也连接至总线904。
设备900中的多个部件连接至I/O接口905,包括:输入单元906,例如键盘、鼠标等;输出单元907,例如各种类型的显示器、扬声器等;存储单元908,例如磁盘、光盘等;以及通信单元909,例如网卡、调制解调器、无线通信收发机等。通信单元909允许设备900通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元901可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元901的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元901执行上文所描述的各个方法和处理,例如文字识别模型的训练方法和识别文字的方法。例如,在一些实施例中,文字识别模型的训练方法和识别文字的方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元908。在一些实施例中,计算机程序的部分或者全部可以经由ROM 902和/或通信单元909而被载入和/或安装到设备900上。当计算机程序加载到RAM 903并由计算单元901执行时,可以执行上文描述的文字识别模型的训练方法和识别文字的方法的一个或多个步骤。备选地,在其他实施例中,计算单元901可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行文字识别模型的训练方法和识别文字的方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理 装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常 通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (19)

  1. 一种文字识别模型的训练方法,包括:
    根据多个单语种语料,确定多个第一样本图片以及所述多个第一样本图片的内容标签和语种标签;
    根据多个混合语种语料,确定多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签;以及
    根据所述多个第一样本图片、所述多个第一样本图片的内容标签和语种标签、所述多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
  2. 根据权利要求1所述的方法,其中,所述根据多个单语种语料,确定多个第一样本图片以及所述多个第一样本图片的内容标签和语种标签,包括:
    针对所述多个单语种语料中的每个单语种语料,
    生成包含所述单语种语料的图片,作为所述第一样本图片;
    根据所述单语种语料的文本内容,确定所述第一样本图片的内容标签;以及
    根据所述单语种语料的语种,确定所述第一样本图片的语种标签。
  3. 根据权利要求1所述的方法,还包括:
    对多个语种的原始语料进行混合拼接处理,得到所述多个混合语料。
  4. 根据权利要求3所述的方法,其中,所述根据多个混合语种语料,确定多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签,包括:
    针对所述多个混合语种语料中的每个混合语种语料,
    生成包含所述混合语种语料的图片,作为所述第二样本图片;
    根据所述混合语种语料的文本内容,确定所述第二样本图片的内容标签;以及
    根据所述混合语种语料的语种,确定所述第二样本图片的语种标签。
  5. 根据权利要求1所述的方法,其中,所述根据所述多个第一样本图片、所述多个第一样本图片的内容标签和语种标签、所述多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练,包括:
    使用所述文字识别模型来确定所述多个第一样本图片和所述多个第二样本图片中的一个样本图片的内容识别结果和语种识别结果;
    根据所述内容识别结果和所述样本图片的内容标签,确定第一损失,并根据所述语种识别结果和所述样本图片的语种标签,确定第二损失;
    根据所述第一损失和所述第二损失,确定总损失;以及
    根据所述总损失,调整所述文字识别模型的参数,并针对所述多个第一样本图片和所述多个第二样本图片中的另一个样本图片返回使用所述文字识别模型来确定内容识别结果和语种识别结果的步骤。
  6. 根据权利要求5所述的方法,其中,所述文字识别模型包括第一卷积神经网络、循环神经网络、联结时序分类网络和第二卷积神经网络。
  7. 根据权利要求6所述的方法,其中,所述使用所述文字识别模型来确定所述样本图片的内容识别结果和语种识别结果,包括:
    使用所述第一卷积神经网络确定所述样本图片的特征向量;
    使用所述循环神经网络,根据所述特征向量来确定序列特征,并使用所述联结时序分类网络,根据所述序列特征来确定所述内容识别结果;以及
    根据所述特征向量,确定多元特征向量,并使用第二卷积神经网络,根据所述多元特征向量来确定所述语种识别结果。
  8. 一种识别文字的方法,包括:
    获取包含文字信息的待识别图片;
    将所述待识别图片输入文字识别模型,得到所述待识别图片的内容识别结果和语种识别结果,其中,所述内容识别结果用于表示所述待识别图片中包含的文字信息,所述语种识别结果用于表示所述文字信息所对应的语种,
    其中,所述文字识别模型是根据权利要求1-7中任一项所述的方法训练的。
  9. 一种文字识别模型的训练装置,包括:
    第一确定模块,用于根据多个单语种语料,确定多个第一样本图片以及所述多个第一样本图片的内容标签和语种标签;
    第二确定模块,用于根据多个混合语种语料,确定多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签;以及
    训练模块,用于根据所述多个第一样本图片、所述多个第一样本图片的内容标签和语种标签、所述多个第二样本图片以及所述多个第二样本图片的内容标签和语种标签,对文字识别模型进行训练。
  10. 根据权利要求9所述的装置,其中,所述第一确定模块包括:
    第一生成子模块,用于针对所述多个单语种语料中的每个单语种语料,生成包含所述单语种语料的图片,作为所述第一样本图片;
    第一内容标签确定子模块,用于根据所述单语种语料的文本内容,确定所述第一样本图片的内容标签;以及
    第一语种标签确定子模块,用于根据所述单语种语料的语种,确定所述第一样本图片的语种标签。
  11. 根据权利要求9所述的装置,还包括:
    拼接模块,用于对多个语种的原始语料进行混合拼接处理,得到所述多个混合语料。
  12. 根据权利要求11所述的装置,其中,所述第二确定模块,包括:
    第二生成子模块,用于针对所述多个混合语种语料中的每个混合语种语料,生成包含所述混合语种语料的图片,作为所述第二样本图片;
    第二内容标签确定子模块,用于根据所述混合语种语料的文本内容,确定所述第二样本图片的内容标签;以及
    第二语种标签确定子模块,用于根据所述混合语种语料的语种,确定所述第二样本图片的语种标签。
  13. 根据权利要求9所述的装置,其中,所述训练模块,包括:
    识别子模块,用于使用所述文字识别模型来确定所述多个第一样本图片和所述多个第二样本图片中的一个样本图片的内容识别结果和语种识别结果;
    第一损失确定子模块,用于根据所述内容识别结果和所述样本图片的内容标签,确定第一损失,并根据所述语种识别结果和所述样本图片的语种标签,确定第二损失;
    第二损失确定子模块,用于根据所述第一损失和所述第二损失,确定总损失;以及
    调整子模块,用于根据所述总损失,调整所述文字识别模型的参数,并针对所述多个第一样本图片和所述多个第二样本图片中的另一个样本图片返回使用所述文字识别模型来确定内容识别结果和语种识别结果的步骤。
  14. 根据权利要求13所述的装置,其中,所述文字识别模型包括第一卷积神经网络、循环神经网络、联结时序分类网络和第二卷积神经网络。
  15. 根据权利要求14所述的装置,其中,所述识别子模块,包括:
    特征向量确定单元,用于使用所述第一卷积神经网络确定所述样本图片的特征向量;
    内容识别单元,用于使用所述循环神经网络,根据所述特征向量来确定序列特征,并使用所述联结时序分类网络,根据所述序列特征来确定所述内容识别结果;以及
    语种识别单元,用于根据所述特征向量,确定多元特征向量,并使用第二卷积神经网络,根据所述多元特征向量来确定所述语种识别结果。
  16. 一种识别文字的装置,包括:
    获取模块,用于获取包含文字信息的待识别图片;
    输入模块,用于将所述待识别图片输入文字识别模型,得到所述待识别图片的内容识别结果和语种识别结果,其中,所述内容识别结果用于表示所述待识别图片中包含的文字信息,所述语种识别结果用于表示所述文字信息所对应的语种,
    其中,所述文字识别模型是根据权利要求9-15中任一项所述的装置训练的。
  17. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-8中任一项所述的方法。
  18. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-8中任一项所述的方法。
  19. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-8中任一项所述的方法。
PCT/CN2022/104891 2021-08-13 2022-07-11 文字识别模型的训练方法、识别文字的方法和装置 WO2023016163A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110934328.0A CN113657391A (zh) 2021-08-13 2021-08-13 文字识别模型的训练方法、识别文字的方法和装置
CN202110934328.0 2021-08-13

Publications (1)

Publication Number Publication Date
WO2023016163A1 true WO2023016163A1 (zh) 2023-02-16

Family

ID=78480310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104891 WO2023016163A1 (zh) 2021-08-13 2022-07-11 文字识别模型的训练方法、识别文字的方法和装置

Country Status (2)

Country Link
CN (1) CN113657391A (zh)
WO (1) WO2023016163A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657391A (zh) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 文字识别模型的训练方法、识别文字的方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110777A (zh) * 2019-04-28 2019-08-09 网易有道信息技术(北京)有限公司 图像处理方法和训练方法、以及装置、介质和计算设备
CN112288018A (zh) * 2020-10-30 2021-01-29 北京市商汤科技开发有限公司 文字识别网络的训练方法、文字识别方法和装置
WO2021081562A2 (en) * 2021-01-20 2021-04-29 Innopeak Technology, Inc. Multi-head text recognition model for multi-lingual optical character recognition
CN112883968A (zh) * 2021-02-24 2021-06-01 北京有竹居网络技术有限公司 图像字符识别方法、装置、介质及电子设备
CN113033660A (zh) * 2021-03-24 2021-06-25 支付宝(杭州)信息技术有限公司 一种通用小语种检测方法、装置以及设备
CN113657391A (zh) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 文字识别模型的训练方法、识别文字的方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108648747B (zh) * 2018-03-21 2020-06-02 清华大学 语种识别系统
CN109948696A (zh) * 2019-03-19 2019-06-28 上海七牛信息技术有限公司 一种多语言场景字符识别方法及系统
US11551053B2 (en) * 2019-08-15 2023-01-10 Sap Se Densely connected convolutional neural network for service ticket classification
CN111401374A (zh) * 2020-03-06 2020-07-10 湖南快乐阳光互动娱乐传媒有限公司 基于多任务的模型训练方法、字符识别方法及装置
CN112613324A (zh) * 2020-12-29 2021-04-06 北京中科闻歌科技股份有限公司 语义情绪识别方法、装置、设备及存储介质
CN112883149B (zh) * 2021-01-20 2024-03-26 华为技术有限公司 一种自然语言处理方法以及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110777A (zh) * 2019-04-28 2019-08-09 网易有道信息技术(北京)有限公司 图像处理方法和训练方法、以及装置、介质和计算设备
CN112288018A (zh) * 2020-10-30 2021-01-29 北京市商汤科技开发有限公司 文字识别网络的训练方法、文字识别方法和装置
WO2021081562A2 (en) * 2021-01-20 2021-04-29 Innopeak Technology, Inc. Multi-head text recognition model for multi-lingual optical character recognition
CN112883968A (zh) * 2021-02-24 2021-06-01 北京有竹居网络技术有限公司 图像字符识别方法、装置、介质及电子设备
CN113033660A (zh) * 2021-03-24 2021-06-25 支付宝(杭州)信息技术有限公司 一种通用小语种检测方法、装置以及设备
CN113657391A (zh) * 2021-08-13 2021-11-16 北京百度网讯科技有限公司 文字识别模型的训练方法、识别文字的方法和装置

Also Published As

Publication number Publication date
CN113657391A (zh) 2021-11-16

Similar Documents

Publication Publication Date Title
US20230106873A1 (en) Text extraction method, text extraction model training method, electronic device and storage medium
US9766868B2 (en) Dynamic source code generation
US9619209B1 (en) Dynamic source code generation
US11856277B2 (en) Method and apparatus for processing video, electronic device, medium and product
US20210326524A1 (en) Method, apparatus and device for quality control and storage medium
US20220139096A1 (en) Character recognition method, model training method, related apparatus and electronic device
US11651015B2 (en) Method and apparatus for presenting information
CN114595686B (zh) 知识抽取方法、知识抽取模型的训练方法及装置
US20220174369A1 (en) Method for processing video, device and storage medium
US20240013558A1 (en) Cross-modal feature extraction, retrieval, and model training method and apparatus, and medium
CN113360699A (zh) 模型训练方法和装置、图像问答方法和装置
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN115982376A (zh) 基于文本、多模数据和知识训练模型的方法和装置
CN107766498B (zh) 用于生成信息的方法和装置
WO2023093014A1 (zh) 一种票据识别方法、装置、设备以及存储介质
WO2023016163A1 (zh) 文字识别模型的训练方法、识别文字的方法和装置
EP3920074A2 (en) Method for industry text increment, related apparatus, and computer program product
US20230377225A1 (en) Method and apparatus for editing an image and method and apparatus for training an image editing model, device and medium
CN113761923A (zh) 命名实体识别方法、装置、电子设备及存储介质
US20230081015A1 (en) Method and apparatus for acquiring information, electronic device and storage medium
US20230086145A1 (en) Method of processing data, electronic device, and medium
US11929100B2 (en) Video generation method, apparatus, electronic device, storage medium and program product
US20210342379A1 (en) Method and device for processing sentence, and storage medium
CN115565186A (zh) 文字识别模型的训练方法、装置、电子设备和存储介质
US20210311985A1 (en) Method and apparatus for image processing, electronic device, and computer readable storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE