WO2021042527A1 - 字符识别方法、装置及计算机可读存储介质 - Google Patents

字符识别方法、装置及计算机可读存储介质 Download PDF

Info

Publication number
WO2021042527A1
WO2021042527A1 PCT/CN2019/117287 CN2019117287W WO2021042527A1 WO 2021042527 A1 WO2021042527 A1 WO 2021042527A1 CN 2019117287 W CN2019117287 W CN 2019117287W WO 2021042527 A1 WO2021042527 A1 WO 2021042527A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
edit
target
string
structured form
Prior art date
Application number
PCT/CN2019/117287
Other languages
English (en)
French (fr)
Inventor
陈少琼
卢宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042527A1 publication Critical patent/WO2021042527A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a character recognition method, device and computer-readable storage medium based on deep learning.
  • the image recognition OCR based on deep learning is prone to misrecognition of similar characters, such as O and 0, I and L.
  • the field will fail to recognize, which greatly affects the accuracy rate.
  • it brings great inconvenience to the later manual verification and affects work efficiency.
  • the present application provides a character recognition method, device, and computer-readable storage medium, the main purpose of which is to present accurate recognition results to the user when the user is performing character recognition.
  • a character recognition method includes: obtaining a structured form text set, and performing character extraction on the structured form text set by an optical character recognition method to obtain a character set; Perform preprocessing operations on the form text set to obtain a target text set, where the preprocessing operations include word segmentation, encoding, and normalization; build a dictionary tree for the target text set to obtain a target string set; use the minimum edit distance algorithm Match the character set and the target string set one by one to obtain a similar character list; receive the structured form text to be processed, and extract from the structured form text to be processed according to the similar character list The extracted characters are matched, the character with the highest degree of matching with the extracted characters is output, and the character recognition of the structured form text to be processed is completed.
  • the present application also provides a character recognition device, which includes a memory and a processor.
  • the memory stores a character recognition program that can run on the processor, and the character recognition program is
  • the processor executes the following steps: acquiring a structured form text set, extracting characters from the structured form text set by an optical character recognition method to obtain a character set; performing preprocessing operations on the structured form text set , Obtain the target text set, wherein the preprocessing operations include word segmentation, encoding and normalization; establish a dictionary tree for the target text set to obtain the target string set; use the minimum edit distance algorithm to compare the character set with the The target string set is matched one by one to obtain a similar character list; the structured form text to be processed is received, and the characters extracted from the structured form text to be processed are matched according to the similar character list, and output The character with the highest degree of matching with the extracted character completes the character recognition of the structured form text to be processed.
  • the present application also provides a computer-readable storage medium having a character recognition program stored on the computer-readable storage medium, and the character recognition program can be executed by one or more processors to achieve The steps of the character recognition method as described above.
  • the character recognition method, device and computer-readable storage medium proposed in this application combine the characters extracted from the structured form text when the user performs character recognition of the structured form text, and use the established similar character table to perform traversal search, The character with the highest degree of matching with the character extracted from the structured form text is output, so that accurate recognition results can be presented to the user.
  • FIG. 1 is a schematic flowchart of a character recognition method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of a character recognition device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a character recognition program in a character recognition device provided by an embodiment of the application.
  • This application provides a character recognition method.
  • FIG. 1 it is a schematic flowchart of a character recognition method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the character recognition method includes:
  • the structured form text set may be generated based on the business.
  • the structured form text set can be obtained in the following two ways: Method one, through the employee’s Data, for example, the text data of invoices issued by the personnel of Ping An’s financial department every month; method two, obtain it in the search engine through keywords.
  • optical character recognition Optical Character Recognition, OCR
  • OCR optical Character Recognition
  • the preprocessing operation includes word segmentation, encoding and normalization.
  • the preprocessing operation includes: using natural language processing technology to perform word segmentation on the structured form text set to obtain a string set of the structured form text set, and converting the string set by encoding technology
  • normalization processing is performed on the encoded character string set to obtain the target text set.
  • the encoded character string set can be mapped between the interval (0, 1) to facilitate data extraction.
  • this application uses the Natural Language Toolkit (nltk) in natural language processing to perform word segmentation; uses One Hot Encoder (One Hot Encoder) technology to implement the conversion of the string set to a numeric value; and adopts The feature normalization (Normalizer) algorithm performs normalization processing.
  • a dictionary tree is established for the target text set to obtain a target character string set.
  • the specific implementation steps of establishing the dictionary tree include: inputting the above-obtained text set, preset any string in the target text set as the root of the target text set;
  • the string of the target text set and the root are strings of a preset distance length, a node string set is obtained, and child nodes of the root are established; the target text is compared to the root and child nodes of the root
  • the word distance of the set is calculated in a loop to obtain each node of the dictionary tree, thereby obtaining the target character string set.
  • the preset root may be the word string GAME, and the preset distance length is 1 and/or 2.
  • the present application obtains the distance length by calculating the edit distance between the string in the target text set and the root.
  • the distance length is a preset 1 or 2, and is the first at the root node
  • this application recursively descends the character string in the target text set along the corresponding edge, and compares the value of the target text set.
  • the length of the word string is cyclically traversed and calculated to obtain each node of the dictionary tree, thereby obtaining the target string set.
  • the distance between the FAME and the preset root GAME is calculated to be 1, so a new child node is created under the root, and an edge labeled 1 is connected;
  • the distance between the string GAIN and the root GAME is calculated to be 2
  • the distance between the string GATE and the root GAME is calculated to be 1
  • the distance between the string GATE and the root GAME is calculated to be 1
  • the distance between the string GATE and the root GAME is calculated to be 1, so along the edge numbered 1 recursively insert into the string FAME Where the subtree is located, where the distance between the string GATE and the string FAME is 2, so this application puts the string GATE under the string FAME node, and the edge number is 2, and the same is true.
  • the minimum edit distance algorithm refers to the minimum number of edits for converting one character string into another character string.
  • the core idea of the minimum editing algorithm is: inserting a character, deleting a character, and modifying a character.
  • the operation of converting the character'house' into'home' is to delete the two words'u' and's' in'house', and add the word'm', that is, edit 3.
  • the fewer edits are experienced, the more similar the two characters are.
  • the present application adopts a preset editing function edit[i][j], and the editing function edit[i][j] represents the number of edits from the character set length of i characters to the target character string set length of j characters, The number of edits is the distance length value.
  • edit[0][0] means that the character and the string are empty, and the number of edits is 0, then the length of the distance between the two is 0;
  • edit[0][j] means that the character is empty, The string length is j, you need to add j lengths of characters, and the number of edits is j, then the distance between the two is j;
  • edit[i][0] means that the character length is i, and the string length is 0 , The character needs to be deleted for i lengths, and the number of edits is i, then the distance between the two is i.
  • the present application calculates the value of the editing function edit[i][j] according to a preset dynamic programming formula, and obtains the similarity between characters according to the value of the editing function edit[i][j].
  • the present application sorts the similar characters according to the order of similarity degree from high to low, so as to establish the similar character table.
  • the value of the edit function edit[i][j] calculated by the preset dynamic programming formula in this application includes:
  • edit[2][1] edit[3][1]
  • edit[i][1]... ..edit[i][j] The minimum number of edits.
  • S5. Receive the structured form text to be processed, match the characters extracted from the structured form text to be processed according to the similar character table, and output the character with the highest degree of matching with the extracted characters. Complete character recognition of the structured form text to be processed.
  • the preferred embodiment of the present application uses the above-mentioned OCR to extract characters from the structured form text to be processed, and performs traversal search through the similar character table established above, so as to output the character with the highest degree of matching with the character, and complete the to-be-processed Character recognition of structured form text.
  • the structured form text to be processed in this application and the structured form text set obtained above belong to the same business category, for example, both belong to the business category of issuing invoices.
  • the traversal refers to visiting each node in the similar character table sequentially along a certain search route.
  • the invention also provides a character recognition device.
  • FIG. 2 it is a schematic diagram of the internal structure of a character recognition device provided by an embodiment of this application.
  • the character recognition device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
  • the character recognition device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the character recognition device 1 in some embodiments, such as a hard disk of the character recognition device 1.
  • the memory 11 may also be an external storage device of the character recognition device 1, for example, a plug-in hard disk equipped on the character recognition device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the character recognition apparatus 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the character recognition device 1, such as the code of the character recognition program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as executing character recognition program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor or other data processing chip, for running program codes or processing stored in the memory 11 Data, such as executing character recognition program 01, etc.
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the apparatus 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the character recognition device 1 and to display a visualized user interface.
  • Fig. 2 only shows the character recognition device 1 with components 11-14 and the character recognition program 01.
  • Fig. 1 does not constitute a limitation on the character recognition device 1, and may include Fewer or more components than shown, or combination of certain components, or different component arrangements.
  • the character recognition program 01 is stored in the memory 11; the processor 12 implements the following steps when executing the character recognition program 01 stored in the memory 11:
  • Step 1 Obtain a structured form text set, and perform character extraction on the structured form text set to obtain a character set.
  • the structured form text set may be generated based on the business.
  • the structured form text set can be obtained in the following two ways: Method one, through the employee’s Data, for example, the text data of invoices issued by the personnel of Ping An’s financial department every month; method two, obtain it in the search engine through keywords.
  • optical character recognition Optical Character Recognition, OCR
  • OCR optical Character Recognition
  • Step 2 Perform a preprocessing operation on the structured form text set to obtain a target text set.
  • the preprocessing operation includes word segmentation, encoding and normalization.
  • the preprocessing operation includes: using natural language processing technology to perform word segmentation on the structured form text set to obtain a string set of the structured form text set, and converting the string set by encoding technology
  • normalization processing is performed on the encoded character string set to obtain the target text set.
  • the encoded character string set can be mapped between the interval (0, 1) to facilitate data extraction.
  • this application uses the Natural Language Toolkit (nltk) in natural language processing to perform word segmentation; uses One Hot Encoder (One Hot Encoder) technology to implement the conversion of the string set to a numeric value; and adopts The feature normalization (Normalizer) algorithm performs normalization processing.
  • Step 3 Build a dictionary tree for the target text set to obtain a target string set.
  • the specific implementation steps of establishing the dictionary tree include: inputting the above-obtained text set, preset any string in the target text set as the root of the target text set;
  • the string of the target text set and the root are strings of a preset distance length, a node string set is obtained, and child nodes of the root are established; the target text is compared to the root and child nodes of the root
  • the word distance of the set is calculated in a loop to obtain each node of the dictionary tree, thereby obtaining the target character string set.
  • the preset root may be the word string GAME, and the preset distance length is 1 and/or 2.
  • the present application obtains the distance length by calculating the edit distance between the string in the target text set and the root.
  • the distance length is a preset 1 or 2, and is the first at the root node
  • this application recursively descends the character string in the target text set along the corresponding edge, and compares the value of the target text set.
  • the length of the word string is cyclically traversed and calculated to obtain each node of the dictionary tree, thereby obtaining the target string set.
  • the distance between the FAME and the preset root GAME is calculated to be 1, so a new child node is created under the root, and an edge labeled 1 is connected;
  • the distance between the string GAIN and the root GAME is calculated to be 2
  • the distance between the string GATE and the root GAME is calculated to be 1
  • the distance between the string GATE and the root GAME is calculated to be 1
  • the distance between the string GATE and the root GAME is calculated to be 1, so along the edge numbered 1 recursively insert into the string FAME Where the subtree is located, where the distance between the string GATE and the string FAME is 2, so this application puts the string GATE under the string FAME node, and the edge number is 2, and the same is true.
  • Step 4 Use the minimum edit distance algorithm to perform one-to-one matching between the character set and the target character string set to obtain a similar character list.
  • the minimum edit distance algorithm refers to the minimum number of edits for converting one character string into another character string.
  • the core idea of the minimum editing algorithm is: inserting a character, deleting a character, and modifying a character.
  • the operation of converting the character'house' into'home' is to delete the two words'u' and's' in'house', and add the word'm', that is, edit 3.
  • the fewer edits are experienced, the more similar the two characters are.
  • the present application adopts a preset editing function edit[i][j], and the editing function edit[i][j] represents the number of edits from the character set length of i characters to the target character string set length of j characters, The number of edits is the distance length value.
  • edit[0][0] means that the character and the string are empty, and the number of edits is 0, then the length of the distance between the two is 0;
  • edit[0][j] means that the character is empty, The length of the string is j, you need to add j lengths of characters, and the number of edits is j, then the distance between the two is j;
  • edit[i][0] means that the length of the character is i, and the length of the string is 0 , The character needs to be deleted for i lengths, and the number of edits is i, then the distance between the two is i.
  • the present application calculates the value of the editing function edit[i][j] according to a preset dynamic programming formula, and obtains the similarity between characters according to the value of the editing function edit[i][j].
  • the present application sorts the similar characters according to the order of similarity degree from high to low, so as to establish the similar character table.
  • the value of the edit function edit[i][j] calculated by the preset dynamic programming formula in this application includes:
  • edit[2][1] edit[3][1]
  • edit[i][1]... ..edit[i][j] The minimum number of edits.
  • Step 5 Receive the structured form text to be processed, match the characters extracted from the structured form text to be processed according to the similar character table, and output the character with the highest degree of matching with the extracted characters To complete the character recognition of the structured form text to be processed.
  • the preferred embodiment of the present application uses the above-mentioned OCR to extract characters from the structured form text to be processed, and performs traversal search through the similar character table established above, so as to output the character with the highest degree of matching with the character, and complete the to-be-processed Character recognition of structured form text.
  • the structured form text to be processed in this application and the structured form text set obtained above belong to the same business category, for example, both belong to the business category of issuing invoices.
  • the traversal refers to visiting each node in the similar character table sequentially along a certain search route.
  • the character recognition program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and are executed by one or more processors (in this embodiment, the processing The module 12) is executed to complete this application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the character recognition program in the character recognition device.
  • FIG. 3 is a schematic diagram of the program modules of the character recognition program in an embodiment of the character recognition apparatus of the present application.
  • the character recognition program can be divided into a character extraction module 10 and a character string creation module. 20.
  • the matching module 30 and the identification module 40 exemplarily:
  • the character extraction module 10 is used to obtain a structured form text set, and perform character extraction on the structured form text set by an optical character recognition method to obtain a character set.
  • the word string establishment module 20 is configured to: perform a preprocessing operation on the structured form text set to obtain a target text set, wherein the preprocessing operation includes word segmentation, encoding, and normalization, and the target text set Build a dictionary tree to get the target string set.
  • the matching module 30 is used for matching the character set and the target string set one by one using the minimum edit distance algorithm to obtain a similar character list.
  • the recognition module 40 is configured to receive the structured form text to be processed, match the characters extracted from the structured form text to be processed according to the similar character table, and output the characters extracted from the structured form text to be processed.
  • the character with the highest matching degree completes the character recognition of the structured form text to be processed.
  • an embodiment of the present application also proposes a computer-readable storage medium with a character recognition program stored on the computer-readable storage medium, and the character recognition program can be executed by one or more processors to implement the following operations:
  • preprocessing operations on the structured form text set to obtain a target text set, where the preprocessing operations include word segmentation, encoding, and normalization;
  • Receive the structured form text to be processed match the characters extracted from the structured form text to be processed according to the similar character table, output the character with the highest degree of matching with the extracted characters, and complete all Describes the character recognition of the structured form text to be processed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Character Discrimination (AREA)

Abstract

本申请涉及一种人工智能技术,揭露了一种字符识别方法,包括:获取结构化表单文本集,将所述结构化表单文本集进行字符提取,得到字符集;并对所述结构化表单文本集进行预处理操作,得到目标文本集;对所述目标文本集建立字典树,得到目标字串集;将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。本申请还提出一种字符识别装置以及一种计算机可读存储介质。本申请实现了字符的精准识别。

Description

字符识别方法、装置及计算机可读存储介质
本申请要求于2019年9月6日提交中国专利局,申请号为201910846707.7、发明名称为“字符识别方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于深度学习的字符识别方法、装置及计算机可读存储介质。
背景技术
目前基于深度学习的图像识别OCR对于相近的字符识别容易出现误识别,如O和0,I和L等,一个字段中只要有一个字符识别错误,所述字段就会识别失败,大大影响准确率,同时给后期人工校验带来了极大地不便,影响工作效率。
发明内容
本申请提供一种字符识别方法、装置及计算机可读存储介质,其主要目的在于当用户在进行字符识别时,给用户呈现出精准的识别结果。
为实现上述目的,本申请提供的一种字符识别方法,包括:获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化;对所述目标文本集建立字典树,得到目标字串集;利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
此外,为实现上述目的,本申请还提供一种字符识别装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的字符识别程序,所述字符识别程序被所述处理器执行时实现如下步骤:获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化;对所述目标文本集建立字典树,得到目标字串集;利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有字符识别程序,所述字符识别程序可被一个或者多个处理器执行,以实现如上所述的字符识别方法的步骤。
本申请提出的字符识别方法、装置及计算机可读存储介质,在用户进行结构化表单文本字符识别时,结合从所述结构化表单文本提取出的字符,利用建立的相似字符表进行遍历查找,输出与所述结构化表单文本提取出的字符匹配度最高的字符,从而可以给用户呈现出精准的识别结果。
附图说明
图1为本申请一实施例提供的字符识别方法的流程示意图;
图2为本申请一实施例提供的字符识别装置的内部结构示意图;
图3为本申请一实施例提供的字符识别装置中字符识别程序的模块示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供一种字符识别方法。参照图1所示,为本申请一实施例提供的字符识别方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。
在本实施例中,字符识别方法包括:
S1、获取结构化表单文本集,对所述结构化表单文本集进行字符提取,得到字符集。
本申请较佳实施例中,所述结构化表单文本集可以是基于业务产生的,例如,所述结构化表单文本集通过以下两种方式获取得到:方式一、通过企业员工在产生业务时的数据,例如,中国平安的财务部门人员每月开具的发票文本数据;方式二、通过关键字在搜索引擎中获取得到。
进一步地,本申请通过光学字符识别方法(Optical Character Recognition,OCR)对所述所述结构化表单文本集进行字符提取。所述OCR指的是采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式,供文字处理软件进一步编辑加工的技术。
S2、对所述结构化表单文本集进行预处理操作,得到目标文本集。
本申请较佳实施例中,所述预处理操作包括分词,编码以及归一化。
详细地,所述预处理操作包括:利用自然语言处理技术对所述结构化表单文本集进行分词操作,得到所述结构化表单文本集的字串集,通过编码技术将所述字串集转换成数值形式,对编码后的所述字串集进行归一化处理,得到所述目标文本集。其中,通过所述归一化处理可以将编码后的所述字串集映射区间(0,1)之间,方便数据的提取。优选地,本申请通过自然语言处理中的自然语言处理工具包(Natural Language Toolkit,nltk)进行分词操作;利用独热编码(One Hot Encoder)技术实现所述字串集到数值的转换;以及采用特征归一化(Normalizer)算法进行归一化处理。
S3、对所述目标文本集建立字典树,得到目标字串集。
本申请较佳实施例中,所述建立字典树的具体实施步骤包括:输入上述获取的文本集,预设所述目标文本集中的任意一个字串作为所述目标文本集的根;筛选出所述目标文本集中字串与所述根为预设距离长度的字串,得到节点字串集,并建立所述根的子节点;根据所述根与所述根的子节点对所述目标文本集的字词距离进行循环遍历计算,得到所述字典树的各个节点,从 而得到所述目标字串集。例如,所述预设的根可以为字串GAME,所述预设距离长度为1和/或2。
进一步地,本申请通过计算所述目标文本集中的字串与所述根的编辑距离,得到距离长度,当所述距离长度为预设的1或2时,且是在所述根节点处第一次出现时,建立一个新子节点;当所述距离长度不是预设的1或2时本申请将所述目标文本集中的字串沿着对应的边递归下去,对所述目标文本集的字串进行距离长度循环遍历计算,得到字典树的各个节点,从而得到所述目标字串集。例如,对于所述目标文本集中的字串FAME,计算得到所述FAME与所述预设根GAME的距离为1,于是在所述根下方新建一个子节点,并连一条标号为1的边;当插入所述目标文本集中的字串GAIN,计算得到所述字串GAIN与所述根GAME的距离为2,于是在所述根下方新建一个子节点,将所述字串连一条编号为2的边;当插入所述目标文本集中的字串GATE,计算得到所述字串GATE与所述根GAME距离为1,于是沿着所述编号为1的边递归地插入到所述字串FAME所在子树,其中,所述字串GATE与所述字串FAME的距离为2,于是本申请将所述字串GATE放在所述字串FAME节点下,边的编号为2,同理,依次对插入的字串进行距离长度的计算,可以得到目标字串集。
S4、利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表。
本申请较佳实施例中,所述最小编辑距离算法指的是将一个字符串转换成另一个字符串所述经过最少的编辑次数。其中,所述最小编辑算法的核心思想为:插入一个字符、删除一个字符以及修改一个字符。例如,对于字符‘home’与‘house’将字符‘house’转换成‘home’操作为,删除‘house’中‘u’,‘s’两个单词,在添加‘m’单词,即编辑3次可转换成功。其中,在转换过程中,所经历的编辑次数越少,两个字符越相似。
进一步地,本申请通过预设编辑函数edit[i][j],所述编辑函数edit[i][j]表示字符集长度为i字符到目标字串集长度为j字串的编辑次数,所述编辑次数即距离长度值。其中,对于edit[0][0]表示字符与字串是为空的,需要编辑的次数为0,则此时两者的距离长度为0;edit[0][j]表示字符为空,字串长度为j,需要将字符添加j个长度,需要编辑的次数为j,则此时两者的距离长度为j; edit[i][0]表示字符长度为i,字串长度为0,需要将字符删除i个长度,需要编辑的次数为i,则此时两者的距离长度为i。于是,本申请根据预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到字符之间的相似程度。优选地,本申请按照相似程度由高到低的顺序对所述相似字符进行排序,从而建立所述相似字符表。
进一步地,本申请中所述预设的动态规划公式计算出所述编辑函数edit[i][j]的值包括:
当i=0且j=0时,所述edit[i][j]=0;
当i=0且j>0,所述edit[i][j]=j;
当i>0且j=0,所述edit[i][j]=i;
当i≥1且j≥1,所述edit[i][j]==min{edit[i-1][j]+1,edit[i][j-1]+1,edit[i-1][j-1]+f[i][j]},其中,若所述i的字符不等于所述为j的字串时,f[i][j]=1,若所述i的字符等于所述j的字串,f[i][j]=0。
较佳地,本申请以i=1,j=1为实例,计算edit[1][1]的最小编辑次数:
已知:edit[1][1];
计算:edit[0][1]+1==2,edit[1][0]+1==2,edit[0][0]+f[1][1]==0+1==1;
得到:min(edit[0][1],edit[1][0],edit[0][0]+f[1][1])==1;
结果:edit[1][1]==1。
进一步地,本申请根据所述edit[1][1]的计算原理,同理依次得到edit[2][1]、edit[3][1]、edit[i][1]......edit[i][j]的最小编辑次数。
S5、接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
本申请较佳实施例利用上述OCR对待处理的结构化表单文本进行字符的提取,并通过上述建立的相似字符表进行遍历查找,从而输出与所述字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。其中,本申请中所述待处理的结构化表单文本与上述获取的结构化表单文本集属于同一业务范畴内,比如,同属于开具发票的业务范畴。所述遍历指的是沿着某条搜索路线,依次对相似字符表中每个结点均做一次访问。
发明还提供一种字符识别装置。参照图2所示,为本申请一实施例提供的字符识别装置的内部结构示意图。
在本实施例中,所述字符识别装置1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该字符识别装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是字符识别装置1的内部存储单元,例如该字符识别装置1的硬盘。存储器11在另一些实施例中也可以是字符识别装置1的外部存储设备,例如字符识别装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括字符识别装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于字符识别装置1的应用软件及各类数据,例如字符识别程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行字符识别程序01等。
通信总线13用于实现这些组件之间的连接通信。
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在字符识别装置1中处理的信息以及用于显示可视化的用户界面。
图2仅示出了具有组件11-14以及字符识别程序01的字符识别装置1, 本领域技术人员可以理解的是,图1示出的结构并不构成对字符识别装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
在图2所示的装置1实施例中,存储器11中存储有字符识别程序01;处理器12执行存储器11中存储的字符识别程序01时实现如下步骤:
步骤一、获取结构化表单文本集,对所述结构化表单文本集进行字符提取,得到字符集。
本申请较佳实施例中,所述结构化表单文本集可以是基于业务产生的,例如,所述结构化表单文本集通过以下两种方式获取得到:方式一、通过企业员工在产生业务时的数据,例如,中国平安的财务部门人员每月开具的发票文本数据;方式二、通过关键字在搜索引擎中获取得到。
进一步地,本申请通过光学字符识别方法(Optical Character Recognition,OCR)对所述所述结构化表单文本集进行字符提取。所述OCR指的是采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件,并通过识别软件将图像中的文字转换成文本格式,供文字处理软件进一步编辑加工的技术。
步骤二、对所述结构化表单文本集进行预处理操作,得到目标文本集。
本申请较佳实施例中,所述预处理操作包括分词,编码以及归一化。
详细地,所述预处理操作包括:利用自然语言处理技术对所述结构化表单文本集进行分词操作,得到所述结构化表单文本集的字串集,通过编码技术将所述字串集转换成数值形式,对编码后的所述字串集进行归一化处理,得到所述目标文本集。其中,通过所述归一化处理可以将编码后的所述字串集映射区间(0,1)之间,方便数据的提取。优选地,本申请通过自然语言处理中的自然语言处理工具包(Natural Language Toolkit,nltk)进行分词操作;利用独热编码(One Hot Encoder)技术实现所述字串集到数值的转换;以及采用特征归一化(Normalizer)算法进行归一化处理。
步骤三、对所述目标文本集建立字典树,得到目标字串集。
本申请较佳实施例中,所述建立字典树的具体实施步骤包括:输入上述获取的文本集,预设所述目标文本集中的任意一个字串作为所述目标文本集的根;筛选出所述目标文本集中字串与所述根为预设距离长度的字串,得到节点字串集,并建立所述根的子节点;根据所述根与所述根的子节点对所述 目标文本集的字词距离进行循环遍历计算,得到所述字典树的各个节点,从而得到所述目标字串集。例如,所述预设的根可以为字串GAME,所述预设距离长度为1和/或2。
进一步地,本申请通过计算所述目标文本集中的字串与所述根的编辑距离,得到距离长度,当所述距离长度为预设的1或2时,且是在所述根节点处第一次出现时,建立一个新子节点;当所述距离长度不是预设的1或2时本申请将所述目标文本集中的字串沿着对应的边递归下去,对所述目标文本集的字串进行距离长度循环遍历计算,得到字典树的各个节点,从而得到所述目标字串集。例如,对于所述目标文本集中的字串FAME,计算得到所述FAME与所述预设根GAME的距离为1,于是在所述根下方新建一个子节点,并连一条标号为1的边;当插入所述目标文本集中的字串GAIN,计算得到所述字串GAIN与所述根GAME的距离为2,于是在所述根下方新建一个子节点,将所述字串连一条编号为2的边;当插入所述目标文本集中的字串GATE,计算得到所述字串GATE与所述根GAME距离为1,于是沿着所述编号为1的边递归地插入到所述字串FAME所在子树,其中,所述字串GATE与所述字串FAME的距离为2,于是本申请将所述字串GATE放在所述字串FAME节点下,边的编号为2,同理,依次对插入的字串进行距离长度的计算,可以得到目标字串集。
步骤四、利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表。
本申请较佳实施例中,所述最小编辑距离算法指的是将一个字符串转换成另一个字符串所述经过最少的编辑次数。其中,所述最小编辑算法的核心思想为:插入一个字符、删除一个字符以及修改一个字符。例如,对于字符‘home’与‘house’将字符‘house’转换成‘home’操作为,删除‘house’中‘u’,‘s’两个单词,在添加‘m’单词,即编辑3次可转换成功。其中,在转换过程中,所经历的编辑次数越少,两个字符越相似。
进一步地,本申请通过预设编辑函数edit[i][j],所述编辑函数edit[i][j]表示字符集长度为i字符到目标字串集长度为j字串的编辑次数,所述编辑次数即距离长度值。其中,对于edit[0][0]表示字符与字串是为空的,需要编辑的次数为0,则此时两者的距离长度为0;edit[0][j]表示字符为空,字串长度为j, 需要将字符添加j个长度,需要编辑的次数为j,则此时两者的距离长度为j;edit[i][0]表示字符长度为i,字串长度为0,需要将字符删除i个长度,需要编辑的次数为i,则此时两者的距离长度为i。于是,本申请根据预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到字符之间的相似程度。优选地,本申请按照相似程度由高到低的顺序对所述相似字符进行排序,从而建立所述相似字符表。
进一步地,本申请中所述预设的动态规划公式计算出所述编辑函数edit[i][j]的值包括:
当i=0且j=0时,所述edit[i][j]=0;
当i=0且j>0,所述edit[i][j]=j;
当i>0且j=0,所述edit[i][j]=i;
当i≥1且j≥1,所述edit[i][j]==min{edit[i-1][j]+1,edit[i][j-1]+1,edit[i-1][j-1]+f[i][j]},其中,若所述i的字符不等于所述为j的字串时,f[i][j]=1,若所述i的字符等于所述j的字串,f[i][j]=0。
较佳地,本申请以i=1,j=1为实例,计算edit[1][1]的最小编辑次数:
已知:edit[1][1];
计算:edit[0][1]+1==2,edit[1][0]+1==2,edit[0][0]+f[1][1]==0+1==1;
得到:min(edit[0][1],edit[1][0],edit[0][0]+f[1][1])==1;
结果:edit[1][1]==1。
进一步地,本申请根据所述edit[1][1]的计算原理,同理依次得到edit[2][1]、edit[3][1]、edit[i][1]......edit[i][j]的最小编辑次数。
步骤五、接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
本申请较佳实施例利用上述OCR对待处理的结构化表单文本进行字符的提取,并通过上述建立的相似字符表进行遍历查找,从而输出与所述字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。其中,本申请中所述待处理的结构化表单文本与上述获取的结构化表单文本集属于同一业务范畴内,比如,同属于开具发票的业务范畴。所述遍历指的是沿着某条搜索路线,依次对相似字符表中每个结点均做一次访问。
可选地,在其他实施例中,字符识别程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述字符识别程序在字符识别装置中的执行过程。
例如,参照图3所示,为本申请字符识别装置一实施例中的字符识别程序的程序模块示意图,该实施例中,所述字符识别程序可以被分割为字符提取模块10、字串建立模块20、匹配模块30以及识别模块40,示例性地:
所述字符提取模块10用于:获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集。
所述字串建立模块20用于:对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化,对所述目标文本集建立字典树,得到目标字串集。
所述匹配模块30用于:利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表。
所述识别模块40用于:接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
上述字符提取模块10、字串建立模块20、匹配模块30以及识别模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有字符识别程序,所述字符识别程序可被一个或多个处理器执行,以实现如下操作:
获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;
对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所 述预处理操作包括分词、编码以及归一化;
对所述目标文本集建立字典树,得到目标字串集;
利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;
接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
本申请计算机可读存储介质具体实施方式与上述字符识别装置和方法各实施例基本相同,在此不作累述。
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种字符识别方法,其特征在于,所述方法包括:
    获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;
    对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化;
    对所述目标文本集建立字典树,得到目标字串集;
    利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;
    接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
  2. 如权利要求1所述的字符识别方法,其特征在于,所述对所述结构化表单文本集进行预处理操作,得到目标文本集,包括:
    利用自然语言处理技术对所述结构化表单文本集进行分词操作,得到所述结构化表单文本集的字串集,通过编码技术将所述字串集转换成数值形式,对编码后的所述字串集进行归一化处理,得到所述目标文本集。
  3. 如权利要求1所述的字符识别方法,其特征在于,所述对所述目标文本集建立字典树,得到目标字串集,包括:
    预设所述目标文本集中的任意一个字串作为所述目标文本集的根;
    筛选出所述目标文本集中字串与所述根为预设距离长度的字串,得到节点字串集,并建立所述根的子节点;
    根据所述根与所述根的子节点对所述目标文本集的字串进行距离长度循环遍历计算,得到所述字典树的各个节点,从而得到所述目标字串集。
  4. 如权利要求1所述的字符识别方法,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  5. 如权利要求2所述的字符识别方法,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  6. 如权利要求3所述的字符识别方法,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  7. 如权利要求4-6任一项所述的字符识别方法,其特征在于,所述利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,包括:
    当i=0且j=0时,所述edit[i][j]=0;
    当i=0且j>0,所述edit[i][j]=j;
    当i>0且j=0,所述edit[i][j]=i;
    当i≥1且j≥1,所述edit[i][j]==min{edit[i-1][j]+1,edit[i][j-1]+1,edit[i-1][j-1]+f[i][j]},其中,若所述i的字符不等于所述为j的字串时,f[i][j]=1,若所述i的字符等于所述j的字串,f[i][j]=0。
  8. 一种字符识别装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的字符识别程序,所述字符识别程序被所述处理器执行时实现如下步骤:
    获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;
    对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化;
    对所述目标文本集建立字典树,得到目标字串集;
    利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;
    接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
  9. 如权利要求8所述的字符识别装置,其特征在于,所述对所述结构化表单文本集进行预处理操作,得到目标文本集,包括:
    利用自然语言处理技术对所述结构化表单文本集进行分词操作,得到所述结构化表单文本集的字串集,通过编码技术将所述字串集转换成数值形式,对编码后的所述字串集进行归一化处理,得到所述目标文本集。
  10. 如权利要求8所述的字符识别装置,其特征在于,所述对所述目标文本集建立字典树,得到目标字串集,包括:
    预设所述目标文本集中的任意一个字串作为所述目标文本集的根;
    筛选出所述目标文本集中字串与所述根为预设距离长度的字串,得到节点字串集,并建立所述根的子节点;
    根据所述根与所述根的子节点对所述目标文本集的字串进行距离长度循环遍历计算,得到所述字典树的各个节点,从而得到所述目标字串集。
  11. 如权利要求8所述的字符识别装置,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  12. 如权利要求9所述的字符识别装置,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  13. 如权利要求10所述的字符识别装置,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  14. 如权利要求11-13任一项所述的字符识别装置,其特征在于,所述利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,包括:
    当i=0且j=0时,所述edit[i][j]=0;
    当i=0且j>0,所述edit[i][j]=j;
    当i>0且j=0,所述edit[i][j]=i;
    当i≥1且j≥1,所述edit[i][j]==min{edit[i-1][j]+1,edit[i][j-1]+1,edit[i-1][j-1]+f[i][j]},其中,若所述i的字符不等于所述为j的字串时,f[i][j]=1,若所述i的字符等于所述j的字串,f[i][j]=0。
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有字符识别程序,所述字符识别程序可被一个或者多个处理器执行,以实现如下步骤:
    获取结构化表单文本集,通过光学字符识别方法将所述结构化表单文本集进行字符提取,得到字符集;
    对所述结构化表单文本集进行预处理操作,得到目标文本集,其中,所述预处理操作包括分词、编码以及归一化;
    对所述目标文本集建立字典树,得到目标字串集;
    利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表;
    接收待处理的结构化表单文本,根据所述相似字符表,与所述待处理的结构化表单文本中提取出来的字符进行匹配,输出与所述提取出来的字符匹配度最高的字符,完成所述待处理的结构化表单文本的字符识别。
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述对所 述结构化表单文本集进行预处理操作,得到目标文本集,包括:
    利用自然语言处理技术对所述结构化表单文本集进行分词操作,得到所述结构化表单文本集的字串集,通过编码技术将所述字串集转换成数值形式,对编码后的所述字串集进行归一化处理,得到所述目标文本集。
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述对所述目标文本集建立字典树,得到目标字串集,包括:
    预设所述目标文本集中的任意一个字串作为所述目标文本集的根;
    筛选出所述目标文本集中字串与所述根为预设距离长度的字串,得到节点字串集,并建立所述根的子节点;
    根据所述根与所述根的子节点对所述目标文本集的字串进行距离长度循环遍历计算,得到所述字典树的各个节点,从而得到所述目标字串集。
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  19. 如权利要求17所述的计算机可读存储介质,其特征在于,所述利用最小编辑距离算法将所述字符集与所述目标字串集进行一一匹配,得到相似字符表,包括:
    预设一个编辑函数edit[i][j],其中,所述编辑函数edit[i][j]表示字符集长度为i的字符到目标字串集长度为j的字串的距离长度;
    利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,并根据所述编辑函数edit[i][j]的值得到所述相似字符表。
  20. 如权利要求18-19任一项所述的计算机可读存储介质,其特征在于,所述利用预设的动态规划公式计算出所述编辑函数edit[i][j]的值,包括:
    当i=0且j=0时,所述edit[i][j]=0;
    当i=0且j>0,所述edit[i][j]=j;
    当i>0且j=0,所述edit[i][j]=i;
    当i≥1且j≥1,所述edit[i][j]==min{edit[i-1][j]+1,edit[i][j-1]+1, edit[i-1][j-1]+f[i][j]},其中,若所述i的字符不等于所述为j的字串时,f[i][j]=1,若所述i的字符等于所述j的字串,f[i][j]=0。
PCT/CN2019/117287 2019-09-06 2019-11-12 字符识别方法、装置及计算机可读存储介质 WO2021042527A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910846707.7A CN110738202A (zh) 2019-09-06 2019-09-06 字符识别方法、装置及计算机可读存储介质
CN201910846707.7 2019-09-06

Publications (1)

Publication Number Publication Date
WO2021042527A1 true WO2021042527A1 (zh) 2021-03-11

Family

ID=69267538

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117287 WO2021042527A1 (zh) 2019-09-06 2019-11-12 字符识别方法、装置及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN110738202A (zh)
WO (1) WO2021042527A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705167A (zh) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 字符校验方法、装置、设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782892B (zh) * 2020-06-30 2023-09-19 中国平安人寿保险股份有限公司 基于前缀树的相似字符识别方法、设备、装置和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070172124A1 (en) * 2006-01-23 2007-07-26 Withum Timothy O Modified levenshtein distance algorithm for coding
US9659224B1 (en) * 2014-03-31 2017-05-23 Amazon Technologies, Inc. Merging optical character recognized text from frames of image data
CN107220639A (zh) * 2017-04-14 2017-09-29 北京捷通华声科技股份有限公司 Ocr识别结果的纠正方法和装置
CN108563685A (zh) * 2018-03-13 2018-09-21 阿里巴巴集团控股有限公司 一种银行标识代码的查询方法、装置及设备
CN109582972A (zh) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 一种基于自然语言识别的光学字符识别纠错方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
CN108304378B (zh) * 2018-01-12 2019-09-24 深圳壹账通智能科技有限公司 文本相似度计算方法、装置、计算机设备和存储介质
CN109657738B (zh) * 2018-10-25 2024-04-30 平安科技(深圳)有限公司 字符识别方法、装置、设备及存储介质
CN110147433B (zh) * 2019-05-21 2021-01-29 北京鸿联九五信息产业有限公司 一种基于字典树的文本模板提取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070172124A1 (en) * 2006-01-23 2007-07-26 Withum Timothy O Modified levenshtein distance algorithm for coding
US9659224B1 (en) * 2014-03-31 2017-05-23 Amazon Technologies, Inc. Merging optical character recognized text from frames of image data
CN107220639A (zh) * 2017-04-14 2017-09-29 北京捷通华声科技股份有限公司 Ocr识别结果的纠正方法和装置
CN108563685A (zh) * 2018-03-13 2018-09-21 阿里巴巴集团控股有限公司 一种银行标识代码的查询方法、装置及设备
CN109582972A (zh) * 2018-12-27 2019-04-05 信雅达系统工程股份有限公司 一种基于自然语言识别的光学字符识别纠错方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705167A (zh) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 字符校验方法、装置、设备及存储介质
CN113705167B (zh) * 2021-08-31 2024-04-19 中科软科技股份有限公司 字符校验方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN110738202A (zh) 2020-01-31

Similar Documents

Publication Publication Date Title
CN110909548A (zh) 中文命名实体识别方法、装置及计算机可读存储介质
CN109388795B (zh) 一种命名实体识别方法、语言识别方法及系统
CN108960223B (zh) 基于票据智能识别自动生成凭证的方法
US8856642B1 (en) Information extraction and annotation systems and methods for documents
WO2022142011A1 (zh) 一种地址识别方法、装置、计算机设备及存储介质
CN109446885B (zh) 一种基于文本的元器件识别方法、系统、装置和存储介质
WO2019041521A1 (zh) 用户关键词提取装置、方法及计算机可读存储介质
CN111177184A (zh) 基于自然语言的结构化查询语言转换方法、及其相关设备
AU2019204444B2 (en) System and method for enrichment of ocr-extracted data
US9141853B1 (en) System and method for extracting information from documents
JP6506770B2 (ja) 音楽記号を認識するための方法および装置
WO2020253042A1 (zh) 情感智能判断方法、装置及计算机可读存储介质
CN112417891B (zh) 一种基于开放式信息抽取的文本关系自动标注方法
CN103823838A (zh) 一种多格式文档录入并比对的方法
WO2020056977A1 (zh) 知识点推送方法、装置及计算机可读存储介质
WO2021042527A1 (zh) 字符识别方法、装置及计算机可读存储介质
WO2023116561A1 (zh) 一种实体提取方法、装置、电子设备及存储介质
CN111695336A (zh) 疾病名称对码方法、装置、计算机设备及存储介质
CN111858567A (zh) 一种通过标准数据元进行政务数据清洗的方法和系统
CN108647511A (zh) 基于弱口令推导的口令强度评估方法
WO2021139076A1 (zh) 智能化文本对话生成方法、装置及计算机可读存储介质
CN113961768A (zh) 敏感词检测方法、装置、计算机设备和存储介质
CN111831624A (zh) 数据表创建方法、装置、计算机设备及存储介质
CN106570095B (zh) 一种xml数据的操作方法及设备
CN103927176A (zh) 一种基于层次主题模型的程序特征树的生成方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944325

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944325

Country of ref document: EP

Kind code of ref document: A1