WO2015136692A1 - Digital image document editing system - Google Patents

Digital image document editing system Download PDF

Info

Publication number
WO2015136692A1
WO2015136692A1 PCT/JP2014/056927 JP2014056927W WO2015136692A1 WO 2015136692 A1 WO2015136692 A1 WO 2015136692A1 JP 2014056927 W JP2014056927 W JP 2014056927W WO 2015136692 A1 WO2015136692 A1 WO 2015136692A1
Authority
WO
WIPO (PCT)
Prior art keywords
character string
electronic image
image document
recognized
editing system
Prior art date
Application number
PCT/JP2014/056927
Other languages
French (fr)
Japanese (ja)
Inventor
久雄 間瀬
義行 小林
新庄 広
竜治 嶺
高橋 寿一
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2014/056927 priority Critical patent/WO2015136692A1/en
Priority to JP2016507228A priority patent/JPWO2015136692A1/en
Publication of WO2015136692A1 publication Critical patent/WO2015136692A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • G06V30/422Technical drawings; Geographical maps
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters

Definitions

  • the present invention relates to an electronic image document editing system.
  • Editing a document includes creating a new document, updating an existing document (addition, correction, deletion, etc.), proofreading character information in the document, and translating the character information.
  • design documents of past products often remain only in the form of electronic image documents.
  • the character string is recognized from the design document that is an electronic image document, The recognized character string must be edited.
  • Patent Document 1 original image data read by the scanner unit 1 is input to the recognition processing unit 102 via the image processing unit 3 when recognition processing is set, and character recognition is performed.
  • a predetermined number of lines (the number of hits) of the translated words in the total number of recognized words are stored, and when the hit number of the target line and the preceding and following lines is equal to or less than the predetermined number, or the character code of the recognized word is
  • an electronic image document editing system In order for a system for editing an electronic image document (hereinafter referred to as an electronic image document editing system) to edit character information in the electronic image document, it is necessary to first perform character recognition processing. That is, the electronic image document editing system needs to specify a region in which character information is described in the electronic image document, and to perform processing for specifying the character content described in the region.
  • the electronic image document editing system can accurately specify the description location and the description amount of the character information in the electronic image document.
  • the electronic image document editing system causes erroneous recognition due to the quality of the scanned paper, the resolution of the electronic image, the font type and size of the written characters, and the like in the character recognition process.
  • Patent Document 1 applies a predetermined rule to a recognized character string and its translation result (translation search result) as a result of character recognition processing, and when the rule matches, the recognized character string and its Do not output translation results.
  • the rules mentioned in Patent Document 1 are the following two.
  • the first rule is “if the number of translations in the total number of recognized words in one line (the number of hits) is memorized in a predetermined number of lines, and the hit number of the target line and the preceding and following lines is less than the predetermined number, “Stop drawing the line of interest”.
  • the second rule is to determine whether or not the character code of the recognized word matches a certain pattern such as a character other than a character or the same code continuing a predetermined number of times. When the ratio of the number of pattern matches in one line in the total number of recognized words is equal to or greater than a predetermined value, the drawing of the attention line is stopped.
  • the recognized noise character string includes, for example, one or more characters composed of kanji, hiragana, katakana, numbers, alphabets, symbols, and the like. It is a string.
  • the noise character string cannot be identified with high accuracy, and as a result, many noise character strings are to be drawn.
  • the technique described in Patent Document 1 uses a translation result (translation word search result) to identify whether or not the recognized character string is a drawing target character string. That is, in the technique described in Patent Document 1, when the first rule is used, it is necessary to perform a translation process even on a character string that does not output a translated word, and the processing load increases. For example, when the first rule is applied to an electronic image document editing system for editing work other than translation, a translation function that is not necessary for the original editing work must be installed in the electronic image document editing system. The cost burden increases.
  • an object of the present invention is to identify and remove a noise character string with high accuracy from a recognized character string in an electronic image document. Another object of the present invention is to specify a noise character string from a recognized character string without performing a character string editing process.
  • An electronic image document editing system for editing a character string recognized from an electronic image document including a processor and a storage device, wherein the storage device is a character string made up of one or more characters whether or not the character string to be edited
  • One or more character string determination criteria that are criteria for determining whether or not the processor accepts an input of an electronic image document, and from one or more characters in a plurality of types of characters in the input electronic image document
  • the recognized character string satisfies the character string criterion, it is determined that the recognized character string is an edit target character string, and the character string criterion is the recognized character string.
  • the first threshold is an integer of 2 or more
  • the characters of the first type group in which the recognized character string is part of the plurality of types A second determination criterion including a partial character string composed of two or more threshold values (the second threshold value is an integer of 2 or more), and a character in the second type group in which the recognized character string is a part of the plurality of types.
  • An electronic image document editing system including at least one determination criterion among a third determination criterion including and a fourth determination criterion in which the recognized character string includes a content word.
  • a noise character string can be specified with high accuracy from a character string recognized by character recognition processing from an electronic image document.
  • a noise character string can be specified from a recognized character string with high accuracy without performing editing processing on the recognized character string.
  • 2 shows an example of the system configuration of an electronic image document editing system.
  • 2 shows an example of a hardware configuration of an electronic image document editing system.
  • An example of input electronic image document data is shown.
  • the example of the electronic image document data after a translation process is shown.
  • the example of the electronic image document data before a character recognition process is shown.
  • the example of the electronic image document data after a character recognition process is shown.
  • the 1st example of a character string criterion is shown.
  • the 2nd example of a character string criteria is shown.
  • the 3rd example of a character string criterion is shown.
  • An example of a character string information table is shown.
  • the example of a character string determination process flowchart by a character string determination part in case the value of Judge is the number of matching items is shown.
  • the example of the character string determination process flowchart by a character string determination part in case the value of Judge is the weight sum of a matching item is shown.
  • the example of the list output screen of the character string determined to be a translation object character string is shown.
  • the example of the change screen of a character string criterion is shown.
  • the example of the output screen which re-executed the translation process after a character string determination standard change is shown.
  • the electronic image document editing system accepts an input of an electronic image document of a design drawing and performs an editing process on the input electronic image document.
  • the electronic image document editing system supports the work of translating a Japanese character string described in an electronic image document into an English character string as an example of editing processing.
  • the character string in this embodiment is composed of one or more characters.
  • the electronic image document editing system recognizes characters in an electronic image document such as a design drawing and extracts candidates for locations where character strings are described.
  • the electronic image document editing system identifies a part where a character string is actually described from among candidates for a part where the character string is described by a character string determination process described later.
  • the electronic image document editing system performs translation processing on the Japanese character string among the specified character strings, and presents translation candidates for the Japanese character string.
  • the electronic image document editing system generates a translation object for the translation selected by the user, corrects the layout of the translation object, and pastes it at an appropriate position on the document.
  • the design drawing is used as an example of the input electronic image document.
  • an electronic image diagram included in a text or a paper that is converted into an electronic image may be used as the input electronic image document.
  • the electronic image document editing system in this embodiment mainly describes the work of recognizing Japanese character strings and translating the recognized Japanese character strings into English character strings. There are no particular restrictions on.
  • the work of translating a document is described, but the present invention can also be applied to other editing work such as document update and document proofreading.
  • FIG. 1 shows a configuration example of the electronic image document editing system of this embodiment.
  • the electronic image document editing system includes an input processing unit 1, an output processing unit 2, a character recognition processing unit 4, a character string determination unit 7, a translation processing unit 10, a translated word object generating unit 13, a translated word object editing unit 14, and character string information.
  • a management unit 16 is included. Each unit described above is a program.
  • the electronic image document editing system also includes a translation target image document 3, a character recognition dictionary 5, a translation target image document 6 with character recognition results, a character string determination criterion 8, a character string information table 9, a translation dictionary 11, and a translation word candidate table 12. , An image document 15 with a character recognition result / translation result, and a word / character dictionary 17.
  • the input processing unit 1 accepts various data and operations designated or instructed by the user via input means such as a keyboard, a mouse, a touch panel, and a touch pen. As an example of specific data or operation instructions, the input processing unit 1 selects an electronic image document to be translated, instructs to perform character recognition, changes contents of character string criteria, specifies a character string to be translated, Selection and input, editing of translation object, etc. are accepted.
  • the output processing unit 2 outputs various data and processing results to the user via output means such as a display.
  • the output processing unit 2 includes, as an example of specific data or processing results, an image document to be translated, an image document to be translated with character recognition results, a character string determination criterion, and character string information to be translated. Outputs an image document with a list, translation candidates, character recognition results, and translation results.
  • the user When using the electronic image document editing system of this embodiment, the user first selects an electronic image document to be translated from the electronic image document input to the electronic image document editing system.
  • the content of the selected electronic image document is displayed to the user via a display or the like and is stored in the translation target image document 3.
  • the character recognition processing unit 4 extracts electronic image document data from the translation target image document 3 and refers to a character recognition dictionary 5 that stores data relating to individual characters, rules for character recognition, and the like. Perform character recognition.
  • the character recognition process includes a character string area specifying process, a character cut-out process from the character string area, and a cut-out character recognition process. Since many character recognition algorithms used for character recognition processing are already widely known, description of character recognition processing is omitted. Note that the character recognition processing unit 4 may perform the character recognition process using any character recognition algorithm.
  • the character string recognized by the character recognition processing unit 4 is stored in the character string information table 9 together with the description location (coordinate position in the document image) of the character string. Further, the recognized character string is stored in the translation object image document 6 with the character recognition result in a form embedded in the document description portion of the translation object image document 3.
  • the character string determination unit 7 analyzes the character string recognized by the character recognition processing unit 4 and determines whether or not the recognized character string is a character string to be translated.
  • the character string determination unit 7 analyzes the character string with reference to a word / character dictionary 17 in which a list of characters and attributes, and a headline and attributes of words are stored.
  • the character string determination unit 7 refers to a character string determination criterion item stored in the character string determination criterion 8 and determines whether the recognized character string is a character string to be translated. Details of the processing by the character string determination unit 7 and the character string determination reference 8 will be described later.
  • the determination result by the character string determination unit 7 is stored in the character string information table 9.
  • the user looks at the displayed translation target image document 6 with the character recognition result, designates a description portion corresponding to the character string via a mouse, a touch pen, etc., and instructs execution of translation.
  • designating the description location for example, clicking the description location, dragging the description location, and selecting a rectangle of the range including the description location, etc., any method may be used.
  • the translation processing unit 10 extracts a character string corresponding to the description location (coordinates) designated by the user from the character string information table 9.
  • the translation processing unit 10 refers to the translation dictionary 11, extracts translation word candidates corresponding to the character string, and presents them to the user.
  • the translation processing unit 10 searches the translated word by matching the character string with the translation dictionary.
  • the translation string is divided into words by morphological analysis of the character string, and the translation dictionary for each word. You may search and present a translation from 11. Further, the translation processing unit 10 may pass the character string to a machine translation system and present a translation result by the machine translation system.
  • the electronic image document editing system of the present embodiment may use any translation dictionary search algorithm and machine translation algorithm when performing translation processing.
  • the translation result is stored in the translated word candidate table 12.
  • the translation candidate table 12 temporarily stores the correspondence between Japanese character strings and translation candidates.
  • the translation object generation unit 13 transmits the translation word candidates stored in the translation word candidate table 12 to the output processing unit 2, and the output processing unit 2 presents the received translation word candidates to the user.
  • the user selects a correct translation from the presented translation candidates. If there is no correct translation in the presented translation candidates, the user inputs the correct translation directly from the keyboard or the like. If there is an error in the recognized character string, the user corrects the recognized character string and instructs re-execution of translation. The user selects a correct translation from the translation candidates presented again.
  • the translated object generating unit 13 When the translated word is confirmed by the user inputting or selecting a correct translated word, the translated object generating unit 13 generates a translated object consisting of the translated text character string and displays it on the translation target image document 3. Further, the character string information management unit 16 stores the corrected character string and the confirmed translation result in the character string information table 9.
  • the translated object editing unit 14 adjusts the object size of the displayed translated object, the font size of the text, and the like, and performs an editing process for prompting the user to move to an appropriate position on the document and paste it. .
  • the translated object editing unit 14 may automatically adjust the object size of the translated object, the font size of the text, and the like according to the character string length before and after translation.
  • the electronic image document data at that time is stored in the image document 15 with character recognition result / translation result.
  • the character string information management unit 16 manages the character string to be translated and the translation processing status of the character string. Specifically, the character string information management unit 16 analyzes the character string information table 9, calculates the number of character strings to be translated and the number of characters in the electronic image document, and holds them. In addition, the character string information management unit 16 manages the editing work status such as whether or not each translation target character string has been translated in cooperation with the translated word object generating unit 13 and the translated word object editing unit 14.
  • the character string information management unit 16 When the character string information management unit 16 receives information from the translated object editing unit 14 that the translated object has been pasted on a predetermined coordinate, the translation of the character string to be translated corresponding to the coordinate is completed. I reckon. At this time, the character string information management unit 16 stores 1 in a translation work completion flag (to be described later) of the character string information table 9.
  • the electronic image document editing system can manage to what extent the translation work is completed at a certain point of time by the translation work management by the character string information management unit 16 and can present the translation work status to the user.
  • FIG. 2 shows a hardware configuration example of the electronic image document editing system of the present embodiment.
  • the electronic image document editing system includes a processing device 50, an input device 30, an output device 40, and a storage device 60, and is connected to a network 90.
  • the processing device 50 includes a processor and / or a logic circuit that operates according to a program, inputs / outputs data, reads / writes data, and executes each program shown in FIG.
  • the program is executed by the processor to perform a predetermined process using a storage device and a communication port (communication device). Therefore, in the present embodiment and other embodiments, the description with the program as the subject may be the description with the processor as the subject. Alternatively, the process executed by the program is a process performed by a computer and a computer system on which the program operates.
  • the processor operates as a functional unit that realizes a predetermined function by operating according to a program.
  • the processor functions as the character recognition processing unit 4 by operating according to the character recognition processing program, and functions as the character string determination unit 7 by operating according to the character string determination program.
  • the processor also operates as a functional unit that realizes each of a plurality of processes executed by each program.
  • a computer and a computer system are an apparatus and a system including these functional units.
  • the input device 30 is a device that accepts an operation content or data input from a user.
  • the input device 30 includes a keyboard 31 and a mouse 32.
  • the input device 30 may include a touch pen, a touch panel, or the like instead of or in addition to the keyboard 31 and the mouse 32.
  • the output device 40 is a device that outputs calculation processing results and the like to the user.
  • the output device 40 includes an output monitor 41.
  • the electronic image document editing system transmits / receives input / output data via the network 90 when the input / output data is exchanged with another computer.
  • the storage device 60 stores the program and data shown in FIG.
  • the storage device 60 includes a working area 61 that temporarily stores processing data generated by the processing device 50 when the program is executed.
  • the storage device 60 is an area for storing each data shown in FIG. 1, which is a translation target image document storage area 62, a character recognition dictionary storage area 64, a translation target image document storage area 65 with a character recognition result, a character A column criterion storage area 67, a character string information table storage area 68, a translation dictionary storage area 70, a translation word candidate table storage area 71, an image document storage area 74 with character recognition results / translation results, and a word / character dictionary storage area 75 Including.
  • the storage device 60 is an area for storing each unit shown in FIG. 1, and is a character recognition processing unit storage area 63, a character string determination unit storage area 66, a translation processing unit storage area 69, and a translated object generation unit storage area. 72, and a translated object editing section storage area 73.
  • the electronic image document editing system has a configuration in which all data and processing are aggregated in one computer, but the data and processing may be distributed and arranged in a plurality of computers.
  • a character recognition server which is another computer storing the character recognition processing unit 4 and the character recognition dictionary 5, and a computer having a function other than character recognition may exchange data with each other via the network 90.
  • a translation server which is another computer storing the translation processing unit 10 and the translation dictionary 11 and a computer having a function other than translation may exchange data with each other via the network 90.
  • FIG. 3A shows an example of an input electronic image document before translation.
  • a simple electric circuit diagram is used as an example of an electronic image document.
  • the electronic image document editing system includes a large amount of character information and drawing information including non-character information. Is often entered.
  • the electric circuit diagram in the electronic image document 301 before translation shows a circuit including a 6V dry cell, a miniature bulb, a transistor, and a resistor.
  • a character string representing the content and explanation of each symbol is described. That is, character information and non-character information such as a symbol representing a circuit and wiring are mixed in the electric circuit diagram.
  • FIG. 3B shows an example of the electronic image document of FIG. 3A translated by the electronic image document editing system.
  • the portion of the figure (non-character division) in the translated electronic image document 302 is not edited, the contents of FIG. 3A are displayed as they are, and only the Japanese character string is translated into English.
  • character strings having the same notation and meaning in Japanese and English, such as “100 ⁇ ” and “6V”, are not translated, and the contents in the electronic image document 301 before translation are displayed as they are.
  • the user can adjust the character font or add a line break to the electronic image document of the translated word. Editing processing such as making multiple lines or adjusting the description position may be performed.
  • FIG. 4A shows an example of an electronic image document before character recognition by the electronic image document editing system.
  • the electronic image document 401 before character recognition is the same as the electronic image document 301 before translation described in FIG. 3A.
  • FIG. 4B shows an example of a character recognition result for the electronic image document of FIG. 4A by the electronic image document editing system.
  • the character string in the character recognition result 402 is associated with the description location (coordinates) of the character string.
  • the character recognition result is displayed by overwriting the document data for convenience of explanation. However, in actuality, the character string obtained as the character recognition result is arranged behind the document data and is visible to the user. Absent.
  • Character strings “resistance 100 ⁇ ”, “resistance 200 ⁇ ”, “bean bulb”, and “dry battery 6V” in the character recognition result 402 are correctly recognized.
  • the character string “NPN transistor” is erroneously recognized as “NPN transistor” with only one character (“J” is “S”).
  • the partial character string “input” is “entered”
  • the partial character string “change” is “weird”
  • the partial character string “)” is “ ⁇ ”.
  • the partial character string in the character string of length n is a continuous character string from the i-th character to the j-th character (1 ⁇ i ⁇ j ⁇ n) of the character string.
  • the miniature bulb circuit symbol is the character string “te W”
  • the NPN transistor circuit symbol is the character string “six”
  • the resistor circuit symbol is the character string “ ⁇ VV-”
  • the dry cell circuit Each symbol is recognized as a character string “state”.
  • These recognized character strings are all noise character strings in which non-character information is erroneously recognized as character information.
  • FIG. 5A shows a first example of the character string determination criterion 8.
  • the character string determination criterion 8 includes a plurality of determination criterion items. Each determination criterion item includes an ID 501 for identifying the determination criterion item, a determination criterion item content 502 describing the specific content of the determination criterion item, a weight value 503 representing the importance (reliability) of each determination criterion item, and An application flag 504 that indicates by 1/0 whether or not to apply the criterion item is included. Further, the character string determination criterion 8 includes a determination method 505 that defines a determination method using one or more determination criterion items.
  • the determination criterion item content 502 is described using a variable that can be recognized by the character string determination unit 7.
  • S_length represents the number of characters constituting the character string.
  • C_type represents the type of characters constituting the character string. For example, Kanji, hiragana, katakana, symbol, symbol suffix (n_suffix), number, alphabet, alphabet ), An emergency kanji (non_j_kanji), etc., and a character type value recognized by the character recognition processing unit 4.
  • the character recognition processing unit 4 recognizes characters used for the original word and the translated word.
  • the number may indicate an Arabic numeral, or may indicate a numeral including a numeral other than an Arabic numeral (for example, a Roman numeral or a Greek numeral).
  • the number suffix is an example of a classifier and represents a word that is a suffix among the classifiers.
  • the classifier is a concept including a unit of measurement.
  • the character recognition processing unit 4 identifies one type of each recognized character.
  • the type of the letter “A” may be an alphabet or a numeric suffix (ampere) representing a unit of current.
  • the character recognition processing unit 4 identifies one type of the character “A” from the relationship with the characters before and after the character “A”.
  • C_type_seq represents the number of consecutive characters defined by C_type.
  • C_word represents the number of independent words included in the character string.
  • the determination criterion item Rule_1 indicates that “the number of characters constituting the recognized character string is two or more characters”.
  • the numerical value 2 in Rule_1 may be an integer of 3 or more. Since a character string with a small number of characters is unlikely to constitute a word, it is highly likely that it is a noise character string.
  • the character string determination unit 7 can exclude such a noise character string from the edit target character string by applying Rule_1 to the recognized character string.
  • the character string determination unit 7 can perform determination processing using Rule_1 by counting the number of characters in the character string without performing morphological analysis and without referring to various dictionaries. Therefore, the character string determination unit 7 can perform determination processing using Rule_1 at high speed.
  • the character string to be edited includes many character strings of two characters. Therefore, the character string determination unit 7 removes a large number of noise character strings by performing determination using Rule_1 of the present embodiment, in which two or more character strings are to be edited. Despite this, it is possible to reduce the number of character strings that are no longer considered for editing. Note that the character string determination unit 7 can directly apply the above Rule_1 to the recognized character string even when the original language is other than Japanese.
  • the criterion item Rule_2 indicates that “the recognized character string includes a partial character string in which two or more kanji characters, hiragana characters, or katakana characters continue”.
  • the numerical value 2 in Rule_2 may be an integer of 3 or more.
  • a character string that does not include a partial character string in which a predetermined type of characters continues for a certain number of characters is highly likely to be a noise character string.
  • a character string that does not satisfy Rule_2 having the predetermined type as kanji, hiragana or katakana is not likely to include a partial character string that is a Japanese (original) word, and thus may be a noise character string in particular. Is expensive.
  • the character string determination unit 7 can exclude such a noise character string from the editing target by applying Rule_2 to the recognized character string.
  • the character string determination unit 7 determines Rule_2 by determining the type of characters constituting the character string and counting the number of characters in the character string without performing morphological analysis and referring to various dictionaries. The used determination process can be performed. Therefore, the character string determination unit 7 can perform determination processing using Rule_2 at high speed.
  • noise character strings are character strings that do not include a partial character string in which two or more kanji characters, hiragana characters, or katakana characters continue.
  • the character string to be edited is a character string including a partial character string in which two kanji characters, hiragana characters, or katakana characters are continuous, and includes a partial character string in which three or more kanji characters, hiragana characters, or katakana characters are continuous. Contains a lot of non-character strings. Therefore, the character string determination unit 7 removes a large number of noise character strings by performing determination using Rule_2 of the present embodiment in which two or more character strings are to be edited. Despite this, it is possible to reduce the number of character strings that are no longer considered for editing.
  • Rule_2 may be, for example, “the recognized character string includes a partial character string in which two or more characters in the first type group that is a part of the character type recognized by the character recognition processing unit 4 are continuous”. .
  • the type group represents one or a plurality of types. Therefore, in Rule_2, for example, the first type group can be hiragana or katakana. Also in this case, the numerical value 2 in Rule_2 may be an integer of 3 or more.
  • Rule_2 can be, for example, “a recognized character string includes a partial character string in which two or more alphabets are continuous”. Further, when the original language is Chinese, Rule_2 can be, for example, “the recognized character string includes a partial character string in which two or more kanji characters are continuous”.
  • Judgment criteria item Rule_3 indicates that “the recognized character string includes one or more characters other than symbols, numeral suffixes, numbers, alphabets, and emergency kanji”.
  • a character string whose recognition character string does not include one or more characters other than the predetermined type is highly likely to be a noise character string.
  • a character string that does not satisfy Rule_3 except that the predetermined type is a symbol, a numeric suffix, a number, an alphabet, or an emergency kanji is highly likely not to be a Japanese (original language) word, and thus is particularly a noise character string. Is likely.
  • the character string determination unit 7 can exclude such a noise character string from the editing target by applying Rule_3 to the recognized character string.
  • Rule_3 may be, for example, “the recognized character string includes one or more characters in the second type group that is a part of the character type recognized by the character recognition processing unit 4”.
  • Rule_3 can be, for example, “the recognized character string includes one or more characters other than symbols, classifiers, numbers, kanji, hiragana, katakana, and emergency kanji”.
  • Rule_3 is, for example, “recognized character string includes one or more characters other than symbols, classifiers, numbers, hiragana, katakana, alphabet, and emergency kanji”. it can.
  • Judgment criteria item Rule_4 indicates that “the recognized character string includes one or more independent words”. Since a character string that does not include an independent word is a character string that does not have a word content, there is a high possibility that the character string is a noise character string.
  • the character string determination unit 7 can exclude the noise character string from editing targets by applying Rule_4 to the recognized character string.
  • Rule_3 may be, for example, “the recognized character string includes one or more content words”.
  • a content word is a word having a specific meaning content other than a grammatical role, such as a noun, a verb, and an adjective.
  • An independent word is an example of a content word in Japanese.
  • the electronic image document editing system does not use the information related to the editing result when performing the determination using the determination reference items Rule_1 to Rule_4, and therefore does not perform the editing process on the recognized character string. However, it can be determined whether the recognized character string is an edit target character string.
  • the electronic image document editing system provides two types of determination methods 505.
  • FIG. 5A shows that a determination method (Num_of_items) is designated as the determination method 505 based on the number of matching determination criterion items. Also, the threshold value (Num_of_items_threshold) for the number of determination criterion items in FIG.
  • the character string determination unit 7 Indicates that the recognized character string is determined to be a character string to be translated. In the determination method 505, when the recognized character string satisfies only two or less types of determination criterion items among the determination criterion items, the character string determination unit 7 determines that the recognized character string is not a character string to be translated. It shows that.
  • FIG. 5B shows a second example of the character string criterion 8.
  • the determination method 505 indicates that the threshold (Num_of_items_threshold) for the number of determination criterion items is 2. That is, in the determination method 505, the character string determination unit 7 translates the recognized character string only when the recognized character string satisfies all of the two types of determination criterion items that are application targets (the application flag 504 is 1). Indicates that the target character string is AND determination. Further, for example, if the threshold value of the number of determination criterion items is 1, the determination method 505 indicates that the character string determination unit 7 determines that the recognized character string 7 satisfies one of the two types of determination criterion items. Indicates an OR determination in which is determined as a character string to be translated.
  • FIG. 5C shows a third example of the character string criterion 8.
  • the determination method 505 indicates that a method (Sum_of_weights) for determining whether or not the recognized character string is a character string to be translated is specified based on the sum of the weight values 503 of the matching determination criterion items.
  • the weight sum threshold (Sum_of_weights_threshold) is 3.0.
  • the determination method 505 is a character string when the sum of the weight values of the items satisfied by the recognized character string is 3.0 or more among the four types of determination criterion items to be applied (the application flag 504 is 1). It shows that the determination part 7 determines the said recognition character string as a translation object character string.
  • the determination method 505 indicates that, when the sum of the weight values is less than 3.0, the character string determination unit 7 determines that the recognized character string is not a character string to be translated.
  • the determination using FIG. 5C is based on the number of matched items shown in FIG. This is the same as the determination. That is, the determination based on the number of matching items is an example of determination based on a weight value.
  • the user can change the contents of the character string judgment standard 8. That is, the user can select the determination criterion items to be applied, and can add, delete, or change parameters, threshold values, and the like. Details of the change of the contents of the character string determination criterion 8 will be described later.
  • FIG. 6 shows an example of the character string information table 9.
  • the character string information table 9 holds data relating to the determination result by the character string determination unit 7 and the translation result by the translation processing unit 10.
  • a character string information table 9 that holds the result of the processing performed by the character recognition processing unit 4 and the character string determination unit 7 on the input document of FIG. 4A by the electronic image document editing system.
  • the character string information table 9 includes a recognized character string 601, a description position 602, a determination reference collation result 607, a translation target flag 610, a modified character string 611, a translated character string 612, and a translation status flag 613.
  • the recognized character string 601 holds a character string recognized by the character recognition processing unit 4.
  • the description position 602 holds information on a rectangular area where the recognized character string 601 is displayed.
  • the description position 602 includes an upper left X coordinate 603 and an upper left Y coordinate 604 that are coordinates of the upper left vertex of the rectangular area where the recognition character string 601 is displayed, and a lower right X coordinate that is a coordinate of the lower right vertex of the rectangular area. 605 and the lower right Y coordinate 606.
  • the determination reference collation result 607 holds the determination reference collation result in the character string determination unit 7.
  • the judgment reference matching result 607 includes a column that holds the matching result for each judgment criterion item, the number of matches 608 that holds the number of matching judgment criterion items, and the weight that holds the sum of the weight values of the matching judgment criterion items. Total 609.
  • the translation target flag 610 is a flag for identifying whether or not the recognized character string 601 is a translation target character string.
  • the number of matches 608 is equal to or greater than a threshold. If the number of matches is less than the threshold, 0 is held.
  • the translation target flag 610 is determined based on the sum of weights of the matched items as to whether or not the recognized character string is a translation target as in the character string determination criterion 8 shown in FIG. If there is, 1 is held, and if the weight sum 609 is less than the threshold, 0 is held.
  • the corrected character string 611 holds a character string corrected by the user when there is an error in the recognized character string 601 whose translation target flag 610 is 1.
  • the translated character string 612 holds the translation result for the recognized character string 601 or the corrected character string 611.
  • the translation status flag 613 is a flag for identifying whether or not the translation work of the recognized character string 601 or the corrected character string 611 having the translation target flag 610 of 1 is completed.
  • the translation status flag 613 holds 1 when the translation result is held in the translated character string 612, 0 when it is not held, and NULL when the recognized character string 601 is not a translation target.
  • the recognized character string “resistance 100 ⁇ ” is described in a rectangular area having the coordinates (160, 30) as the upper left vertex and the coordinates (300, 50) as the lower right vertex.
  • the translation processing unit 10 translates the recognized character string “resistance 100 ⁇ ” into “Resistor 100 ⁇ ” and stores the translation result in the translated character string 612. Since the translation of the recognized character string “resistance 100 ⁇ ” by the translation processing unit 10 has been completed, the corresponding translation status flag 613 holds “1”.
  • the recognition character string “NPN transistor” is described in a rectangular area having coordinates (250, 250) as the upper left vertex and coordinates (390, 270) as the lower right vertex.
  • the recognition character string “NPN transistor” has an incorrect character recognition result.
  • the translation object generation unit 13 stores the correction result in the corrected character string 611.
  • the translation processing unit 10 translates the modified character string “NPN transistor” and stores the translation result in the translated character string 612. Since the translation of the corrected character string “NPN transistor” by the translation processing unit 10 has been completed, the corresponding translation status flag 613 holds “1”.
  • the recognition character string “Te W” is described in a rectangular area having coordinates (160, 240) upper left vertex and coordinates (200, 260) lower right vertex.
  • the recognition character string “dry battery 6V” is described in a rectangular area having coordinates (335, 410) as the upper left vertex and coordinates (460, 430) as the lower right vertex.
  • the character string determining unit 7 determines that the character string is to be translated. Accordingly, the corresponding translation target flag 610 holds “1”. However, since the translation work for the recognized character string “dry battery 6V” has not been performed yet, that is, the translation character string 612 does not hold the translation result, the corresponding translation status flag 613 holds 0.
  • the character string determination unit 7 checks the determination method defined in the character string determination standard 8. That is, the character string determination unit 7 checks whether the value of the variable Judge of the determination method 505 is the number of matching items (Num_of_items) or the weight sum of matching items (Sum_of_weights). When the value of Judge is the number of matching items, the character string determination unit 7 performs the process shown in FIG. 7A. If the Judge value is the weighted sum of the matching items, the character string determination unit 7 performs the process shown in FIG. 7B.
  • FIG. 7A shows an example of a translation target character string determination process performed by the character string determination unit 7 when the Judge value is the number of matching items.
  • the character string determination unit 7 acquires the value of the threshold value S1 defined in the determination method 505 of the character string determination criterion 8 (step 702). That is, the character string determination unit 7 holds the value of the variable Num_of_items_threshold as the threshold value S1.
  • the character string determination unit 7 determines whether or not there is an undetermined recognized character string in the character string information table 9 (step 703). If there is no undetermined recognized character string (step 703: No), the character string determination unit 7 ends the process.
  • the character string determining unit 7 analyzes the recognized character string (step 704).
  • the character string determination unit 7 refers to the word / character dictionary 17 and, for example, the number of characters constituting the recognized character string, the type of characters constituting the recognized character string, and the independent character included in the recognized character string. Information necessary for determination of the determination criterion item content 502 such as a word is extracted.
  • the character string determination unit 7 determines whether or not there is a determination criterion item content 502 not applied to the recognized character string (step 705).
  • the character string determination unit 7 collates the unapplied determination criterion item content 502 (step 706), and the recognized character string is not applied. It is determined whether or not the determination criterion item content 502 is met (step 707).
  • the character string determination unit 7 stores the value 0 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9. (Step 708), the process returns to Step 705.
  • the recognized character string matches the determination criterion item (step 705: No)
  • the character string determination unit 7 stores the value 1 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9 (step 705). 709), and returns to step 705.
  • step 705 If there is no unapplied determination criterion item (step 705: No), the character string determination unit 7 sums the values for each determination criterion item stored in the determination criterion collation result 607 of the character string information table 9, and the total value Is stored in the match number 608 (step 710). Next, the character string determination unit 7 determines whether or not the total value stored in the number of matches 608 is equal to or greater than the threshold value S1 acquired in Step 702 (Step 711).
  • step 711: No If the total value is not equal to or greater than the threshold value S1 (step 711: No), the character string determination unit 7 stores a value 0 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 712). Return to 703. If the total value is greater than or equal to the threshold value S1 (step 711: Yes), the character string determination unit 7 stores the value 1 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 713). Return to step 703.
  • FIG. 7B shows an example of translation target character string determination processing by the character string determination unit 7 when the Judge value is the weighted sum of matching items.
  • the character string determination unit 7 acquires the value of the threshold value S2 defined in the determination method 505 of the character string determination criterion 8 (step 714). That is, the character string determination unit 7 holds the value of the variable Sum_of_weights_threshold as the threshold value S2.
  • the character string determination unit 7 determines whether or not there is an undetermined recognized character string in the character string information table 9 (step 715).
  • step 715: No If there is no undetermined recognized character string (step 715: No), the character string determination unit 7 ends the process. If there is an undetermined recognized character string (step 715: Yes), the character string determining unit 7 analyzes the recognized character string (step 716). Since this analysis is the same as the analysis performed in step 704, description thereof is omitted.
  • the character string determination unit 7 determines whether or not there is an unapplied determination criterion item content 502 for the recognized character string (step 717). If there is an unapplied determination criterion item content 502 (step 717: Yes), the character string determination unit 7 collates the unapplied determination criterion item content 502 (step 718), and the recognized character string is not applied. It is determined whether or not the determination criterion item content 502 is met (step 719).
  • the character string determination unit 7 stores the value 0 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9 (step 719). 720), the process returns to Step 717.
  • the character string determination unit 7 adds the character string determination criterion 8 to the corresponding determination criterion item of the determination criterion matching result 607 in the character string information table 9.
  • the weight value 503 of the corresponding determination criterion item is stored (step 721), and the process returns to step 717.
  • step 717 when there is no unapplied determination criterion item (step 717: No), the character string determination unit 7 sums the values for each determination criterion item stored in the determination criterion matching result 607 of the character string information table 9. The total value is stored in the weight sum 609 (step 722). Next, the character string determination unit 7 determines whether or not the total value stored in the weight sum 609 is equal to or greater than the threshold value S2 acquired in Step 714 (Step 723).
  • step 723: No If the total value is not greater than or equal to the threshold value S2 (step 723: No), the character string determination unit 7 stores the value 0 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 724), and step Return to 715. If the total value is greater than or equal to the threshold value (step 723: Yes), the character string determination unit 7 stores the value 1 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 725). Return to 715.
  • FIG. 8 shows an example of a list output screen of character strings determined as translation target character strings.
  • FIG. 8 illustrates a list output screen that is output and displayed based on the data in the character string information table 9 shown in FIG.
  • the character string list output screen 800 includes an output image subscreen 801 that outputs an electronic image document and a translation result of a character string in the electronic image document, and a translation status subscreen 802 that outputs a list of translation target character strings and a translation result. And including.
  • the translation status sub-screen 802 includes a status 803 that displays the translation work status of each character string, a translation target character string 804 that is a recognized character string to be translated, and a translation result 805 for the translation target character string 804. .
  • the user can sort the values of the selected items in descending or ascending order by selecting any of the item headings of the status 803, the translation target character string 804, and the translation result 805. Thereby, for example, the user can easily grasp a character string that has not yet been translated, or can easily check whether the translation result of the same character string varies.
  • the translation target character string 804 is linked with the description position on the output image sub-screen 801.
  • the output image sub-screen 801 displays a description portion of the designated character string.
  • the character string information management unit 16 refers to the description position 602 of the character string information table 9 to obtain the description position of the character string of the translation target character string 804.
  • the user can refer to the list of character strings to be translated on the translation status sub-screen 802 and display it in conjunction with the output image sub-screen 801. Therefore, the translation omission for the character strings in the electronic image document can be performed. Can be reduced.
  • the translation status sub-screen 802 displays the number of character strings to be translated and the total number of characters at the top. In the example of FIG. 8, the translation status subscreen 802 displays that the number of character strings to be translated is six and the total number of characters is 40 characters. These values are calculated by the character string information management unit 16 from the recognized character string 601 whose translation target flag is 1 in the character string information table 9. From the display of the translation status sub-screen 802, the user can grasp the amount of character strings to be translated in the electronic image document together with the list of character strings. Can be estimated.
  • FIG. 9 shows an example of a screen for changing the character string criterion 8.
  • a determination criterion change screen 900 for changing the character string determination criterion 8 is displayed.
  • the user changes the constituent elements and values (values enclosed in [] in the determination criterion change screen 900) for each determination criterion item constituting the character string determination criterion 8 on the determination criterion change screen 900.
  • FIG. 9 shows that when the threshold value (Num_of_items_threshold) 902 of the character string determination criterion 8 shown in FIG. 5A is 3 to 4 (that is, when the recognized character string satisfies all four kinds of reference items) An example of a change to (determined as a character string) is shown.
  • the update button 903 When the user presses the update button 903 after inputting the determination criterion change content, the content displayed on the determination criterion change screen 900 immediately before the pressing is updated and reflected in the character string determination criterion 8.
  • the cancel button 904 When the user presses the cancel button 904, the content is not updated and reflected in the character string determination standard 8.
  • FIG. 10 shows an example of an output screen in which the translation process is re-executed after the character string determination standard 8 is changed.
  • the character string determination unit 7 re-executes the determination process using the updated character string determination standard 8.
  • the re-execution result is stored in the character string information table 9. Then, based on the information in the character string information table 9 storing the re-execution result, the translation status sub-screen 802 is re-displayed.
  • the recognized character string “NPN transistor” is not regarded as a character string to be translated based on the updated determination criteria, and the characters output and displayed on the translation status sub-screen 802 Excluded from the column.
  • the recognized character string “NPN transistor” is no longer regarded as a translation target, the number of translation target character strings and the number of translation target characters are also reduced.
  • the electronic image document editing system allows the user to set the character string determination criterion 8 corresponding to the electronic image document to be edited because the contents of the character string determination criterion 8 can be updated.
  • the character string to be edited can be extracted with high accuracy.
  • the electronic image document editing system of the present embodiment is recognized by character recognition processing from an electronic image document in which figure (non-character information) and character information are mixed, such as a design drawing.
  • character information to be edited can be specified with high accuracy.
  • the user can easily and accurately grasp the description location and amount of text information to be edited, which in turn improves the efficiency and quality of editing work. Can be improved.
  • this invention is not limited to the above-mentioned Example, Various modifications are included.
  • the above-described embodiments are described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment.
  • each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
  • Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
  • Information such as programs, tables, and files for realizing each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)
  • Machine Translation (AREA)
  • Character Input (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Provided is a digital image document editing system, which: accepts an input of a digital image document; recognizes, within the inputted digital image document, a text character string formed from one or more text characters of a plurality of categories of text characters; and, if the recognized text character string satisfies a text character string determination rule, determines that the recognized text character string is a text character string to be edited. The text character string determination rules include at least one determination rule among: a first determination rule wherein the recognized text character string is formed from a number of text characters equal to or greater than a first threshold (which is an integer greater than one); a second determination rule wherein the recognized text character string includes a partial text character string having a number of text characters belonging to a first group of categories, which are a portion of the abovementioned plurality of categories, equal to or greater than a second threshold (which is an integer greater than one); a third determination rule wherein the recognized text character string includes a text character that belongs to a second group of categories, which are a portion of the plurality of categories; and a fourth determination rule wherein the recognized text character string includes a content word.

Description

電子イメージ文書編集システムElectronic image document editing system
 本発明は、電子イメージ文書編集システムに関する。 The present invention relates to an electronic image document editing system.
 複数の人が協調して一つの業務を迅速かつ的確に遂行するためには、業務の遂行に必要な情報を含む編集された文書を、共有することが望ましい。文書の編集は、文書の新規作成、既存文書の更新(追加、修正、削除等)、文書中の文字情報の校正、及び当該文字情報の翻訳等を含む。 In order for multiple people to collaborate and perform one task quickly and accurately, it is desirable to share an edited document containing information necessary for the task. Editing a document includes creating a new document, updating an existing document (addition, correction, deletion, etc.), proofreading character information in the document, and translating the character information.
 昨今の情報化社会の進展により、誰でも、文書内の文字情報がコード化された電子文書を容易に編集することが可能となった。しかし、紙文書をスキャンした電子イメージ文書や、電子文書をイメージデータとして保存した電子イメージ文書のように、文書内の文字情報がコード化されていない文書が、数多く存在する。 The recent development of the information society has made it possible for anyone to easily edit an electronic document in which character information in the document is encoded. However, there are many documents in which character information in the document is not coded, such as an electronic image document obtained by scanning a paper document and an electronic image document obtained by saving the electronic document as image data.
 例えば、電力・電機製品のような、ライフサイクルの長い製品の設計開発においては、過去の製品の設計文書が電子イメージ文書の形式でしか、残っていない場合が多い。当該設計文書の一部を変更したり、当該設計文書中の文字を翻訳して海外の設計部署と設計情報を共有したりする場合、電子イメージ文書である当該設計文書から文字列を認識し、当該認識した文字列を編集しなければならない。 For example, in the design and development of products having a long life cycle such as electric power / electrical products, design documents of past products often remain only in the form of electronic image documents. When changing a part of the design document or translating characters in the design document and sharing design information with an overseas design department, the character string is recognized from the design document that is an electronic image document, The recognized character string must be edited.
 本技術の背景分野として、特開平9-223147号公報(特許文献1)がある。この公報には、「スキャナ部1により読み取られた原稿画像データは、認識処理が設定された場合、画像処理部3を経て認識処理部102に入力され、文字認識が行われる。そして、1行中に認識された単語総数中の訳語が存在した数(ヒット数)を所定数行記憶し、注目行及び前後の行のヒット数が所定数以下の場合、または認識された単語の文字コードが文字以外または同じコードが所定回連続している等のある設定されたパターンに合致するか否かを判定し、1行中に認識された単語総数中の1行中の前記パターン合致数の割合が所定値以上の場合に注目行を作画することをやめることにより、余分な出力を抑える。」と記載されている(要約参照)。 As a background field of this technology, there is JP-A-9-223147 (Patent Document 1). According to this publication, “original image data read by the scanner unit 1 is input to the recognition processing unit 102 via the image processing unit 3 when recognition processing is set, and character recognition is performed. A predetermined number of lines (the number of hits) of the translated words in the total number of recognized words are stored, and when the hit number of the target line and the preceding and following lines is equal to or less than the predetermined number, or the character code of the recognized word is A ratio of the number of pattern matches in one line out of the total number of words recognized in one line by determining whether or not the pattern matches a certain pattern such as non-letters or the same code is repeated a predetermined number of times. “If the value is equal to or greater than a predetermined value, the extra line is suppressed by stopping drawing the attention line” (see summary).
特開平9-223147号公報JP-A-9-223147
 例えば、特許明細書本文や論文のように、文字が一箇所に連続して記載されている電子イメージ文書においては、利用者は当該電子イメージ文書中の文字の記載箇所や記載量を容易かつ正確に把握することができる。しかし、例えば、設計図面のように文字情報が文書内に分散して記載され、かつ、図(非文字情報)と文字情報が混在していて、文字情報の記載箇所を目視で識別しづらいような場合、利用者は、文字情報の記載箇所や記載量を容易かつ正確に把握することが困難となる。 For example, in an electronic image document in which characters are continuously written in one place, such as the text of a patent specification or a paper, the user can easily and accurately describe the place and amount of characters in the electronic image document. Can grasp. However, for example, character information is distributed and described in a document as in a design drawing, and a figure (non-character information) and character information are mixed, so that it is difficult to visually identify the character information description location. In this case, it becomes difficult for the user to easily and accurately grasp the description location and the description amount of the character information.
 電子イメージ文書の編集を行うシステム(以下、電子イメージ文書編集システムと呼ぶ)が、電子イメージ文書中の文字情報を編集するためには、まず文字認識処理を行う必要がある。即ち、電子イメージ文書編集システムは、電子イメージ文書中の文字情報が記載されている領域を特定し、当該領域に記載されている文字内容を特定する処理を行う必要がある。 In order for a system for editing an electronic image document (hereinafter referred to as an electronic image document editing system) to edit character information in the electronic image document, it is necessary to first perform character recognition processing. That is, the electronic image document editing system needs to specify a region in which character information is described in the electronic image document, and to perform processing for specifying the character content described in the region.
 もし、文字認識処理の認識精度が100%であれば、電子イメージ文書編集システムは、電子イメージ文書中の文字情報の記載箇所や記載量を正確に特定できる。しかし実際には、電子イメージ文書編集システムは、文字認識処理において、スキャンされる紙の質や、電子イメージの解像度、記載文字のフォントの種類やサイズ等に起因した誤認識を起こす。 If the recognition accuracy of the character recognition process is 100%, the electronic image document editing system can accurately specify the description location and the description amount of the character information in the electronic image document. However, in practice, the electronic image document editing system causes erroneous recognition due to the quality of the scanned paper, the resolution of the electronic image, the font type and size of the written characters, and the like in the character recognition process.
 なお、当該誤認識には、文字情報が非文字情報として認識される場合(漏れ)、非文字情報が文字情報として認識される場合(ノイズ)、及び文字情報が文字情報として認識されているが、当該文字情報の内容が正しく認識されていない場合(認識誤り)がある。電子イメージ文書編集システムが、図(非文字情報)と文字情報が混在する設計図面のような文書に対して文字認識処理を行う場合、図面の断片を文字として誤認識することにより生じるノイズ文字列が、文字認識結果内に多くみられる。 In addition, in the said misrecognition, when character information is recognized as non-character information (leakage), when non-character information is recognized as character information (noise), and character information is recognized as character information. In some cases, the content of the character information is not correctly recognized (recognition error). When an electronic image document editing system performs character recognition processing on a document such as a design drawing in which figure (non-character information) and character information are mixed, a noise character string generated by erroneously recognizing a fragment of the drawing as a character Are often seen in character recognition results.
 特許文献1に記載の技術は、文字認識処理の結果、認識した文字列及びその翻訳結果(訳語検索結果)に所定のルールを適用し、ルールが合致した場合に、認識された文字列及びその翻訳結果を出力しない。特許文献1で言及されているルールは下記の2つである。第1のルールは、「1行中に認識された単語総数中の訳語が存在した数(ヒット数)を所定数行記憶し、注目行及び前後の行のヒット数が所定数以下の場合、注目行を作画することをやめる」というものである。 The technique described in Patent Document 1 applies a predetermined rule to a recognized character string and its translation result (translation search result) as a result of character recognition processing, and when the rule matches, the recognized character string and its Do not output translation results. The rules mentioned in Patent Document 1 are the following two. The first rule is “if the number of translations in the total number of recognized words in one line (the number of hits) is memorized in a predetermined number of lines, and the hit number of the target line and the preceding and following lines is less than the predetermined number, “Stop drawing the line of interest”.
 また、第2のルールは、「認識された単語の文字コードが文字以外または同じコードが所定回連続している等のある設定されたパターンに合致するか否かを判定し、1行中に認識された単語総数中の1行中の前記パターン合致数の割合が所定値以上の場合、注目行を作画することをやめる」というものである。 The second rule is to determine whether or not the character code of the recognized word matches a certain pattern such as a character other than a character or the same code continuing a predetermined number of times. When the ratio of the number of pattern matches in one line in the total number of recognized words is equal to or greater than a predetermined value, the drawing of the attention line is stopped.
 電子イメージ文書編集システムが、設計図面に対して文字認識処理を行った結果、認識されるノイズ文字列は、例えば、漢字、平仮名、片仮名、数字、アルファベット、記号等から構成される1文字以上の文字列である。特許文献1に記載の技術が、第2のルールを用いた場合、ノイズ文字列を高精度に識別できず、結果として多くのノイズ文字列が作画対象となる。 As a result of the character recognition processing performed on the design drawing by the electronic image document editing system, the recognized noise character string includes, for example, one or more characters composed of kanji, hiragana, katakana, numbers, alphabets, symbols, and the like. It is a string. When the technique described in Patent Document 1 uses the second rule, the noise character string cannot be identified with high accuracy, and as a result, many noise character strings are to be drawn.
 また、特許文献1に記載の技術は、第1のルールを用いた場合、認識文字列が作画対象文字列であるか否かを識別するために、翻訳結果(訳語検索結果)を用いる。つまり、特許文献1に記載の技術は、第1のルールを用いた場合、訳語を出力しない文字列に対しても翻訳処理を行う必要があり、処理負担が増大する。また、第1のルールを、例えば、翻訳以外の編集作業を目的とする電子イメージ文書編集システム適用する場合、本来の編集作業に必要としない翻訳機能を電子イメージ文書編集システムに搭載しなければならず、費用負担が増大する。 In addition, when the first rule is used, the technique described in Patent Document 1 uses a translation result (translation word search result) to identify whether or not the recognized character string is a drawing target character string. That is, in the technique described in Patent Document 1, when the first rule is used, it is necessary to perform a translation process even on a character string that does not output a translated word, and the processing load increases. For example, when the first rule is applied to an electronic image document editing system for editing work other than translation, a translation function that is not necessary for the original editing work must be installed in the electronic image document editing system. The cost burden increases.
 そこで本発明は、電子イメージ文書中の認識した文字列からノイズ文字列を高精度に特定し、除去することを目的とする。また、本発明は、文字列の編集処理を行うことなく、認識した文字列からノイズ文字列を特定することを目的とする。 Accordingly, an object of the present invention is to identify and remove a noise character string with high accuracy from a recognized character string in an electronic image document. Another object of the present invention is to specify a noise character string from a recognized character string without performing a character string editing process.
 上記課題を解決するために本発明は、例えば、以下のような構成を採用する。電子イメージ文書から認識された文字列を編集する、電子イメージ文書編集システムであって、プロセッサと記憶装置とを含み、前記記憶装置は、1以上の文字からなる文字列が編集対象文字列か否かを判定する基準である1以上の文字列判定基準を保持し、前記プロセッサは、電子イメージ文書の入力を受け付け、前記入力された電子イメージ文書中の、複数種別の文字における1以上の文字からなる文字列を認識し、前記認識した文字列が前記文字列判定基準を満たす場合、前記認識した文字列が編集対象文字列であると判定し、前記文字列判定基準は、前記認識した文字列が、第1閾値(前記第1閾値は2以上の整数)以上の文字からなる第1判定基準と、前記認識した文字列が、前記複数種別の一部である第1種別群の文字における第2閾値(前記第2閾値は2以上の整数)以上の文字からなる部分文字列を含む第2判定基準と、前記認識した文字列が前記複数種別の一部である第2種別群における文字を含む第3判定基準と、前記認識した文字列が内容語を含む第4判定基準のうち、少なくとも1つの判定基準を含む電子イメージ文書編集システム。 In order to solve the above problems, the present invention employs the following configuration, for example. An electronic image document editing system for editing a character string recognized from an electronic image document, including a processor and a storage device, wherein the storage device is a character string made up of one or more characters whether or not the character string to be edited One or more character string determination criteria that are criteria for determining whether or not the processor accepts an input of an electronic image document, and from one or more characters in a plurality of types of characters in the input electronic image document When the recognized character string satisfies the character string criterion, it is determined that the recognized character string is an edit target character string, and the character string criterion is the recognized character string. However, in the first determination group consisting of characters greater than or equal to the first threshold (the first threshold is an integer of 2 or more) and the characters of the first type group in which the recognized character string is part of the plurality of types A second determination criterion including a partial character string composed of two or more threshold values (the second threshold value is an integer of 2 or more), and a character in the second type group in which the recognized character string is a part of the plurality of types. An electronic image document editing system including at least one determination criterion among a third determination criterion including and a fourth determination criterion in which the recognized character string includes a content word.
 本発明の一態様によれば、電子イメージ文書の中から文字認識処理によって認識した文字列から、ノイズ文字列を高精度に特定できる。また、本発明の一態様によれば、認識した文字列に対して編集処理を行わなくても、認識した文字列からノイズ文字列を高精度に特定することができる。 According to one aspect of the present invention, a noise character string can be specified with high accuracy from a character string recognized by character recognition processing from an electronic image document. In addition, according to one aspect of the present invention, a noise character string can be specified from a recognized character string with high accuracy without performing editing processing on the recognized character string.
電子イメージ文書編集システムのシステム構成例を示す。2 shows an example of the system configuration of an electronic image document editing system. 電子イメージ文書編集システムのハードウェア構成例を示す。2 shows an example of a hardware configuration of an electronic image document editing system. 入力電子イメージ文書データの例を示す。An example of input electronic image document data is shown. 翻訳処理後の電子イメージ文書データの例を示す。The example of the electronic image document data after a translation process is shown. 文字認識処理前の電子イメージ文書データの例を示す。The example of the electronic image document data before a character recognition process is shown. 文字認識処理後の電子イメージ文書データの例を示す。The example of the electronic image document data after a character recognition process is shown. 文字列判定基準の第1の例を示す。The 1st example of a character string criterion is shown. 文字列判定基準の第2の例を示す。The 2nd example of a character string criteria is shown. 文字列判定基準の第3の例を示す。The 3rd example of a character string criterion is shown. 文字列情報テーブルの例を示す。An example of a character string information table is shown. Judgeの値が合致項目数である場合における、文字列判定部による文字列判定処理フローチャート例を示す。The example of a character string determination process flowchart by a character string determination part in case the value of Judge is the number of matching items is shown. Judgeの値が合致項目の重み和である場合における、文字列判定部による文字列判定処理フローチャート例を示す。The example of the character string determination process flowchart by a character string determination part in case the value of Judge is the weight sum of a matching item is shown. 翻訳対象文字列と判定された文字列の一覧出力画面の例を示す。The example of the list output screen of the character string determined to be a translation object character string is shown. 文字列判定基準の変更画面の例を示す。The example of the change screen of a character string criterion is shown. 文字列判定基準変更後に翻訳処理を再実行した出力画面の例を示す。The example of the output screen which re-executed the translation process after a character string determination standard change is shown.
 以下、添付図面を参照して本発明の実施形態を説明する。本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。各図において共通の構成については同一の参照符号が付されている。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. It should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each figure, the same reference numerals are given to common configurations.
 本実施例の電子イメージ文書編集システムは、設計図面の電子イメージ文書の入力を受け付け、入力された電子イメージ文書に対して編集処理を行う。本実施例の電子イメージ文書編集システムは、編集処理の一例として、電子イメージ文書中に記載されている日本語文字列を英語文字列に翻訳する作業を支援する。本実施例における文字列は、1以上の文字からなる。 The electronic image document editing system according to the present embodiment accepts an input of an electronic image document of a design drawing and performs an editing process on the input electronic image document. The electronic image document editing system according to the present embodiment supports the work of translating a Japanese character string described in an electronic image document into an English character string as an example of editing processing. The character string in this embodiment is composed of one or more characters.
 具体的には、電子イメージ文書編集システムは、設計図面等の電子イメージ文書を文字認識して、文字列が記載されている箇所の候補を抽出する。電子イメージ文書編集システムは、後述する文字列判定処理によって、文字列が記載されている箇所の候補の中から、実際に文字列が記載されている箇所を特定する。 Specifically, the electronic image document editing system recognizes characters in an electronic image document such as a design drawing and extracts candidates for locations where character strings are described. The electronic image document editing system identifies a part where a character string is actually described from among candidates for a part where the character string is described by a character string determination process described later.
 電子イメージ文書編集システムは、特定した文字列のうち日本語文字列に対して翻訳処理を行い、当該日本語文字列の訳語候補を提示する。電子イメージ文書編集システムは、利用者によって選択された訳語に対する訳語オブジェクトを生成し、訳語オブジェクトのレイアウトを修正して、文書上の適切な位置に貼り付ける。 The electronic image document editing system performs translation processing on the Japanese character string among the specified character strings, and presents translation candidates for the Japanese character string. The electronic image document editing system generates a translation object for the translation selected by the user, corrects the layout of the translation object, and pastes it at an appropriate position on the document.
 本実施例では、設計図面を入力電子イメージ文書の一例としているが、例えば、電子イメージ化されている文章や、論文等に含まれる電子イメージ図表を入力電子イメージ文書としてもよい。また、電子イメージ文書編集システムは、本実施例において、主に日本語文字列を認識し、認識した日本語文字列を英語文字列に翻訳する作業について説明しているが、原語及び訳語の言語について特に制約はない。さらに本実施例では、文書を翻訳する作業について述べているが、文書更新や文書校正等の他の編集作業にも適用可能である。 In this embodiment, the design drawing is used as an example of the input electronic image document. However, for example, an electronic image diagram included in a text or a paper that is converted into an electronic image may be used as the input electronic image document. In addition, the electronic image document editing system in this embodiment mainly describes the work of recognizing Japanese character strings and translating the recognized Japanese character strings into English character strings. There are no particular restrictions on. Furthermore, in this embodiment, the work of translating a document is described, but the present invention can also be applied to other editing work such as document update and document proofreading.
 図1は、本実施例の電子イメージ文書編集システムの構成例を示す。電子イメージ文書編集システムは、入力処理部1、出力処理部2、文字認識処理部4、文字列判定部7、翻訳処理部10、訳語オブジェクト生成部13、訳語オブジェクト編集部14、及び文字列情報管理部16を含む。上述した各部はプログラムである。 FIG. 1 shows a configuration example of the electronic image document editing system of this embodiment. The electronic image document editing system includes an input processing unit 1, an output processing unit 2, a character recognition processing unit 4, a character string determination unit 7, a translation processing unit 10, a translated word object generating unit 13, a translated word object editing unit 14, and character string information. A management unit 16 is included. Each unit described above is a program.
 また、電子イメージ文書編集システムは、翻訳対象イメージ文書3、文字認識辞書5、文字認識結果付き翻訳対象イメージ文書6、文字列判定基準8、文字列情報テーブル9、翻訳辞書11、訳語候補テーブル12、文字認識結果・翻訳結果付きイメージ文書15、及び単語・文字辞書17を含む。 The electronic image document editing system also includes a translation target image document 3, a character recognition dictionary 5, a translation target image document 6 with character recognition results, a character string determination criterion 8, a character string information table 9, a translation dictionary 11, and a translation word candidate table 12. , An image document 15 with a character recognition result / translation result, and a word / character dictionary 17.
 入力処理部1は、キーボードやマウス、タッチパネル、タッチペン等の入力手段を介して利用者から指定又は指示される、各種データや操作を受け付ける。入力処理部1は、具体的なデータ又は操作指示の一例として、翻訳対象となる電子イメージ文書の選択、文字認識実行指示、文字列判定基準の内容変更、翻訳対象となる文字列の指定、訳語の選択及び入力、訳語オブジェクトの編集等を受け付ける。 The input processing unit 1 accepts various data and operations designated or instructed by the user via input means such as a keyboard, a mouse, a touch panel, and a touch pen. As an example of specific data or operation instructions, the input processing unit 1 selects an electronic image document to be translated, instructs to perform character recognition, changes contents of character string criteria, specifies a character string to be translated, Selection and input, editing of translation object, etc. are accepted.
 出力処理部2は、ディスプレイ等の出力手段を介して利用者に各種データや処理結果を出力する。出力処理部2は、具体的なデータ又は処理結果の一例として、翻訳対象となるイメージ文書、文字認識結果が付いた翻訳対象となるイメージ文書、文字列判定基準、翻訳対象となる文字列情報の一覧、訳語候補、文字認識結果及び翻訳結果が付いたイメージ文書等を出力する。 The output processing unit 2 outputs various data and processing results to the user via output means such as a display. The output processing unit 2 includes, as an example of specific data or processing results, an image document to be translated, an image document to be translated with character recognition results, a character string determination criterion, and character string information to be translated. Outputs an image document with a list, translation candidates, character recognition results, and translation results.
 利用者は、本実施例の電子イメージ文書編集システムを利用するにあたり、まず電子イメージ文書編集システムに入力された電子イメージ文書から翻訳対象となる電子イメージ文書を選択する。選択された電子イメージ文書の内容は、ディスプレイ等を介して利用者に表示されるとともに、翻訳対象イメージ文書3に格納される。 When using the electronic image document editing system of this embodiment, the user first selects an electronic image document to be translated from the electronic image document input to the electronic image document editing system. The content of the selected electronic image document is displayed to the user via a display or the like and is stored in the translation target image document 3.
 続いて、利用者は文字認識の実行を指示する。文字認識処理部4は、翻訳対象イメージ文書3から電子イメージ文書データを取り出し、個々の文字に関するデータや文字認識に係るルール等を格納した文字認識辞書5を参照して、当該電子イメージ文書中の文字認識を行う。 Subsequently, the user instructs execution of character recognition. The character recognition processing unit 4 extracts electronic image document data from the translation target image document 3 and refers to a character recognition dictionary 5 that stores data relating to individual characters, rules for character recognition, and the like. Perform character recognition.
 文字認識処理は、文字列領域の特定処理、文字列領域からの文字の切り出し処理、切り出した文字の認識処理を含む。なお、既に、文字認識処理に用いられる多くの文字認識アルゴリズムが広く知られているため、文字認識処理についての説明を省略する。なお、文字認識処理部4は、どのような文字認識アルゴリズムを用いて、文字認識処理を行ってもよい。 The character recognition process includes a character string area specifying process, a character cut-out process from the character string area, and a cut-out character recognition process. Since many character recognition algorithms used for character recognition processing are already widely known, description of character recognition processing is omitted. Note that the character recognition processing unit 4 may perform the character recognition process using any character recognition algorithm.
 文字認識処理部4によって認識された文字列は、当該文字列の記載箇所(文書イメージにおける座標位置)とともに、文字列情報テーブル9に格納される。また、認識された文字列は、翻訳対象イメージ文書3の文書の記載箇所に埋め込まれた形で、文字認識結果付き翻訳対象イメージ文書6に格納される。 The character string recognized by the character recognition processing unit 4 is stored in the character string information table 9 together with the description location (coordinate position in the document image) of the character string. Further, the recognized character string is stored in the translation object image document 6 with the character recognition result in a form embedded in the document description portion of the translation object image document 3.
 文字列判定部7は、文字認識処理部4によって認識された文字列を解析して、認識された文字列が翻訳対象文字列であるか否かを判定する。文字列判定部7は、文字の一覧や属性、及び単語の見出しや属性等が格納された単語・文字辞書17を参照して文字列を解析する。文字列判定部7は、文字列判定基準8に格納された文字列判定基準項目等を参照して、認識された文字列が翻訳対象文字列であるか否かを判定する。なお、文字列判定部7による処理及び文字列判定基準8についての詳細は後述する。文字列判定部7による判定結果は、文字列情報テーブル9に格納される。 The character string determination unit 7 analyzes the character string recognized by the character recognition processing unit 4 and determines whether or not the recognized character string is a character string to be translated. The character string determination unit 7 analyzes the character string with reference to a word / character dictionary 17 in which a list of characters and attributes, and a headline and attributes of words are stored. The character string determination unit 7 refers to a character string determination criterion item stored in the character string determination criterion 8 and determines whether the recognized character string is a character string to be translated. Details of the processing by the character string determination unit 7 and the character string determination reference 8 will be described later. The determination result by the character string determination unit 7 is stored in the character string information table 9.
 続いて、利用者は表示されている文字認識結果付き翻訳対象イメージ文書6を見て、文字列に相当する記載箇所をマウスやタッチペン等を介して指定し、翻訳の実行を指示する。利用者による当該記載箇所の指定方法として、例えば、当該記載箇所のクリック、当該記載箇所のドラッグ、及び当該記載箇所を含む範囲の矩形選択等があるが、どのような方法を用いてもよい。 Subsequently, the user looks at the displayed translation target image document 6 with the character recognition result, designates a description portion corresponding to the character string via a mouse, a touch pen, etc., and instructs execution of translation. As a method for designating the description location by the user, for example, clicking the description location, dragging the description location, and selecting a rectangle of the range including the description location, etc., any method may be used.
 翻訳処理部10は、利用者が指定した記載箇所(座標)に対応する文字列を文字列情報テーブル9から取り出す。翻訳処理部10は、翻訳辞書11を参照して、当該文字列に対応する訳語候補を抽出し、利用者に提示する。本実施例において、翻訳処理部10は、文字列と翻訳辞書とを照合して訳語を検索しているが、当該文字列を形態素解析して単語に分割し、個々の単語に対して翻訳辞書11から訳語を検索して提示してもよい。また、翻訳処理部10は、当該文字列を機械翻訳システムに渡して、当該機械翻訳システムによる翻訳結果を提示してもよい。 The translation processing unit 10 extracts a character string corresponding to the description location (coordinates) designated by the user from the character string information table 9. The translation processing unit 10 refers to the translation dictionary 11, extracts translation word candidates corresponding to the character string, and presents them to the user. In this embodiment, the translation processing unit 10 searches the translated word by matching the character string with the translation dictionary. The translation string is divided into words by morphological analysis of the character string, and the translation dictionary for each word. You may search and present a translation from 11. Further, the translation processing unit 10 may pass the character string to a machine translation system and present a translation result by the machine translation system.
 なお、多くの翻訳辞書検索アルゴリズム及び機械翻訳アルゴリズムが広く知られているため、これらを用いた翻訳処理についての説明を省略する。本実施例の電子イメージ文書編集システムは、翻訳処理を行う際に、どのような翻訳辞書検索アルゴリズム及び機械翻訳アルゴリズムを用いてもよい。翻訳結果は、訳語候補テーブル12に格納される。訳語候補テーブル12は、日本語文字列と訳語候補との対応関係を一時的に格納する。 In addition, since many translation dictionary search algorithms and machine translation algorithms are widely known, description of translation processing using them will be omitted. The electronic image document editing system of the present embodiment may use any translation dictionary search algorithm and machine translation algorithm when performing translation processing. The translation result is stored in the translated word candidate table 12. The translation candidate table 12 temporarily stores the correspondence between Japanese character strings and translation candidates.
 訳語オブジェクト生成部13は、訳語候補テーブル12に格納された訳語候補を出力処理部2に送信し、出力処理部2は受信した訳語候補を利用者に提示する。利用者は提示された訳語候補の中から正しい訳語を選択する。提示された訳語候補の中に正しい訳語がない場合、利用者はキーボード等から正しい訳語を直接入力する。また、文字認識された文字列に誤りがある場合、利用者は認識された文字列を修正して翻訳の再実行を指示する。利用者は、再度提示された訳語候補から正しい訳語を選択する。 The translation object generation unit 13 transmits the translation word candidates stored in the translation word candidate table 12 to the output processing unit 2, and the output processing unit 2 presents the received translation word candidates to the user. The user selects a correct translation from the presented translation candidates. If there is no correct translation in the presented translation candidates, the user inputs the correct translation directly from the keyboard or the like. If there is an error in the recognized character string, the user corrects the recognized character string and instructs re-execution of translation. The user selects a correct translation from the translation candidates presented again.
 利用者が正しい訳語を入力又は選択することで訳語が確定すると、訳語オブジェクト生成部13は、訳語テキスト文字列からなる訳語オブジェクトを生成し、翻訳対象イメージ文書3の上に表示する。また、文字列情報管理部16は、修正された文字列及び確定した翻訳結果を、文字列情報テーブル9に格納する。 When the translated word is confirmed by the user inputting or selecting a correct translated word, the translated object generating unit 13 generates a translated object consisting of the translated text character string and displays it on the translation target image document 3. Further, the character string information management unit 16 stores the corrected character string and the confirmed translation result in the character string information table 9.
 訳語オブジェクト編集部14は、表示された訳語オブジェクトのオブジェクトサイズやテキストのフォントサイズ等を調整し、文書上の適切な位置に移動して貼り付ける作業を、利用者に促すための編集処理を行う。なお、例えば、訳語オブジェクト編集部14は、翻訳前後の文字列長等に従って、訳語オブジェクトのオブジェクトサイズやテキストのフォントサイズ等を自動調整してもよい。利用者が編集結果の保存を指示すると、その時点での電子イメージ文書データが文字認識結果・翻訳結果付きイメージ文書15に格納される。 The translated object editing unit 14 adjusts the object size of the displayed translated object, the font size of the text, and the like, and performs an editing process for prompting the user to move to an appropriate position on the document and paste it. . For example, the translated object editing unit 14 may automatically adjust the object size of the translated object, the font size of the text, and the like according to the character string length before and after translation. When the user instructs to save the editing result, the electronic image document data at that time is stored in the image document 15 with character recognition result / translation result.
 文字列情報管理部16は、翻訳対象の文字列及び当該文字列の翻訳処理状況等を管理する。具体的には、文字列情報管理部16は、文字列情報テーブル9を解析して、電子イメージ文書内の翻訳対象文字列の数や文字数を算出し、保持する。また、文字列情報管理部16は、訳語オブジェクト生成部13及び訳語オブジェクト編集部14と連携して、個々の翻訳対象文字列の翻訳が完了しているか否か等の編集作業状況を管理する。 The character string information management unit 16 manages the character string to be translated and the translation processing status of the character string. Specifically, the character string information management unit 16 analyzes the character string information table 9, calculates the number of character strings to be translated and the number of characters in the electronic image document, and holds them. In addition, the character string information management unit 16 manages the editing work status such as whether or not each translation target character string has been translated in cooperation with the translated word object generating unit 13 and the translated word object editing unit 14.
 文字列情報管理部16は、訳語オブジェクト編集部14から訳語オブジェクトが所定の座標の上に貼り付けられた旨の情報を受信すると、当該座標に対応する翻訳対象文字列の翻訳作業が終了したとみなす。このとき、文字列情報管理部16は、文字列情報テーブル9の後述する翻訳作業完了フラグに1を格納する。電子イメージ文書編集システムは、文字列情報管理部16による翻訳作業管理によって、ある時点で翻訳作業がどこまで完了しているかを管理できるとともに、利用者に翻訳作業状況を提示することができる。 When the character string information management unit 16 receives information from the translated object editing unit 14 that the translated object has been pasted on a predetermined coordinate, the translation of the character string to be translated corresponding to the coordinate is completed. I reckon. At this time, the character string information management unit 16 stores 1 in a translation work completion flag (to be described later) of the character string information table 9. The electronic image document editing system can manage to what extent the translation work is completed at a certain point of time by the translation work management by the character string information management unit 16 and can present the translation work status to the user.
 図2は、本実施例の電子イメージ文書編集システムのハードウェア構成例を示す。電子イメージ文書編集システムは、処理装置50、入力装置30、出力装置40、及び記憶装置60を含み、ネットワーク90に接続されている。 FIG. 2 shows a hardware configuration example of the electronic image document editing system of the present embodiment. The electronic image document editing system includes a processing device 50, an input device 30, an output device 40, and a storage device 60, and is connected to a network 90.
 処理装置50は、プログラムに従って動作するプロセッサ及び/又は論理回路を含み、データの入力/出力、読み込み/書き込みを行い、さらに、図1に示した各プログラムを実行する。プログラムはプロセッサによって実行されることで、定められた処理を記憶装置及び通信ポート(通信デバイス)を用いながら行う。従って、本実施形態及び他の実施形態においてプログラムを主語とする説明は、プロセッサを主語とした説明でもよい。若しくは、プログラムが実行する処理は、そのプログラムが動作する計算機及び計算機システムが行う処理である。 The processing device 50 includes a processor and / or a logic circuit that operates according to a program, inputs / outputs data, reads / writes data, and executes each program shown in FIG. The program is executed by the processor to perform a predetermined process using a storage device and a communication port (communication device). Therefore, in the present embodiment and other embodiments, the description with the program as the subject may be the description with the processor as the subject. Alternatively, the process executed by the program is a process performed by a computer and a computer system on which the program operates.
 プロセッサは、プログラムに従って動作することによって、所定の機能を実現する機能部として動作する。例えば、プロセッサは、文字認識処理プログラムに従って動作することで文字認識処理部4として機能し、文字列判定プログラムに従って動作することで文字列判定部7として機能する。他のプログラムについても同様である。さらに、プロセッサは、各プログラムが実行する複数の処理のそれぞれを実現する機能部としても動作する。計算機及び計算機システムは、これらの機能部を含む装置及びシステムである。 The processor operates as a functional unit that realizes a predetermined function by operating according to a program. For example, the processor functions as the character recognition processing unit 4 by operating according to the character recognition processing program, and functions as the character string determination unit 7 by operating according to the character string determination program. The same applies to other programs. Furthermore, the processor also operates as a functional unit that realizes each of a plurality of processes executed by each program. A computer and a computer system are an apparatus and a system including these functional units.
 入力装置30は、利用者からの操作内容又はデータの入力を受け付けるデバイスである。入力装置30は、キーボード31及びマウス32を含む。なお、入力装置30は、キーボード31、マウス32に代えて、又は加えて、タッチペン、タッチパネル等を含んでもよい。 The input device 30 is a device that accepts an operation content or data input from a user. The input device 30 includes a keyboard 31 and a mouse 32. The input device 30 may include a touch pen, a touch panel, or the like instead of or in addition to the keyboard 31 and the mouse 32.
 出力装置40は、計算処理結果等を利用者に出力するデバイスである。出力装置40は出力モニタ41を含む。電子イメージ文書編集システムは、入出力データを別の計算機とやりとりする場合、ネットワーク90を介して入出力データを送受信する。 The output device 40 is a device that outputs calculation processing results and the like to the user. The output device 40 includes an output monitor 41. The electronic image document editing system transmits / receives input / output data via the network 90 when the input / output data is exchanged with another computer.
 記憶装置60は、図1に示したプログラム及びデータを格納する。記憶装置60は、プログラムが実行される際に処理装置50によって生成される処理データを一時的に格納するワーキングエリア61を含む。 The storage device 60 stores the program and data shown in FIG. The storage device 60 includes a working area 61 that temporarily stores processing data generated by the processing device 50 when the program is executed.
 また、記憶装置60は、図1に示した各データをそれぞれ格納するエリアである、翻訳対象イメージ文書格納エリア62、文字認識辞書格納エリア64、文字認識結果付き翻訳対象イメージ文書格納エリア65、文字列判定基準格納エリア67、文字列情報テーブル格納エリア68、翻訳辞書格納エリア70、訳語候補テーブル格納エリア71、文字認識結果・翻訳結果付きイメージ文書格納エリア74、及び単語・文字辞書格納エリア75を含む。 Further, the storage device 60 is an area for storing each data shown in FIG. 1, which is a translation target image document storage area 62, a character recognition dictionary storage area 64, a translation target image document storage area 65 with a character recognition result, a character A column criterion storage area 67, a character string information table storage area 68, a translation dictionary storage area 70, a translation word candidate table storage area 71, an image document storage area 74 with character recognition results / translation results, and a word / character dictionary storage area 75 Including.
 また、記憶装置60は、図1に示した各部をそれぞれ格納するエリアである、文字認識処理部格納エリア63、文字列判定部格納エリア66、翻訳処理部格納エリア69、訳語オブジェクト生成部格納エリア72、及び訳語オブジェクト編集部格納エリア73を含む。 In addition, the storage device 60 is an area for storing each unit shown in FIG. 1, and is a character recognition processing unit storage area 63, a character string determination unit storage area 66, a translation processing unit storage area 69, and a translated object generation unit storage area. 72, and a translated object editing section storage area 73.
 図2において電子イメージ文書編集システムは、全てのデータ及び処理が1つの計算機内に集約されている構成となっているが、データ及び処理を複数の計算機に分散して配置する構成としてもよい。例えば、文字認識処理部4及び文字認識辞書5を格納した別の計算機である文字認識サーバと、文字認識以外の機能を担う計算機とが、ネットワーク90を介して、互いにデータをやりとりしてもよい。同様に、例えば、翻訳処理部10及び翻訳辞書11を格納した別の計算機である翻訳サーバと、翻訳以外の機能を担う計算機とが、ネットワーク90を介して、互いにデータをやりとりしてもよい。 In FIG. 2, the electronic image document editing system has a configuration in which all data and processing are aggregated in one computer, but the data and processing may be distributed and arranged in a plurality of computers. For example, a character recognition server, which is another computer storing the character recognition processing unit 4 and the character recognition dictionary 5, and a computer having a function other than character recognition may exchange data with each other via the network 90. . Similarly, for example, a translation server which is another computer storing the translation processing unit 10 and the translation dictionary 11 and a computer having a function other than translation may exchange data with each other via the network 90.
 図3Aは、翻訳前の入力電子イメージ文書の一例を示す。ここでは、説明の都合により、電子イメージ文書の一例として簡単な電気回路図を用いているが、実際は、電子イメージ文書編集システムには、大量の文字情報や非文字情報である図情報を含む図面が入力されることが多い。翻訳前の電子イメージ文書301中の電気回路図は、6V乾電池、豆電球、トランジスタ、及び抵抗を含む回路を示す。電気回路図内の各記号の近傍に、当該各記号の内容及び説明を表す文字列が記載されている。つまり、文字情報と、回路を表す記号や配線などの非文字情報とが、当該電気回路図内に混在している。 FIG. 3A shows an example of an input electronic image document before translation. Here, for convenience of explanation, a simple electric circuit diagram is used as an example of an electronic image document. However, in reality, the electronic image document editing system includes a large amount of character information and drawing information including non-character information. Is often entered. The electric circuit diagram in the electronic image document 301 before translation shows a circuit including a 6V dry cell, a miniature bulb, a transistor, and a resistor. In the vicinity of each symbol in the electric circuit diagram, a character string representing the content and explanation of each symbol is described. That is, character information and non-character information such as a symbol representing a circuit and wiring are mixed in the electric circuit diagram.
 図3Bは、電子イメージ文書編集システムによって、翻訳された図3Aの電子イメージ文書の一例を示す。図3Bにおいて、翻訳後の電子イメージ文書302内の図(非文字除法)の部分は編集されず、図3Aの内容がそのまま表示され、日本語文字列だけが英語に翻訳されている。また、「100Ω」、「6V」のように、日本語及び英語において表記及び意味が共通である文字列については翻訳されず、翻訳前の電子イメージ文書301内の内容がそのまま表示される。なお、例えば、翻訳前の日本語と翻訳後の英語との文字列長が大きく異なる場合には、利用者は、翻訳語の電子イメージ文書に対して、文字フォントを調整したり、改行を加えて複数行にしたり、記載位置を調整したりするなどの編集処理を行ってもよい。 FIG. 3B shows an example of the electronic image document of FIG. 3A translated by the electronic image document editing system. In FIG. 3B, the portion of the figure (non-character division) in the translated electronic image document 302 is not edited, the contents of FIG. 3A are displayed as they are, and only the Japanese character string is translated into English. Further, character strings having the same notation and meaning in Japanese and English, such as “100Ω” and “6V”, are not translated, and the contents in the electronic image document 301 before translation are displayed as they are. For example, if the character string lengths of Japanese before translation and English after translation differ greatly, the user can adjust the character font or add a line break to the electronic image document of the translated word. Editing processing such as making multiple lines or adjusting the description position may be performed.
 図4Aは、電子イメージ文書編集システムによる文字認識前の電子イメージ文書の一例を示す。文字認識前の電子イメージ文書401は、図3Aに記載された翻訳前の電子イメージ文書301と同一である。 FIG. 4A shows an example of an electronic image document before character recognition by the electronic image document editing system. The electronic image document 401 before character recognition is the same as the electronic image document 301 before translation described in FIG. 3A.
 図4Bは、電子イメージ文書編集システムによる、図4Aの電子イメージ文書に対する文字認識結果の一例を示す。文字認識結果402内の文字列は、当該文字列の記載箇所(座標)に対応付けられている。図4Bでは説明の都合により、文字認識結果を文書データに上書きして表示しているが、実際は、文字認識結果として得られた文字列は、文書データの裏に配置され、利用者には見えない。 FIG. 4B shows an example of a character recognition result for the electronic image document of FIG. 4A by the electronic image document editing system. The character string in the character recognition result 402 is associated with the description location (coordinates) of the character string. In FIG. 4B, the character recognition result is displayed by overwriting the document data for convenience of explanation. However, in actuality, the character string obtained as the character recognition result is arranged behind the document data and is visible to the user. Absent.
 文字認識結果402内の、文字列「抵抗100Ω」、「抵抗200Ω」、「豆電球」、及び「乾電池6V」は正しく文字認識されている。しかし、文字列「NPNトランジスタ」は、「NPNトランシスタ」と一文字だけ(「ジ」が「シ」と)誤認識されている。また、文字列「入力の変化を増幅」のうち、部分文字列「入力」が「入刀」に、部分文字列「変化」が「変イヒ」に、部分文字列「)」が「}」に、それぞれ誤認識されている。なお、長さnの文字列における部分文字列とは、当該文字列のi文字目からj文字目まで(1≦i≦j≦n)の連続する文字列である。 Character strings “resistance 100Ω”, “resistance 200Ω”, “bean bulb”, and “dry battery 6V” in the character recognition result 402 are correctly recognized. However, the character string “NPN transistor” is erroneously recognized as “NPN transistor” with only one character (“J” is “S”). Also, in the character string “amplify changes in input”, the partial character string “input” is “entered”, the partial character string “change” is “weird”, and the partial character string “)” is “}”. Are mistakenly recognized. The partial character string in the character string of length n is a continuous character string from the i-th character to the j-th character (1 ≦ i ≦ j ≦ n) of the character string.
 さらに、文字認識結果402内の、豆電球の回路記号が文字列「てW」、NPNトランジスタの回路記号が文字列「六」、抵抗の回路記号が文字列「-VV-」、乾電池の回路記号が文字列「州」として、それぞれ認識されている。これらの認識文字列は全て、非文字情報が文字情報として誤認識されたノイズ文字列である。 Furthermore, in the character recognition result 402, the miniature bulb circuit symbol is the character string “te W”, the NPN transistor circuit symbol is the character string “six”, the resistor circuit symbol is the character string “−VV-”, and the dry cell circuit Each symbol is recognized as a character string “state”. These recognized character strings are all noise character strings in which non-character information is erroneously recognized as character information.
 図5Aは、文字列判定基準8の第1の例を示す。文字列判定基準8は、複数の判定基準項目を含む。各判定基準項目は、判定基準項目を識別するID501、判定基準項目の具体的な内容を記載した判定基準項目内容502、各判定基準項目の重要度(信頼性)を表す重み値503、及び当該判定基準項目を適用するか否かを1/0で表す適用フラグ504を含む。また、文字列判定基準8は、1以上の判定基準項目を用いた判定方法を規定する判定方法505を含む。 FIG. 5A shows a first example of the character string determination criterion 8. The character string determination criterion 8 includes a plurality of determination criterion items. Each determination criterion item includes an ID 501 for identifying the determination criterion item, a determination criterion item content 502 describing the specific content of the determination criterion item, a weight value 503 representing the importance (reliability) of each determination criterion item, and An application flag 504 that indicates by 1/0 whether or not to apply the criterion item is included. Further, the character string determination criterion 8 includes a determination method 505 that defines a determination method using one or more determination criterion items.
 判定基準項目内容502は、文字列判定部7が認識可能な変数を用いて記述される。S_lengthは文字列を構成する文字数を表す。C_typeは文字列を構成する文字の種別を表し、例えば、漢字(kanji)、平仮名(hiragana)、片仮名(katakana)、記号(symbol)、数詞接尾語(n_suffix)、数字(numeral)、アルファベット(alphabet)、非常用漢字(non_j_kanji)等、文字認識処理部4が認識する文字の種別の値を持つ。本実施例において、文字認識処理部4は、原語及び訳語に用いられる文字を認識する。 The determination criterion item content 502 is described using a variable that can be recognized by the character string determination unit 7. S_length represents the number of characters constituting the character string. C_type represents the type of characters constituting the character string. For example, Kanji, hiragana, katakana, symbol, symbol suffix (n_suffix), number, alphabet, alphabet ), An emergency kanji (non_j_kanji), etc., and a character type value recognized by the character recognition processing unit 4. In this embodiment, the character recognition processing unit 4 recognizes characters used for the original word and the translated word.
 なお、数字(numeral)は、アラビア数字を示すものとしてもよいし、アラビア数字以外の数字(例えばローマ数字、ギリシャ数字等)を含む数字を示すものとしてよい。また、数詞接尾語は、助数詞の一例であり、助数詞のうち接尾語である語を表す。なお、助数詞は計量単位を含む概念である。 In addition, the number (numeral) may indicate an Arabic numeral, or may indicate a numeral including a numeral other than an Arabic numeral (for example, a Roman numeral or a Greek numeral). The number suffix is an example of a classifier and represents a word that is a suffix among the classifiers. The classifier is a concept including a unit of measurement.
 なお、文字認識処理部4は、認識した各文字の種別を1つに特定する。例えば、文字「A」の種別は、アルファベット、又は電流の単位を表す数詞接尾語(アンペア)である可能性がある。文字認識処理部4は、例えば、文字「A」の前後の文字との関係等から、文字「A」の種別を1つに特定する。C_type_seqは、C_typeで規定された文字が連続する文字数を表す。C_wordは、文字列中に含まれる自立語の数を表す。 Note that the character recognition processing unit 4 identifies one type of each recognized character. For example, the type of the letter “A” may be an alphabet or a numeric suffix (ampere) representing a unit of current. For example, the character recognition processing unit 4 identifies one type of the character “A” from the relationship with the characters before and after the character “A”. C_type_seq represents the number of consecutive characters defined by C_type. C_word represents the number of independent words included in the character string.
 判定基準項目Rule_1は、「認識文字列を構成する文字数が2文字以上である」ことを表す。なお、Rule_1中の数値2は、3以上の整数であってもよい。文字数の少ない文字列は、単語を構成する可能性が低いため、ノイズ文字列である可能性が高い。文字列判定部7は、Rule_1を認識文字列に適用することにより、このようなノイズ文字列を編集対象文字列から除外することができる。 The determination criterion item Rule_1 indicates that “the number of characters constituting the recognized character string is two or more characters”. In addition, the numerical value 2 in Rule_1 may be an integer of 3 or more. Since a character string with a small number of characters is unlikely to constitute a word, it is highly likely that it is a noise character string. The character string determination unit 7 can exclude such a noise character string from the edit target character string by applying Rule_1 to the recognized character string.
 また、文字列判定部7は、形態素解析を行うことなく、さらに各種辞書を参照することもなく、文字列中の文字数をカウントすることにより、Rule_1を用いた判定処理を行うことができる。従って、文字列判定部7は、Rule_1を用いた判定処理を高速で行うことができる。 Further, the character string determination unit 7 can perform determination processing using Rule_1 by counting the number of characters in the character string without performing morphological analysis and without referring to various dictionaries. Therefore, the character string determination unit 7 can perform determination processing using Rule_1 at high speed.
 なお、ノイズ文字列の多くは、1文字の文字列である。また、編集対象とすべき文字列は、多くの2文字の文字列を含む。従って、文字列判定部7は、2文字以上の文字列を編集対象とする本実施例のRule_1を用いた判定を行うことで、多くのノイズ文字列を除去し、さらに編集すべき文字列であるにも関わらず、編集対象とみなされなくなる文字列を減少させることができる。なお、文字列判定部7は、原語が日本語以外である場合においても、認識文字列に対して上述のRule_1をそのまま適用することができる。 Note that many noise character strings are single character strings. The character string to be edited includes many character strings of two characters. Therefore, the character string determination unit 7 removes a large number of noise character strings by performing determination using Rule_1 of the present embodiment, in which two or more character strings are to be edited. Despite this, it is possible to reduce the number of character strings that are no longer considered for editing. Note that the character string determination unit 7 can directly apply the above Rule_1 to the recognized character string even when the original language is other than Japanese.
 判定基準項目Rule_2は、「認識文字列が、漢字、平仮名、又は片仮名が2文字以上連続する部分文字列を含む」ことを表す。なお、Rule_2中の数値2は、3以上の整数であってもよい。所定種別の文字が一定文字数以上連続する部分文字列を含まない文字列は、ノイズ文字列である可能性が高い。また、当該所定種別を漢字、平仮名又は片仮名としたRule_2を満たさない文字列は、日本語(原語)の単語である部分文字列を含まない可能性が高いため、特にノイズ文字列である可能性が高い。文字列判定部7は、Rule_2を認識文字列に適用することにより、このようなノイズ文字列を編集対象から除外することができる。 The criterion item Rule_2 indicates that “the recognized character string includes a partial character string in which two or more kanji characters, hiragana characters, or katakana characters continue”. The numerical value 2 in Rule_2 may be an integer of 3 or more. A character string that does not include a partial character string in which a predetermined type of characters continues for a certain number of characters is highly likely to be a noise character string. In addition, a character string that does not satisfy Rule_2 having the predetermined type as kanji, hiragana or katakana is not likely to include a partial character string that is a Japanese (original) word, and thus may be a noise character string in particular. Is expensive. The character string determination unit 7 can exclude such a noise character string from the editing target by applying Rule_2 to the recognized character string.
 また、文字列判定部7は、形態素解析を行うことなく、さらに各種辞書を参照することもなく、文字列を構成する文字の種別の判別及び文字列中の文字数のカウントを行うことでRule_2を用いた判定処理を行うことができる。従って、文字列判定部7は、Rule_2を用いた判定処理を高速で行うことができる。 In addition, the character string determination unit 7 determines Rule_2 by determining the type of characters constituting the character string and counting the number of characters in the character string without performing morphological analysis and referring to various dictionaries. The used determination process can be performed. Therefore, the character string determination unit 7 can perform determination processing using Rule_2 at high speed.
 なお、ノイズ文字列の多くは、漢字、平仮名、又は片仮名が2文字以上連続する部分文字列を含まない文字列である。また、編集対象とすべき文字列は、漢字、平仮名、又は片仮名が2文字連続する部分文字列を含む文字列であって、漢字、平仮名、又は片仮名が3文字以上連続する部分文字列を含まない文字列、を数多く含む。従って、文字列判定部7は、2文字以上の文字列を編集対象とする本実施例のRule_2を用いた判定を行うことで、多くのノイズ文字列を除去し、さらに編集すべき文字列であるにも関わらず、編集対象とみなされなくなる文字列を減少させることができる。 Note that many noise character strings are character strings that do not include a partial character string in which two or more kanji characters, hiragana characters, or katakana characters continue. In addition, the character string to be edited is a character string including a partial character string in which two kanji characters, hiragana characters, or katakana characters are continuous, and includes a partial character string in which three or more kanji characters, hiragana characters, or katakana characters are continuous. Contains a lot of non-character strings. Therefore, the character string determination unit 7 removes a large number of noise character strings by performing determination using Rule_2 of the present embodiment in which two or more character strings are to be edited. Despite this, it is possible to reduce the number of character strings that are no longer considered for editing.
 なお、Rule_2を、例えば、「認識文字列が、文字認識処理部4が認識する文字の種別の一部である第1種別群における文字が2文字以上連続する部分文字列を含む」としてもよい。なお、本実施例において種別群とは、1又は複数の種別を表す。従って、Rule_2において、例えば、第1種別群を、平仮名、又は片仮名とすることができる。また、この場合においてもRule_2中の数値2は3以上の整数であってもよい。 Note that Rule_2 may be, for example, “the recognized character string includes a partial character string in which two or more characters in the first type group that is a part of the character type recognized by the character recognition processing unit 4 are continuous”. . In the present embodiment, the type group represents one or a plurality of types. Therefore, in Rule_2, for example, the first type group can be hiragana or katakana. Also in this case, the numerical value 2 in Rule_2 may be an integer of 3 or more.
 原語が英語である場合、Rule_2は、例えば、「認識文字列が、アルファベットが2文字以上連続する部分文字列を含む」とすることができる。また、原語が中国語である場合、Rule_2は、例えば、「認識文字列が、漢字が2文字以上連続する部分文字列を含む」とすることができる。 When the original language is English, Rule_2 can be, for example, “a recognized character string includes a partial character string in which two or more alphabets are continuous”. Further, when the original language is Chinese, Rule_2 can be, for example, “the recognized character string includes a partial character string in which two or more kanji characters are continuous”.
 判定基準項目Rule_3は、「認識文字列が、記号、数詞接尾語、数字、アルファベット、及び非常用漢字以外の文字を1文字以上含む」ことを表す。認識文字列が所定種別以外の文字を1文字以上含まない文字列は、ノイズ文字列である可能性が高い。また、当該所定種別を、記号、数詞接尾語、数字、アルファベット、及び非常用漢字以外としたRule_3を満たさない文字列は、日本語(原語)の単語でない可能性が高いため、特にノイズ文字列である可能性が高い。文字列判定部7は、Rule_3を認識文字列に適用することにより、このようなノイズ文字列を編集対象から除外することができる。 Judgment criteria item Rule_3 indicates that “the recognized character string includes one or more characters other than symbols, numeral suffixes, numbers, alphabets, and emergency kanji”. A character string whose recognition character string does not include one or more characters other than the predetermined type is highly likely to be a noise character string. In addition, a character string that does not satisfy Rule_3 except that the predetermined type is a symbol, a numeric suffix, a number, an alphabet, or an emergency kanji is highly likely not to be a Japanese (original language) word, and thus is particularly a noise character string. Is likely. The character string determination unit 7 can exclude such a noise character string from the editing target by applying Rule_3 to the recognized character string.
 なお、Rule_3を、例えば、「認識文字列が、文字認識処理部4が認識する文字の種別の一部である第2種別群における文字を1文字以上含む」としてもよい。原語が英語である場合、Rule_3は、例えば、「認識文字列が、記号、助数詞、数字、漢字、平仮名、片仮名、及び非常用漢字以外の文字を1文字以上含む」とすることができる。また、原語が中国語である場合、Rule_3は、例えば、「認識文字列が、記号、助数詞、数字、平仮名、片仮名、アルファベット、及び非常用漢字以外の文字を1文字以上含む」とすることができる。 Note that Rule_3 may be, for example, “the recognized character string includes one or more characters in the second type group that is a part of the character type recognized by the character recognition processing unit 4”. When the original language is English, Rule_3 can be, for example, “the recognized character string includes one or more characters other than symbols, classifiers, numbers, kanji, hiragana, katakana, and emergency kanji”. When the original language is Chinese, Rule_3 is, for example, “recognized character string includes one or more characters other than symbols, classifiers, numbers, hiragana, katakana, alphabet, and emergency kanji”. it can.
 判定基準項目Rule_4は、「認識文字列が、自立語を1語以上含む」ことを表す。自立語を含まない文字列は、言葉としての内容を持たない文字列であるため、ノイズ文字列である可能性が高い。文字列判定部7は、Rule_4を認識文字列に適用することにより、当該ノイズ文字列を編集対象から除外することができる。 Judgment criteria item Rule_4 indicates that “the recognized character string includes one or more independent words”. Since a character string that does not include an independent word is a character string that does not have a word content, there is a high possibility that the character string is a noise character string. The character string determination unit 7 can exclude the noise character string from editing targets by applying Rule_4 to the recognized character string.
 なお、Rule_3は、例えば、「認識文字列が、内容語を1語以上含む」としてもよい。内容語とは、例えば、名詞、動詞、及び形容詞等のように、単独で文法的役割以外の特定の意味内容を持つ単語である。自立語は、日本語における内容語の一例である。 Note that Rule_3 may be, for example, “the recognized character string includes one or more content words”. A content word is a word having a specific meaning content other than a grammatical role, such as a noun, a verb, and an adjective. An independent word is an example of a content word in Japanese.
 なお、本実施例の電子イメージ文書編集システムは、判定基準項目Rule_1~Rule_4を用いて判定を行う際に、編集結果に係る情報を用いないため、認識した文字列に対して編集処理を行わなくても、認識文字列が編集対象文字列であるか否かを判定することができる。 The electronic image document editing system according to the present embodiment does not use the information related to the editing result when performing the determination using the determination reference items Rule_1 to Rule_4, and therefore does not perform the editing process on the recognized character string. However, it can be determined whether the recognized character string is an edit target character string.
 本実施例の電子イメージ文書編集システムは、2種類の判定方法505を提供する。図5Aは、判定方法505として、合致した判定基準項目の数によって判定する方法(Num_of_items)が指定されていることを表す。また、図5Aにおける判定基準項目数の閾値(Num_of_items_threshold)は3である。 The electronic image document editing system according to the present embodiment provides two types of determination methods 505. FIG. 5A shows that a determination method (Num_of_items) is designated as the determination method 505 based on the number of matching determination criterion items. Also, the threshold value (Num_of_items_threshold) for the number of determination criterion items in FIG.
 即ち、判定方法505は、認識文字列が適用対象である(適用フラグ504が1である)4種類の判定基準項目のうち、3種類以上の判定基準項目を満たす場合、文字列判定部7が、当該認識文字列を翻訳対象文字列と判定することを示す。また、判定方法505は、認識文字列が当該判定基準項目のうち、2種類以下の判定基準項目しか満たさない場合、文字列判定部7が、当該認識文字列を翻訳対象文字列でないと判定することを示す。 That is, in the determination method 505, when the recognition character string is an application target (the application flag 504 is 1), among the four types of determination reference items, the character string determination unit 7 Indicates that the recognized character string is determined to be a character string to be translated. In the determination method 505, when the recognized character string satisfies only two or less types of determination criterion items among the determination criterion items, the character string determination unit 7 determines that the recognized character string is not a character string to be translated. It shows that.
 図5Bは、文字列判定基準8の第2の例を示す。判定方法505は、判定基準項目数の閾値(Num_of_items_threshold)が2であることを示す。即ち、判定方法505は、認識文字列が、適用対象である(適用フラグ504が1である)2種類の判定基準項目の全てを満たす場合のみ、文字列判定部7が当該認識文字列を翻訳対象文字列と判定する、AND判定であることを示す。また、例えば、判定基準項目数の閾値を1とすれば、判定方法505は、認識文字列が当該2種類の判定基準項目のいずれかを満たす場合に、文字列判定部7が当該認識文字列を翻訳対象文字列と判定する、OR判定であることを示す。 FIG. 5B shows a second example of the character string criterion 8. The determination method 505 indicates that the threshold (Num_of_items_threshold) for the number of determination criterion items is 2. That is, in the determination method 505, the character string determination unit 7 translates the recognized character string only when the recognized character string satisfies all of the two types of determination criterion items that are application targets (the application flag 504 is 1). Indicates that the target character string is AND determination. Further, for example, if the threshold value of the number of determination criterion items is 1, the determination method 505 indicates that the character string determination unit 7 determines that the recognized character string 7 satisfies one of the two types of determination criterion items. Indicates an OR determination in which is determined as a character string to be translated.
 図5Cは、文字列判定基準8の第3の例を示す。判定方法505は、合致した判定基準項目の持つ重み値503の総和によって、認識文字列が翻訳対象文字列であるか否かを判定する方法(Sum_of_weights)が指定されていることを示す。 FIG. 5C shows a third example of the character string criterion 8. The determination method 505 indicates that a method (Sum_of_weights) for determining whether or not the recognized character string is a character string to be translated is specified based on the sum of the weight values 503 of the matching determination criterion items.
 また、判定方法505における、重み和の閾値(Sum_of_weights_threshold)が3.0である。即ち、判定方法505は、適用対象である(適用フラグ504が1である)4種類の判定基準項目のうち、認識文字列が満たす項目の重み値の総和が3.0以上の場合、文字列判定部7が当該認識文字列を翻訳対象文字列と判定することを示す。また、判定方法505は、当該重み値の総和が3.0未満の場合、文字列判定部7が、当該認識文字列を翻訳対象文字列でないと判定することを示す。 In addition, in the determination method 505, the weight sum threshold (Sum_of_weights_threshold) is 3.0. In other words, the determination method 505 is a character string when the sum of the weight values of the items satisfied by the recognized character string is 3.0 or more among the four types of determination criterion items to be applied (the application flag 504 is 1). It shows that the determination part 7 determines the said recognition character string as a translation object character string. The determination method 505 indicates that, when the sum of the weight values is less than 3.0, the character string determination unit 7 determines that the recognized character string is not a character string to be translated.
 なお、図5Cにおいて、重み値を全て1.0とした場合、重み和の閾値を判定基準項目数の閾値と読み替えれば、図5Cを用いた判定は、図5Aで示した合致項目数による判定と同様である。つまり合致項目数による判定は、重み値による判定の一例である。 In FIG. 5C, when the weight values are all 1.0, the determination using FIG. 5C is based on the number of matched items shown in FIG. This is the same as the determination. That is, the determination based on the number of matching items is an example of determination based on a weight value.
 なお、利用者は、文字列判定基準8の内容を変更することができる。即ち、利用者は、適用する判定基準項目を選別したり、パラメータや閾値等を追加、削除、又は変更したりすることができる。文字列判定基準8の内容の変更についての詳細は後述する。 Note that the user can change the contents of the character string judgment standard 8. That is, the user can select the determination criterion items to be applied, and can add, delete, or change parameters, threshold values, and the like. Details of the change of the contents of the character string determination criterion 8 will be described later.
 図6は、文字列情報テーブル9の一例を示す。文字列情報テーブル9は、文字列判定部7による判定結果及び翻訳処理部10による翻訳結果に係るデータ等を保持する。ここでは、電子イメージ文書編集システムが、図4Aの入力文書に対して文字認識処理部4及び文字列判定部7による処理を実行した結果を保持する文字列情報テーブル9を示す。 FIG. 6 shows an example of the character string information table 9. The character string information table 9 holds data relating to the determination result by the character string determination unit 7 and the translation result by the translation processing unit 10. Here, there is shown a character string information table 9 that holds the result of the processing performed by the character recognition processing unit 4 and the character string determination unit 7 on the input document of FIG. 4A by the electronic image document editing system.
 文字列情報テーブル9は、認識文字列601、記載位置602、判定基準照合結果607、翻訳対象フラグ610、修正文字列611、翻訳文字列612、及び翻訳状況フラグ613を含む。 The character string information table 9 includes a recognized character string 601, a description position 602, a determination reference collation result 607, a translation target flag 610, a modified character string 611, a translated character string 612, and a translation status flag 613.
 認識文字列601は、文字認識処理部4によって認識された文字列を保持する。記載位置602は、認識文字列601が表示される矩形領域の情報を保持する。記載位置602は、認識文字列601が表示される矩形領域の左上の頂点の座標である左上X座標603及び左上Y座標604と、当該矩形領域の右下の頂点の座標である右下X座標605及び右下Y座標606と、を含む。 The recognized character string 601 holds a character string recognized by the character recognition processing unit 4. The description position 602 holds information on a rectangular area where the recognized character string 601 is displayed. The description position 602 includes an upper left X coordinate 603 and an upper left Y coordinate 604 that are coordinates of the upper left vertex of the rectangular area where the recognition character string 601 is displayed, and a lower right X coordinate that is a coordinate of the lower right vertex of the rectangular area. 605 and the lower right Y coordinate 606.
 判定基準照合結果607は、文字列判定部7での判定基準の照合結果を保持する。判定基準照合結果607は、判定基準項目毎の照合結果を保持する列と、合致した判定基準項目の数を保持する合致数608と、合致した判定基準項目の持つ重み値の総和を保持する重み総和609と、を含む。 The determination reference collation result 607 holds the determination reference collation result in the character string determination unit 7. The judgment reference matching result 607 includes a column that holds the matching result for each judgment criterion item, the number of matches 608 that holds the number of matching judgment criterion items, and the weight that holds the sum of the weight values of the matching judgment criterion items. Total 609.
 翻訳対象フラグ610は、認識文字列601が翻訳対象文字列であるか否かを識別するフラグである。翻訳対象フラグ610は、図5A又は図5Bに示した文字列判定基準8のように、認識文字列が翻訳対象であるか否かを、合致項目数によって判定する場合、合致数608が閾値以上であれば1を、合致数が閾値未満であれば0を保持する。翻訳対象フラグ610は、図5Cに示した文字列判定基準8のように、認識文字列が翻訳対象であるか否かを合致した項目の重み和で判定する場合、重み総和609が閾値以上であれば1を保持し、重み総和609が閾値未満であれば0を保持する。 The translation target flag 610 is a flag for identifying whether or not the recognized character string 601 is a translation target character string. When the translation target flag 610 determines whether or not the recognized character string is a translation target, as in the character string determination criterion 8 shown in FIG. 5A or 5B, the number of matches 608 is equal to or greater than a threshold. If the number of matches is less than the threshold, 0 is held. When the translation target flag 610 is determined based on the sum of weights of the matched items as to whether or not the recognized character string is a translation target as in the character string determination criterion 8 shown in FIG. If there is, 1 is held, and if the weight sum 609 is less than the threshold, 0 is held.
 修正文字列611は、翻訳対象フラグ610が1である認識文字列601に誤りがある場合に、利用者によって修正された文字列を保持する。翻訳文字列612は、認識文字列601又は修正文字列611に対する翻訳結果を保持する。翻訳状況フラグ613は、翻訳対象フラグ610が1である、認識文字列601又は修正文字列611の翻訳作業が終了したか否かを識別するフラグである。翻訳状況フラグ613は、翻訳文字列612に翻訳結果が保持されている場合に1を、保持されていない場合に0を保持し、認識文字列601が翻訳対象外の場合にNULLとなる。 The corrected character string 611 holds a character string corrected by the user when there is an error in the recognized character string 601 whose translation target flag 610 is 1. The translated character string 612 holds the translation result for the recognized character string 601 or the corrected character string 611. The translation status flag 613 is a flag for identifying whether or not the translation work of the recognized character string 601 or the corrected character string 611 having the translation target flag 610 of 1 is completed. The translation status flag 613 holds 1 when the translation result is held in the translated character string 612, 0 when it is not held, and NULL when the recognized character string 601 is not a translation target.
 図6において、認識文字列「抵抗100Ω」は、座標(160,30)を左上頂点、座標(300,50)を右下頂点とする矩形領域内に記載されている。また、認識文字列「抵抗100Ω」は、合致数608が4.0であり、閾値(=3)以上である(又は重み総和609が4.0であり、閾値(=3.0)以上である)ことから、文字列判定部7によって翻訳対象文字列と判定される。従って、認識文字列「抵抗100Ω」における翻訳対象フラグ610は、1を保持する。 6, the recognized character string “resistance 100Ω” is described in a rectangular area having the coordinates (160, 30) as the upper left vertex and the coordinates (300, 50) as the lower right vertex. The recognition character string “resistance 100Ω” has a match number 608 of 4.0 and is equal to or greater than a threshold value (= 3) (or a weight sum 609 of 4.0 and equal to or greater than a threshold value (= 3.0). Therefore, the character string determination unit 7 determines the character string to be translated. Therefore, the translation target flag 610 in the recognized character string “resistance 100Ω” holds “1”.
 また、文字認識結果も正しいことから翻訳処理部10は、認識文字列「抵抗100Ω」を「Resistor 100Ω」と翻訳し、翻訳結果を翻訳文字列612に格納する。翻訳処理部10による認識文字列「抵抗100Ω」の翻訳が完了しているため、対応する翻訳状況フラグ613は1を保持する。 Since the character recognition result is also correct, the translation processing unit 10 translates the recognized character string “resistance 100Ω” into “Resistor 100Ω” and stores the translation result in the translated character string 612. Since the translation of the recognized character string “resistance 100Ω” by the translation processing unit 10 has been completed, the corresponding translation status flag 613 holds “1”.
 認識文字列「NPNトランシスタ」は、座標(250,250)を左上頂点、座標(390,270)を右下頂点とする矩形領域内に記載されている。認識文字列「NPNトランシスタ」は、合致数608が3.0であり、閾値(=3)以上である(又は重み総和609が3.5であり、閾値(=3.0)以上である)ことから、文字列判定部7によって翻訳対象文字列と判定される。従って、認識文字列「NPNトランシスタ」における翻訳対象フラグ610は1を保持する。 The recognition character string “NPN transistor” is described in a rectangular area having coordinates (250, 250) as the upper left vertex and coordinates (390, 270) as the lower right vertex. The recognized character string “NPN transistor” has a match number 608 of 3.0 and is equal to or greater than a threshold value (= 3) (or a weight sum 609 is 3.5 and is equal to or greater than a threshold value (= 3.0)). Therefore, the character string determination unit 7 determines that the character string is a translation target character string. Therefore, the translation target flag 610 in the recognized character string “NPN transistor” holds “1”.
 ただし、認識文字列「NPNトランシスタ」は、文字認識結果が誤っている。利用者が認識文字列「NPNトランシスタ」を「NPNトランジスタ」に修正する入力を行った場合、訳語オブジェクト生成部13は、当該修正結果を修正文字列611に格納する。翻訳処理部10は、修正文字列「NPNトランジスタ」を翻訳し、当該翻訳結果を、翻訳文字列612に格納する。翻訳処理部10による修正文字列「NPNトランジスタ」の翻訳が完了しているため、対応する翻訳状況フラグ613は1を保持する。 However, the recognition character string “NPN transistor” has an incorrect character recognition result. When the user inputs to correct the recognized character string “NPN transistor” to “NPN transistor”, the translation object generation unit 13 stores the correction result in the corrected character string 611. The translation processing unit 10 translates the modified character string “NPN transistor” and stores the translation result in the translated character string 612. Since the translation of the corrected character string “NPN transistor” by the translation processing unit 10 has been completed, the corresponding translation status flag 613 holds “1”.
 認識文字列「てW」は、座標(160,240)左上頂点、座標(200,260)右下頂点とする矩形領域内に記載されている。しかし、認識文字列「てW」は、合致数608が2であり、閾値(=3)未満である(又は重み総和609が2.5であり、閾値(=3.0)未満である)ことから、文字列判定部7によって翻訳対象文字列ではないと判定される。従って、対応する翻訳対象フラグ610は0を保持する。その結果、翻訳処理部10は認識文字列「てW」に対する翻訳作業を行わない。 The recognition character string “Te W” is described in a rectangular area having coordinates (160, 240) upper left vertex and coordinates (200, 260) lower right vertex. However, in the recognized character string “te W”, the number of matches 608 is 2 and is less than the threshold value (= 3) (or the weight sum 609 is 2.5 and is less than the threshold value (= 3.0)). Therefore, the character string determination unit 7 determines that the character string is not a translation target character string. Accordingly, the corresponding translation target flag 610 holds 0. As a result, the translation processing unit 10 does not perform the translation work for the recognized character string “Te W”.
 認識文字列「乾電池6V」は、座標(335,410)を左上頂点、座標(460,430)を右下頂点とする矩形領域内に記載されている。認識文字列「乾電池6V」は、合致数608が4であり、閾値(=3)以上である(又は重み総和609が4.0であり、閾値(=3.0)以上である)ことから、文字列判定部7によって、翻訳対象文字列と判定される。従って、対応する翻訳対象フラグ610は1を保持する。ただし、この認識文字列「乾電池6V」に対する翻訳作業はまだ行われていない、即ち翻訳文字列612は翻訳結果を保持していないため、対応する翻訳状況フラグ613は0を保持する。 The recognition character string “dry battery 6V” is described in a rectangular area having coordinates (335, 410) as the upper left vertex and coordinates (460, 430) as the lower right vertex. The recognition character string “dry battery 6V” has a match number 608 of 4 and is equal to or greater than a threshold value (= 3) (or a weight sum 609 is 4.0 and is equal to or greater than a threshold value (= 3.0)). The character string determining unit 7 determines that the character string is to be translated. Accordingly, the corresponding translation target flag 610 holds “1”. However, since the translation work for the recognized character string “dry battery 6V” has not been performed yet, that is, the translation character string 612 does not hold the translation result, the corresponding translation status flag 613 holds 0.
 以下、文字列判定部7による翻訳対象文字列判定処理について説明する。まず、文字列判定部7は、文字列判定基準8において規定されている判定方法をチェックする。即ち、文字列判定部7は、判定方法505の変数Judgeの値が合致項目数(Num_of_items)であるか、合致項目の重み和(Sum_of_weights)であるかをチェックする。Judgeの値が合致項目数である場合、文字列判定部7は、図7Aに示す処理を行う。Judgeの値が合致項目の重み和である場合、文字列判定部7は、図7Bに示す処理を行う。 Hereinafter, the translation target character string determination process by the character string determination unit 7 will be described. First, the character string determination unit 7 checks the determination method defined in the character string determination standard 8. That is, the character string determination unit 7 checks whether the value of the variable Judge of the determination method 505 is the number of matching items (Num_of_items) or the weight sum of matching items (Sum_of_weights). When the value of Judge is the number of matching items, the character string determination unit 7 performs the process shown in FIG. 7A. If the Judge value is the weighted sum of the matching items, the character string determination unit 7 performs the process shown in FIG. 7B.
 図7Aは、Judgeの値が合致項目数である場合における、文字列判定部7による翻訳対象文字列判定処理の一例を示す。文字列判定部7は、文字列判定基準8の判定方法505に規定された閾値S1の値を取得する(ステップ702)。即ち、文字列判定部7は、変数Num_of_items_thresholdの値を閾値S1として保持する。次に、文字列判定部7は、文字列情報テーブル9に未判定の認識文字列があるか否かを判定する(ステップ703)。未判定の認識文字列がない場合は(ステップ703:No)、文字列判定部7は処理を終了する。 FIG. 7A shows an example of a translation target character string determination process performed by the character string determination unit 7 when the Judge value is the number of matching items. The character string determination unit 7 acquires the value of the threshold value S1 defined in the determination method 505 of the character string determination criterion 8 (step 702). That is, the character string determination unit 7 holds the value of the variable Num_of_items_threshold as the threshold value S1. Next, the character string determination unit 7 determines whether or not there is an undetermined recognized character string in the character string information table 9 (step 703). If there is no undetermined recognized character string (step 703: No), the character string determination unit 7 ends the process.
 未判定の認識文字列がある場合(ステップ703:Yes)、文字列判定部7は、当該認識文字列を解析する(ステップ704)。文字列判定部7は、当該解析において、単語・文字辞書17を参照し、例えば、当該認識文字列を構成する文字数、当該認識文字列を構成する文字の種別、当該認識文字列に含まれる自立語等、判定基準項目内容502の判定に必要な情報を抽出する。 If there is an undetermined recognized character string (step 703: Yes), the character string determining unit 7 analyzes the recognized character string (step 704). In the analysis, the character string determination unit 7 refers to the word / character dictionary 17 and, for example, the number of characters constituting the recognized character string, the type of characters constituting the recognized character string, and the independent character included in the recognized character string. Information necessary for determination of the determination criterion item content 502 such as a word is extracted.
 次に、文字列判定部7は、認識文字列に対して未適用の判定基準項目内容502があるか否かを判定する(ステップ705)。未適用の判定基準項目内容502がある場合(ステップ705:Yes)、文字列判定部7は、未適用の判定基準項目内容502の照合を行い(ステップ706)、認識文字列が当該未適用の判定基準項目内容502に合致するか否かを判定する(ステップ707)。 Next, the character string determination unit 7 determines whether or not there is a determination criterion item content 502 not applied to the recognized character string (step 705). When there is an unapplied determination criterion item content 502 (step 705: Yes), the character string determination unit 7 collates the unapplied determination criterion item content 502 (step 706), and the recognized character string is not applied. It is determined whether or not the determination criterion item content 502 is met (step 707).
 認識文字列が判定基準項目内容502に合致しない場合(ステップ707:Yes)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607の該当する判定基準項目に値0を格納し(ステップ708)、ステップ705に戻る。認識文字列が判定基準項目に合致する場合(ステップ705:No)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607の該当する判定基準項目に値1を格納し(ステップ709)、ステップ705に戻る。 When the recognized character string does not match the determination criterion item content 502 (step 707: Yes), the character string determination unit 7 stores the value 0 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9. (Step 708), the process returns to Step 705. When the recognized character string matches the determination criterion item (step 705: No), the character string determination unit 7 stores the value 1 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9 (step 705). 709), and returns to step 705.
 未適用の判定基準項目がない場合(ステップ705:No)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607に格納された判定基準項目毎の値を合計し、合計値を合致数608に格納する(ステップ710)。次に、文字列判定部7は、合致数608に格納された合計値がステップ702で取得した閾値S1以上であるか否かを判定する(ステップ711)。 If there is no unapplied determination criterion item (step 705: No), the character string determination unit 7 sums the values for each determination criterion item stored in the determination criterion collation result 607 of the character string information table 9, and the total value Is stored in the match number 608 (step 710). Next, the character string determination unit 7 determines whether or not the total value stored in the number of matches 608 is equal to or greater than the threshold value S1 acquired in Step 702 (Step 711).
 当該合計値が閾値S1以上でない場合(ステップ711:No)、文字列判定部7は、文字列情報テーブル9の当該認識文字列の翻訳対象フラグ610に値0を格納し(ステップ712)、ステップ703に戻る。当該合計値が閾値S1以上である場合(ステップ711:Yes)、文字列判定部7は、文字列情報テーブル9の当該認識文字列の翻訳対象フラグ610に値1を格納し(ステップ713)、ステップ703に戻る。 If the total value is not equal to or greater than the threshold value S1 (step 711: No), the character string determination unit 7 stores a value 0 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 712). Return to 703. If the total value is greater than or equal to the threshold value S1 (step 711: Yes), the character string determination unit 7 stores the value 1 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 713). Return to step 703.
 図7Bは、Judgeの値が合致項目の重み和である場合における、文字列判定部7による翻訳対象文字列判定処理の一例を示す。文字列判定部7は、文字列判定基準8の判定方法505に規定された閾値S2の値を取得する(ステップ714)。即ち、文字列判定部7は、変数Sum_of_weights_thresholdの値を閾値S2として保持する。次に、文字列判定部7は、文字列情報テーブル9に未判定の認識文字列があるか否かを判定する(ステップ715)。 FIG. 7B shows an example of translation target character string determination processing by the character string determination unit 7 when the Judge value is the weighted sum of matching items. The character string determination unit 7 acquires the value of the threshold value S2 defined in the determination method 505 of the character string determination criterion 8 (step 714). That is, the character string determination unit 7 holds the value of the variable Sum_of_weights_threshold as the threshold value S2. Next, the character string determination unit 7 determines whether or not there is an undetermined recognized character string in the character string information table 9 (step 715).
 未判定の認識文字列がない場合(ステップ715:No)、文字列判定部7は処理を終了する。未判定の認識文字列がある場合は(ステップ715:Yes)、文字列判定部7は当該認識文字列を解析する(ステップ716)。当該解析は、ステップ704で行われた解析と同様であるため説明を省略する。 If there is no undetermined recognized character string (step 715: No), the character string determination unit 7 ends the process. If there is an undetermined recognized character string (step 715: Yes), the character string determining unit 7 analyzes the recognized character string (step 716). Since this analysis is the same as the analysis performed in step 704, description thereof is omitted.
 次に、文字列判定部7は、認識文字列に対して未適用の判定基準項目内容502があるか否かを判定する(ステップ717)。未適用の判定基準項目内容502がある場合(ステップ717:Yes)、文字列判定部7は、未適用の判定基準項目内容502の照合を行い(ステップ718)、認識文字列が当該未適用の判定基準項目内容502に合致するか否かを判定する(ステップ719)。 Next, the character string determination unit 7 determines whether or not there is an unapplied determination criterion item content 502 for the recognized character string (step 717). If there is an unapplied determination criterion item content 502 (step 717: Yes), the character string determination unit 7 collates the unapplied determination criterion item content 502 (step 718), and the recognized character string is not applied. It is determined whether or not the determination criterion item content 502 is met (step 719).
 認識文字列が判定基準項目に合致しない場合(ステップ719:No)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607の該当する判定基準項目に値0を格納し(ステップ720)、ステップ717に戻る。認識文字列が判定基準項目に合致する場合(ステップ719:Yes)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607の該当する判定基準項目に、文字列判定基準8の該当する判定基準項目の重み値503を格納し(ステップ721)、ステップ717に戻る。 When the recognized character string does not match the determination criterion item (step 719: No), the character string determination unit 7 stores the value 0 in the corresponding determination criterion item of the determination criterion collation result 607 of the character string information table 9 (step 719). 720), the process returns to Step 717. When the recognized character string matches the determination criterion item (step 719: Yes), the character string determination unit 7 adds the character string determination criterion 8 to the corresponding determination criterion item of the determination criterion matching result 607 in the character string information table 9. The weight value 503 of the corresponding determination criterion item is stored (step 721), and the process returns to step 717.
 ステップ717において、未適用の判定基準項目がない場合(ステップ717:No)、文字列判定部7は、文字列情報テーブル9の判定基準照合結果607に格納された判定基準項目毎の値を合計し、合計値を重み総和609に格納する(ステップ722)。次に、文字列判定部7は、重み総和609に格納された合計値がステップ714で取得した閾値S2以上であるか否かを判定する(ステップ723)。 In step 717, when there is no unapplied determination criterion item (step 717: No), the character string determination unit 7 sums the values for each determination criterion item stored in the determination criterion matching result 607 of the character string information table 9. The total value is stored in the weight sum 609 (step 722). Next, the character string determination unit 7 determines whether or not the total value stored in the weight sum 609 is equal to or greater than the threshold value S2 acquired in Step 714 (Step 723).
 当該合計値が閾値S2以上でない場合(ステップ723:No)、文字列判定部7は、文字列情報テーブル9の当該認識文字列の翻訳対象フラグ610に値0を格納し(ステップ724)、ステップ715に戻る。当該合計値が閾値以上である場合(ステップ723:Yes)、文字列判定部7は、文字列情報テーブル9の当該認識文字列の翻訳対象フラグ610に値1を格納し(ステップ725)、ステップ715に戻る。 If the total value is not greater than or equal to the threshold value S2 (step 723: No), the character string determination unit 7 stores the value 0 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 724), and step Return to 715. If the total value is greater than or equal to the threshold value (step 723: Yes), the character string determination unit 7 stores the value 1 in the translation target flag 610 of the recognized character string in the character string information table 9 (step 725). Return to 715.
 図8は、翻訳対象文字列と判定された文字列の一覧出力画面の一例を示す。図8は、図6に示した文字列情報テーブル9のデータに基づいて出力表示された一覧出力画面を例示している。文字列一覧出力画面800は、電子イメージ文書及び当該電子イメージ文書中の文字列の翻訳結果を出力する出力イメージサブ画面801と、翻訳対象文字列の一覧及び翻訳結果を出力する翻訳状況サブ画面802と、を含む。翻訳状況サブ画面802は、各文字列の翻訳作業状況を表示する状況803と、翻訳対象となる認識文字列である翻訳対象文字列804と、翻訳対象文字列804に対する翻訳結果805と、を含む。 FIG. 8 shows an example of a list output screen of character strings determined as translation target character strings. FIG. 8 illustrates a list output screen that is output and displayed based on the data in the character string information table 9 shown in FIG. The character string list output screen 800 includes an output image subscreen 801 that outputs an electronic image document and a translation result of a character string in the electronic image document, and a translation status subscreen 802 that outputs a list of translation target character strings and a translation result. And including. The translation status sub-screen 802 includes a status 803 that displays the translation work status of each character string, a translation target character string 804 that is a recognized character string to be translated, and a translation result 805 for the translation target character string 804. .
 利用者は、状況803、翻訳対象文字列804、及び翻訳結果805の項目見出しのいずれかを選択することにより、選択した項目の値を降順又は昇順に並べ替えることができる。これにより、利用者は、例えば、まだ翻訳されていない文字列を容易に把握したり、同じ文字列の翻訳結果がばらついていないかを容易にチェックしたりすることができる。 The user can sort the values of the selected items in descending or ascending order by selecting any of the item headings of the status 803, the translation target character string 804, and the translation result 805. Thereby, for example, the user can easily grasp a character string that has not yet been translated, or can easily check whether the translation result of the same character string varies.
 また、翻訳対象文字列804は、出力イメージサブ画面801における記載位置と連動している。利用者が、翻訳対象文字列804内の任意の文字列を指定すると、出力イメージサブ画面801は、当該指定した文字列の記載箇所を表示する。なお、文字列情報管理部16は、文字列情報テーブル9の記載位置602を参照することにより、翻訳対象文字列804の文字列の記載位置を、取得する。このように、利用者は、翻訳状況サブ画面802において、翻訳対象となる文字列の一覧を参照でき、出力イメージサブ画面801と連動して表示できるため、電子イメー文書中の文字列に対する翻訳漏れを減少させることができる。 Also, the translation target character string 804 is linked with the description position on the output image sub-screen 801. When the user designates an arbitrary character string in the translation target character string 804, the output image sub-screen 801 displays a description portion of the designated character string. The character string information management unit 16 refers to the description position 602 of the character string information table 9 to obtain the description position of the character string of the translation target character string 804. In this way, the user can refer to the list of character strings to be translated on the translation status sub-screen 802 and display it in conjunction with the output image sub-screen 801. Therefore, the translation omission for the character strings in the electronic image document can be performed. Can be reduced.
 また、翻訳状況サブ画面802は、翻訳対象となる文字列の個数及びトータルの文字数を上部に表示する。図8の例では、翻訳状況サブ画面802は、翻訳対象文字列の数が6個であり、トータルの文字数が40文字であると表示している。これらの値は、文字列情報テーブル9において、翻訳対象フラグが1になっている認識文字列601から、文字列情報管理部16によって算出される。利用者は、翻訳状況サブ画面802の表示から、電子イメージ文書の中に翻訳対象となる文字列の量がどのくらいあるかを、文字列の一覧とともに把握できるため、翻訳にかかる作業工数を容易に見積もることができる。 Also, the translation status sub-screen 802 displays the number of character strings to be translated and the total number of characters at the top. In the example of FIG. 8, the translation status subscreen 802 displays that the number of character strings to be translated is six and the total number of characters is 40 characters. These values are calculated by the character string information management unit 16 from the recognized character string 601 whose translation target flag is 1 in the character string information table 9. From the display of the translation status sub-screen 802, the user can grasp the amount of character strings to be translated in the electronic image document together with the list of character strings. Can be estimated.
 図9は、文字列判定基準8の変更画面の一例を示す。利用者が、図8に示した判定基準変更ボタン806を押下すると、文字列判定基準8を変更する判定基準変更画面900が表示される。利用者は、判定基準変更画面900において、文字列判定基準8を構成する各判定基準項目について、その構成要素や値(判定基準変更画面900中の〔 〕で囲まれた値)を変更することができる。図9は、図5Aに示した文字列判定基準8の、閾値(Num_of_items_threshold)902を3から4(即ち、認識文字列が4種類の基準項目すべてを満たす場合に、当該認識文字列が翻訳対象文字列と判定される)に変更した例を示す。 FIG. 9 shows an example of a screen for changing the character string criterion 8. When the user presses the determination criterion change button 806 shown in FIG. 8, a determination criterion change screen 900 for changing the character string determination criterion 8 is displayed. The user changes the constituent elements and values (values enclosed in [] in the determination criterion change screen 900) for each determination criterion item constituting the character string determination criterion 8 on the determination criterion change screen 900. Can do. FIG. 9 shows that when the threshold value (Num_of_items_threshold) 902 of the character string determination criterion 8 shown in FIG. 5A is 3 to 4 (that is, when the recognized character string satisfies all four kinds of reference items) An example of a change to (determined as a character string) is shown.
 利用者が判定基準変更内容入力後に更新ボタン903を押下すると、押下直前に判定基準変更画面900に表示されていた内容が、文字列判定基準8に更新反映される。利用者がキャンセルボタン904を押下した場合、当該内容は文字列判定基準8に更新反映されない。 When the user presses the update button 903 after inputting the determination criterion change content, the content displayed on the determination criterion change screen 900 immediately before the pressing is updated and reflected in the character string determination criterion 8. When the user presses the cancel button 904, the content is not updated and reflected in the character string determination standard 8.
 図10は、文字列判定基準8の変更後に翻訳処理を再実行した出力画面の一例を示す。利用者が、図9に例示した内容で文字列判定基準8を更新した後に、再表示ボタン807を押下すると、文字列判定部7は更新された文字列判定基準8を用いて判定処理を再実行し、再実行結果を文字列情報テーブル9に格納する。そして、再実行結果を格納した文字列情報テーブル9の情報に基づいて、翻訳状況サブ画面802が再表示される。 FIG. 10 shows an example of an output screen in which the translation process is re-executed after the character string determination standard 8 is changed. When the user presses the redisplay button 807 after updating the character string determination standard 8 with the contents illustrated in FIG. 9, the character string determination unit 7 re-executes the determination process using the updated character string determination standard 8. The re-execution result is stored in the character string information table 9. Then, based on the information in the character string information table 9 storing the re-execution result, the translation status sub-screen 802 is re-displayed.
 図10と図8とを比較すると、図10では更新された判定基準に基づいて、認識文字列「NPNトランシスタ」が翻訳対象文字列とみなされなくなり、翻訳状況サブ画面802に出力表示される文字列から除外されている。また、認識文字列「NPNトランシスタ」が翻訳対象とみなされなくなったことにより、翻訳対象文字列数及び翻訳対象文字数の値も小さくなっている。 Comparing FIG. 10 with FIG. 8, in FIG. 10, the recognized character string “NPN transistor” is not regarded as a character string to be translated based on the updated determination criteria, and the characters output and displayed on the translation status sub-screen 802 Excluded from the column. In addition, since the recognized character string “NPN transistor” is no longer regarded as a translation target, the number of translation target character strings and the number of translation target characters are also reduced.
 このように、本実施例の電子イメージ文書編集システムは、文字列判定基準8の内容が更新可能であることにより、編集対象である電子イメージ文書に応じた文字列判定基準8を、利用者が選択することができ、ひいては編集対象文字列を高精度に抽出できるようになる。 As described above, the electronic image document editing system according to the present embodiment allows the user to set the character string determination criterion 8 corresponding to the electronic image document to be edited because the contents of the character string determination criterion 8 can be updated. The character string to be edited can be extracted with high accuracy.
 上述したように、本実施例の電子イメージ文書編集システムは、例えば、設計図面のように図(非文字情報)と文字情報が混在している電子イメージ文書の中から文字認識処理によって認識された文字情報のうち、編集対象となる文字情報を高精度に特定できる。その結果、利用者は、本実施例の電子イメージ文書編集システムを用いて、編集対象となる文字情報の記載箇所や記載量を容易かつ正確に把握することができ、ひいては編集作業の効率や品質を向上できる。 As described above, the electronic image document editing system of the present embodiment is recognized by character recognition processing from an electronic image document in which figure (non-character information) and character information are mixed, such as a design drawing. Among character information, character information to be edited can be specified with high accuracy. As a result, using the electronic image document editing system of the present embodiment, the user can easily and accurately grasp the description location and amount of text information to be edited, which in turn improves the efficiency and quality of editing work. Can be improved.
 なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に記載したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to the above-mentioned Example, Various modifications are included. For example, the above-described embodiments are described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of one embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of one embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.
 また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、SSD(Solid State Drive)等の記録装置、又は、ICカード、SDカード、DVD等の記録媒体に置くことができる。 In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

Claims (13)

  1.  電子イメージ文書から認識された文字列を編集する、電子イメージ文書編集システムであって、
     プロセッサと記憶装置とを含み、
     前記記憶装置は、1以上の文字からなる文字列が編集対象文字列か否かを判定する基準である1以上の文字列判定基準を保持し、
     前記プロセッサは、
     電子イメージ文書の入力を受け付け、
     前記入力された電子イメージ文書中の、複数種別の文字における1以上の文字からなる文字列を認識し、
     前記認識した文字列が前記文字列判定基準を満たす場合、前記認識した文字列が編集対象文字列であると判定し、
     前記文字列判定基準は、
     前記認識した文字列が、第1閾値(前記第1閾値は2以上の整数)以上の文字からなる第1判定基準と、
     前記認識した文字列が、前記複数種別の一部である第1種別群の文字における第2閾値(前記第2閾値は2以上の整数)以上の文字からなる部分文字列を含む第2判定基準と、
     前記認識した文字列が前記複数種別の一部である第2種別群における文字を含む第3判定基準と、
     前記認識した文字列が内容語を含む第4判定基準のうち、少なくとも1つの判定基準を含む電子イメージ文書編集システム。
    An electronic image document editing system for editing a character string recognized from an electronic image document,
    Including a processor and a storage device;
    The storage device stores one or more character string determination criteria that are criteria for determining whether or not a character string including one or more characters is an edit target character string;
    The processor is
    Accept input of electronic image documents,
    Recognizing a character string composed of one or more characters among a plurality of types of characters in the input electronic image document;
    When the recognized character string satisfies the character string determination criteria, it is determined that the recognized character string is an edit target character string;
    The character string criterion is
    A first determination criterion in which the recognized character string is composed of characters that are equal to or greater than a first threshold value (the first threshold value is an integer of 2 or more);
    A second determination criterion, wherein the recognized character string includes a partial character string composed of a character equal to or greater than a second threshold value (the second threshold value is an integer equal to or greater than 2) in characters of the first type group that is a part of the plurality of types. When,
    A third determination criterion including a character in a second type group in which the recognized character string is a part of the plurality of types;
    An electronic image document editing system including at least one determination criterion among fourth determination criteria in which the recognized character string includes a content word.
  2.  請求項1に記載の電子イメージ文書編集システムであって、
     前記記憶装置は、前記文字列判定基準のそれぞれに対応する重み値をさらに保持し、
     前記プロセッサは、
     前記認識した文字列が満たす文字列判定基準のそれぞれに対応する重み値の総和を算出し、
     前記重み値の総和が第3閾値以上である場合、前記認識した文字列が編集対象文字列であると判定する電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The storage device further holds a weight value corresponding to each of the character string determination criteria,
    The processor is
    Calculating a sum of weight values corresponding to each of the character string determination criteria satisfied by the recognized character string;
    An electronic image document editing system that determines that the recognized character string is an edit target character string when the sum of the weight values is equal to or greater than a third threshold value.
  3.  請求項1に記載の電子イメージ文書編集システムであって、
     前記文字列判定基準は、前記第1判定基準と前記第2判定基準とを含み、
     前記プロセッサは、前記認識した文字列が前記第1判定基準と前記第2判定基準の双方を満たす場合、前記認識した文字列が編集対象文字列であると判定する電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The character string criterion includes the first criterion and the second criterion,
    The electronic image document editing system in which the processor determines that the recognized character string is an edit target character string when the recognized character string satisfies both the first determination criterion and the second determination criterion.
  4.  請求項1に記載の電子イメージ文書編集システムであって、
     前記第1閾値は2である電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The electronic image document editing system, wherein the first threshold is 2.
  5.  請求項1に記載の電子イメージ文書編集システムであって、
     前記第1種別群は、漢字、平仮名、及び片仮名から構成される電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The first type group is an electronic image document editing system composed of kanji, hiragana and katakana.
  6.  請求項1に記載の電子イメージ文書編集システムであって、
     前記第2閾値は2である電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The electronic image document editing system, wherein the second threshold is 2.
  7.  請求項1に記載の電子イメージ文書編集システムであって、
     前記第2種別群は、記号、数詞接尾語、数字、アルファベット、及び非常用漢字以外の種別から構成される電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The second type group is an electronic image document editing system configured with types other than symbols, numeral suffixes, numbers, alphabets, and emergency kanji.
  8.  請求項1に記載の電子イメージ文書編集システムであって、
     前記内容語は、自立語である電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The electronic image document editing system, wherein the content word is an independent word.
  9.  請求項1に記載の電子イメージ文書編集システムであって、
     前記プロセッサは、
     前記編集対象文字列の一覧を前記入力された電子イメージ文書とともに出力する電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The processor is
    An electronic image document editing system for outputting a list of character strings to be edited together with the input electronic image document.
  10.  請求項1に記載の電子イメージ文書編集システムであって、
     前記プロセッサは、前記編集対象文字列の総数、及び前記編集対象文字列の文字数の総和を出力する電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The electronic image document editing system, wherein the processor outputs a total number of the edit target character strings and a total number of characters of the edit target character strings.
  11.  請求項1に記載の電子イメージ文書編集システムであって、
     前記プロセッサは、
     前記文字列判定基準の変更の入力を受け付け、
     前記変更した文字列判定基準を前記記憶装置に格納し、
     前記認識した文字列が前記変更した文字列判定基準を満たす場合、前記認識した文字列が編集対象文字列であると判定する電子イメージ文書編集システム。
    The electronic image document editing system according to claim 1,
    The processor is
    Accepts an input for changing the character string criteria;
    Storing the changed character string criterion in the storage device;
    An electronic image document editing system that determines that the recognized character string is an edit target character string when the recognized character string satisfies the changed character string determination criterion.
  12.  電子イメージ文書から認識された文字列を編集する、電子イメージ文書編集システムが、1以上の文字からなる文字列が編集対象文字列であるか否かを判定する方法であって、
     前記電子イメージ文書編集システムは、1以上の文字からなる文字列が編集対象文字列か否かを判定する基準である1以上の文字列判定基準を保持し、
     前記方法は、
     電子イメージ文書の入力を受け付け、
     前記入力された電子イメージ文書中の、複数種別の文字における1以上の文字からなる文字列を認識し、
     前記認識した文字列が前記文字列判定基準を満たす場合、前記認識した文字列が編集対象文字列であると判定する、ことを含み、
     前記文字列判定基準は、
     前記認識した文字列が、第1閾値(前記第1閾値は2以上の整数)以上の文字からなる第1判定基準と、
     前記認識した文字列が、前記複数種別の一部である第1種別群の文字における第2閾値(前記第2閾値は2以上の整数)以上の文字からなる部分文字列を含む第2判定基準と、
     前記認識した文字列が前記複数種別の一部である第2種別群における文字を含む第3判定基準と、
     前記認識した文字列が内容語を含む第4判定基準のうち、少なくとも1つの判定基準を含む方法。
    A method of editing a character string recognized from an electronic image document, wherein the electronic image document editing system determines whether a character string consisting of one or more characters is an edit target character string,
    The electronic image document editing system holds one or more character string determination criteria that are criteria for determining whether or not a character string consisting of one or more characters is an edit target character string,
    The method
    Accept input of electronic image documents,
    Recognizing a character string composed of one or more characters among a plurality of types of characters in the input electronic image document;
    Determining that the recognized character string is an edit target character string if the recognized character string satisfies the character string determination criterion,
    The character string criterion is
    A first determination criterion in which the recognized character string is composed of characters that are equal to or greater than a first threshold value (the first threshold value is an integer of 2 or more);
    A second determination criterion, wherein the recognized character string includes a partial character string composed of a character equal to or greater than a second threshold value (the second threshold value is an integer equal to or greater than 2) in characters of the first type group that is a part of the plurality of types. When,
    A third determination criterion including a character in a second type group in which the recognized character string is a part of the plurality of types;
    The method in which the recognized character string includes at least one determination criterion among the fourth determination criteria including a content word.
  13.  電子イメージ文書から認識された文字列を編集する、電子イメージ文書編集システムにおいて実行されるプログラムであって、
     前記電子イメージ文書編集システムは、プロセッサと記憶装置とを含み、
     前記記憶装置は、1以上の文字からなる文字列が編集対象文字列か否かを判定する基準である1以上の文字列判定基準を保持し、
     前記プログラムは、
     電子イメージ文書の入力を受け付ける手順と、
     前記入力された電子イメージ文書中の、複数種別の文字における1以上の文字からなる文字列を認識する手順と、
     前記認識した文字列が前記文字列判定基準を満たす場合、前記認識した文字列が編集対象文字列であると判定する手順と、を前記プロセッサに実行させ、
     前記文字列判定基準は、
     前記認識した文字列が、第1閾値(前記第1閾値は2以上の整数)以上の文字からなる第1判定基準と、
     前記認識した文字列が、前記複数種別の一部である第1種別群の文字における第2閾値(前記第2閾値は2以上の整数)以上の文字からなる部分文字列を含む第2判定基準と、
     前記認識した文字列が前記複数種別の一部である第2種別群における文字を含む第3判定基準と、
     前記認識した文字列が内容語を含む第4判定基準のうち、少なくとも1つの判定基準を含むプログラム。
    A program executed in an electronic image document editing system for editing a character string recognized from an electronic image document,
    The electronic image document editing system includes a processor and a storage device,
    The storage device stores one or more character string determination criteria that are criteria for determining whether or not a character string including one or more characters is an edit target character string;
    The program is
    A procedure for accepting input of an electronic image document;
    Recognizing a character string composed of one or more characters among a plurality of types of characters in the input electronic image document;
    If the recognized character string satisfies the character string determination criterion, the processor determines that the recognized character string is an edit target character string, and
    The character string criterion is
    A first determination criterion in which the recognized character string is composed of characters that are equal to or greater than a first threshold value (the first threshold value is an integer of 2 or more);
    A second determination criterion, wherein the recognized character string includes a partial character string composed of a character equal to or greater than a second threshold value (the second threshold value is an integer equal to or greater than 2) in characters of the first type group that is a part of the plurality of types. When,
    A third determination criterion including a character in a second type group in which the recognized character string is a part of the plurality of types;
    A program including at least one determination criterion among fourth determination criteria in which the recognized character string includes a content word.
PCT/JP2014/056927 2014-03-14 2014-03-14 Digital image document editing system WO2015136692A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2014/056927 WO2015136692A1 (en) 2014-03-14 2014-03-14 Digital image document editing system
JP2016507228A JPWO2015136692A1 (en) 2014-03-14 2014-03-14 Electronic image document editing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/056927 WO2015136692A1 (en) 2014-03-14 2014-03-14 Digital image document editing system

Publications (1)

Publication Number Publication Date
WO2015136692A1 true WO2015136692A1 (en) 2015-09-17

Family

ID=54071169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/056927 WO2015136692A1 (en) 2014-03-14 2014-03-14 Digital image document editing system

Country Status (2)

Country Link
JP (1) JPWO2015136692A1 (en)
WO (1) WO2015136692A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241594A (en) * 2016-12-26 2018-07-03 卡西欧计算机株式会社 Word editing method, electronic equipment and recording medium
US11568659B2 (en) * 2019-03-29 2023-01-31 Fujifilm Business Innovation Corp. Character recognizing apparatus and non-transitory computer readable medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05266074A (en) * 1992-03-24 1993-10-15 Ricoh Co Ltd Translating image forming device
JP2009205209A (en) * 2008-02-26 2009-09-10 Fuji Xerox Co Ltd Document image processor and document image processing program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05266074A (en) * 1992-03-24 1993-10-15 Ricoh Co Ltd Translating image forming device
JP2009205209A (en) * 2008-02-26 2009-09-10 Fuji Xerox Co Ltd Document image processor and document image processing program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HIROKI TAKAHASHI ET AL.: "Extraction of Hangul Text from Scenery Images by Using Hangul Structure", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. J88-D-II, no. 9, 1 September 2005 (2005-09-01), pages 1808 - 1816, XP008171336 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241594A (en) * 2016-12-26 2018-07-03 卡西欧计算机株式会社 Word editing method, electronic equipment and recording medium
US11568659B2 (en) * 2019-03-29 2023-01-31 Fujifilm Business Innovation Corp. Character recognizing apparatus and non-transitory computer readable medium

Also Published As

Publication number Publication date
JPWO2015136692A1 (en) 2017-04-06

Similar Documents

Publication Publication Date Title
EP0686286B1 (en) Text input transliteration system
US5640587A (en) Object-oriented rule-based text transliteration system
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
US9384389B1 (en) Detecting errors in recognized text
US10671805B2 (en) Digital processing and completion of form documents
US20120136647A1 (en) Machine translation apparatus and non-transitory computer readable medium
US11520835B2 (en) Learning system, learning method, and program
JPWO2008146583A1 (en) Dictionary registration system, dictionary registration method, and dictionary registration program
WO2015136692A1 (en) Digital image document editing system
JP2019179470A (en) Information processing program, information processing method, and information processing device
JP2011238159A (en) Computer system
CN104794140B (en) text highlight display method and device
US10049107B2 (en) Non-transitory computer readable medium and information processing apparatus and method
JP2011039576A (en) Specific information detecting device, specific information detecting method, and specific information detecting program
JP2010026718A (en) Character input device and method
TWM491194U (en) Data checking platform server
JP4466241B2 (en) Document processing method and document processing apparatus
JP2012108893A (en) Hand-written entry method
JP2943791B2 (en) Language identification device, language identification method, and recording medium recording language identification program
JP2011090524A (en) System and program for detecting and displaying difference in document of book
Xiang et al. Recovering semantic relations from web pages based on visual cues
JP2019087233A (en) Information processing device, information processing method, and information processing program
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program
Wu et al. Computer processing of Chinese characters: An overview of two decades' research and development
CN114417871B (en) Model training and named entity recognition method, device, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14885758

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016507228

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14885758

Country of ref document: EP

Kind code of ref document: A1