WO2017143973A1 - 文本识别模型建立方法和装置 - Google Patents

文本识别模型建立方法和装置 Download PDF

Info

Publication number
WO2017143973A1
WO2017143973A1 PCT/CN2017/074291 CN2017074291W WO2017143973A1 WO 2017143973 A1 WO2017143973 A1 WO 2017143973A1 CN 2017074291 W CN2017074291 W CN 2017074291W WO 2017143973 A1 WO2017143973 A1 WO 2017143973A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
file
files
different
information
Prior art date
Application number
PCT/CN2017/074291
Other languages
English (en)
French (fr)
Inventor
李洁
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017143973A1 publication Critical patent/WO2017143973A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • Embodiments of the present invention relate to the field of communications, and in particular, to a method and apparatus for establishing a text recognition model.
  • OCR optical character recognition
  • Embodiments of the present invention provide a text recognition model establishing method and apparatus, to at least solve the text recognition model established by using the same text file repeatedly acquired in the related art. The problem of lower accuracy.
  • a text recognition model establishing method includes: acquiring a text file set; selecting a text file different from each other as a feature text file from the text file set; and using the feature text The file establishes a text recognition model, wherein the text recognition model is used to identify text information in a text file to be recognized.
  • selecting the different text files from the set of text files as the feature text file comprises: according to a file identifier of a text file in the text file set and/or text in the text file set A storage location identifier of the file, the text files that are different from each other are selected from the set of text files as the feature text file.
  • the file as the feature text file includes: acquiring the file identifier in the first preset number of the text file sets according to a preset algorithm, to obtain a file identifier set, wherein the same text file identifier in the file identifier set
  • the storage location identifiers of the corresponding text files are the same; the storage location identifiers corresponding to the file identifiers in the file identifier set are obtained; and the storage location identifiers are different from the file identifiers according to the different storage location identifiers.
  • filtering out a second preset number of mutually different file identifiers and extracting the text files corresponding to the mutually different file identifiers from the set of text files as the feature text files.
  • the acquiring the text file set includes: acquiring text information; and copying the text information in batches to obtain a plurality of the text information; respectively setting text parameters for the plurality of the text information to obtain text files that are different from each other
  • the set of text files includes the text files that are different from each other.
  • the obtaining the text information includes: receiving the input first text string as the text information; or reading a second text string stored in the system; and dividing the second text string according to a preset policy Obtaining a set of text strings; extracting a third text string in the set of text strings as the text information.
  • the text parameter includes at least one of the following: a word of the text in the text information a body format parameter, a font display size parameter of the text in the text information, a blank character size ratio parameter in the text information, a spacing size ratio parameter of the text in the text information, a rotation angle parameter of the text in the text information, a font color parameter of the text in the text information, a transparency parameter of the text in the text information, a boldness parameter of the text in the text information, a tilt degree parameter of the text in the text information, and a text in the text information
  • a text recognition model establishing apparatus including: an obtaining module, configured to acquire a text file set; and a selecting module, configured to select different from the set of text files
  • the text file is a feature text file;
  • a building module is configured to establish a text recognition model using the feature text file, wherein the text recognition model is used to identify text information in the text file to be recognized.
  • the selecting module is configured to: select, according to the file identifier of the text file in the text file set and/or the storage location identifier of the text file in the text file set, from the set of text files The same text file is used as the feature text file.
  • the selecting module includes: a first acquiring unit, configured to acquire, according to a preset algorithm, the first preset number of the file identifiers in the set of text files, to obtain a file identifier set, where the file is The storage location identifier of the text file corresponding to the same text file identifier in the identifier set is the same; the second obtaining unit is configured to obtain different storage location identifiers corresponding to the file identifiers in the file identifier set; And setting, according to the different storage location identifiers, a second preset number of mutually different file identifiers from the file identifier set; and an extracting unit configured to extract the mutual from the text file set The different file identifiers correspond to the text files as the feature text files.
  • the obtaining module includes: a third acquiring unit configured to acquire text information; a copying unit configured to batch copy the text information to obtain a plurality of the text information; and a setting unit configured to be respectively multiple The text information sets a text parameter to obtain text files that are different from each other, wherein the text file set includes the text files that are different from each other.
  • the third obtaining unit is configured to: receive the input first text string as The text information; or reading a second text string stored in the system; dividing the second text string according to a preset policy to obtain a text string set; and extracting a third text in the text string set A string is used as the text information.
  • a computer storage medium is further provided, and the computer storage medium may store an execution instruction for executing the text recognition model establishing method in the above embodiment.
  • the embodiment of the present invention after acquiring the set of text files; selecting a text file different from each other as a feature text file from the set of text files; to implement a text recognition model using the feature text file, wherein the text recognition model is used for Identify text information in the text file to be identified. That is to say, by automatically selecting different text files from the text file collection as feature text files, a text recognition model for identifying text information in the text file is established, so that the established text recognition model can cover different texts.
  • a text file is used to ensure the accuracy of the established text recognition model and to overcome the problem of low accuracy of the text recognition model established by using the same text file repeatedly obtained in the related art. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • FIG. 1 is a flow chart of an alternative text recognition model establishing method according to an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method for establishing a text recognition model in accordance with an alternative embodiment of the present invention
  • FIG. 3 is a flow chart of a novel improved linear congruential random number generator in accordance with an alternative embodiment of the present invention.
  • FIG. 4 is a structural block diagram of an optional text recognition model establishing apparatus according to an embodiment of the present invention.
  • FIG. 5 is a structural block diagram of another optional text recognition model establishing apparatus according to an embodiment of the present invention.
  • FIG. 6 is a structural block diagram of another optional text recognition model establishing apparatus according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of an optional text recognition model establishing method according to an embodiment of the present invention. As shown in FIG. 1 , the process includes the following steps:
  • Step S102 acquiring a text file set
  • Step S104 selecting text files that are different from each other as a feature text file from the set of text files;
  • Step S106 the text recognition model is established using the feature text file, wherein the text recognition model is used to identify the text information in the text file to be recognized.
  • the embodiment may be, but is not limited to, applied to a scene in which a text recognition model is established.
  • a text recognition model for machine learning is established in the context of Optical Character Recognition (OCR).
  • OCR Optical Character Recognition
  • it may be, but is not limited to, applied to a process of text localization, text detection, or text recognition.
  • OCR Optical Character Recognition
  • a text recognition model for recognizing text information in the text file is established by automatically selecting different text files from the text file collection as the feature text file, so that the established text recognition model can cover different texts.
  • a text file is used to ensure the accuracy of the established text recognition model and to overcome the problem of low accuracy of the text recognition model established by using the same text file repeatedly obtained in the related art. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the text recognition model can be used for the training of the OCR text recognition model.
  • the OCR can be understood as allowing the computer to recognize the text in the picture. For example, if the picture is in a computer, the computer cannot automatically recognize the text in the picture.
  • the text in the picture is first recognized and converted into a text format, so that the computer can read its contents.
  • you need to build an OCR model which is obtained through training. Before training, it is necessary to obtain OCR text files for training to form a text recognition model.
  • the related art method is to collect pictures with texts, collect a large number of text pictures, and label the contents of the text pictures one by one (ie: Make the computer readable the content). Let the computer model learn these marked text files. Using a large number of text files to train the OCR model, the OCR model encounters a new picture with text, can recognize the text on the picture, and output a computer-readable text format.
  • the text recognition model is generated by a computer-readable text file, the problem of manual labeling of text pictures does not exist.
  • a random algorithm is added, and the text file is randomly selected twice for training use. If you do not add a random algorithm, such as the word "good”, generate 1000 pictures of "good” words, and "bad” words will also generate 1000 kinds. Each time the program inputs content, the output form is 1000 kinds. Instead, it will reduce the accuracy of computer recognition.
  • 1000 kinds of "good” characters are generated, 500 kinds are randomly selected, 1000 kinds of "bad” characters are generated, and 500 kinds are randomly selected. This ensures that the sample is rich and random.
  • the text file of the text file in the text file set and/or the storage location identifier of the text file in the text file set may be selected, and the text files different from each other are selected as the feature text file from the text file set.
  • Example 1 is a process of selecting text files that are different from each other as a feature text file from a set of text files according to the file identifier of the text file in the text file collection.
  • the device identifier can be selected in batches by a preset algorithm, and then the same file identifier is deleted, and file identifiers different from each other are retained. Then, the corresponding text file is extracted from the text file set according to the selected different file identifiers to establish a text recognition model as the feature text file.
  • the feature text file is obtained according to the characteristics of different text files carrying different text identifiers, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model and overcome related technologies.
  • the problem of lower accuracy of the text recognition model established using the same text file obtained repeatedly. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the second example is a process of selecting text files that are different from each other as a feature text file from a set of text files according to the storage location identifier of the text file in the text file collection.
  • the storage location identifiers may be selected in batches by a preset algorithm, and the same storage location identifiers are deleted, and the mutual storage identifiers are retained. Not the same storage location identifier. Then, the corresponding text file is extracted from the text file set according to the selected different storage location identifiers to establish a text recognition model as the feature text file.
  • the feature text files are obtained by carrying different storage location identifiers, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model. And overcome the problem of low accuracy of the text recognition model established by using the same text file repeatedly obtained in the related art. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the third example is a process of selecting text files that are different from each other as a feature text file from a set of text files according to the file identifier of the text file in the text file collection and the storage location identifier of the text file in the text file collection.
  • the text identifier may be manually selected from the text file collection according to the text identifier.
  • the batch selected text identifiers may be the same, and then the different text identifiers are stored in the text identifier.
  • the same text identifiers are stored in the same storage location, so that different text identifiers carry different storage location identifiers, and then different storage location identifiers are selected in batches, according to different
  • the storage location identifier obtains file identifiers that are different from each other, so that corresponding text files that are different from each other are obtained as text file files in the text file collection, and a text recognition model is established.
  • the same file identifiers in the duplicated file identifiers that are obtained in batches are stored in the same location, and the file identifiers that are different from each other are mutually different, and different storage locations are identified according to different storage location identifiers.
  • the file identifier extracts the feature text file from the text file collection, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model, and overcome the same repeated use in the related art.
  • the text recognition model created by the text file has a lower accuracy. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the manner of acquiring the text file set may be obtaining the related text file set, or generating the text file set according to the predetermined rule.
  • the method for generating a collection of text files may be, but is not limited to, generating a text file in batches, and then selecting a text file that constitutes a collection of text files from the generated text file, or selecting an existing text file to form a collection of text files.
  • the processing manner includes but is not limited to: blur, noise, sharpening, illumination, and the like.
  • the obtained text information may be copied in batches to obtain a large amount of the text information, and different text parameters are set for each text information, and a plurality of text files different from each other are formed into a text file. set.
  • set different text parameters for a large number of identical text information and obtain text files that are different from each other to form a text file.
  • the set of components ensures that the text file collection stores text files with the same text information but different text parameters, and ensures that the text information can be recognized from various forms of text files in the process of recognizing the text file.
  • the form of acquisition of the text information may be, but not limited to, receiving the input text string, or reading the text string stored in the system.
  • the read text string is divided into a plurality of different text strings according to a predetermined rule, and then a text as a generated text file is extracted therein. information.
  • the division unit may be, but is not limited to, one line, multiple lines, one word, multiple words, one word, multiple words, and the like.
  • the generated text file carries the same text information, but the text parameters of the text information are different from each other.
  • the conditions for establishing the text recognition model are met.
  • the text parameter may include, but is not limited to, at least one of the following: a font format, a font display size, a blank character size ratio, a space size ratio of the text, a rotation angle of the text, a font color of the text, and a transparency parameter of the text.
  • a font format a font display size
  • a blank character size ratio a space size ratio of the text
  • a rotation angle of the text a font color of the text
  • a transparency parameter of the text may be set by calling, but not limited to, a port of an open source computer vision library (OPENCV).
  • OPENCV open source computer vision library
  • the background picture is taken as an example to illustrate the setting process of the text parameter.
  • the same text information may be added Different text files are generated in different background images, and different text information can be added to the same background image to generate different text files, thereby obtaining a large number of text files.
  • the text file that is different from the text file set may be selected as the feature text file according to the file identifier of the text file in the text file set and/or the storage location identifier of the text file in the text file set. .
  • the storage location identifier of the merged text file is obtained by selecting a text file that is different from the text file set as the feature text file, and obtaining the file identifier in the first preset number of text file sets according to the preset algorithm, and obtaining the file identifier set.
  • the storage location identifier of the text file corresponding to the same text file identifier in the file identifier set is the same; the storage location identifiers corresponding to the file identifiers in the file identifier set are obtained; and the storage location identifiers are different according to different storage locations.
  • a second preset number of mutually different file identifiers are filtered out in the file identifier set; and the text files corresponding to the mutually different file identifiers are extracted from the text file collection as the feature text files.
  • Example 1 Filtering a second preset number of different file identifiers from a file identifier set according to mutually different storage location identifiers may be, but is not limited to, the following process: repeating the following steps until the acquired ones are different from each other The number of file identifiers reaches a second preset number: determining whether the number of mutually different file identifiers currently obtained reaches a second preset number; when the number does not reach the second preset number, from the storage location identifier set Obtaining a storage location identifier, and generating a current variable according to the obtained storage location identifier, where the storage location identifier set is used to store a storage location identifier that has not been used to generate a variable; and obtaining a random corresponding to the current variable in a preset random array Obtaining the file identifier corresponding to the random number from the file identifier set as the currently obtained different file identifiers; updating the currently obtained different file identifiers
  • W is the number of the binary digit after converting the storage location identifier into binary (the value is sequentially taken from the lower order to the upper digit from 0), and l represents the storage location.
  • the number of the identifier, l takes an integer from 0 to L-1
  • I W+l is the storage location identifier obtained from the storage location identifier set, and the storage location identifier set is used to store the I W+ that has not been used to generate n.
  • L may be, but not limited to, pre-set, the numbers of W and l are sequentially incremented, and I W+l corresponds to the storage location identifier in the storage location identifier set, because the storage location identifier is not repeated, I W+ l itself is not repeated, multiplied by 2l, can disturb the order of storage location identification, further ensure the randomness of the obtained storage location identifier, wherein the larger the L, the more random the storage location identifier is, and the storage is disordered.
  • the random array V[N] obtained after the arrangement of the position identifiers is larger.
  • L may be reasonably selected in the implementation process according to actual conditions.
  • Example 2 The process of obtaining the file identifier in the first preset number of text file sets according to the preset algorithm may be: acquiring a first preset quantity according to a preset random number generator (for example, a linear congruential random number generator) The file identifier.
  • a preset random number generator for example, a linear congruential random number generator
  • the process of obtaining the set of text files may be: acquiring text information; copying the text information in batches to obtain a plurality of text information; respectively setting text parameters for the plurality of text information, and obtaining text files different from each other, wherein The set of text files includes the text files that are different from each other.
  • the first text string input may be received as text information; or the second text string stored in the system may be read; the second text string is segmented according to a preset policy to obtain a text string set; and the text is extracted.
  • a third text string in the string collection is obtained as text information.
  • the text parameter may include, but is not limited to, at least one of the following: a font format parameter of the text in the text information, a font display size parameter of the text in the text information, a blank character size ratio parameter in the text information, and an interval of the text in the text information. Size ratio parameter, rotation angle parameter of text in text information, font color parameter of text in text information, transparency parameter of text in text information, boldness parameter of text in text information, inclination degree parameter of text in text information, text The underline drawing parameter of the text in the message, the background image, and the display position parameter of the text information in the background image.
  • the text file takes a sample as an example
  • the text file set takes a batch sample set as an example
  • the feature text file takes a feature sample as an example.
  • This alternative embodiment proposes a batch sample generation method for text localization, detection and recognition.
  • the present optional embodiment solves the problem that when the related OCR based on machine learning is used to perform complex background text images, the same text file may be repeatedly obtained, resulting in a low accuracy of the established text recognition model.
  • a text recognition model generating method for text localization, detection and recognition of the alternative embodiment comprises the following steps:
  • Step 1 loading text information, can provide two loading methods: input text string, if it is this mode, perform step 3; or read the relevant text string, if it is this mode, perform step 2;
  • Step 2 Select a predetermined rule to divide the read text string into a plurality of objects, and save the segmented text strings to the specified path;
  • Step 3 select a background image to be loaded from the background image library
  • Step 4 reading the segmented text string or reading the input string, and setting the batch text parameter, the text parameter includes at least one of the following: a font format, a font display size, a blank character size ratio, an interval size ratio, Rotation angle, display position, font color, transparency setting, boldness, degree of tilt, underline drawing, etc.;
  • Step 5 adding various text information after batch setting the text parameter to the background of the picture to generate a text file
  • Step 6 according to the requirements, whether to perform image processing on the text file: If image processing is required, step 7 is performed, and if image processing is not required, step 8 is performed;
  • Step 7 performing image processing on the text file, wherein the image processing includes: blur, noise Sound, sharpening, lighting, etc.;
  • Step 8 provides a new type of improved linear congruential random number generator to ensure arbitrary randomness of the feature text file:
  • Step 8-1 Set a random rule to the generated text file:
  • x0 is the initial text file
  • M is the modulus
  • a is a multiplier
  • c is an increment
  • x0, M, a, c are preset values .
  • Step 8-2 generating x i and ax i-1 from step 8-1, where x i and ax i-1 are from a text file set
  • W is the number of the binary digits after converting the storage location identifier into binary (the values are sequentially taken from the lower order to the upper digit from 0), and l indicates the number of the storage location identifier, Taking an integer from 0 to L-1, I W+l is the storage location identifier indicated by the storage location of the integer ax i-1 or x i in the computer;
  • Step 8-5 extracting the xi corresponding to the preset number of random numbers yi, and obtaining the corresponding text file as the feature text file;
  • step 9 the selected feature text file is re-saved, renamed (for example, renamed by sequential numbers), and a text recognition model is generated.
  • FIG. 2 is a flowchart of a text recognition model establishing method according to an alternative embodiment of the present invention, wherein the text string is exemplified by a text document of the format *.txt.
  • the process includes the following steps:
  • Step S202 loading text information to determine whether to read the text string.
  • the loading text information includes two loading methods: inputting a text string, or obtaining from a pre-stored text string. If it is determined that the text string is read (that is, it needs to be obtained from the pre-stored text string), step S204-2 is performed, and it is determined that the text string is not read (that is, the text word needs to be input) When the string is), step S204-1 is performed.
  • Step S204-1 inputting a text string.
  • Step S204-2 Select a predetermined rule to divide the read text string into a plurality of objects, and select “line segmentation” or “word segmentation” according to requirements; save the segmented text strings (formatted as *.txt) to Specify the path, named path_A (Path_A); find the divided text file to be processed under the file path Path_A, named file source-text.txt.
  • Step S206 loading a background image.
  • the supported image formats are: Windows bitmap files BMP, DIB, JPEG file JPEG, JPG, JPE, portable network map PNG; portable image PBM, PGM, PPM, Sun rasters image SR, RAS, TIFF image TIFF, TIF, OpenEXR HDR image EXR, JPEG 2000 picture jp2.
  • Step S208 a batch operation, where step S208 includes:
  • Step S208-1 Perform batch setting of text parameters on the text string source-text.txt or the input text string:
  • Optional formats include, but are not limited to, various fonts for the following font libraries:
  • Type 1fonts and collections
  • Type 1fonts and collections
  • CID-keyed Type 1fonts CFF fonts
  • OpenType fonts both TrueType and CFF variants
  • SFNT-based bitmap fonts X11PCF fonts
  • Windows FNT fonts BDF fonts (including anti-aliased ones );
  • Batch font position setting setting the text to be displayed in the position of the picture, which can be, but is not limited to, setting the batch text position display by batch setting the horizontal and vertical coordinates of the first pixel in the upper left corner of the text;
  • Batch font color setting In RGB format, a preset array is set by setting different values of R ⁇ G ⁇ B to generate batches of different color fonts;
  • Batch font transparency setting the setting range can be 0 ⁇ 100%
  • Batch font rendering effect settings bold (can be set to a bold degree, vertical bold or horizontal bold), tilt (can set different tilt angles), stroke drawing, shadow drawing, underline drawing, and so on.
  • Step S208-2 Write different types of text files after batch adjustment to the background image.
  • Step S208-3 determining whether to perform image processing according to requirements: if image processing is required, step S208-4 is performed, and if image processing is not required, step S208-5 is performed.
  • Step S208-4 the combination selection performs image processing on the series of text files obtained in step S208-2, and the image processing may include blurring, noise, sharpening, illumination, etc.; after the image processing, step S208-5 is continued.
  • Step S208-5 renaming the batch-generated text file (for example, renaming in order), storing the text file as a new format, and selecting its save path_B (Path_B).
  • Step S210 Generate a feature text file.
  • step S210 a new type of improved linear congruential random number generator is provided to ensure arbitrary randomness of the generated feature samples; the generation process of the new improved linear congruential random number generator can be as shown in FIG. It can be seen that the randomness of the generated feature text file can be guaranteed by the improved linear congruential random number generator described above.
  • the flow of the above generation process is shown in FIG. 3, and includes the following steps:
  • Step S302 loading the batch text file xi, and setting a random rule to the generated batch text file:
  • x0 is the initial text file
  • M is the modulus
  • a is a multiplier
  • c is an increment
  • x0, M, a, c are preset values .
  • Step S304 generating x i and ax i-1 by step S302, wherein x i and ax i-1 are from a text file
  • W is the number of the binary bit after converting the storage location identifier into binary (the value is sequentially taken from the lower order to the upper order from 0)
  • l represents the number of the storage location identifier
  • I W+l is the storage location identifier indicated by the storage location of the integer ax i-1 or x i in the computer.
  • Step S310 Extract the xi corresponding to the obtained preset number of random numbers yi, and find the corresponding text file in the save path Path_B.
  • Step S312 re-storing the selected text file to be renamed (renamed in sequential order), saving as a target path (Path_target), and generating a batch feature text file.
  • the present invention is not limited to using the above method to obtain the number of samples that can satisfy the training machine learning, and other random methods can also be used to generate the feature text file.
  • Step S212 Select a save format and a path of the feature text file.
  • Step S214 saving the feature text file.
  • the embodiments and optional embodiments of the present invention are capable of generating a large variety of text files required in large quantities according to requirements, and the advantages thereof are as follows: First, the input text can be input through the "personalization" of the edit command. You can also directly read the relevant text string and split it to get the desired text paragraph. Secondly, a large number of methods are added to realize the font format, font display size, blank character size ratio, interval size ratio, rotation angle, display position, font color, transparency setting, boldness, tilt degree, underline drawing, etc. Sexual generation, adding a series of image processing operations such as blur, noise, sharpening and illumination, further expands the sample diversity.
  • a new improved linear congruential random number generator method is provided to ensure the “randomness” of the generated samples, to provide a more complete and reasonable sample for the subsequent machine learning-based model training, and to ensure the model it trains. With higher accuracy.
  • text The identification model establishment method significantly saves labor costs and greatly improves the training efficiency of machine learning.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present invention in essence or the contribution to the related art can be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, CD-ROM).
  • the instructions include a number of instructions for causing a terminal device (which may be a cell phone, computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention.
  • a text recognition model establishing device is further provided, which is used to implement the above-mentioned embodiments and optional embodiments, and has not been described again.
  • the term “module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 4 is a structural block diagram of an optional text recognition model establishing apparatus according to an embodiment of the present invention. As shown in FIG. 4, the apparatus includes:
  • the obtaining module 42 is configured to obtain a set of text files
  • a selection module 44 coupled to the acquisition module 42, configured to select text files that are different from each other as a feature text file from the set of text files;
  • a setup module 46 coupled to the selection module 44, configured to establish a text recognition model using the feature text file, wherein the text recognition model is used to identify textual information in the text file to be recognized.
  • the embodiment may be, but is not limited to, applied to a scene in which a text recognition model is established.
  • a text recognition model for machine learning is established in an optical character recognition scenario.
  • the first obtaining module 42 obtains a large number of text files to form a text file.
  • the collection, selection module 44 automatically selects text files that are different from each other in the text file collection.
  • the creation module 46 creates a text recognition model for identifying text information in the text file, so that the created text recognition model can cover different texts.
  • the document is to ensure the accuracy of the established text recognition model and to overcome the problem of low accuracy of the text recognition model established by using the same text file repeatedly obtained in the related art. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the selection module 44 may be, but is not limited to, being configured to select different text files from the set of text files according to the file identifier of the text file in the text file collection and/or the storage location identifier of the text file in the text file collection. As a feature text file.
  • the process by which the selection module 44 selects text files that are different from each other as the feature text file from the set of text files is explained below by three examples.
  • Example 1 is a process in which the selection module 44 selects text files that are different from each other as a feature text file from a set of text files according to the file identifier of the text file in the text file collection.
  • the selection module 44 may batch select the file identifiers by using a preset algorithm, and then delete the same file identifiers, and retain different files. logo. Then, the corresponding text file is extracted from the text file set according to the selected different file identifiers to establish a text recognition model as the feature text file.
  • the feature text file is obtained according to the characteristics of different text files carrying different text identifiers, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model and overcome related technologies.
  • the text recognition model created using the same text file obtained repeatedly is less accurate problem. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the second example is a process in which the selection module 44 selects text files that are different from each other as the feature text file from the set of text files according to the storage location identifier of the text file in the text file collection.
  • the selecting module 44 may batch select the storage location identifiers by using a preset algorithm, and then delete the same storage location identifiers. , keep different storage location identifiers. Then, the corresponding text file is extracted from the text file set according to the selected different storage location identifiers to establish a text recognition model as the feature text file.
  • the feature text files are obtained by carrying different storage location identifiers, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model. And overcome the problem of low accuracy of the text recognition model established by using the same text file repeatedly obtained in the related art. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the third example is a process in which the selection module 44 selects text files that are different from each other as the feature text file from the text file set according to the file identifier of the text file in the text file set and the storage location identifier of the text file in the text file set.
  • the selection module 44 may first select the text identifiers from the text file collection according to the text identifier.
  • the batch selected text identifiers may be the same, and then store the different text identifiers in different storage locations, the same.
  • the text identifiers are stored in the same storage location, so that different text identifiers carry different storage location identifiers, and then the storage location identifiers that are different from each other are selected in batches, and the storage location identifiers that are different from each other are different from each other.
  • the file identifier is obtained, so that corresponding text files corresponding to each other are obtained as a feature text file in the text file collection, and a text recognition model is established.
  • the same file identifiers in the duplicated file identifiers that are obtained in batches are stored in the same location, so that different file identifiers corresponding to different mutually different storage location identifiers are ensured, and different storage locations are selected according to different storage location identifiers.
  • the file identifier extracts the feature text file from the text file collection, so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model, and overcome the same repeated use in the related art.
  • the text recognition model created by the text file has a lower accuracy. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.
  • the obtaining module 42 may obtain the set of text files by acquiring the related text file set, or may generate the text file set according to the predetermined rule.
  • the method for generating a collection of text files may be, but is not limited to, generating a text file in batches, and then selecting a text file that constitutes a collection of text files from the generated text file, or selecting an existing text file to form a collection of text files.
  • the obtaining module 42 may also determine whether to process the text file before generating the text file set, wherein the processing manner includes but is not limited to: blur, noise, sharpening, illumination, and the like.
  • the obtaining module 42 may copy the obtained text information in batches to obtain a large amount of the text information, and set different text parameters for each text information to obtain a plurality of text files different from each other. Make up a collection of text files.
  • different text parameters are set for a large number of identical text information, and different text files are formed to form a text file set, which ensures that the text file collection stores text files with the same text information but different text parameters. It is ensured that the text information can be recognized from various forms of text files during the subsequent recognition of the text file.
  • the form in which the acquisition module 42 obtains the text information may be, but is not limited to, receiving the input text string, or reading the stored text string in the system.
  • the obtaining module 42 divides the read text string into a plurality of different text strings according to a predetermined rule, and then extracts one of them as the generated text.
  • the textual information of the file can be, but are not limited to, one line, multiple lines, one word, multiple words, one word, multiple words, and the like.
  • the generated text file carries the same text information, but the text parameters of the text information are different from each other.
  • the conditions for establishing the text recognition model are met.
  • the text parameter may include, but is not limited to, at least one of the following: a font format, a font display size, a blank character size ratio, a space size ratio of the text, a rotation angle of the text, a font color of the text, and a transparency parameter of the text.
  • a font format a font display size
  • a blank character size ratio a space size ratio of the text
  • a rotation angle of the text a font color of the text
  • a transparency parameter of the text may be set by calling, but not limited to, a port of the OPENCV.
  • the background picture is taken as an example to illustrate the setting process of the text parameter.
  • the obtaining module 42 sets different text parameters for the text information batch, and adds the text information with different text parameters to the one or more background images from the background image library, and the same text information. You can add different text files to different background images. Different text information can be added to the same background image to generate different text files, thus obtaining a large number of text files.
  • the selection module 44 is configured to select a text file that is different from the text file set as the feature text file according to the file identifier of the text file in the text file collection and/or the storage location identifier of the text file in the text file collection.
  • FIG. 5 is a structural block diagram of another optional text recognition model establishing apparatus according to an embodiment of the present invention.
  • the selecting module 44 includes:
  • the first obtaining unit 52 is configured to obtain a file identifier in the first preset number of text file sets according to the preset algorithm, to obtain a file identifier set, where the text file corresponding to the same text file identifier in the file identifier set is obtained.
  • the storage location identifier is the same;
  • the second obtaining unit 54 coupled to the first obtaining unit 52, is configured to obtain different storage location identifiers corresponding to the file identifiers in the file identifier set;
  • the selection unit 56 is coupled to the second obtaining unit 54 and configured to select a second preset number of different file identifiers from the file identifier set according to different storage location identifiers;
  • the extracting unit 58 coupled to the selecting unit 56, is configured to extract a text file corresponding to the file identifiers different from each other as the feature text file from the set of text files.
  • FIG. 6 is a structural block diagram of another optional text recognition model establishing apparatus according to an embodiment of the present invention.
  • the obtaining module 42 includes:
  • the third obtaining unit 62 is configured to obtain text information
  • a copying unit 64 coupled to the third obtaining unit 62, configured to batch copy the text information to obtain a plurality of text information
  • the setting unit 66 is coupled to the copy unit 64, and is configured to respectively set text parameters for the plurality of text information to obtain text files that are different from each other, wherein the text file set includes text files that are different from each other.
  • the third obtaining unit 62 is configured to: receive the input first text string as the text information; or read the second text string stored in the system; and divide the second text string according to the preset policy to obtain the text A collection of strings; extracts a third text string from the collection of text strings as textual information.
  • the text parameter includes at least one of the following: a font format parameter of the text in the text information, a font display size parameter of the text in the text information, a blank character size ratio parameter in the text information, a spacing size ratio parameter of the text in the text information, The rotation angle parameter of the text in the text information, the font color parameter of the text in the text information, the transparency parameter of the text in the text information, the boldness parameter of the text in the text information, the inclination degree parameter of the text in the text information, and the text in the text information.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the modules are located in multiple In the processor.
  • Embodiments of the present invention also provide a storage medium.
  • the above storage medium may be configured to store program code for performing the following steps:
  • Step S1 acquiring a text file set
  • Step S2 selecting text files that are different from each other as a feature text file from the set of text files;
  • Step S3 the text recognition model is established using the feature text file, wherein the text recognition model is used to identify the text information in the text file to be recognized.
  • the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), and a Random Access Memory (RAM).
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the text file is selected as a feature text file by selecting a text file that is different from the text file set;
  • the text recognition model is used to identify text information in the text file to be identified. That is to say, by automatically selecting different text files from the text file collection as feature text files, it is established to identify the text files.
  • the text recognition model of the text information so that the established text recognition model can cover different text files to ensure the accuracy of the established text recognition model, and overcome the use of the same text file repeatedly obtained in the related art.
  • the text recognition model has lower accuracy issues. Further, it is ensured that the text recognition model established by the text recognition model establishing method provided in the embodiment can accurately recognize the text information in the text picture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

一种文本识别模型建立方法和装置。其中,该方法包括:获取文本文件集合(S102);从文本文件集合中选择互不相同的文本文件作为特征文本文件(S104);使用特征文本文件建立文本识别模型(S106),其中,文本识别模型用于识别待识别的文本文件中的文本信息。本方法和装置解决了相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题,从而实现了提高所建立的文本识别模型的准确性的效果。

Description

文本识别模型建立方法和装置 技术领域
本发明实施例涉及通信领域,具体而言,涉及一种文本识别模型建立方法和装置。
背景技术
随着互联网的发展和移动设备的普及,产生了大量含有复杂噪音或者各种变形的网络合成的文本图片,为了从大量公开的多媒体数据中挖掘出有价值的信息,识别这些复杂的网络合成的文本图片的意义十分重大。
然而,识别复杂网络合成文本图片具有相当大的挑战性:一方面,复杂的网络合成的文本图片具有多样性,它们可能具有不同的字体、颜色、大小、方向和排列方式;另一方面,复杂的网络合成的文本图片中存在噪声、模糊、光照和遮挡等问题,这给文字的检测和识别带来巨大的困难。
若使用传统的光学字符识别(Optical Character Recognition,简称为OCR)方法识别这些网络合成的文本图片,则在识别速率和准确性方面将难以达到预定要求。随着机器学习方法的出现,使得复杂背景文本图片的OCR得到了突破性的进展,但是使用机器学习来进行文字识别之前,需要大量的文本文件作为训练样本来建立文本识别模型。然而,目前在相关的文本识别模型建立过程中,往往会重复获取到相同的文本文件,这样采用相同的文本文件所建立的文本识别模型将无法覆盖所有文本内容,从而使得采用该文本识别模型无法进行准确地文本识别。
针对相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种文本识别模型建立方法和装置,以至少解决相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的 准确性较低的问题。
根据本发明实施例的一个方面,提供了一种文本识别模型建立方法,包括:获取文本文件集合;从所述文本文件集合中选择互不相同的文本文件作为特征文本文件;使用所述特征文本文件建立文本识别模型,其中,所述文本识别模型用于识别待识别的文本文件中的文本信息。
可选地,从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件包括:根据所述文本文件集合中文本文件的文件标识和/或所述文本文件集合中文本文件的存储位置标识,从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件。
可选地,根据所述文本文件集合中文本文件的所述文件标识和/或所述文本文件集合中文本文件的所述存储位置标识从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件包括:根据预设算法获取第一预设数量的所述文本文件集合中的所述文件标识,得到文件标识集合,其中,所述文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;获取所述文件标识集合中的所述文件标识对应的互不相同的存储位置标识;根据所述互不相同的存储位置标识从所述文件标识集合中筛选出第二预设数量的互不相同的文件标识;从所述文本文件集合中提取所述互不相同的文件标识对应的文本文件作为所述特征文本文件。
可选地,所述获取文本文件集合包括:获取文本信息;批量复制所述文本信息,得到多个所述文本信息;分别为多个所述文本信息设置文本参数,得到互不相同的文本文件,其中,所述文本文件集合包括所述互不相同的文本文件。
可选地,所述获取文本信息包括:接收输入的第一文本字符串作为所述文本信息;或者读取系统中存储的第二文本字符串;根据预设策略分割所述第二文本字符串,得到文本字符串集合;提取所述文本字符串集合中的一个第三文本字符串作为所述文本信息。
可选地,所述文本参数包括以下至少之一:所述文本信息中文字的字 体格式参数、所述文本信息中文字的字体显示大小参数、所述文本信息中空白字符大小比例参数、所述文本信息中文字的间隔大小比例参数、所述文本信息中文字的旋转角度参数、所述文本信息中文字的字体颜色参数、所述文本信息中文字的透明度参数、所述文本信息中文字的加粗程度参数、所述文本信息中文字的倾斜程度参数、所述文本信息中文字的下划线绘制参数、背景图片、所述文本信息在所述背景图片中的显示位置参数。
根据本发明实施例的另一个方面,还提供了一种文本识别模型建立装置,包括:获取模块,设置为获取文本文件集合;选择模块,设置为从所述文本文件集合中选择互不相同的文本文件作为特征文本文件;建立模块,设置为使用所述特征文本文件建立文本识别模型,其中,所述文本识别模型用于识别待识别的文本文件中的文本信息。
可选地,所述选择模块设置为:根据所述文本文件集合中文本文件的文件标识和/或所述文本文件集合中文本文件的存储位置标识从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件。
可选地,所述选择模块包括:第一获取单元,设置为根据预设算法获取第一预设数量的所述文本文件集合中的所述文件标识,得到文件标识集合,其中,所述文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;第二获取单元,设置为获取所述文件标识集合中的所述文件标识对应的互不相同的存储位置标识;选择单元,设置为根据所述互不相同的存储位置标识从所述文件标识集合中选择第二预设数量的互不相同的文件标识;提取单元,设置为从所述文本文件集合中提取所述互不相同的文件标识对应的文本文件作为所述特征文本文件。
可选地,所述获取模块包括:第三获取单元,设置为获取文本信息;复制单元,设置为批量复制所述文本信息,得到多个所述文本信息;设置单元,设置为分别为多个所述文本信息设置文本参数,得到互不相同的文本文件,其中,所述文本文件集合包括所述互不相同的文本文件。
可选地,所述第三获取单元设置为:接收输入的第一文本字符串作为 所述文本信息;或者读取系统中存储的第二文本字符串;根据预设策略分割所述第二文本字符串,得到文本字符串集合;提取所述文本字符串集合中的一个第三文本字符串作为所述文本信息。
在本发明实施例中,还提供了一种计算机存储介质,该计算机存储介质可以存储有执行指令,该执行指令用于执行上述实施例中的文本识别模型建立方法。
通过本发明实施例,在获取文本文件集合后;通过从文本文件集合中选择互不相同的文本文件作为特征文本文件;以实现使用上述特征文本文件建立文本识别模型,其中,文本识别模型用于识别待识别的文本文件中的文本信息。也就是说,通过从文本文件集合中自动选取互不相同的文本文件作为特征文本文件,来建立用于识别文本文件中文本信息的文本识别模型,从而使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
此外,通过从文本文件集合中自动选取互不相同的文本文件来建立文本识别模型的方式,还可以减少作为训练样本,用于建立文本识别模型的文本文件的数量,即减少重复获取到的文本文件的数量,从而实现提高建立文本识别模型的效率,进而避免所获取的文本文件数量过多所导致的建立文本识别模型的效率较低问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的一种可选的文本识别模型建立方法的流程图;
图2是根据本发明可选实施例的文本识别模型建立方法的流程图;
图3是根据本发明可选实施例的新型的改进型线性同余随机数发生器的流程图;
图4是根据本发明实施例的一种可选的文本识别模型建立装置的结构框图;
图5是根据本发明实施例的另一种可选的文本识别模型建立装置的结构框图;
图6是根据本发明实施例的另一种可选的文本识别模型建立装置的结构框图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
实施例一
在本实施例中提供了一种文本识别模型建立方法,图1是根据本发明实施例的一种可选的文本识别模型建立方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,获取文本文件集合;
步骤S104,从文本文件集合中选择互不相同的文本文件作为特征文本文件;
步骤S106,使用特征文本文件建立文本识别模型,其中,文本识别模型用于识别待识别的文本文件中的文本信息。
可选地,本实施例可以但不限于应用于建立文本识别模型的场景中。特别是在光学字符识别(Optical Character Recognition,简称为OCR)场景下建立用于机器学习的文本识别模型。例如,可以但不限于应用于文本定位、文本检测或文本识别的过程中。上述场景仅是一种示例,本实施例中对此不做任何限定。
通过上述步骤,通过从文本文件集合中自动选取互不相同的文本文件作为特征文本文件,来建立用于识别文本文件中文本信息的文本识别模型,从而使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
此外,通过从文本文件集合中自动选取互不相同的文本文件来建立文本识别模型的方式,还可以减少作为训练样本,用于建立文本识别模型的文本文件的数量,即减少重复获取到的文本文件的数量,从而实现提高建立文本识别模型的效率,进而避免所获取的文本文件数量过多所导致的建立文本识别模型的效率较低问题。
在本实施例中,文本识别模型可以用于OCR文本识别模型的训练,OCR可以理解为让计算机将图片中的文字识别出来,例如,图片在计算机中,计算机并不能自动认识图片里的文字,在OCR技术中,就是先把图片中的文字识别出来,转化成文本格式,令计算机能够将其内容读出来。要实现上述功能,需要建立一个OCR模型,这个模型是通过训练得到的。在进行训练前,需要获取用于训练的OCR文本文件,组成文本识别模型,相关技术中的办法是搜集有文字的图片,搜集到海量的文字图片,并逐一标注文字图片里的内容(即:使计算机可读取该内容)。让计算机的模型去学习这些标注好的文本文件。使用海量的文本文件去训练OCR模型,OCR模型遇到新的有文字的图片,就能识别图片上的文字,并输出计算机可读的文字格式。
但是,在OCR模型训练中,样本必须是非常海量的,保证足够训练出一个可用的OCR模型。海量有两个缺点:1、这么多样本,搜集起来加以标注,必须是人看到图片,知道了图中文字内容,然后标注这个内容为文本格式(即:使计算机可读),每一个图片都要这样操作。人工消耗非常大,而且不能保证人工不出错。2、样本必须有非常好的多样性。比如“好”这个字,有各种不同的颜色,字体,背景,甚至阴影,倾斜,粗细,不同角度光照等等变化。需要尽可能让这些各种表现形式的“好”字作为样本给OCR模型学习训练,OCR模型才会在今后使用时,正确地识别出新遇到的“好”字。但是收集丰富表现形式的样本,人工寻找筛选,工程量非常大。
在本实施例中,首先,由于文本识别模型是由计算机可读的文本文件生成的,因此文本图片人工标注的问题就不存在了。其次,用于生成文本识别模型的文本文件中的同一文本信息有各种各样不同的表现形式。此外,生成文本文件后,加入随机算法,二次随机选取文本文件,作为训练使用。如果不加入随机算法,例如“好”字,生成1000种表现形式的“好”字的图片,“坏”字也生成1000种,每次程序输入内容,输出的表现形式都是1000种,这样反而会降低计算机识别的准确率。在本实施例中,生成1000种“好”字,随机选取500种,生成1000种“坏”字,随机选取500种。这样能够保证样本的丰富和随机。
在本实施例中,可以但不限于根据文本文件集合中文本文件的文件标识和/或文本文件集合中文本文件的存储位置标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件。
下面通过三个示例说明从所述文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
示例一是根据文本文件集合中文本文件的文件标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例一中,由于在文本文件集合中不同的文本文件携带有不同的文 件标识,因此可以通过预设算法批量选择文件标识,再删除其中相同的文件标识,保留互不相同的文件标识。然后,根据筛选出的互不相同的文件标识从文本文件集合中提取对应的文本文件作为特征文本文件建立文本识别模型。通过上述步骤,根据不同文本文件携带不同文本标识的特点获取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
示例二是根据文本文件集合中文本文件的存储位置标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例二中,由于在文本文件集合中不同的文本文件存储位置不同,因此携带有不同的存储位置标识,可以通过预设算法批量选择存储位置标识,再删除其中相同的存储位置标识,保留互不相同的存储位置标识。然后,根据筛选出的互不相同的存储位置标识从文本文件集合中提取对应的文本文件作为特征文本文件建立文本识别模型。通过上述步骤,根据不同文本文件存储位置不同导致携带不同存储位置标识的特点获取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
示例三是根据文本文件集合中文本文件的文件标识和文本文件集合中文本文件的存储位置标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例三中,可以首先根据文本标识从文本文件集合中批量选取文本标识,此时,批量选取的文本标识可能相同,再将不同的文本标识存储在 不同的存储位置上,相同的文本标识存储在相同的存储位置上,使不同的文本标识携带有互不相同的存储位置标识,然后,批量选取互不相同的存储位置标识,根据互不相同的存储位置标识得到互不相同的文件标识,从而在文本文件集合中获取对应的互不相同的文本文件作为特征文本文件,建立文本识别模型。通过上述步骤,将批量获取的可能重复的文件标识中相同的文件标识存储在相同的位置,保证了互不相同的文件标识对应互不相同的存储位置标识,根据不同存储位置标识筛选出不同的文件标识从文本文件集合中提取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
需要说明的是,本实施例仅以文本标识和存储位置标识为例说明如何获取互不相同的文本文件建立文本识别模型,其他可以区分互不相同的文本文件的标识或参数等信息也可以用来获取互不相同的文本文件,属于本发明的保护范围,在此不再赘述。
在上述步骤S102中,获取文本文件集合的方式可以是获取相关的文本文件集合,也可以是根据预定规则生成文本文件集合。生成文本文件集合的方式可以但不限于批量生成文本文件,再从生成的文本文件中选取组成文本文件集合的文本文件,也可以选取已有的文本文件组成文本文件集合。
在生成文本文件集合前,还可以判断是否对文本文件进行处理,其中,处理方式包括但不限于:模糊、噪声、锐化、光照等。
在本实施例中,为了获取文本文件集合,可以将获取的文本信息批量复制,得到大量的该文本信息,为每个文本信息设置不同的文本参数,得到互不相同的大量文本文件组成文本文件集合。通过上述步骤,为大量相同的文本信息设置不同的文本参数,得到互不相同的文本文件组成文本文 件集合,保证了文本文件集合中存储的是文本信息相同但文本参数互不相同的文本文件,确保在之后对文本文件的识别过程中可以从各种形式的文本文件中识别出该文本信息。
此外,在本实施例中,文本信息的获取形式可以但不限于接收输入的文本字符串,或者,读取系统中已存储的文本字符串。
如果通过读取系统中已存储的文本字符串的方式获取文本信息,那么将读取的文本字符串按照预定规则分割成若干个不同的文本字符串,再在其中提取一个作为生成文本文件的文本信息。其中,分割单位可以但不限于是一行,多行,一个字,多个字,一个单词,多个单词等。
通过上述步骤,可以保证生成的文本文件携带有相同的文本信息,但文本信息的文本参数互不相同。满足了文本识别模型的建立条件。
在本实施例中,文本参数可以但不限于包括以下至少之一:字体格式、字体显示大小、空白字符大小比例、文字的间隔大小比例、文字的旋转角度、文字的字体颜色、文字的透明度参数、文字的加粗程度、文字的倾斜程度、文字的下划线绘制、背景图片、文本信息在背景图片中的显示位置。可选地,在本实施例中,可以但不限于调用开源计算机视觉库(OPENCV)的端口来设置上述文本信息的文本参数。
下面以背景图片为例说明文本参数的设置过程。
在获取文本信息后,然后为文本信息批量设置不同的文本参数,分别将文本参数互不相同的文本信息添加到从背景图片库中获取一张或多张背景图片中,同一个文本信息可以添加到不同的背景图片中生成不同的文本文件,不同的文本信息可以添加到同一张背景图片中生成不同的文本文件,从而得到大量的文本文件。
可选地,在上述步骤S104中,可以根据文本文件集合中文本文件的文件标识和/或文本文件集合中文本文件的存储位置标识从文本文件集合中选择互不相同的文本文件作为特征文本文件。
可选地,在根据文本文件集合中文本文件的文件标识和/或文本文件集 合中文本文件的存储位置标识从文本文件集合中选择互不相同的文本文件作为特征文本文件时,可以根据预设算法获取第一预设数量的文本文件集合中的文件标识,得到文件标识集合,其中,文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;获取文件标识集合中的文件标识对应的互不相同的存储位置标识;根据互不相同的存储位置标识从文件标识集合中筛选出第二预设数量的互不相同的文件标识;从文本文件集合中提取互不相同的文件标识对应的文本文件作为特征文本文件。
下面举例说明上述过程。
例1:根据互不相同的存储位置标识从文件标识集合中筛选出第二预设数量的互不相同的文件标识可以但不限于是如下过程:重复执行以下步骤,直至获取到的互不相同的文件标识的数量达到第二预设数量:判断当前获取到的互不相同的文件标识的数量是否达到第二预设数量;在数量未达到第二预设数量时,从存储位置标识集合中获取存储位置标识,并根据获取到的存储位置标识生成当前变量,其中,存储位置标识集合用于存储还未用于生成变量的存储位置标识;在预设随机数组中获取与当前变量对应的随机数;从文件标识集合中获取与该随机数对应的文件标识作为当前获取到的互不相同的文件标识;更新当前获取到的互不相同的文件标识的
Figure PCTCN2017074291-appb-000001
为将存储位置标识转化为二进制后的二进制位的数量,W为将存储位置标识转化为二进制后的二进制位的编号(从0开始按低位到高位的顺序依次分别取值),l表示存储位置标识的编号,l依次取从0到L-1的整数,IW+l为从存储位置标识集合中获取的存储位置标识,存储位置标识集合用于存储还未用于生成n的IW+l;赋值yi=V[n],其中,V[n]是n在随机数组V[N]中对应的随机数;从文件标识集合中获取与yi对应的文件标识作为当前获取到的互不相同的文件标识。在上述过程中,L可以但不限于为预先设置的,W和l的编号依次递增,IW+l对应存储位置标识集合中的存储位置标识,因为存储位置标识是不重复的,IW+l本身就是不重复的,乘以2l,可 以打乱存储位置标识的排列顺序,进一步保证获取的存储位置标识的随机性,其中,L越大,则存储位置标识的排列越随机,打乱存储位置标识的排列顺序后得到的随机数组V[N]也就越大。进一步为了兼顾存储位置标识的随机性和存储量之间的平衡,在本实施例中可以根据实际情况,在实现过程中合理选取L。
例2:根据预设算法获取第一预设数量的文本文件集合中的文件标识的过程可以是:根据预设随机数发生器(例如:线性同余随机数发生器)获取第一预设数量的所述文件标识。
可以通过以下公式根据线性同余随机数发生器获取第一预设数量的所述文件标识:由xi=(axi-1+c)mod(M)生成的随机数x1,x2…xi-1,xi构成第一预设数量的文件标识,其中,a,c,M,x0为预设参数,M>0,0<a<M,0≤c<M。
可选地,获取文本文件集合的过程可以是:获取文本信息;批量复制文本信息,得到多个文本信息;分别为多个文本信息设置文本参数,得到互不相同的文本文件,其中,所述文本文件集合包括所述互不相同的文本文件。
可选地,可以通过接收输入的第一文本字符串作为文本信息;或者读取系统中存储的第二文本字符串;根据预设策略分割第二文本字符串,得到文本字符串集合;提取文本字符串集合中的一个第三文本字符串作为文本信息的方式获取文本信息。
可选地,文本参数可以但不限于包括以下至少之一:文本信息中文字的字体格式参数、文本信息中文字的字体显示大小参数、文本信息中空白字符大小比例参数、文本信息中文字的间隔大小比例参数、文本信息中文字的旋转角度参数、文本信息中文字的字体颜色参数、文本信息中文字的透明度参数、文本信息中文字的加粗程度参数、文本信息中文字的倾斜程度参数、文本信息中文字的下划线绘制参数、背景图片、文本信息在背景图片中的显示位置参数。
在下面的示例和可选实施例中,文本文件以样本为例,文本文件集合以批量样本集合为例,特征文本文件以特征样本为例。
为了使本发明实施例的描述更加清楚,下面结合可选实施例进行描述和说明。
本可选实施例提出了一种用于文本定位、检测与识别的批量样本生成方法。
本可选实施例解决了相关的基于机器学习进行复杂背景文本图片的OCR时,可能重复获取到相同的文本文件导致建立的文本识别模型的准确性较低的问题。
本可选实施例的一种用于文本定位、检测与识别的文本识别模型生成方法包括如下步骤:
步骤1,加载文本信息,可提供两种加载方式:输入文本字符串,若为此种模式,执行步骤3;或者读取相关文本字符串,若为此种模式,执行步骤2;
步骤2,选取预定规则将读入的文本字符串分割成若干对象,将分割完成的若干文本字符串保存至指定路径;
步骤3,从背景图片库中选择要加载的背景图片;
步骤4,读取分割后文本字符串或读取输入的字符串,对其进行批量文本参数设置,文本参数包括以下至少之一:字体格式、字体显示大小、空白字符大小比例、间隔大小比例、旋转角度、显示位置、字体颜色、透明度设置、加粗程度、倾斜程度、下划线绘制等;
步骤5,将批量设置文本参数后的各类不同的文本信息添加到图片背景中,生成文本文件;
步骤6,根据需求,判断是否对文本文件进行图像处理:如需图像处理,执行步骤7,如无需图像处理,执行步骤8;
步骤7,对文本文件进行图像处理,其中,图像处理包括:模糊、噪 声、锐化以及光照等;
步骤8,提供一种新型的改进型线性同余随机数发生器,保证获取特征文本文件的任意随机性:
步骤8-1、设定随机规则给生成的文本文件:
xi=(axi-1+c)mod(M)
其中,x0为初始文本文件,M为模数,M>0,a为乘子,0<a<M,c为增量,0≤c<M;x0,M,a,c为预设值。
步骤8-2、由步骤8-1生成xi和axi-1,其中,xi和axi-1为从文本文件集
Figure PCTCN2017074291-appb-000002
进制后的二进制位的数量,W为将存储位置标识转化为二进制后的二进制位的编号(从0开始按低位到高位的顺序依次分别取值),l表示存储位置标识的编号,l依次取从0到L-1的整数,IW+l为整数axi-1或xi在计算机中的存储位置指示的存储位置标识;
步骤8-4、赋值yi=V[n],其中V[n]为辅助随机数组V[N]中的随机数;
步骤8-5、将得到的预设个数的随机数yi对应的xi提取,并获取其对应的文本文件作为特征文本文件;
步骤9,重新保存被选中的特征文本文件,对其重命名(例如:以顺序数字重命名),生成文本识别模型。
具体结合以下示例进行说明,如图2所示是根据本发明可选实施例的文本识别模型建立方法的流程图,其中,文本字符串以格式为*.txt的文本文档为例。该流程包括如下步骤:
步骤S202,加载文本信息,判断是否读取文本字符串。其中,加载文本信息包括两种加载方式:输入文本字符串,或者从预存的文本字符串中获取。判断出是读取文本字符串(即需要从预存的文本字符串中获取)时,则执行步骤S204-2,判断出不是读取文本字符串(即需要输入文本字 符串)时,则执行步骤S204-1。
步骤S204-1、输入文本字符串。
步骤S204-2、选取预定规则将读入的文本字符串分割成若干对象,根据需求选择“行分割”或“单词分割”;将分割完成的若干文本字符串(格式为*.txt)保存至指定路径,命名为路径_A(Path_A);在文件路径Path_A下找到需要处理的分割好的文本文件,命名为文件source-text.txt。
步骤S206,加载背景图片。
从相关的背景图片库中选择要加载的背景图片(命名为background),背景图片库为开放的,可根据需要添加新的图片文件进入,支持的图片格式为:Windows位图文件BMP、DIB,JPEG文件JPEG、JPG、JPE,便携式网络图PNG;便携式图像PBM,PGM,PPM、Sun rasters图像SR,RAS、TIFF图像TIFF,TIF、OpenEXR HDR图像EXR、JPEG 2000图片jp2。
步骤S208、批量操作,其中,步骤S208包括:
步骤S208-1、对文本字符串source-text.txt或输入的文本字符串进行文本参数批量设置:
批量字体格式设置:可选的格式包括但不限于如下字体库的各种字体:
TrueType fonts(and collections)、Type 1fonts、CID-keyed Type 1fonts、CFF fonts、OpenType fonts(both TrueType and CFF variants)、SFNT-based bitmap fonts、X11PCF fonts、Windows FNT fonts、BDF fonts(including anti-aliased ones);
批量字体尺寸设置:通过调整字体尺寸参数,可以批量地设置字体显示大小、空白字符大小比例、间隔大小比例、旋转角度等尺寸参数;
批量字体位置设置:设置文本显示在图片的位置,可以但不限于通过批量地设置文本左上角第一个像素点的横纵坐标进行批量文本位置显示设置;
批量字体颜色设置:采用RGB格式,通过设置R\G\B的不同数值组合预先设定好的数组,生成批量的不同颜色字体;
批量字体透明度设置:设置范围可以为0~100%;
批量字体渲染效果设置:加粗(可单独设置加粗程度、垂直加粗或水平加粗)、倾斜(可设置不同的倾斜角度)、描边绘制、阴影绘制、下划线绘制等。
步骤S208-2、将批量调参后的各类不同的文本文件分别写入背景图片(background)。
步骤S208-3、根据需求,判断是否进行图像处理:如需进行图像处理,执行步骤S208-4,如无需进行图像处理,执行步骤S208-5。
步骤S208-4、组合选择对步骤S208-2中得到的一系列文本文件进行图像处理,图像处理可以包括模糊、噪声、锐化以及光照等;在图像处理后继续执行步骤S208-5。
步骤S208-5、重命名批量生成的文本文件(例如:以顺序数字重命名),将文本文件存储为新的格式,并选取其保存路径_B(Path_B)。
步骤S210、生成特征文本文件。
在步骤S210中,提供了一种新型的改进型线性同余随机数发生器,保证生成特征样本的任意随机性;上述新型的改进型线性同余随机数发生器的生成过程可以如图3所示,通过上述改进型线性同余随机数发生器可以保证生成的特征文本文件的任意随机性。上述生成过程的流程如图3所示,包括如下步骤:
步骤S302、载入批量文本文件xi,设定随机规则给生成的批量文本文件:
xi=(axi-1+c)mod(M)
其中,x0为初始文本文件,M为模数,M>0,a为乘子,0<a<M,c为增量,0≤c<M;x0,M,a,c为预设值。
步骤S304、由步骤S302生成xi和axi-1,其中,xi和axi-1为从文本文件
Figure PCTCN2017074291-appb-000003
化为二进制后的二进制位的数量,W为将存储位置标识转化为二进制后的二进制位的编号(从0开始按低位到高位的顺序依次分别取值),l表示存储位置标识的编号,l依次取从0到L-1的整数,IW+l为整数axi-1或xi在计算机中的存储位置指示的存储位置标识。
步骤S308、为yi赋值,其中,yi=V[n],V[n]为辅助随机数组V[N]中的随机数。
步骤S310、将得到的预设个数的随机数yi对应的xi提取,并在保存路径Path_B找到其对应的文本文件。
步骤S312、重新保存被选中的文本文件重命名(以顺序数字重命名),保存为目标路径(Path_target),生成批量特征文本文件。
需要说明的是,本发明并不限于使用上述方法来得到能够满足训练机器学习的样本数量,其他的随机方法也可以用来生成特征文本文件。
步骤S212、选取特征文本文件的保存格式和路径。
步骤S214、保存特征文本文件。
综上所述,本发明实施例和可选实施例能够根据需求大批量生成所需各种各样的文本文件,其优势如下:首先,输入的文本既可以通过编辑命令“个性化”输入,又可以直接读取相关的文本字符串,分割得到所需的文本段落。其次,加入大量方法,实现字体格式、字体显示大小、空白字符大小比例、间隔大小比例、旋转角度、显示位置、字体颜色、透明度设置、加粗程度、倾斜程度、下划线绘制等不同文本格式批量一次性生成,又加入模糊、噪声、锐化以及光照等一系列的图像处理操作,进一步扩展样本的多样性。此外,提供一种新型的改进型线性同余随机数发生器方法,保证了生成样本的“随机性”,为后续的基于机器学习的模型训练提供更加完善合理的样本,确保其训练出的模型具有更高的准确性。同时,文本 识别模型建立方法,显著节约了人力成本,大幅度提高机器学习的训练效率。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
实施例二
在本实施例中还提供了一种文本识别模型建立装置,该装置用于实现上述实施例及可选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图4是根据本发明实施例的一种可选的文本识别模型建立装置的结构框图,如图4所示,该装置包括:
1)获取模块42,设置为获取文本文件集合;
2)选择模块44,耦合至获取模块42,设置为从文本文件集合中选择互不相同的文本文件作为特征文本文件;
3)建立模块46,耦合至选择模块44,设置为使用特征文本文件建立文本识别模型,其中,文本识别模型用于识别待识别的文本文件中的文本信息。
可选地,本实施例可以但不限于应用于建立文本识别模型的场景中。特别是在光学字符识别场景下建立用于机器学习的文本识别模型。
通过上述装置,首先获取模块42获取大量的文本文件组成文本文件 集合,选择模块44再从文本文件集合中自动选取互不相同的文本文件由建立模块46来建立用于识别文本文件中文本信息的文本识别模型,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
此外,通过从文本文件集合中自动选取互不相同的文本文件来建立文本识别模型的方式,还可以减少作为训练样本,用于建立文本识别模型的文本文件的数量,即减少重复获取到的文本文件的数量,从而实现提高建立文本识别模型的效率,进而避免所获取的文本文件数量过多所导致的建立文本识别模型的效率较低问题。
在本实施例中,选择模块44可以但不限于设置为根据文本文件集合中文本文件的文件标识和/或文本文件集合中文本文件的存储位置标识从文本文件集合中选择互不相同的文本文件作为特征文本文件。
下面通过三个示例说明选择模块44从所述文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
示例一是选择模块44根据文本文件集合中文本文件的文件标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例一中,由于在文本文件集合中不同的文本文件携带有不同的文件标识,因此选择模块44可以通过预设算法批量选择文件标识,再删除其中相同的文件标识,保留互不相同的文件标识。然后,根据筛选出的互不相同的文件标识从文本文件集合中提取对应的文本文件作为特征文本文件建立文本识别模型。通过上述装置,根据不同文本文件携带不同文本标识的特点获取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的 问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
示例二是选择模块44根据文本文件集合中文本文件的存储位置标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例二中,由于在文本文件集合中不同的文本文件存储位置不同,因此携带有不同的存储位置标识,选择模块44可以通过预设算法批量选择存储位置标识,再删除其中相同的存储位置标识,保留互不相同的存储位置标识。然后,根据筛选出的互不相同的存储位置标识从文本文件集合中提取对应的文本文件作为特征文本文件建立文本识别模型。通过上述装置,根据不同文本文件存储位置不同导致携带不同存储位置标识的特点获取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
示例三是选择模块44根据文本文件集合中文本文件的文件标识和文本文件集合中文本文件的存储位置标识,从文本文件集合中选择互不相同的文本文件作为特征文本文件的过程。
在示例三中,选择模块44可以首先根据文本标识从文本文件集合中批量选取文本标识,此时,批量选取的文本标识可能相同,再将不同的文本标识存储在不同的存储位置上,相同的文本标识存储在相同的存储位置上,使不同的文本标识携带有互不相同的存储位置标识,然后,批量选取互不相同的存储位置标识,根据互不相同的存储位置标识得到互不相同的文件标识,从而在文本文件集合中获取对应的互不相同的文本文件作为特征文本文件,建立文本识别模型。通过上述装置,将批量获取的可能重复的文件标识中相同的文件标识存储在相同的位置,保证了互不相同的文件标识对应互不相同的存储位置标识,根据不同存储位置标识筛选出不同的 文件标识从文本文件集合中提取特征文本文件,使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。
需要说明的是,本实施例仅以文本标识和存储位置标识为例说明如何获取互不相同的文本文件建立文本识别模型,其他可以区分互不相同的文本文件的标识或参数等信息也可以用来获取互不相同的文本文件,属于本发明的保护范围,在此不再赘述。
上述获取模块42获取文本文件集合的方式可以是获取相关的文本文件集合,也可以是根据预定规则生成文本文件集合。生成文本文件集合的方式可以但不限于批量生成文本文件,再从生成的文本文件中选取组成文本文件集合的文本文件,也可以选取已有的文本文件组成文本文件集合。
获取模块42还可以在生成文本文件集合前,判断是否对文本文件进行处理,其中,处理方式包括但不限于:模糊、噪声、锐化、光照等。
在本实施例中,获取模块42为了获取文本文件集合,可以将获取的文本信息批量复制,得到大量的该文本信息,为每个文本信息设置不同的文本参数,得到互不相同的大量文本文件组成文本文件集合。通过上述装置,为大量相同的文本信息设置不同的文本参数,得到互不相同的文本文件组成文本文件集合,保证了文本文件集合中存储的是文本信息相同但文本参数互不相同的文本文件,确保在之后对文本文件的识别过程中可以从各种形式的文本文件中识别出该文本信息。
此外,在本实施例中,获取模块42获取文本信息的形式可以但不限于接收输入的文本字符串,或者,读取系统中已存储的文本字符串。
如果通过读取系统中已存储的文本字符串的方式获取文本信息,那么获取模块42将读取的文本字符串按照预定规则分割成若干个不同的文本字符串,再在其中提取一个作为生成文本文件的文本信息。其中,分割单 位可以但不限于是一行,多行,一个字,多个字,一个单词,多个单词等。
通过上述装置,可以保证生成的文本文件携带有相同的文本信息,但文本信息的文本参数互不相同。满足了文本识别模型的建立条件。
在本实施例中,文本参数可以但不限于包括以下至少之一:字体格式、字体显示大小、空白字符大小比例、文字的间隔大小比例、文字的旋转角度、文字的字体颜色、文字的透明度参数、文字的加粗程度、文字的倾斜程度、文字的下划线绘制、背景图片、文本信息在背景图片中的显示位置。可选地,在本实施例中,可以但不限于调用OPENCV的端口来设置上述文本信息的文本参数。
下面以背景图片为例说明文本参数的设置过程。
获取模块42在获取文本信息后,为文本信息批量设置不同的文本参数,分别将文本参数互不相同的文本信息添加到从背景图片库中获取一张或多张背景图片中,同一个文本信息可以添加到不同的背景图片中生成不同的文本文件,不同的文本信息可以添加到同一张背景图片中生成不同的文本文件,从而得到大量的文本文件。
可选地,选择模块44设置为:根据文本文件集合中文本文件的文件标识和/或文本文件集合中文本文件的存储位置标识从文本文件集合中选择互不相同的文本文件作为特征文本文件。
图5是根据本发明实施例的另一种可选的文本识别模型建立装置的结构框图,如图5所示,可选地,选择模块44包括:
1)第一获取单元52,设置为根据预设算法获取第一预设数量的文本文件集合中的文件标识,得到文件标识集合,其中,文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;
2)第二获取单元54,耦合至第一获取单元52,设置为获取文件标识集合中的文件标识对应的互不相同的存储位置标识;
3)选择单元56,耦合至第二获取单元54,设置为根据互不相同的存储位置标识从文件标识集合中选择第二预设数量的互不相同的文件标识;
4)提取单元58,耦合至选择单元56,设置为从文本文件集合中提取互不相同的文件标识对应的文本文件作为特征文本文件。
图6是根据本发明实施例的另一种可选的文本识别模型建立装置的结构框图,如图6所示,可选地,获取模块42包括:
1)第三获取单元62,设置为获取文本信息;
2)复制单元64,耦合至第三获取单元62,设置为批量复制文本信息,得到多个文本信息;
3)设置单元66,耦合至复制单元64,设置为分别为多个文本信息设置文本参数,得到互不相同的文本文件,其中,文本文件集合包括互不相同的文本文件。
可选地,第三获取单元62设置为:接收输入的第一文本字符串作为文本信息;或者读取系统中存储的第二文本字符串;根据预设策略分割第二文本字符串,得到文本字符串集合;提取文本字符串集合中的一个第三文本字符串作为文本信息。
可选地,文本参数包括以下至少之一:文本信息中文字的字体格式参数、文本信息中文字的字体显示大小参数、文本信息中空白字符大小比例参数、文本信息中文字的间隔大小比例参数、文本信息中文字的旋转角度参数、文本信息中文字的字体颜色参数、文本信息中文字的透明度参数、文本信息中文字的加粗程度参数、文本信息中文字的倾斜程度参数、文本信息中文字的下划线绘制参数、背景图片、文本信息在背景图片中的显示位置参数。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述模块分别位于多个处理器中。
实施例三
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:
步骤S1,获取文本文件集合;
步骤S2,从文本文件集合中选择互不相同的文本文件作为特征文本文件;
步骤S3,使用特征文本文件建立文本识别模型,其中,文本识别模型用于识别待识别的文本文件中的文本信息。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的可选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
在本发明实施例中,通过本发明实施例,在获取文本文件集合后;通过从文本文件集合中选择互不相同的文本文件作为特征文本文件;以实现使用上述特征文本文件建立文本识别模型,其中,文本识别模型用于识别待识别的文本文件中的文本信息。也就是说,通过从文本文件集合中自动选取互不相同的文本文件作为特征文本文件,来建立用于识别文本文件中 文本信息的文本识别模型,从而使所建立的文本识别模型可以覆盖不同的文本文件,以保证所建立的文本识别模型的准确性,并克服相关技术中使用重复获取到的相同的文本文件所建立的文本识别模型的准确性较低的问题。进而保证采用本实施例中提供的文本识别模型建立方法所建立的文本识别模型可以准确识别出文本图片中的文本信息。此外,通过从文本文件集合中自动选取互不相同的文本文件来建立文本识别模型的方式,还可以减少作为训练样本,用于建立文本识别模型的文本文件的数量,即减少重复获取到的文本文件的数量,从而实现提高建立文本识别模型的效率,进而避免所获取的文本文件数量过多所导致的建立文本识别模型的效率较低问题。

Claims (11)

  1. 一种文本识别模型建立方法,包括:
    获取文本文件集合;
    从所述文本文件集合中选择互不相同的文本文件作为特征文本文件;
    使用所述特征文本文件建立文本识别模型,其中,所述文本识别模型用于识别待识别的文本文件中的文本信息。
  2. 根据权利要求1所述的方法,其中,从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件包括:
    根据所述文本文件集合中文本文件的文件标识和/或所述文本文件集合中文本文件的存储位置标识,从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件。
  3. 根据权利要求2所述的方法,其中,根据所述文本文件集合中文本文件的所述文件标识和/或所述文本文件集合中文本文件的所述存储位置标识从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件包括:
    根据预设算法获取第一预设数量的所述文本文件集合中的所述文件标识,得到文件标识集合,其中,所述文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;
    获取所述文件标识集合中的所述文件标识对应的互不相同的存储位置标识;
    根据所述互不相同的存储位置标识从所述文件标识集合中筛选出第二预设数量的互不相同的文件标识;
    从所述文本文件集合中提取所述互不相同的文件标识对应的文本文 件作为所述特征文本文件。
  4. 根据权利要求1所述的方法,其中,所述获取文本文件集合包括:
    获取文本信息;
    批量复制所述文本信息,得到多个所述文本信息;
    分别为多个所述文本信息设置文本参数,得到互不相同的文本文件,其中,所述文本文件集合包括所述互不相同的文本文件。
  5. 根据权利要求4所述的方法,其中,所述获取文本信息包括:
    接收输入的第一文本字符串作为所述文本信息;或者
    读取系统中存储的第二文本字符串;根据预设策略分割所述第二文本字符串,得到文本字符串集合;提取所述文本字符串集合中的一个第三文本字符串作为所述文本信息。
  6. 根据权利要求4或5所述的方法,其中,所述文本参数包括以下至少之一:所述文本信息中文字的字体格式参数、所述文本信息中文字的字体显示大小参数、所述文本信息中空白字符大小比例参数、所述文本信息中文字的间隔大小比例参数、所述文本信息中文字的旋转角度参数、所述文本信息中文字的字体颜色参数、所述文本信息中文字的透明度参数、所述文本信息中文字的加粗程度参数、所述文本信息中文字的倾斜程度参数、所述文本信息中文字的下划线绘制参数、背景图片、所述文本信息在所述背景图片中的显示位置参数。
  7. 一种文本识别模型建立装置,包括:
    获取模块,设置为获取文本文件集合;
    选择模块,设置为从所述文本文件集合中选择互不相同的文本文件作 为特征文本文件;
    建立模块,设置为使用所述特征文本文件建立文本识别模型,其中,所述文本识别模型用于识别待识别的文本文件中的文本信息。
  8. 根据权利要求7所述的装置,其中,所述选择模块设置为:
    根据所述文本文件集合中文本文件的文件标识和/或所述文本文件集合中文本文件的存储位置标识从所述文本文件集合中选择所述互不相同的文本文件作为所述特征文本文件。
  9. 根据权利要求8所述的装置,其中,所述选择模块包括:
    第一获取单元,设置为根据预设算法获取第一预设数量的所述文本文件集合中的所述文件标识,得到文件标识集合,其中,所述文件标识集合中相同的文本文件标识所对应的文本文件的存储位置标识相同;
    第二获取单元,设置为获取所述文件标识集合中的所述文件标识对应的互不相同的存储位置标识;
    选择单元,设置为根据所述互不相同的存储位置标识从所述文件标识集合中选择第二预设数量的互不相同的文件标识;
    提取单元,设置为从所述文本文件集合中提取所述互不相同的文件标识对应的文本文件作为所述特征文本文件。
  10. 根据权利要求7所述的装置,其中,所述获取模块包括:
    第三获取单元,设置为获取文本信息;
    复制单元,设置为批量复制所述文本信息,得到多个所述文本信息;
    设置单元,设置为分别为多个所述文本信息设置文本参数,得到互不相同的文本文件,其中,所述文本文件集合包括所述互不相同的文本文件。
  11. 根据权利要求10所述的装置,其中,所述第三获取单元设置为:
    接收输入的第一文本字符串作为所述文本信息;或者
    读取系统中存储的第二文本字符串;根据预设策略分割所述第二文本字符串,得到文本字符串集合;提取所述文本字符串集合中的一个第三文本字符串作为所述文本信息。
PCT/CN2017/074291 2016-02-25 2017-02-21 文本识别模型建立方法和装置 WO2017143973A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610105478.XA CN107122785B (zh) 2016-02-25 2016-02-25 文本识别模型建立方法和装置
CN201610105478.X 2016-02-25

Publications (1)

Publication Number Publication Date
WO2017143973A1 true WO2017143973A1 (zh) 2017-08-31

Family

ID=59685923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/074291 WO2017143973A1 (zh) 2016-02-25 2017-02-21 文本识别模型建立方法和装置

Country Status (2)

Country Link
CN (1) CN107122785B (zh)
WO (1) WO2017143973A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766879B (zh) * 2019-01-11 2023-06-30 北京字节跳动网络技术有限公司 字符检测模型的生成、字符检测方法、装置、设备及介质
CN111695381B (zh) * 2019-03-13 2024-02-02 杭州海康威视数字技术股份有限公司 一种文本特征提取方法、装置、电子设备及可读存储介质
CN110135413B (zh) * 2019-05-08 2021-08-17 达闼机器人有限公司 一种字符识别图像的生成方法、电子设备和可读存储介质
CN113034415B (zh) * 2021-03-23 2021-09-14 哈尔滨市科佳通用机电股份有限公司 一种铁路机车小部件图像扩增的方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635763A (zh) * 2008-07-23 2010-01-27 深圳富泰宏精密工业有限公司 图片分类系统及方法
CN102024152A (zh) * 2010-12-14 2011-04-20 浙江大学 一种基于稀疏表达和字典学习进行交通标志识别的方法
CN103077407A (zh) * 2013-01-21 2013-05-01 信帧电子技术(北京)有限公司 车标定位识别方法及系统
CN104298713A (zh) * 2014-09-16 2015-01-21 北京航空航天大学 一种基于模糊聚类的图片检索方法
CN105184313A (zh) * 2015-08-24 2015-12-23 小米科技有限责任公司 分类模型构建方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100718139B1 (ko) * 2005-11-04 2007-05-14 삼성전자주식회사 영상에 포함된 문자 인식장치 및 방법
US8867828B2 (en) * 2011-03-04 2014-10-21 Qualcomm Incorporated Text region detection system and method
CN102999533A (zh) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 一种火星文识别方法和系统
CN103488798B (zh) * 2013-10-14 2016-06-15 大连民族学院 一种甲骨文自动识别方法
CN104751153B (zh) * 2013-12-31 2018-08-14 中国科学院深圳先进技术研究院 一种识别场景文字的方法及装置
CN104778481B (zh) * 2014-12-19 2018-04-27 五邑大学 一种大规模人脸模式分析样本库的构建方法和装置
CN104966097B (zh) * 2015-06-12 2019-01-18 成都数联铭品科技有限公司 一种基于深度学习的复杂文字识别方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101635763A (zh) * 2008-07-23 2010-01-27 深圳富泰宏精密工业有限公司 图片分类系统及方法
CN102024152A (zh) * 2010-12-14 2011-04-20 浙江大学 一种基于稀疏表达和字典学习进行交通标志识别的方法
CN103077407A (zh) * 2013-01-21 2013-05-01 信帧电子技术(北京)有限公司 车标定位识别方法及系统
CN104298713A (zh) * 2014-09-16 2015-01-21 北京航空航天大学 一种基于模糊聚类的图片检索方法
CN105184313A (zh) * 2015-08-24 2015-12-23 小米科技有限责任公司 分类模型构建方法及装置

Also Published As

Publication number Publication date
CN107122785B (zh) 2022-09-27
CN107122785A (zh) 2017-09-01

Similar Documents

Publication Publication Date Title
CN110414519B (zh) 一种图片文字的识别方法及其识别装置、存储介质
WO2018223994A1 (zh) 中文打印字符图像合成方法及装置
CN109933756A (zh) 基于ocr的图像转档方法、装置、设备及可读存储介质
CN111476227B (zh) 基于ocr的目标字段识别方法、装置及存储介质
WO2017143973A1 (zh) 文本识别模型建立方法和装置
JP5058575B2 (ja) 画像処理装置及びその制御方法、プログラム
CN111091167B (zh) 标志识别训练数据合成方法、装置、电子设备及存储介质
CN109035370B (zh) 一种图片标注方法与系统
WO2020186785A1 (zh) 样本集构建方法、装置、计算机设备和存储介质
CN108805519B (zh) 纸质日程表电子化生成方法、装置及电子日程表生成方法
CN109493400A (zh) 手写样本生成方法、装置、计算机设备及存储介质
CN109446873A (zh) 手写字体识别方法、系统以及终端设备
CN109255826B (zh) 中文训练图像生成方法、装置、计算机设备及存储介质
CN109522898A (zh) 手写样本图片标注方法、装置、计算机设备及存储介质
CN114332895A (zh) 文本图像合成方法、装置、设备、存储介质和程序产品
CN112651399B (zh) 检测倾斜图像中同行文字的方法及其相关设备
CN116757165B (zh) 基于版式数据流文件底板将效果工具投影到ofd文件的方法
CN113780116A (zh) 发票分类方法、装置、计算机设备和存储介质
US10691884B2 (en) System and method for cheque image data masking using data file and template cheque image
CN116167910B (zh) 文本编辑方法、装置、计算机设备及计算机可读存储介质
CN111709293A (zh) 一种基于ResUNet神经网络的化学结构式分割方法
CN108133205B (zh) 复制图像中文本内容的方法及装置
CN114565702A (zh) 文本图像生成方法、装置及电子设备
CN113936187A (zh) 文本图像合成方法、装置、存储介质及电子设备
WO2020228171A1 (zh) 数据增强方法、装置及计算机可读存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17755802

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17755802

Country of ref document: EP

Kind code of ref document: A1