CN111079708B

CN111079708B - Information identification method and device, computer equipment and storage medium

Info

Publication number: CN111079708B
Application number: CN201911413133.0A
Authority: CN
Inventors: 高宇明; 田兴林; 郭健; 甄智; 李科勇; 郑捷
Original assignee: Guangzhou Hoolinks Technologies Corp ltd
Current assignee: Guangzhou Hoolinks Technologies Corp ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-12-29
Anticipated expiration: 2039-12-31
Also published as: CN111079708A

Abstract

The embodiment of the invention discloses an information identification method, an information identification device, computer equipment and a storage medium, wherein the method comprises the following steps: receiving an original file; carrying out optical character recognition on the original file to obtain a target file, wherein the target file has text information; carrying out binarization processing on the target file according to the text information to obtain a dot matrix file; searching a lattice model matched with the original file; identifying a target model similar to the lattice file from the lattice model; target information belonging to a specified class is determined from the target file using the target model. The method and the device automatically identify the relationship between the category and the target information, and greatly reduce the operations of manually browsing the text, screening the required information and copying the information into the editable document by the user, thereby improving the simplicity of the operation of inputting the information and reducing the time consumption.

Description

Information identification method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to a natural language processing technology, in particular to an information identification method, an information identification device, computer equipment and a storage medium.

Background

In the occasions of customs clearance, trade exhibition, technical communication conference, shopping and the like, many manufacturers print new information such as customs clearance, invoices and the like on the existing paper-version documents.

At present, in order to meet the requirements of paperless office work, data archiving, data analysis and the like, a paper-based file is subjected to OCR (Optical Character Recognition) to recognize a text, and necessary information is recorded therein.

However, these files have various formats, and information is shifted when printed, and the process of entering information is usually that a user manually browses a text, screens required information, and copies the information to an editable document, which is troublesome and time-consuming to enter information.

Disclosure of Invention

The embodiment of the invention provides an information identification method, an information identification device, computer equipment and a storage medium, and aims to solve the problems that information recorded in a paper file printed for multiple times is complicated and consumes long time.

In a first aspect, an embodiment of the present invention provides an information identification method, including:

receiving an original file;

carrying out optical character recognition on the original file to obtain a target file, wherein the target file has text information;

carrying out binarization processing on the target file according to the text information to obtain a dot matrix file;

searching a lattice model matched with the original file;

identifying a target model similar to the lattice file from the lattice model;

target information belonging to a specified class is determined from the target file using the target model.

Optionally, the binarizing the target file according to the text information to obtain a dot matrix file includes:

determining pixel points in the target file;

setting a first element to a first value, the first element being a pixel point representing a single text message;

setting a second element as a second value, wherein the second element is other pixel points except the first element;

merging the first elements in the group into a target area.

Optionally, the merging the group of the first elements into a target region includes:

counting the number of the second elements spaced between every two adjacent first elements as a single distance;

calculating the average value of all the single distances as a distance threshold value;

if the single distance between two adjacent first elements is smaller than the distance threshold, combining the two adjacent first elements in the same group;

setting the minimum bounding rectangle of all the first elements in the group as a target area.

Optionally, the searching for the lattice model matching with the original file includes:

determining the dimension of the original file, wherein the dimension comprises the type of the original file and the enterprise to which the original file belongs;

and searching a lattice model set for the dimension.

Optionally, the lattice model has a first element and a second element, the first element constitutes a reference region for associating categories, the lattice file has a first element and a second element therein, and the first element constitutes a target region representing text information;

the identifying, from the lattice model, an object model similar to the lattice file includes:

determining the number of first elements and/or second elements contained in a non-overlapping region as a single area for each lattice model, wherein the non-overlapping region is a region which is not overlapped with the target region in the lattice model and the reference region;

calculating the sum of all the single areas as a total area;

counting the number of first elements contained in all the reference regions as an original area;

calculating a difference value of a non-overlapping occupation ratio as the similarity of the lattice model and the lattice file, wherein the non-overlapping occupation ratio is the ratio of the total area to the original area;

and setting the lattice model with the highest similarity as a target model similar to the lattice file.

Optionally, the determining, for each of the lattice models, the number of first elements and/or second elements contained in the non-overlapping region as a single area includes:

for each dot matrix model, searching a target area at least partially overlapped with the reference area;

if the reference area and the target area are found, generating a minimum circumscribed rectangle containing the reference area and the target area;

removing the overlapped area of the reference area and the target area in the minimum bounding rectangle to obtain a non-overlapped area;

counting the number of first elements and/or second elements contained in the non-overlapping area as a single area;

and if the first element is not found, counting the number of the first elements contained in the reference region as a single area.

Optionally, the determining, from the target file, target information belonging to a specified category using the target model includes:

determining, in the target model, coordinates indicated by the reference region;

and extracting text information positioned in the coordinates in the target file to serve as target information belonging to the reference area association category.

Optionally, the method further comprises:

receiving a correction operation;

according to the correction operation, correcting target information belonging to a certain class;

and updating the lattice model according to the correction operation.

Optionally, the correcting, according to the correcting operation, target information belonging to a certain class includes:

determining a category indicated by the correction operation and a correction area indicated in the target file;

extracting text information positioned in the correction area from the target file;

and setting the text information as the target information of the category.

Optionally, the updating the lattice model according to the correcting operation includes:

determining the similarity between the dot matrix model and the dot matrix file;

if the similarity is smaller than or equal to a preset threshold value, setting the dot matrix file as a new dot matrix model, wherein a target area where the target information is located in the dot matrix file is a reference area in the new dot matrix model;

if the similarity is larger than a preset threshold value, determining the category of the correction operation instruction and the correction area indicated in the target file;

updating a reference area representing the category based on the correction area.

Optionally, the updating the reference area represented by the category based on the correction area includes:

if the text information in the correction area contains the text information in the reference area, combining the correction area and the reference area to serve as the reference area represented by the node;

alternatively, the first and second electrodes may be,

if the text information in the reference area contains the text information in the correction area, subtracting the text information in the correction area from the text information in the reference area to obtain difference information;

removing the area where the area difference information is located in the reference area as a reference area represented by the node;

alternatively, the first and second electrodes may be,

if the text information in the reference area is partially the same as the text information in the correction area, subtracting the text information in the correction area from the text information in the reference area to obtain difference information;

removing the area where the area difference information is located in the reference area as a difference area;

and combining the correction area and the reference area to be used as the reference area represented by the node.

Optionally, the method further comprises:

determining the minimum circumscribed rectangle of all the text information in the target file;

ignoring regions other than the minimum bounding rectangle in the target file.

In a second aspect, an embodiment of the present invention further provides an information identification apparatus, including:

the original file receiving module is used for receiving an original file;

the optical character recognition module is used for carrying out optical character recognition on the original file to obtain a target file, and the target file has text information;

a binarization processing module, configured to perform binarization processing on the target file according to the text information to obtain a dot matrix file;

the dot matrix model searching module is used for searching a dot matrix model matched with the original file;

the target model identification module is used for identifying a target model similar to the dot matrix file from the dot matrix model;

and the target information determining module is used for determining target information belonging to a specified class from the target file by using the target model.

Optionally, the binarization processing module includes:

the pixel point determining submodule is used for determining pixel points in the target file;

a first element setting submodule configured to set a first element as a first value, where the first element is a pixel point representing a single text message;

the second element setting submodule is used for setting a second element as a second value, wherein the second element is other pixel points except the first element;

and the target area merging submodule is used for merging the group of first elements into a target area.

Optionally, the target region merging sub-module includes:

the single distance counting unit is used for counting the number of the second elements spaced between every two adjacent first elements as a single distance;

a distance threshold calculation unit for calculating an average value of all the singles as a distance threshold;

an adjacent element merging unit, configured to merge two adjacent first elements into a same group if a single distance between the two adjacent first elements is smaller than the distance threshold;

and the target area setting unit is used for setting the minimum bounding rectangle of all the first elements in the group as a target area.

Optionally, the lattice model searching module includes:

the dimension determining submodule is used for determining the dimension of the original file, and the dimension comprises an enterprise to which the original file belongs and the type of the original file;

and the dimension searching submodule is used for searching the dot matrix model set for the dimension.

the object model identification module comprises:

a single area calculation submodule, configured to determine, for each dot matrix model, the number of first elements and/or second elements included in a non-overlapping region, as a single area, where the non-overlapping region is a region, which is not overlapped with the target region, in the dot matrix model and in the reference region;

the total area calculation submodule is used for calculating the sum of all the single areas to serve as the total area;

the original area counting submodule is used for counting the number of first elements contained in all the reference areas and taking the first elements as original areas;

the similarity calculation operator module is used for calculating a difference value of a non-overlapping occupation ratio as the similarity of the dot matrix model and the dot matrix file, wherein the non-overlapping occupation ratio is the ratio of the total area to the original area;

and the target model setting submodule is used for setting the lattice model with the highest similarity as a target model similar to the lattice file.

In one embodiment of the present invention, the single term area calculation submodule includes:

an overlap search unit, configured to search, for each dot matrix model, a target region that at least partially overlaps with the reference region;

the circumscribed rectangle generating unit is used for generating a minimum circumscribed rectangle containing the reference region and the target region if the reference region and the target region are found;

an overlap region removing unit, configured to remove a region where the reference region overlaps with the target region in the minimum bounding rectangle, to obtain a non-overlap region;

an element counting unit, configured to count the number of first elements and/or second elements included in the non-overlapping region as a single area;

and the quantity counting unit is used for counting the quantity of the first elements contained in the reference region as a single area if the first elements are not found.

Optionally, the target information determining module includes:

a coordinate determination sub-module for determining coordinates indicated by the reference region in the target model;

and the text information extraction submodule is used for extracting the text information positioned in the coordinates in the target file to be used as the target information belonging to the reference area association category.

Optionally, the method further comprises:

the correction operation receiving module is used for receiving correction operation;

the target information correction module is used for correcting target information belonging to a certain class according to the correction operation;

and the dot matrix model updating module is used for updating the dot matrix model according to the correcting operation.

Optionally, the target information correcting module includes:

a correction instruction determining sub-module for determining a category of the correction operation instruction and a correction area indicated in the target file;

the correction text extraction sub-module is used for extracting text information in the correction area from the target file;

and the target information setting submodule is used for setting the text information as the target information of the category.

Optionally, the lattice model updating module includes:

the similarity determining submodule is used for determining the similarity between the dot matrix model and the dot matrix file;

a new model setting submodule, configured to set the lattice file as a new lattice model if the similarity is smaller than or equal to a preset threshold, where a target area where the target information is located in the lattice file is a reference area in the new lattice model;

a correction information determination submodule, configured to determine a category of the correction operation instruction and a correction area indicated in the target file if the similarity is greater than a preset threshold;

a reference area update submodule for updating a reference area representing the class based on the correction area.

Optionally, the reference area updating sub-module includes:

a first merging unit, configured to, if text information in the correction region includes text information in the reference region, merge the correction region and the reference region to serve as a reference region represented by the node;

alternatively, the first and second electrodes may be,

a first difference determining unit, configured to, if the text information in the reference region includes the text information in the correction region, subtract the text information in the correction region from the text information in the reference region to obtain difference information;

a first removing unit, configured to remove, in the reference area, an area where the area difference information is located, as a reference area represented by the node;

alternatively, the first and second electrodes may be,

a second difference determining unit, configured to, if the text information in the reference region is partially the same as the text information in the correction region, subtract the text information in the correction region from the text information in the reference region to obtain difference information;

a second removing unit configured to remove, as a difference region, a region in which the region difference information is located in the reference region;

and the second merging unit is used for merging the correction area and the reference area as the reference area represented by the node.

Optionally, the method further comprises:

a minimum circumscribed rectangle determining module, configured to determine a minimum circumscribed rectangle of all the text information in the target file;

and the region ignoring module is used for ignoring regions except the minimum bounding rectangle in the target file.

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the information identification method according to any one of the first aspects.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the information identification method according to any one of the first aspect.

In the embodiment, an original file is received, optical character recognition is carried out on the original file to obtain an object file, text information is contained in the object file, binarization processing is carried out on the object file according to the text information to obtain a dot matrix file, a dot matrix model matched with the original file is searched, an object model similar to the dot matrix file is identified from the dot matrix model, the object information belonging to a specified category is determined from the object file by using the object model, the accuracy of the relationship between the category and the object information can be ensured by positioning the possible positions of the object information under a certain category through the dot matrix model based on the relatively fixed characteristic of the position of the information in the original file, the relationship between the category and the object information is automatically identified, and the operations of manually browsing the text by a user, screening required information and copying the information into an editable file are greatly reduced, therefore, the convenience of the operation of inputting the information is improved, and the time consumption is reduced.

Drawings

Fig. 1 is a flowchart of an information identification method according to an embodiment of the present invention;

fig. 2A to fig. 2J are exemplary diagrams of identification target information according to an embodiment of the present invention;

fig. 3 is a flowchart of an information identification method according to a second embodiment of the present invention;

fig. 4A to fig. 4C are exemplary diagrams of updating a reference area according to a second embodiment of the present invention;

fig. 5 is a schematic structural diagram of an information identification apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of an information identification method according to an embodiment of the present invention, where the method is applicable to a case where information of a specified category is automatically discriminated according to a binarized dot matrix model, and the method may be executed by an information identification device, where the information identification device may be implemented by software and/or hardware, and may be configured in a computer device, for example, a personal computer, a mobile terminal (e.g., a mobile phone, a tablet computer, etc.), a wearable device (e.g., a smart watch, etc.), and the method specifically includes the following steps:

s101, receiving an original file.

In this embodiment, paper-based documents of manufacturers, such as customs clearance, invoice, shopping receipt, etc., can be collected in customs clearance, trade exhibition, technical communication conference, etc., and original documents are generated by scanning, photographing, etc.

The original file is a file of non-editable text information, and is generally image data, and of course, the image data may also be written into files of other formats, such as a PDF (Portable Document Format) file, a PPT (PowerPoint) file, a word (a word processor application) file, and the like.

S102, carrying out optical character recognition on the original file to obtain a target file.

In the embodiment, OCR processing is performed on an original file to obtain an object file, wherein the object file has editable text information, and the position of the text information in the original file corresponds to the position of the text information in the object file.

OCR, which is a process of examining characters, determining their shapes by detecting dark and light patterns, and then translating the shapes into computer-readable text by a character recognition method, generally includes the following processes:

image preprocessing: the method mainly comprises algorithms of image binarization, noise removal, inclination correction and the like.

Analyzing the layout: the document pictures are segmented, and the algorithm of the line segmentation is called a layout analysis algorithm.

Character cutting: the character cutting algorithm mainly solves the problem that characters are difficult to cut simply due to character adhesion and broken strokes.

Extracting character features: and extracting multidimensional characteristics from the character image for a subsequent characteristic matching pattern recognition algorithm.

Character recognition: and carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library, and identifying the character.

And (3) page recovery: the typesetting of the original document is identified, and the identification result is output to the documents with the formats of word, PDF and the like according to the original typesetting format, which is called a layout recovery algorithm.

And (4) post-processing correction, namely performing a correction algorithm on the recognition result according to the relation of the specific language context.

In one embodiment of the invention, after S102, the target file may be normalized to improve the accuracy of identifying target information belonging to the specified category.

In a specific implementation, a Minimum Bounding Rectangle (MBR) of all text information is determined in the target file, and areas other than the minimum bounding rectangle are ignored in the target file.

The minimum bounding rectangle is a maximum range of a plurality of two-dimensional shapes (i.e., irregular figures represented by text information) represented by two-dimensional coordinates, that is, a rectangle whose lower boundary is defined by the maximum abscissa, the minimum abscissa, the maximum ordinate, and the minimum ordinate of each vertex of a given two-dimensional shape.

In this embodiment, when converting a paper-based document into an original document, there may be an offset, and in order to maintain the accuracy of the relative position between text messages, the area outside the minimum bounding rectangle may be ignored.

The ignoring may refer to cutting out a region outside the minimum bounding rectangle, or, in a case that the region outside the minimum bounding rectangle is reserved, establishing a coordinate system with a certain corner point of the minimum bounding rectangle as an origin, where the range of the coordinate system is within the range of the minimum bounding rectangle, and the positions of the regions (such as the reference region, the correction region, and the like) referred to in this embodiment are all established on the coordinate system, and the present embodiment does not limit this.

For example, a customs clearance is scanned to obtain image data (original document) as shown in fig. 2A, OCR processing is performed on the image data to obtain an editable target document as shown in fig. 2B, and blank areas are present above, below, on the left side, and on the right side of the target document, at this time, a minimum circumscribed rectangle 200 of all text information may be generated, an area located on the left side of the minimum circumscribed rectangle 200 may be cut out, an area located on the right side of the minimum circumscribed rectangle 200 may be cut out, an area located above the minimum circumscribed rectangle 200 may be cut out, an area located below the minimum circumscribed rectangle 200 may be cut out, or a coordinate system may be established with a point O as an origin, the length of the X axis is m, and the length of the Y axis is n.

S103, carrying out binarization processing on the target file according to the text information to obtain a dot matrix file.

In this embodiment, a target file is subjected to binarization processing based on text information, that is, text information is represented by one numerical value and non-text information is represented by another numerical value, so that the target file is converted into a dot matrix file.

In specific implementation, the pixel points in the target file can be determined, binarization processing is performed by taking the pixel points as units, and at this time, the pixel points can be represented by an array R [ m ] [ n ], wherein m is the coordinate of the pixel points on an X axis, and n is the coordinate of the pixel points on a Y axis.

In one aspect, a first element is set to a first value, such as 1, where the first element is an element corresponding to a pixel point representing a single text message.

On the other hand, the second element is set to a second value, such as 0, and the second element is other pixel points except the first element.

Further, for simplifying the representation, the pixel point located at the midpoint position in the single text message may be set as the first value, and the pixel points located at other positions may be set as the second value, however, the single text message still occupies the original area, and when the target area is merged, the calculation is performed on any original area.

For example, as shown in fig. 2C, for the "ship" in fig. 2B, the occupied area (the minimum circumscribed rectangle) is (2, 1) in the upper left corner and (6, 5) in the lower right corner, as shown in fig. 2D, the pixel point in the area may be set to 1, and the remaining pixel points may be set to 0, and for simplification, as shown in fig. 2E, the pixel point (4, 3) located at the midpoint of the area may be set to 1, and the other pixel points may be set to 0.

Thereafter, the grouped first elements are merged into a target area, and the remaining areas can represent second elements, thereby obtaining a dot matrix file.

For example, the original file shown in fig. 2A is converted into the dot matrix file 201 shown in fig. 2F, wherein the black area in the dot matrix file 201 represents the target area, and the white area represents the area where the second element is located.

In one merging manner, the number of second elements spaced between every two adjacent first elements may be counted as a single distance.

The average of all the singles is calculated as the distance threshold.

And if the single distance between two adjacent first elements is smaller than the distance threshold, combining the two adjacent first elements in the same group.

Here, the term "adjacent" means that a line connecting two consecutive first elements is a horizontal line or a vertical line, and the two first elements may have a second element without having another first element therebetween.

After all the first elements are traversed, the minimum bounding rectangle of all the first elements in the group may be set as the target area.

In this embodiment, by counting the individual distances and setting the average value as the distance threshold, the distance threshold is adaptively adjusted according to different target files, so as to merge the first elements, and some characters printed sparsely can be merged in the same target area, as shown in "number of pieces and name of goods" in fig. 2A, so that the problem of omitting subsequent identification target information is solved, and the accuracy of identification target information is improved.

And S104, searching a lattice model matched with the original file.

In this embodiment, since the original file formats are numerous, as shown in fig. 2G, a plurality of lattice models 202 can be preset according to the business requirements.

Further, the lattice model has a first element and a second element, wherein the first element constitutes a reference region of the associated category, i.e. the information belonging to the category is in the reference region.

As shown in fig. 2G, the black area in the dot matrix model 202 represents a reference area, the information in the black area represents its associated categories, such as "sender", "receiver", "shipment name", "number of pieces", etc., i.e., the reference area where "sender" is located has the name of "sender", the reference area where "receiver" is located has the name of "receiver", etc., and the white area represents the area where the second element is located.

In a specific implementation, a specified dimension may be screened out, and in the dimension, the position relationship of information in the original file is relatively fixed, for example, an enterprise to which the original file belongs, the type of the original file (such as a customs declaration, an invoice, and the like), and the like, so that a lattice model may be set for the dimension.

At this time, the dimension of the original file is determined, and the lattice model set for the dimension is searched.

Of course, besides the enterprise to which the original file belongs, the type of the original file, other dimensions, such as time, and the like, may be set, which is not limited in this embodiment.

And S105, identifying a target model similar to the dot matrix file from the dot matrix model.

And comparing the dot matrix model under a certain dimensionality with the dot matrix file in sequence, and selecting the dot matrix model similar to the dot matrix file from the dot matrix file as a target model.

In one embodiment of the invention, the lattice model has a first element and a second element, the first element constitutes a reference area of the associated category, the lattice file has the first element and the second element, and the first element constitutes a target area representing the text information.

Then, in this embodiment, S105 includes the following steps:

s1051, aiming at each dot matrix model, determining the number of the first elements and/or the second elements contained in the non-overlapping area as a single area.

The non-overlapping area is an area which is not overlapped with the target area in the reference area in the lattice model.

In a specific implementation, for each lattice model, a target region is searched for that at least partially overlaps with a reference region.

And if the minimum bounding rectangle is found, generating the minimum bounding rectangle containing the reference area and the target area.

And removing the overlapped area of the reference area and the target area in the minimum bounding rectangle to obtain a non-overlapped area.

And counting the number of the first elements and/or the second elements contained in the non-overlapping area as a single item area.

For example, as shown in fig. 2H and 2I, the reference region 203 partially overlaps the target region 204, a minimum bounding rectangle is determined for the reference region 203 and the target region 204, and the overlapping region is removed to obtain the non-overlapping region 205.

And S1052, calculating the sum of all the single areas to serve as the total area.

And traversing all the reference areas associated with the categories in the lattice model, and calculating the sum of the single areas of all the reference areas to serve as the total area.

S1053, counting the number of the first elements contained in all the reference regions, and taking the number as the original area.

The number of the first elements, namely the original area, contained in all the reference regions in the lattice model can be recorded in a database as parameters of the lattice model after first statistics, and the first elements can be directly extracted from the database when the similarity between the lattice model and other lattice files is subsequently calculated.

S1054, calculating a difference value of the first non-overlapping proportion as the similarity of the lattice model and the lattice file.

Wherein the non-overlapping area ratio is the ratio of the total area to the original area.

In this embodiment, the similarity between the lattice model and the lattice file is expressed by the following formula:

wherein dist (x, y) represents the similarity between the lattice file x and the lattice model y, Q represents the original area, Δ Q represents the total area, i represents the ith reference area in the lattice model y, and Δ S_iIndicates the area of the non-overlapping region between the ith reference region and the target region.

And S1055, setting the lattice model with the highest similarity as a target model similar to the lattice file.

In the specific implementation, the similarity is sorted in a descending order, and the lattice model to which the highest-sorted similarity belongs is set as a target model similar to the lattice file.

And S106, determining target information belonging to a specified class from the target file by using the target model.

In specific implementation, the coordinates of the designated class can be calibrated through the target model, so that the text information of the corresponding coordinates is extracted from the target file and is used as the target information belonging to the class.

It should be noted that the category is different for different services, for example, for a customs declaration, the category includes a transportation mode, a transportation name, a navigation number, a carrying number, and the like.

In a specific implementation, a reference area is provided in the target model, the reference area is associated with the category, then in the target model, a coordinate indicated by the reference area is determined, and in the target file, text information located in the coordinate is extracted as target information belonging to the associated category of the reference area.

For example, as shown in fig. 2J, after the target model is selected, the category of a reference region of the target model is "transport name", the coordinate represents the region 206, and then in the target file, "pitch 228" is extracted from the region 206, and the "transport name" is combined with a "transport name" construction key value pair "transport name: blessing 228 ", the category of a reference region of the object model is" voyage number ", the coordinates represent region 207, and in the object file," 520201712240 "is extracted from region 207, and a key value pair" voyage number: 520201712240".

Example two

Fig. 3 is a flowchart of an information identification method according to a second embodiment of the present invention, where the present embodiment further adds a correction operation based on the foregoing embodiment, and the method specifically includes the following steps:

s301, receiving an original file;

s302, carrying out optical character recognition on the original file to obtain a target file, wherein the target file has text information;

s303, carrying out binarization processing on the target file according to the text information to obtain a dot matrix file;

s304, searching a lattice model matched with the original file;

s305, identifying a target model similar to the dot matrix file from the dot matrix model;

s306, determining target information belonging to the specified class from the target file by using the target model.

And S307, receiving correction operation.

Because the position relation of each text message in the original file is not fixed, the situation of identifying the target information belonging to the specified category by mistake is easy to occur, especially in the period of initial operation of the lattice model and sparse data accumulation.

Upon identifying an error in the target information belonging to the specified category, the user may trigger an operation to correct it, which may be referred to as a correction operation.

In one example, a user may determine a category to be corrected, delete target information attributed to the category, and select an area in a target file as a correction area, thereby triggering a correction operation intended to set text information in the correction area as target information attributed to the category.

And S308, correcting the target information belonging to a certain class according to the correction operation.

After receiving the correction operation, the target information belonging to a certain class may be corrected in response to the correction operation.

In particular implementations, a category of the corrective action indication may be determined, as well as a corrective region indicated in the target file.

Extracting the text information located in the correction area in the target file, and setting the text information as target information of the attribution class.

For example, as shown in fig. 2J, after the target model is selected, the category represented by a certain reference area of the target model is "voyage number", and assuming that the coordinates represent the area 208, the "destination port" is extracted from the area 208 in the target file, and the "voyage number" is combined with the "voyage number" to construct a key value pair "voyage number: port of destination ". At this time, if the user finds an error, selects "voyage number", deletes "destination port", and re-selects the area 207, the "520201712240" in the area 207 is extracted, and the "voyage number" is combined with the "voyage number" to construct a key value pair "voyage number: 520201712240".

And S309, updating the lattice model according to the correction operation.

After the target information belonging to a certain class is corrected, the lattice model is correspondingly updated, so that the precision of the lattice model is improved.

In a particular implementation, a similarity between the lattice model and the lattice file may be determined.

If the similarity is smaller than or equal to a preset threshold value, if the similarity is 0.9, setting the dot matrix file as a new dot matrix model, wherein a target area where target information is located in the dot matrix file is a reference area in the new dot matrix model, and a target area where non-target information is located and an area where a second element is located in the dot matrix file are areas where the second element is located in the new dot matrix model.

If the similarity is larger than a preset threshold value, such as 0.9, the category indicated by the correction operation is determined, and the correction area indicated in the target file is updated based on the correction area to represent the reference area of the category.

In one case, if the text information in the correction area contains the text information in the reference area, that is, if some correct text information is missing from the text information in the reference area, the correction area and the reference area are merged to be the reference area represented by the node.

Further, most of the regions after the merge operation are irregular figures, and for simplification, in a case where the minimum bounding rectangle of the region after the merge operation does not overlap with other reference regions, the region after the merge operation may be simplified to the minimum bounding rectangle of the region after the merge operation.

Of course, the region after the merging operation may also be directly used as the reference region, which is not limited in this embodiment.

For example, as shown in fig. 4A, when the target information belonging to the "number of voyage" is identified, the text information in the reference area 401 is "0201712240", and "52" is omitted, at this time, the user triggers the correction operation for the "number of voyage", defines the correction area 402, and selects "520201712240", at which time, the reference area 401 and the correction area 402 may be merged.

In another case, if the text information in the reference region contains text information in the correction region, that is, the text information in the reference region contains more useless text information, the text information in the correction region is subtracted from the text information in the reference region to obtain difference information.

And removing the area where the area difference information is located in the reference area as the reference area represented by the node.

For example, as shown in fig. 4B, when the target information belonging to the "way number" is identified, the text information in the reference area 401 is "520201712240 way (3)", and "way (3)" is added, at this time, the user triggers the correction operation for the "way number", defines the correction area 402, selects "520201712240", at this time, the area where "way (3)" is located (i.e., the area on the right side of the line segment 403) may be removed from the reference area 401.

In yet another case, if the text information in the reference area is partially identical to the text information in the correction area, i.e. the text information in the reference area has both some missing text information and some useless text information, the text information in the correction area is subtracted from the text information in the reference area to obtain the difference information.

And removing the area where the area difference information is located in the reference area as a difference area.

For example, as shown in fig. 4C, when the target information belonging to the "number of flights" is identified, the text information in the reference area 401 is "0201712240 THREE (3)", and "52" is omitted and "THREE (3)" is added, at this time, the user triggers a correction operation for the "number of flights", defines the correction area 402, selects "520201712240", at this time, the area where "THREE (3)" is located (i.e., the area on the right side of the line segment 403) may be removed from the reference area 401, and the remaining area of the reference area 401 (the area on the left side of the line segment 403) and the correction area 402 are combined.

For the layer represented by the category, if the correction area is not overlapped with the reference areas represented by all the nodes in the layer, or if the correction area is partially overlapped with the reference areas represented by two or more nodes in the layer, the nodes are newly added in the layer, and the correction area is set to be the reference area represented by the nodes.

In this embodiment, a correction operation is received, target information belonging to a certain category is corrected according to the correction operation, the lattice model is updated according to the correction operation, and the lattice model is continuously optimized by continuously accumulating data and self-learning the position characteristics of different original files, so that the accuracy of identifying the target information is improved to more than 95%.

EXAMPLE III

Fig. 5 is a schematic structural diagram of an information identification apparatus according to a third embodiment of the present invention, where the apparatus may specifically include the following modules:

an original file receiving module 501, configured to receive an original file;

an optical character recognition module 502, configured to perform optical character recognition on the original file to obtain a target file, where the target file has text information;

a binarization processing module 503, configured to perform binarization processing on the target file according to the text information to obtain a dot matrix file;

a lattice model searching module 504, configured to search a lattice model matching the original file;

a target model identification module 505, configured to identify a target model similar to the lattice file from the lattice models;

a target information determination module 506, configured to determine target information belonging to a specified class from the target file using the target model.

In one embodiment of the present invention, the binarization processing module 503 includes:

In one embodiment of the present invention, the target region merging sub-module includes:

In one embodiment of the present invention, the lattice model lookup module 504 includes:

In one embodiment of the invention, the lattice model has a first element and a second element, the first element constitutes a reference area of an associated category, the lattice file has the first element and the second element, and the first element constitutes a target area representing text information;

the object model identification module 505 comprises:

In one embodiment of the present invention, the target information determining module 506 includes:

In one embodiment of the present invention, further comprising:

In one embodiment of the present invention, the target information correcting module includes:

In one embodiment of the present invention, the lattice model update module comprises:

In one embodiment of the present invention, the reference area update sub-module includes:

alternatively, the first and second electrodes may be,

In one embodiment of the present invention, further comprising:

The information identification device provided by the embodiment of the invention can execute the information identification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 6 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. As shown in fig. 6, the computer apparatus includes a processor 600, a memory 601, a communication module 602, an input device 603, and an output device 604; the number of processors 600 in the computer device may be one or more, and one processor 600 is taken as an example in fig. 6; the processor 600, the memory 601, the communication module 602, the input device 603 and the output device 604 in the computer apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 6.

The memory 601, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as modules corresponding to the information identification method in the present embodiment (for example, an original file receiving module 501, an optical character recognition module 502, a binarization processing module 503, a lattice model search module 504, an object model recognition module 505, and an object information determination module 506 in the information identification apparatus shown in fig. 5). The processor 600 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 601, that is, implements the information identification method described above.

The memory 601 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 601 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 601 may further include memory located remotely from processor 600, which may be connected to a computer device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And the communication module 602 is configured to establish a connection with the display screen and implement data interaction with the display screen.

The input device 603 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, and may also be a camera for acquiring images and a sound pickup apparatus for acquiring audio data.

The output device 604 may include an audio device such as a speaker.

It should be noted that the specific composition of the input device 603 and the output device 604 can be set according to actual situations.

The processor 600 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 601, that is, implements the above-described connection node control method of the electronic whiteboard.

The computer device provided in this embodiment may perform the information identification method provided in any embodiment of the present invention, and its corresponding functions and advantages are described in detail.

EXAMPLE five

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements an information identification method, and the method includes:

receiving an original file;

searching a lattice model matched with the original file;

identifying a target model similar to the lattice file from the lattice model;

Of course, the computer program of the computer-readable storage medium provided in the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in the information identification method provided in any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the information identification apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. An information identification method, comprising:

receiving an original file;

searching a lattice model matched with the original file;

identifying a target model similar to the lattice file from the lattice model;

determining target information belonging to a specified class from the target file using the target model;

the lattice model is provided with a first element and a second element, the first element forms a reference area of an associated category, the lattice file is provided with the first element and the second element, and the first element forms a target area representing text information;

calculating the sum of all the single areas as a total area;

2. The method according to claim 1, wherein the binarizing the target file according to the text information to obtain a dot matrix file comprises:

determining pixel points in the target file;

merging the first elements in the group into a target area.

3. The method of claim 2, wherein said merging the set of first elements into a target region comprises:

4. The method of claim 1, wherein said using the object model to determine object information belonging to a specified class from the object file comprises:

5. The method according to any one of claims 1-4, further comprising:

receiving a correction operation;

updating the lattice model according to the correction operation;

the correcting target information belonging to a certain class according to the correcting operation comprises:

and setting the text information as the target information of the category.

6. The method of claim 5, wherein said updating the lattice model in accordance with the corrective action comprises:

7. An information identifying apparatus, comprising:

the original file receiving module is used for receiving an original file;

the target information determining module is used for determining target information belonging to a specified class from the target file by using the target model;

the object model identification module comprises:

8. A computer device, characterized in that the computer device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the information identification method of any of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the information identification method according to any one of claims 1 to 6.