CN111144114B

CN111144114B - Text recognition method and device

Info

Publication number: CN111144114B
Application number: CN201911315736.7A
Authority: CN
Inventors: 赵晓
Original assignee: Glodon Co Ltd
Current assignee: Glodon Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2023-07-18
Anticipated expiration: 2039-12-19
Also published as: CN111144114A

Abstract

The embodiment of the invention relates to a text recognition method and a text recognition device, comprising the following steps: converting all text graphic primitives in the target graphic data into a plurality of text entities; based on the distance relation among the text entities and the text arrangement of the text entities, aggregating a plurality of text entities to obtain a plurality of text groups; wherein each text blob includes at least one text entity; aiming at each text group, sliding along text entities in the text group by utilizing a preset sliding window to obtain a plurality of text combinations; respectively inputting a plurality of text combinations into a preset text classification model, acquiring semantic types and probabilities thereof expressed by the text combinations output by the text classification model, and selecting the text combination with the highest probability and the semantic type thereof as a text recognition result. Thus, automatic recognition of the drawing text can be realized.

Description

Text recognition method and device

Technical Field

The embodiment of the invention relates to the field of pattern recognition, in particular to a text recognition method and device.

Background

In the CAD drawing, the designer marks the structural member using the mark text. The semantic types of the annotation text expression are rich (such as various types including elevation, component name, steel bar specification and the like), and are often divided into a plurality of primitives to be drawn in the drawing. Therefore, how to obtain the labeling text and the semantic type expressed by the labeling text by identifying the primitives in the CAD drawing is a continuous discussion problem in the industry.

Disclosure of Invention

In view of this, in order to solve the above technical problems or part of the technical problems, embodiments of the present invention provide a text recognition method and apparatus.

In a first aspect, an embodiment of the present invention provides a text recognition method, including:

converting all text graphic primitives in the target graphic data into a plurality of text entities;

based on the distance relation among the text entities and the text arrangement of the text entities, aggregating a plurality of text entities to obtain a plurality of text groups; wherein each text blob includes at least one text entity;

aiming at each text group, sliding along text entities in the text group by utilizing a preset sliding window to obtain a plurality of text combinations;

respectively inputting a plurality of text combinations into a preset text classification model, acquiring semantic types and probabilities thereof expressed by the text combinations output by the text classification model, and selecting the text combination with the highest probability and the semantic type thereof as a text recognition result.

In one possible implementation manner, the converting all text primitives into a plurality of text entities includes:

for each text graphic primitive, if the text graphic primitive consists of characters and graphics, identifying the characters and the graphics to obtain a text entity;

and/or the number of the groups of groups,

if the text graphic primitive is composed of a plurality of characters and comprises preset characters, splitting the characters based on the preset characters to obtain a plurality of text entities; each text entity includes at least one character.

In one possible implementation manner, the aggregating the plurality of text entities based on the distance relation between the text entities and the text arrangement of the text entities to obtain a plurality of text groups includes:

determining a two-dimensional bounding box and a text direction of each text entity;

according to the two-dimensional bounding box and the text direction of each text entity, aggregating a plurality of text entities into a plurality of text sets; each text set comprises at least one text entity, and the two-dimensional bounding boxes of the text entities in each text set are overlapped and have the same text direction;

for each text set, projecting all text entities in the text set in a preset direction, and dividing the text set into at least one text subset based on the projection of each text entity; each text subset comprises at least one text entity, and projections of the text entities contained in each text subset in a preset direction are overlapped;

and sequencing the text entities in each text subset according to each text subset to generate a text group corresponding to the text subset.

In one possible implementation manner, the sorting the text entities in the text subset to generate the text cliques corresponding to the text subset includes:

establishing a local coordinate system of the text subset by taking the text direction of the text entity of the text subset as an X axis and taking the direction of clockwise rotation of the text direction by 90 degrees as a Y axis;

the text entities in the subset of text are ordered based on their coordinates in the local coordinate system.

In one possible implementation manner, the inputting the plurality of text combinations into the preset text classification model respectively, and obtaining the semantic type and the probability thereof expressed by each text combination output by the text classification model includes:

for each text combination, replacing a designated text entity in the text combination with a preset text entity corresponding to the designated text entity type according to a preset rule;

word segmentation processing is carried out on the text entities in the replaced text combination to obtain a plurality of segmented words, word embedding processing is carried out on each segmented word to obtain a word vector of each segmented word, and the text vector of the text combination is determined based on the word vector of each segmented word;

and inputting the text vectors corresponding to the text combinations into a preset text classification model, classifying the text combinations based on the text vectors corresponding to the text combinations by the text classification model, and outputting semantic types and probabilities of the text combinations.

In a second aspect, an embodiment of the present invention provides a text recognition apparatus, including:

the conversion unit is used for converting all text graphic primitives in the target graphic data into a plurality of text entities;

the aggregation unit is used for aggregating a plurality of text entities based on the distance relation among the text entities and the text arrangement of the text entities to obtain a plurality of text groups; wherein each text blob includes at least one text entity;

the combination unit is used for sliding along text entities in each text group by utilizing a preset sliding window to obtain a plurality of text combinations;

the classifying unit is used for respectively inputting a plurality of text combinations into a preset text classifying model, acquiring semantic types and probabilities thereof expressed by the text combinations output by the text classifying model, and selecting the text combination with the highest probability and the semantic type thereof as a text recognition result.

In a possible implementation manner, the conversion unit is specifically configured to identify, for each text primitive, a character and a graphic if the text primitive is composed of the character and the graphic, so as to obtain a text entity; and/or if the text graphic primitive is composed of a plurality of characters and the text graphic primitive comprises preset characters, splitting the characters based on the preset characters to obtain a plurality of text entities; each text entity includes at least one character.

In a possible implementation manner, the aggregation unit is specifically configured to determine a two-dimensional bounding box and a text direction of each text entity; according to the two-dimensional bounding box and the text direction of each text entity, aggregating a plurality of text entities into a plurality of text sets; each text set comprises at least one text entity, and the two-dimensional bounding boxes of the text entities in each text set are overlapped and have the same text direction; for each text set, projecting all text entities in the text set in a preset direction, and dividing the text set into at least one text subset based on the projection of each text entity; each text subset comprises at least one text entity, and projections of the text entities contained in each text subset in a preset direction are overlapped; and sequencing the text entities in each text subset according to each text subset to generate a text group corresponding to the text subset.

In a possible implementation manner, when the text entities in the text subset are ordered to generate text groups corresponding to the text subset, the aggregation unit is configured to establish a local coordinate system of the text subset by taking a text direction of the text entities in the text subset as an X axis and a direction of clockwise rotation of the text direction by 90 ° as a Y axis; the text entities in the subset of text are ordered based on their coordinates in the local coordinate system.

In a possible implementation manner, the classification model is specifically configured to replace, for each text combination, a specified text entity in the text combination with a preset text entity corresponding to the specified text entity type according to a preset rule; word segmentation processing is carried out on the text entities in the replaced text combination to obtain a plurality of words, word embedding processing is carried out on each word segment to obtain word vectors of each word segment, and text vectors of the text combination are determined based on the word vectors of each word segment; and inputting the text vectors corresponding to the text combinations into a preset text classification model, classifying the text combinations based on the text vectors corresponding to the text combinations by the text classification model, and outputting semantic types and probabilities of the text combinations.

In the text recognition method provided by the embodiment of the invention, the electronic equipment can acquire the text graphic primitive in the drawing imported into the equipment to generate a plurality of text entities. Then, the electronic equipment can perform text entity density clustering, text arrangement recognition and the like to generate a plurality of text groups, generate all text combinations by utilizing a sliding window method, classify the text combinations by utilizing a trained text classification model, and select proper text combinations according to the probability size to serve as text recognition results.

On one hand, the text in the drawing can be automatically identified, so that the text identification efficiency is greatly improved, and the batched drawing text identification can be realized.

On the other hand, the text combination is classified based on the machine training model, so that the classification result of text classification is more accurate, and regular expressions are not required to be written for each semantic type manually, so that the labor cost is saved.

Drawings

FIG. 1 is a flow chart of a text recognition method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a text collection shown in an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of another text collection shown in an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram of another text collection shown in an exemplary embodiment of the present application;

FIG. 4b is a schematic diagram of another text collection shown in an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of another text collection shown in an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of generating text combinations as shown in an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of generating text combinations as shown in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a present bolus shown in an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a text classification model training method according to an exemplary embodiment of the present application;

fig. 10 is a block diagram of a text recognition device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the embodiments of the invention.

In conventional drawing text recognition technology, electronic devices typically perform text recognition in two steps.

The first step is: and the electronic equipment aggregates the texts which are separately drawn in the drawing according to the characteristics of the geometric distance and the like of the marked text on the drawing.

The second step is: the electronic device recognizes the aggregated text through regular expressions pre-written by a developer.

However, on one hand, the drawing texts are generally semantically associated, and conventional drawing text recognition is only aggregated according to distance, and semantic association between texts is not considered, so that many errors are easily generated after aggregation.

On the other hand, the types of the labeling texts are more, and each type needs to write a corresponding regular expression, so that the labor cost is greatly increased.

In the third aspect, the expression forms of the labeling text are various, and it is difficult to define a regular expression to cover various expression forms of one type of text, so that the conventional drawing text recognition technology has poor practicability.

The application aims to provide a text recognition method, wherein an electronic device can acquire text graphic primitives in a drawing imported into the device to generate a plurality of text entities. Then, the electronic equipment can perform text entity density clustering, text arrangement recognition and the like to generate a plurality of text groups, generate all text combinations by utilizing a sliding window method, classify the text combinations by utilizing a trained text classification model, and select proper text combinations according to the probability size to serve as text recognition results.

On the one hand, the electronic equipment can automatically recognize the text in the drawing, so that the text recognition efficiency is greatly improved, and the batched drawing text recognition can be realized.

On the other hand, in the application, the electronic equipment classifies the text combinations based on the machine training model, so that the text classification result is more accurate, and a regular expression is not required to be written for each type of semantic type manually, so that the labor cost is saved.

The text recognition method provided in the present application is described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of a text recognition method according to an exemplary embodiment of the present application, where the method may be applied to an electronic device, and may include the following steps.

Step 101: the electronic device converts all text primitives in the target graphic into a plurality of text entities.

It should be noted that, the graphics described in the present application refers to graphics having text standards, for example, the graphics may be graphics data stored in CAD drawings, and the like. The CAD drawing may be a CAD drawing of a building type, or a CAD drawing of a mechanical type, an electrical type, or the like, and the drawings described in the present application are not specifically limited herein.

In the embodiment of the application, the electronic device can acquire all text primitives in the graphic data. The electronic device may convert the text primitive into a plurality of text entities that are stored in a predefined text data structure. Each text entity comprises coordinates, directions, text heights, text contents and the like of the text primitives in the drawing.

Several ways of implementing step 101 are described below.

Mode one: for each text primitive, if the text primitive is composed of characters and graphics, the electronic device identifies the characters and graphics to obtain a text entity.

For example, if the text primitive is "(1)", the text primitive needs to be recognized by the characters (i.e., 1) and the graphics (i.e., o) of the text primitive, and the obtained text entity is "circle 1".

Mode two: if the text graphic primitive is composed of a plurality of characters and the text graphic primitive comprises preset characters, the electronic equipment can split the characters based on the preset characters to obtain a plurality of text entities; each text entity includes at least one character.

For example, assuming that the preset character is a space and assuming that the text primitive is "kz1500×500", the text primitive is split into two text entities based on the space character. The two text entities split are "KZ1" and "500 x 500", respectively.

It should be noted that, the electronic device may implement the step 101 in the first and second modes. Of course, the electronic device may implement step 101 in both the first and second modes.

Of course, the electronic device may also convert all text primitives in the target graphic into a plurality of text entities in other manners, which are not specifically limited herein.

Step 102: the electronic equipment aggregates a plurality of text entities based on the distance relation among the text entities and the text arrangement of the text entities to obtain a plurality of text groups.

Step 102 is formed by three steps of density clustering, splitting according to text arrangement and sorting according to reading sequence.

These three steps are described in detail below by steps 1021 through 1024.

1. Density clustering:

step 1021: the electronic device determines a two-dimensional bounding box and a text direction for each text entity.

Step 1022: the electronic equipment aggregates a plurality of text entities into a plurality of text sets according to the two-dimensional bounding boxes and the text directions of the text entities; each text set includes at least one text entity, and the two-dimensional bounding boxes of the text entities in each text set overlap and have the same text direction.

In implementation, the electronic device may extract a two-dimensional bounding box of the text entity. If two-dimensional bounding boxes of two text entities have overlapping portions and the directions of the text of the two text entities are the same, the two text entities are aggregated in one text group. Then, find all other text entities with the same text direction and overlapping with the two-dimensional surrounding of the text entity of the text group, and add the other text entities into the text group. This step is repeated until no new text entities are found, at which time the text group forms a text collection. For example, the text collection is formed as part of the box outlined in FIG. 2.

It should be noted that, the two-dimensional bounding box calculation of the text entity needs to consider the influence of the lines in the drawing, and if a parallel straight line is arranged below one text, the corresponding two-dimensional bounding box needs to be enlarged appropriately. The magnification method is to stretch the bounding box in a direction perpendicular to the straight line. As shown in the above figures (dashed boxes are text bounding boxes).

For example, as shown in fig. 3.

As shown in the left diagram of fig. 3, there is no solid line below the text entity c8@200, and the two-dimensional bounding box of the text entity (i.e. the dashed line in the left diagram of fig. 3) is shown in the left diagram of fig. 3.

As shown in the right diagram of fig. 3, there is a solid line below the text entity c8@200, and the two-dimensional bounding box of the text entity (i.e. the dashed line in the right diagram of fig. 3) is shown in the right diagram of fig. 3.

As can be seen from the left and right diagrams of fig. 3, the two-dimensional bounding box of the text entity with the solid line below is larger than the two-dimensional bounding box of the text entity not implemented below.

And II: splitting according to the text arrangement:

step 1023: the electronic equipment projects all text entities in each text set in a preset direction, and divides the text set into at least one text subset based on the projection of each text entity; each text subset includes at least one text entity, and the projections of the text entities included in each text subset in a preset direction overlap.

As shown in fig. 4a, two columns of text in the box of fig. 4a are subjected to a text set formed by steps 1021 and 1022. At this point, the text arrangement in the text set needs to be identified, and the text entities visually arranged in two columns are split into two text subsets.

In implementation, for each text set, when text arrangement is identified, the electronic device projects text entities in the text set in a direction preset with the text direction, so as to obtain a one-dimensional projection interval.

The electronic device may divide the text set into at least one text subset based on the projected interval based on each text entity; each text subset includes at least one text entity, and the projections of the text entities included in each text subset in a preset direction overlap.

The preset projection direction may be a horizontal direction, a vertical direction, or the like. The specific limitation is not given here.

For example, as shown in fig. 4b, it is assumed that all text entities in fig. 4b belong to one text set. The electronic device may project the set of text in a vertical direction, and thus, the text entities "c8@200" and "4C16" are grouped into a subset of text, as the text entities "c8@200" and "4C16" are projected to overlap. Similarly, "KZ1" overlaps the projection of "one-layer-two-layer" in the vertical direction, so that "KZ1" and "one-layer-two-layer" also form a text subset.

3. Sequencing according to the reading sequence:

step 1024: and the electronic equipment sorts the text entities in the text subsets aiming at each text subset to generate text groups corresponding to the text subsets.

When the method is implemented, the text direction of a text entity of the text subset of the electronic equipment is taken as an X axis, the direction of clockwise rotation of the text direction by 90 degrees is taken as a Y axis, and a local coordinate system of the text subset is established;

the electronic device may sort the text entities in the subset of text based on the coordinates of each text entity in the local coordinate system and treat the sorted subset of text as a clique of text.

For example, as shown in fig. 5, assume that the text subset is the area enclosed by the dashed box of fig. 5, and the text entities included in the text subset are "KZ3, 500×500, 12a20, a8@100/200, up to 0.400".

In the local coordinate system corresponding to the text subset, the x axis is shown in fig. 5, the y axis is shown in fig. 5, and the sequence of the text in the text group is as follows according to the sequence from the y coordinate to the large and from the x coordinate to the large: "KZ3", "500 x 500", "12A20", "A8@100/200", "up to", "0.400".

Through the steps 1021 to 1024, all the text groups clustered according to the text entity position relationship are obtained. The text entities in each text cluster are arranged according to the reading sequence.

Step 103: and the electronic equipment slides along the text entities in each text group by utilizing a preset sliding window to obtain a plurality of text combinations.

In implementation, the electronic device generates all text combinations within the text blob using a sliding window approach.

For example, the text cliques { A, B, C } are shown in the upper left-hand corner of FIG. 6, containing three text entities arranged in order. Sliding windows of 1,2,3 text sizes (upper right corner of fig. 6, lower left corner of fig. 6, lower right corner of fig. 6, respectively) were constructed. And pushing the sliding window to the right by one grid, thus obtaining a new text aggregation.

The gray grid in fig. 7 corresponds to all possible text aggregation results generated. Further combining these texts, four possible combinations are { A, B, C }, { AB, C }, { A, BC }, { ABC }.

As another example, consider the text blob within the box of FIG. 8, where there are two text entities, "4" and "C25". The generated text combinations have two, the first combination contains two text entities, namely '4' and 'C25'; the second combination contains one text entity, which is "4C25".

Step 104: the electronic equipment inputs the text combinations into a preset text classification model respectively, acquires the semantic types and the probabilities thereof expressed by the text combinations output by the text classification model, and selects the text combination with the highest probability and the semantic type thereof as a text recognition result.

Wherein, the semantic types include: component name, component rebar specification, component rebar quantity, etc.

The following describes both the application of the text classification model and the training of the text classification model.

1. Application of a text classification model.

The electronic equipment inputs the text combinations into a preset text classification model respectively, acquires the semantic types and the probabilities thereof expressed by the text combinations output by the text classification model, and selects the text combination with the highest probability and the semantic type thereof as the recognition result of the text group.

For example, the electronic device inputs "a8@200" to the classification model, the classification model outputs that the semantic type of "a8@200" is 95% of the probability of the number of component bars being laid out, and the semantic type of "a8@200" is 5% of the probability of the number of component bars being specified.

Thus, the electronic device can determine A8@200 as the number of component bars.

Optionally, before inputting the text combinations into the text classification model, the electronic device may further replace, for each text combination, a specified text entity in the text combination with a preset text entity corresponding to the specified text entity type according to a preset rule. In other words, the electronic device may perform word conversion, such as replacing the integer by SINGLENUM, DOUBLENUM words, according to the number of bits.

Then, the electronic device may perform word segmentation processing on the text entity in the replaced text combination to obtain a plurality of segmented words, perform word embedding processing on each segmented word to obtain a word vector of each segmented word, and determine a text vector of the text combination based on the word vector of each segmented word.

Then, the electronic equipment inputs the text vectors corresponding to the text combinations into a preset text classification model, so that the text classification model classifies the text combinations based on the text vectors corresponding to the text combinations, and semantic types and probabilities thereof, which are output by the text classification model, of the text combinations are obtained.

Finally, the electronic equipment selects the text combination with highest probability and the semantic type of the text combination as a text recognition result.

2. Classification model training

The step of classification model training is shown in fig. 9.

Step one, data collection

Firstly, a piece of labeling text is led out from massive CAD building structure drawings, meaningful text data is screened out through a manual checking and labeling mode, and semantic type labels are labeled on all the text data.

For example, where the text combines a sample and text label as shown in table 1.

Text composition sample	Text label
		A8@200	3
GBZ4	0
		12B14	2
A6@200 (not shown in the figure)	3
		75.600～81.300	1

TABLE 1

Wherein the semantic types need to be defined by themselves, and each semantic type is represented by a unique integer. The data structure of the dictionary may be used to establish correspondence between integers and semantic types. Such as {0: component elevation, 1: component name, 2: member steel bar specification, 3: number of component ribs … …).

Step two, data conversion

Because the labeling text content in the building structure drawing has a certain expression mode, the labeling text content can be properly converted, and important features are extracted, so that the subsequent training is convenient.

For example, the marked text of the elevation type often contains certificates, DECIMAL numbers or Chinese digits, and the integer number can be replaced by SINGLENUM, DOUBLENUM, TRIPlENUM or QUADRUPLENUM, the DECIMAL number by DECIAL, and the Chinese digits (two, three, five, six, seven, ninety hundred thousand) +layer by CHINESENUM.

Step three, word segmentation and word embedding

The method comprises the steps of segmenting a text in a text combination by using a jieba (word segmentation technology), and performing word embedding on the segmented text combination by using a word2vec (word embedding technology) word embedding model to obtain vector representation of each word.

When jieba word segmentation is used, a common building word stock is required to be input so as to optimize word segmentation effect. Some common building professional vocabularies, such as 'first-level anti-seismic grade', 'reinforcement rate', and the like, need to be independently arranged in a word segmentation library, so that a complete word is still ensured in a word segmentation result.

When setting parameters of the word embedding model, the characteristics of the labeling text need to be considered, and a proper wordcount size is selected. The average value of the number of words after the labeled text word segmentation is generally taken. The dimension of word embedding can be controlled within 100, and the calculation efficiency is ensured.

After a text is segmented, a word embedding model is applied to each word, and word vectors are obtained. And summing and normalizing the vectors of all words to obtain a vectorized representation of the text combination sample, namely the feature vector of the text combination sample.

The method of applying the vector summation normalization ignores the number of words of the text, so that the number of words is added as an important feature after the feature vector. By doing so, the dimension of the feature vector is enlarged by one dimension, and the length information of the text is reserved in the feature vector.

Step four, model training and evaluation

Vector representations of all text combination samples and their corresponding semantic type labels are obtained through the foregoing steps. And dividing vector representations of all text combination samples and corresponding semantic type labels into a training set and a testing set according to the proportion of 8:2, training the training set by using a machine learning algorithm of a random forest, and testing effects on the testing set.

When training the classification model of random forest, choose the method of cross training to evaluate the effect of model.

And when the model parameters are adjusted, a gridding parameter searching algorithm is selected. The principle is to first give all possible combinations of model parameters and then train the model for each combination, comparing the different model effects. Finally, the model parameter combination with the best accuracy and generalization is selected.

Through the steps, a model for classifying the labeling text in the CAD building structure drawing can be obtained.

As can be seen from the above description, the electronic device may obtain text primitives in a drawing imported into the device, and generate a plurality of text entities. Then, the electronic equipment can perform text entity density clustering, text arrangement recognition and the like to generate a plurality of text groups, generate all text combinations by utilizing a sliding window method, classify the text combinations by utilizing a trained text classification model, and select proper text combinations according to the probability size to serve as text recognition results.

On the other hand, in the application, the electronic equipment classifies the text combination based on the machine training model, so that the classification result of text classification is more accurate, and a regular expression is not required to be written for each semantic type manually, so that the labor cost is saved.

Referring to fig. 10, fig. 10 is a block diagram of a text recognition device according to an exemplary embodiment of the present application.

The apparatus may include:

a conversion unit 1001, configured to convert all text primitives in a target graphic into a plurality of text entities;

an aggregation unit 1002, configured to aggregate a plurality of text entities based on a distance relationship between each text entity and a text arrangement of the text entities, to obtain a plurality of text groups; wherein each text blob includes at least one text entity;

a combining unit 1003, configured to slide, for each text group, along text entities in the text group by using a preset sliding window, to obtain a plurality of text combinations;

the classifying unit 1004 is configured to input a plurality of text combinations into a preset text classification model, obtain a semantic type and a probability thereof expressed by each text combination output by the text classification model, and select a text combination with the highest probability and a semantic type thereof as a text recognition result.

In a possible implementation manner, the conversion unit 1001 is specifically configured to identify, for each text primitive, a character and a graphic if the text primitive is composed of the character and the graphic, so as to obtain a text entity; and/or if the text graphic primitive is composed of a plurality of characters and the text graphic primitive comprises preset characters, splitting the characters based on the preset characters to obtain a plurality of text entities; each text entity includes at least one character.

In a possible implementation manner, the aggregation unit 1002 is specifically configured to determine a two-dimensional bounding box and a text direction of each text entity; according to the two-dimensional bounding box and the text direction of each text entity, aggregating a plurality of text entities into a plurality of text sets; each text set comprises at least one text entity, and the two-dimensional bounding boxes of the text entities in each text set are overlapped and have the same text direction; for each text set, projecting all text entities in the text set in a preset direction, and dividing the text set into at least one text subset based on the projection of each text entity; each text subset comprises at least one text entity, and projections of the text entities contained in each text subset in a preset direction are overlapped; and sequencing the text entities in each text subset according to each text subset to generate a text group corresponding to the text subset.

In a possible implementation manner, when the aggregation unit 1002 sorts the text entities in the text subset to generate a text group corresponding to the text subset, the aggregation unit is configured to establish a local coordinate system of the text subset with a text direction of the text entities of the text subset as an X axis and a direction in which the text direction is rotated clockwise by 90 ° as a Y axis; the text entities in the subset of text are ordered based on their coordinates in the local coordinate system.

In a possible implementation manner, the classification model 1004 is specifically configured to replace, for each text combination, a specified text entity in the text combination with a preset text entity corresponding to the specified text entity type according to a preset rule; word segmentation processing is carried out on the text entities in the replaced text combination to obtain a plurality of segmented words, word embedding processing is carried out on each segmented word to obtain a word vector of each segmented word, and the text vector of the text combination is determined based on the word vector of each segmented word; and inputting the text vectors corresponding to the text combinations into a preset text classification model, classifying the text combinations based on the text vectors corresponding to the text combinations by the text classification model, and outputting semantic types and probabilities of the text combinations.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of text recognition, comprising:

respectively inputting a plurality of text combinations into a preset text classification model, acquiring semantic types and probabilities thereof expressed by the text combinations output by the text classification model, and selecting the text combination with the highest probability and the semantic type thereof as a text recognition result;

the aggregation of the text entities based on the distance relation among the text entities and the text arrangement of the text entities to obtain a plurality of text groups comprises the following steps:

2. The method of claim 1, wherein converting all text primitives to a plurality of text entities comprises:

and/or the number of the groups of groups,

3. The method of claim 1, wherein the sorting the text entities in the subset of text to generate the text cliques corresponding to the subset of text comprises:

4. The method according to claim 1, wherein the inputting the plurality of text combinations into the preset text classification model and obtaining the semantic type and the probability thereof expressed by each text combination output by the text classification model respectively includes:

5. A text recognition device, comprising:

the conversion unit is used for converting all text primitives in the target graph into a plurality of text entities;

the classifying unit is used for respectively inputting a plurality of text combinations into a preset text classifying model, acquiring semantic types and probabilities thereof expressed by the text combinations output by the text classifying model, and selecting the text combination with the highest probability and the semantic type thereof as a text recognition result;

the aggregation unit is specifically used for determining a two-dimensional bounding box and a text direction of each text entity; according to the two-dimensional bounding box and the text direction of each text entity, aggregating a plurality of text entities into a plurality of text sets; each text set comprises at least one text entity, and the two-dimensional bounding boxes of the text entities in each text set are overlapped and have the same text direction; for each text set, projecting all text entities in the text set in a preset direction, and dividing the text set into at least one text subset based on the projection of each text entity; each text subset comprises at least one text entity, and projections of the text entities contained in each text subset in a preset direction are overlapped; and sequencing the text entities in each text subset according to each text subset to generate a text group corresponding to the text subset.

6. The apparatus according to claim 5, wherein the converting unit is specifically configured to identify, for each text primitive, a character and a graphic if the text primitive is composed of the character and the graphic, and obtain a text entity; and/or if the text graphic primitive is composed of a plurality of characters and the text graphic primitive comprises preset characters, splitting the characters based on the preset characters to obtain a plurality of text entities; each text entity includes at least one character.

7. The apparatus according to claim 5, wherein the aggregation unit is configured to, when sorting the text entities in the text subset to generate the text groups corresponding to the text subset, establish a local coordinate system of the text subset with a text direction of the text entities of the text subset as an X-axis and a direction in which the text direction is rotated clockwise by 90 ° as a Y-axis; the text entities in the subset of text are ordered based on their coordinates in the local coordinate system.

8. The apparatus of claim 5, wherein the classification model is specifically configured to replace, for each text combination, a specified text entity in the text combination with a preset text entity corresponding to the specified text entity type according to a preset rule; word segmentation processing is carried out on the text entities in the replaced text combination to obtain a plurality of segmented words, word embedding processing is carried out on each segmented word to obtain a word vector of each segmented word, and the text vector of the text combination is determined based on the word vector of each segmented word; and inputting the text vectors corresponding to the text combinations into a preset text classification model, classifying the text combinations based on the text vectors corresponding to the text combinations by the text classification model, and outputting semantic types and probabilities of the text combinations.