CN117935285A

CN117935285A - Text merging method, text recognition device, electronic equipment and storage medium

Info

Publication number: CN117935285A
Application number: CN202311868838.8A
Authority: CN
Inventors: 高宏宇; 张建树; 刘丹; 魏思
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-04-26

Abstract

The application discloses a text merging method, a text recognition device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring at least two text box information in an image to be identified; wherein each text box message comprises a text line content; acquiring text line characteristics corresponding to each text box information, and obtaining at least two text line characteristics; and merging corresponding text line contents according to the similarity between at least two text line features to obtain text contents with continuous semantic information in the image to be identified. According to the scheme, the text line contents in the image text box can be quickly combined, and the text contents with continuous semantic information can be obtained.

Description

Text merging method, text recognition device, electronic equipment and storage medium

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a text merging method, a text recognition device, an electronic device, and a storage medium.

Background

The text detection and recognition and text blocking technology is a technology for combining text boxes with continuous semantic information in an input image.

When the current method detects the text boxes in the image, the output text boxes often have the condition of text semantic breakage, so that great inconvenience is brought to practical use, and the merging speed of the subsequent text boxes is slower.

Disclosure of Invention

The text merging method, the text recognition device, the electronic equipment and the storage medium provided by the application can be used for quickly merging the text line contents in the image text box to obtain the text contents with continuous semantic information.

In order to solve the above technical problem, a first aspect of the present application provides a text merging method, which includes: acquiring at least two text box information in an image to be identified; wherein each text box message comprises a text line content; acquiring text line characteristics corresponding to each text box information, and obtaining at least two text line characteristics; and merging corresponding text line contents according to the similarity between at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

In order to solve the above technical problem, a second aspect of the present application provides a text recognition apparatus, including: the first acquisition module is used for acquiring at least two text box information in the image to be identified; wherein each text box message comprises a text line content; the second acquisition module is used for acquiring text line characteristics corresponding to each text box information to obtain at least two text line characteristics; and the merging module is used for merging the corresponding text line contents according to the similarity between the at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

In order to solve the above technical problem, a third aspect of the present application provides an electronic device, which includes a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the method provided in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer readable storage medium storing program instructions executable by a processor for implementing the method provided in the first aspect.

According to the text merging method, the text recognition device, the electronic equipment and the storage medium, the text line contents in the text boxes of the image can be quickly merged through similarity parallel calculation among the text line features, so that the text contents with continuous semantic information can be obtained, the text lines are not required to be merged in a sorting mode, the situation of merging errors caused by sorting errors is avoided, and merging accuracy is provided.

Drawings

FIG. 1 is a flow chart of an embodiment of a text merging method of the present application;

FIGS. 2 and 3 are schematic views of a text recognition application scenario of the present application;

FIG. 4 is a flow chart of an embodiment of a text merging method of the present application;

FIG. 5 is a flow chart of step 43 according to an embodiment of the present application;

FIG. 6 is a flow chart of another embodiment of the text merging method of the present application;

FIGS. 7 and 8 are schematic diagrams of an application scenario of the text merging method of the present application;

FIG. 9 is a schematic diagram of a framework of an embodiment of a service orchestration device according to the present application;

FIG. 10 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 11 is a schematic diagram of a framework of an embodiment of a computer readable storage medium of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two.

The text detection and recognition and text blocking technology is a technology for combining text boxes with continuous semantic information in an input picture.

When the current method detects the text boxes in the pictures, the output text boxes often have the condition of text semantic breakage, so that great inconvenience is brought to practical use, and the merging speed of the subsequent text boxes is slower.

In the related art, text information in an image is detected and identified through an OCR technology; the text boxes are then used as inputs to a merge model, which typically includes Encoder for encoding text line features and a relational Decoder for decoding text boxes. In general, the number of text boxes in an image is large, and the relation between every two text boxes is predicted by directly using a Decoder, so that the time complexity is high, and the related method adopts a method of firstly sequencing and then predicting the relation of the text boxes, and the main steps are as follows:

1. Acquiring the characteristics of each text box through Encoder;

2. Sequencing the text boxes according to the human reading sequence through a Decoder;

3. Predicting each text box by a Decoder to perform a relation with a previous text box;

4. And finally merging the text boxes according to the relation among the text boxes.

The inventors have found that the above scheme sorts the text boxes first and then merges the relationship between the predicted text box and its previous text box. Because this method usually needs to use an extra Decoder model, the model reasoning speed is slower when the number of text boxes in the picture is larger; meanwhile, when the text boxes are ordered, the ordering is easy to generate errors due to the fact that the situation in a natural scene is complex, and then the result of text box merging is affected.

Based on the method, the device and the system for combining the text line characteristics, through similarity parallel calculation among the text line characteristics, the text line contents in the text box of the image can be combined rapidly to obtain the text contents with continuous semantic information, the text lines are not required to be combined in a sequencing way, the situation of combining errors caused by sequencing errors is avoided, and the combining accuracy is provided. See in particular any of the examples below.

Referring to fig. 1, fig. 1 is a flow chart illustrating an embodiment of a text merging method according to the present application.

Specifically, the method may include the steps of:

Step 11: acquiring at least two text box information in an image to be identified; wherein each text box message includes a text line content therein.

In this embodiment, at least two text lines of content are required in the image to be identified, so that the text contents between the subsequent lines can be combined.

In some embodiments, the content of each text line in the image to be recognized is obtained using optical character recognition (OCR, optical Character Recognition) techniques. And then semantic extraction is carried out on the content of each text line by utilizing a multilingual language model, so as to obtain semantic features.

The description is given in conjunction with fig. 2 and 3:

The text content in fig. 2 can be identified in a text line manner using an optical character recognition technique, resulting in text line content as described in fig. 3. Further, semantic extraction can be performed on the content of each text line by using a multilingual language model, so as to obtain semantic features. By using the multilingual language model, the content of each text line can be changed into a column vector with a limited length, so that the subsequent model can be conveniently processed. It can be understood that the optical character recognition technology can only search, extract and recognize the characters in the image, and cannot know the semantic features of the characters, so that the semantic features of each text line content can be extracted by combining a multilingual language model, and the subsequent combination of the text line content is facilitated.

Step 12: and acquiring text line characteristics corresponding to each text box information, and obtaining at least two text line characteristics.

In some embodiments, the image recognition network may be used to perform image recognition on the image to be recognized, so as to obtain image features corresponding to the image to be recognized, and then combine the semantic features of the text lines mentioned above to obtain text line features.

In other embodiments, the image to be identified may be convolved by using a convolutional neural network, to extract a feature map of the image to be identified, and then combine the position information of each text line content identified by the optical character identification technology to obtain the feature of the corresponding position as the image feature of the text line content, and then combine the image feature of each text line content with the above-mentioned semantic feature of the text line to obtain the text line feature.

Step 13: and merging corresponding text line contents according to the similarity between at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

After text line contents in the image to be identified and text line characteristics corresponding to each text line content are detected, the corresponding text line contents are combined according to the text line characteristics to obtain text contents with continuous semantic information in the image to be identified.

In this embodiment, the overall operation speed is increased by calculating the similarity between text line features. Since the way to calculate the similarity is a vector point multiplication between text line features, which is a parallel process, the calculation efficiency is higher than the way to predict the relationship between adjacent text boxes in turn as mentioned above.

In the embodiment, the text line contents in the text box of the image can be quickly combined through similarity parallel calculation among the text line features to obtain the text contents with continuous semantic information, and the text lines are not required to be sequenced and combined, so that the situation of combining errors caused by sequencing errors is avoided, and the combining accuracy is provided.

Referring to fig. 4, fig. 4 is a flowchart illustrating another embodiment of the text merging method according to the present application.

Specifically, the method may include the steps of:

Step 41: acquiring at least two text box information in an image to be identified; wherein each text box message includes a text line content therein.

Step 42: and extracting image features of the image to be identified by using a convolutional neural network.

In some embodiments, the convolutional neural network may be constructed based on a CNN base model.

In some embodiments, the convolutional neural network may be constructed based on ResNet.

The convolutional neural network can extract all image features in the image to be identified.

Step 43: and splicing the image features and at least two text box information to obtain fusion features.

In some embodiments, the text box information has text content line location information and semantic features. Thus referring to fig. 5, step 43 may include the following procedure:

Step 431: and extracting corresponding row image features from the image features according to each row of position information.

In some embodiments, since the image features include all features of the image to be identified, not only the text line content corresponds to the features, but also the remaining non-text line content features. Based on this, line image features of text line content are extracted from the image features using the line position information.

Step 432: and splicing the line image features and the semantic features to obtain fusion features.

In this embodiment, the stitching may be performed by adding features. In other embodiments, stitching may take the form of feature multiplication.

Step 44: and acquiring text line features from the fusion features to obtain at least two text line features.

In some embodiments, the fusion feature may be input to a text line merge model, and the text line feature may be obtained from the fusion feature using the text line merge model to obtain at least two text line features. Text line merge models are built based on a transformer network. The text line feature at this time has corresponding semantic information.

Step 45: and merging corresponding text line contents according to the similarity between at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

Further, the fusion features not only have the image features of the text lines, but also have the semantic features, so that the subsequent text line features have the image features and the semantic features, the subsequent similarity calculation is facilitated, and the accuracy of the similarity calculation is improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating another embodiment of the text merging method according to the present application.

Specifically, the method may include the steps of:

Step 61: acquiring at least two text box information in an image to be identified; wherein each text box message includes a text line content therein.

Step 62: and acquiring text line characteristics corresponding to each text box to obtain at least two text line characteristics.

Steps 61 to 62 have the same or similar technical means as any embodiment of the present application, and are not described herein.

Step 63: and obtaining a similarity matrix between at least two text line features.

It can be understood that the similarity matrix obtaining process is a parallel computing process, so that the overall merging efficiency can be accelerated.

Step 64: and performing non-maximum suppression on the similarity matrix to obtain the similarity between at least two text line features.

In this embodiment, it is contemplated that several lines of text content are included in the image to be identified. The easy occurrence of the same text line may have a high similarity to a plurality of text lines from different text blocks. Based on this, non-maximum suppression is performed on the similarity matrix to ensure that each text line belongs to a separate text block. Wherein a text block corresponds to a sentence or a piece of text content having continuous semantic information.

Step 65: and merging the corresponding text line contents according to the similarity to obtain text contents with continuous semantic information in the image to be identified.

Further, non-maximum suppression is performed on the similarity matrix so as to avoid excessive overlapping between the same text lines and improve the accuracy of merging the text lines.

In an application scenario, referring to fig. 7 and 8, the text merging method provided by the present application may use the following procedures: the image to be detected is first input into the multilingual OCR engine to obtain the coordinates of each text line in the image, e.g., the text line coordinates may include the upper left corner coordinates (x 1, y 1) of the text line and the lower right corner coordinates (x 2, y 2) of the text line. And identifying text content in the text line.

After obtaining the text content of the text line, the pre-trained language model needs to be used to extract the semantic information (semantic features) of each text line content, and considering that the language model needs to understand the semantic information of multiple languages in multiple language scenes, the language model (such as XMLR and mBART) capable of supporting multiple languages needs to be used in the method proposed by the present invention. By using the multilingual language model, the content of each text line can be changed into a column vector with a limited length, so that the subsequent model can be conveniently processed.

Further, a feature map of the input image is extracted by using a convolutional neural network, and then, according to the extracted text line position information, features at corresponding positions are extracted from the feature map as image features of the text line. And adding the image characteristic of each text line and the semantic characteristic of the extracted text line content to be used as the input of a text line merging model of the next stage.

The text line merge model consists of a transfomer, which is mainly aimed at fusing image features and semantic features of each text line. It will be appreciated that in order to ensure the merging effect of the text line merging model during training, model parameters of the text line merging model need to be initialized using a layout pre-training model (e.g., graphDoc).

And then outputting the text line merging model, and directly calculating the similarity matrix between each text line and other text lines in a attention mode. In some embodiments, during training, the similarity between lines of text with continuous semantics may be increased and the similarity between lines of text without continuous semantics may be decreased by minimizing cross entropy loss of the similarity matrix and the input label.

Further, because of errors in training, the same text line may have a high degree of similarity to a plurality of text lines from different text blocks, and therefore, a non-maximum suppression of the similarity matrix between text lines is required to ensure that each text line belongs to a separate text block.

Specifically, non-maximum suppression may be achieved in the following manner:

In the similarity matrix, each line represents the similarity of the corresponding text line feature to all text line features. Element indexes with similarity larger than 0 in each row are obtained, and average values of the similarity with similarity larger than 0 are obtained. This average similarity is taken as the similarity of the corresponding row.

And if the first target row and the second target row are candidate to be combined, selecting the maximum similarity between the first target row and the second target row as the combined confidence. That is, each row has a combined confidence with the other rows.

These combined confidence levels are then ranked from high to low. Traversing according to the high-low order to delete the candidate parallel of the candidate parallel corresponding to the low confidence and the candidate parallel repeated of the candidate parallel corresponding to the high confidence.

E.g., confidence a, corresponding candidate merge behavior first, second, and third lines, and confidence B, corresponding candidate merge behavior first and third lines. Confidence coefficient A is larger than confidence coefficient B, and text lines corresponding to the confidence coefficient B are partially repeated with text lines corresponding to the confidence coefficient A, and candidate set parallel corresponding to the confidence coefficient B is abandoned. And in the subsequent actual merging, the text lines are merged in parallel only according to the candidate merging corresponding to the confidence coefficient A. By the operation, excessive overlapping among the same text lines can be avoided, and then a more accurate combination result is output.

Referring to fig. 9, fig. 9 is a schematic diagram of a text recognition device according to an embodiment of the application. The text recognition device 90 includes: a first acquisition module 91, a second acquisition module 92 and a combining module 93.

The first obtaining module 91 is configured to obtain at least two text box information in an image to be identified; wherein each text box message includes a text line content therein.

The second obtaining module 92 is configured to obtain text line features corresponding to each text box information, and obtain at least two text line features.

The merging module 93 is configured to merge corresponding text line contents according to the similarity between at least two text line features, and obtain text contents with continuous semantic information in the image to be identified, and obtain at least two text box information in the image to be identified; wherein each text box message includes a text line content therein.

In some embodiments, the second acquisition module 92 is further configured to extract image features of the image to be identified using a convolutional neural network; splicing the image features and at least two text box information to obtain fusion features; and acquiring text line features from the fusion features to obtain at least two text line features.

In some embodiments, the second obtaining module 92 is further configured to extract a corresponding line image feature from the image features according to each line position information; and splicing the line image features and the semantic features to obtain fusion features.

In some embodiments, the second obtaining module 92 is further configured to obtain text line features from the fusion features using a text line merge model, to obtain at least two text line features.

Text line merge models are built based on a transformer network.

In some embodiments, the merging module 93 is further configured to obtain a similarity matrix between at least two text line features; performing non-maximum suppression on the similarity matrix to obtain the similarity between at least two text line features; and merging the corresponding text line contents according to the similarity to obtain text contents with continuous semantic information in the image to be identified.

In some embodiments, each text box information has corresponding text content and semantic features, and the first obtaining module 91 is further configured to obtain content of each text line in the image to be identified using an optical character recognition technology; and carrying out semantic extraction on the content of each text line by utilizing a multilingual language model to obtain semantic features.

It will be appreciated that the first acquisition module 91, the second acquisition module 92 and the combining module 93 cooperate with each other to enable the method of any of the above embodiments.

Referring to fig. 10, fig. 10 is a schematic diagram of a frame of an electronic device according to an embodiment of the application. The electronic device 100 comprises a memory 101 and a processor 102 coupled to each other, the memory 101 having stored therein program instructions, the processor 102 being adapted to execute the program instructions to implement the steps of any of the text recognition method embodiments described above. Specifically, the electronic device 100 may include, but is not limited to: servers, desktop computers, notebook computers, tablet computers, smart phones, etc., are not limited herein.

In particular, the processor 102 is configured to control itself and the memory 101 to implement the steps of any of the text merge method embodiments described above. The processor 102 may also be referred to as a CPU (Central Processing Unit ). The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 102 may be commonly implemented by an integrated circuit chip.

According to the scheme, the text line contents in the text box of the image can be quickly combined through similarity parallel calculation among the text line features, so that the text contents with continuous semantic information are obtained, the text lines are not required to be sequenced and combined, the situation of combining errors caused by sequencing errors is avoided, and the combining accuracy is provided.

Referring to fig. 11, fig. 11 is a schematic diagram illustrating a frame of an embodiment of a computer readable storage medium according to the present application. The computer readable storage medium 110 stores program instructions 111 that can be executed by a processor, the program instructions 111 being configured to implement the steps in any of the text merging method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

Claims

1. A method of text recognition, the method comprising:

Acquiring at least two text box information in an image to be identified; wherein each text box message comprises a text line content;

Acquiring text line characteristics corresponding to each text box information, and obtaining at least two text line characteristics;

And merging the corresponding text line contents according to the similarity between the at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

2. The method according to claim 1, wherein the obtaining text line features corresponding to each text box information, to obtain at least two text line features, includes:

extracting image features of the image to be identified by using a convolutional neural network;

splicing the image features and the at least two text box information to obtain fusion features;

And acquiring the text line features from the fusion features to obtain the at least two text line features.

3. The method of claim 2, wherein each of the text box information has corresponding line location information and semantic features of the text content; the step of splicing the image features and the at least two text box information to obtain fusion features, which comprises the following steps:

Extracting corresponding row image features from the image features according to each row position information;

and splicing the line image features and the semantic features to obtain the fusion features.

4. The method according to claim 2, wherein the obtaining the text line feature from the fusion feature, to obtain the at least two text line features, includes:

And acquiring the text line features from the fusion features by using a text line merging model to obtain the at least two text line features.

5. The method of claim 4, wherein the text line merge model is built based on a transformer network.

6. The method according to claim 1, wherein merging the corresponding text line contents according to the similarity between the at least two text line features to obtain text contents with continuous semantic information in the image to be identified includes:

obtaining a similarity matrix between the at least two text line features;

performing non-maximum suppression on the similarity matrix to obtain the similarity between the at least two text line features;

And merging the corresponding text line contents according to the similarity to obtain text contents with continuous semantic information in the image to be identified.

7. The method of claim 1, wherein each text box information has a corresponding text content and semantic feature, and the obtaining at least two text box information in the image to be identified comprises:

Acquiring the content of each text line in the image to be identified by utilizing an optical character identification technology;

And carrying out semantic extraction on the text line content by utilizing a multilingual language model to obtain semantic features.

8. A text recognition device, the text recognition device comprising:

the first acquisition module is used for acquiring at least two text box information in the image to be identified; wherein each text box message comprises a text line content;

The second acquisition module is used for acquiring text line characteristics corresponding to each text box information to obtain at least two text line characteristics;

and the merging module is used for merging the corresponding text line contents according to the similarity between the at least two text line features to obtain text contents with continuous semantic information in the image to be identified.

9. An electronic device comprising a memory and a processor coupled to each other, the memory having program instructions stored therein, the processor being configured to execute the program instructions to implement the text recognition method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that program instructions executable by a processor for implementing the text recognition method according to any one of claims 1 to 7 are stored.