CN113627439A

CN113627439A - Text structuring method, processing device, electronic device and storage medium

Info

Publication number: CN113627439A
Application number: CN202110921811.5A
Authority: CN
Inventors: 于海鹏; 梁思远; 李煜林; 钦夏孟; 姚锟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-09

Abstract

The disclosure provides a text structuring processing method, a processing device, electronic equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to scenes such as OCR optical character recognition. The specific implementation scheme is as follows: performing text detection on the text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category; determining a text image corresponding to a target text detection box in at least one text detection box; performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box; performing text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result; generating a text structured result, wherein the text structured result comprises a value corresponding to the keyword category and a value corresponding to the numerical value category.

Description

Text structuring method, processing device, electronic device and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly to the field of computer vision and deep learning technology, and can be applied to scenes such as OCR optical character recognition. In particular, the invention relates to a text structuring processing method, a processing device, an electronic device and a storage medium.

Background

With the continuous development and popularization of information technology, various industries have widely utilized the information technology to improve efficiency, so that a large amount of text data is generated, the text data may contain more structured information, and the acquisition of the structured information is helpful for providing help for deep-level application based on the text data.

Disclosure of Invention

The disclosure provides a text structuring processing method, a processing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a text structuring processing method, including: performing text detection on a text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category; determining a text image corresponding to a target text detection box in the at least one text detection box, wherein the target text detection box is a text detection box of which the category information is the numerical category; performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box; performing text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result; and generating a text structured result, wherein the text structured result includes a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category includes the semantic category result, and the value corresponding to the numerical category includes the text recognition result.

According to another aspect of the present disclosure, there is provided a text structuring processing apparatus including: the text detection module is used for performing text detection on a text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category; a determining module, configured to determine a text image corresponding to a target text detection box in the at least one text detection box, where the target text detection box is a text detection box whose category information is the numeric category; the text recognition module is used for performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box; the text classification module is used for performing text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result; and a generating module, configured to generate a text structured result, where the text structured result includes a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category includes the semantic category result, and the value corresponding to the numerical category includes the text recognition result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which a text structured processing method and processing apparatus may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow diagram of a text structuring process method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of a text structuring process according to an embodiment of the present disclosure;

FIG. 4 schematically shows a block diagram of a text structuring processing device according to an embodiment of the present disclosure; and

fig. 5 shows a block diagram of an electronic device suitable for a text structuring processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Text structuring may be understood as processing the text content into a form comprising values corresponding to a keyword category and values corresponding to a numerical category. The Key class (i.e., Key) and the Value class (i.e., Value) can be understood as a Key-Value. The text data may be presented in the form of an image, i.e. forming a text image, the text structuring of which may be achieved in two ways.

The method comprises the steps of processing a text image by using a detection algorithm to obtain a category identification result aiming at a target text included in the text image, processing the target text by using a text identification algorithm to obtain a text identification result, and obtaining a text structuralization result aiming at the target text according to the category identification result and the text identification result. The text structured result includes a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category includes a category recognition result, and the value corresponding to the numerical category includes a text recognition result.

And secondly, processing the text image by using a text detection algorithm to obtain position information of a target text included in the text image, processing the target text by using a text recognition algorithm to obtain a text recognition result, determining a category recognition result corresponding to the target text according to the position information and a preset position relation, and obtaining a text structuralization result for the target text according to the category recognition result and the text recognition result. The preset positional relationship may be understood as a positional relationship between the keyword category and the numerical value category corresponding to the keyword category.

In the process of implementing the concept of the present disclosure, it is found that when the format of the text image changes more, since the visual difference between different texts is relatively small, it is difficult to implement category identification by using the first method, and a case of category identification error may occur. Since the relative position between the keyword category and the numerical category corresponding to the keyword category in the text image is not fixed, the robustness of the second usage method is poor. Therefore, the accuracy of text structuring realized by using a text detection algorithm and a text recognition algorithm is not high.

In the process of realizing the concept disclosed by the invention, the semantic extraction can be carried out on the target text to obtain a semantic category identification result (namely a category identification result) corresponding to the target text because the target text contains semantic information, so that the text classification can be combined with the text detection and the text identification to realize the text structuring of the text image.

Therefore, the embodiment of the disclosure provides a text structuring processing scheme combining text detection, text recognition and text classification, that is, determining a text recognition result of a text image by using the text detection and the text recognition, determining a semantic category recognition result of the text image by using the text classification, and obtaining a text structuring result of the text image according to the text recognition result and the semantic category recognition result. Because the category identification is realized by utilizing the semantic information included in the characters, the accuracy of text structuring is improved.

Based on the foregoing, the disclosed embodiments provide a text structuring method, a processing device, an electronic device, and a non-transitory computer-readable storage medium and a computer program product storing computer instructions. The text structuring processing method can comprise the following steps: performing text detection on the text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category, and determining the text image corresponding to a target text detection box in the at least one text detection box, wherein the target text detection box is the text detection box of which the category information is the numerical value category; performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box, performing text classification on the text recognition result to obtain a semantic category result corresponding to the text recognition result, and generating a text structured result, wherein the text structured result comprises a value corresponding to a keyword category and a value corresponding to a numerical value category, the value corresponding to the keyword category comprises the semantic category result, and the value corresponding to the numerical value category comprises the text recognition result.

Fig. 1 schematically shows an exemplary system architecture to which a text structuring processing method and processing apparatus may be applied according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, the exemplary system architecture 100 to which the text structuring method and processing apparatus may be applied may include a terminal device, but the terminal device may implement the text structuring method and processing apparatus provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by the user using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

The Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and a VPS (Virtual Private Server). The server 105 may also be an edge server. Server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that the text structuring processing method provided by the embodiment of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the text structuring processing device provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

Alternatively, the text structuring processing method provided by the embodiment of the present disclosure may also be generally executed by the server 105. Accordingly, the text structuring processing device provided by the embodiment of the present disclosure may be generally disposed in the server 105. The text structuring method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the text structuring processing apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

For example, the server 105 performs text detection on a text image, obtains category information of at least one text detection box corresponding to the text image, determines a text image corresponding to a target text detection box, performs text recognition on the text image corresponding to the target text detection box, obtains a text recognition result of the text image corresponding to the target text detection box, performs text classification on the text recognition result, obtains a semantic category result corresponding to the text recognition result, and generates a text structuring result. Or text detection of the text image by a server or server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105 and finally generating a text structuring result.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a text structuring processing method according to an embodiment of the present disclosure.

As shown in FIG. 2, the method 200 includes operations S210-S250.

In operation S210, text detection is performed on the text image to obtain category information of at least one text detection box corresponding to the text image, where the category information includes a keyword category or a numerical value category.

In operation S220, a text image corresponding to a target text detection box among the at least one text detection box is determined, wherein the target text detection box is a text detection box whose category information is a numerical category.

In operation S230, text recognition is performed on the text image corresponding to the target text detection box, and a text recognition result of the text image corresponding to the target text detection box is obtained.

In operation S240, the text recognition result is subjected to text classification, and a semantic classification result corresponding to the text recognition result is obtained.

In operation S250, a text structured result is generated, wherein the text structured result includes a value corresponding to a keyword category and a value corresponding to a numerical category, the value corresponding to the keyword category includes a semantic category result, and the value corresponding to the numerical category includes a text recognition result.

According to an embodiment of the present disclosure, a text image may refer to an image including text content. The type of text image may include a variety, for example, the text image may include a medical text image, a goods listing text image, or a financial text image, etc. The text detection box may comprise a four-corner box, i.e. may be characterized by four coordinates. The category information of the text detection box may include a keyword category that may characterize a category attribute of the text content included in the text detection box or a numerical category that may characterize a content attribute of the text content included in the text detection box. The text recognition result of the text image corresponding to the target detection box may be used to characterize the value of the numerical category corresponding to the text image. The semantic category identification result may be used to characterize the value of a key category corresponding to the text image.

For example, a text detection box includes text content of "x₁x₂City center hospital ", the category information of the text detection box is a numerical category. A text detection boxIf the included text content is "name", the category information of the text detection box is the keyword category. If the text content included in one text detection box is 'Zhang III', the category information of the text detection box is a numerical value category.

According to the embodiment of the disclosure, the text image can be processed by using the text detection model, and the category information of at least one text detection box corresponding to the text image is obtained. The text detection model may be obtained by training a first preset model using a first training sample set and a first label set. The first preset model may include a deep learning model or a conventional model. The deep learning model may include a candidate box based text detection model, a segmentation based text detection model, or a hybrid of both. The conventional model may include a text detection model based on SWT (Stroke Width Transform), a text detection model based on EdgeBox (i.e., edge box), or the like.

According to an embodiment of the present disclosure, after the text detection boxes corresponding to the text image are obtained, a text detection box whose category information is a numerical category may be determined from at least one text detection box according to the category information, and a text detection box whose category information is a numerical category may be determined as a target text detection box. After the target text detection box is determined, the text image corresponding to the target text detection box may be extracted from the text image, and then the text image corresponding to the target text detection box may be subjected to text recognition. The text image corresponding to the target text detection box may be processed using a text recognition model. The text recognition model may be obtained by training a second preset model using a second training sample set and a second label set. The second preset model may include a pattern matching model, a machine learning model, or a deep learning model. The deep learning model may include a text recognition model based on single character recognition or a text recognition model based on whole body recognition.

For example, the text recognition model may be a text recognition model based on single character recognition. Detecting an inclusion of "x" corresponding to a target text₁x₂Text image of city center hospitalPerforming text recognition to obtain a word "x₁x₂The text recognition result corresponding to each character in the downtown hospital, that is, the text recognition result corresponding to "x" can be obtained₁"corresponding text recognition symbol is 2, and" x₂The text type identifier corresponding to "3", the text identification identifier corresponding to "city" 4, the text type identification identifier corresponding to "medium" 5, the text type identification identifier corresponding to "heart" 7, the text identification identifier corresponding to "doctor" 8, and the text identification identifier corresponding to "hospital" 6. Determining that the target text detection box corresponds to the x according to the mapping relation between the character meaning and the text recognition identification₁x₂The text recognition result of the text image of the downtown hospital is "x₁x₂City center hospital ".

According to the embodiment of the disclosure, after the text recognition result corresponding to the target text detection box is obtained, the text recognition result may be processed by using a text classification model, that is, semantic features included in the text recognition result are extracted by using the text classification model, and a semantic category result corresponding to the text recognition result is determined according to the semantic features. The text classification model may be obtained by training a third preset model with a third training sample set. The third preset model may include a machine learning model or a deep learning model. The machine learning model may include a text classification model based on a naive bayes algorithm or a text classification model based on a decision tree.

For example, the text recognition result is "zhang san", and the text recognition result is subjected to text classification to obtain a semantic recognition result of "name".

According to an embodiment of the present disclosure, the semantic category result obtained in operation S240 and the text recognition result obtained in operation S230 may be combined into a text structured result in which a value corresponding to the keyword category is a semantic category result and a value corresponding to the numeric category is a text recognition result.

It should be noted that, in the technical solution of the embodiment of the present disclosure, the related text image, the category information of the text image, the text recognition result, the semantic category result, and the text structuring result all meet the regulations of the relevant laws and regulations, and necessary security measures are taken without violating the good customs of the public order.

According to the embodiment of the disclosure, the text detection is performed on the text image to obtain the category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category, the text image corresponding to the target text detection box is determined, the text image corresponding to the target text detection box is subjected to text recognition to obtain the text recognition result of the text image corresponding to the target text detection box, the text recognition result is subjected to text classification to obtain the semantic category result corresponding to the text recognition result, and a text structured result is generated, so that the computer vision and the language model are combined to realize category recognition by utilizing the semantic information contained in the text, and therefore, the accuracy of text structuring is improved.

Operation S240 may operate as follows according to an embodiment of the present disclosure.

And processing the text recognition result by using the text classification model to obtain a semantic classification result corresponding to the text recognition result.

According to embodiments of the present disclosure, the text classification model may include a deep learning model or a machine learning model. The third training sample set may include a plurality of training texts, and the third label set may include a third label corresponding to each training text.

According to an embodiment of the present disclosure, training a third preset model by using a third training sample set and a third label set, and obtaining a text classification model may include: and inputting each training text in the plurality of training texts into a third preset model to obtain a semantic category result corresponding to each training text. And inputting the semantic category result and the third label corresponding to each training text into a first loss function to obtain a first output value. And adjusting the model parameters of the third preset model according to the first output value until the first output value is converged. And determining a third preset model obtained under the condition of meeting the convergence of the first output value as a text classification model.

According to the embodiment of the disclosure, the text recognition result is processed by using the text classification model, and the semantic classification result corresponding to the text recognition result is obtained, so that the semantic information contained in the text recognition result is fully utilized, and the accuracy and the practicability of text structured extraction are further improved.

According to embodiments of the present disclosure, the text classification model may include a deep learning model.

According to embodiments of the present disclosure, the deep learning model may include a fast text (i.e., FastText) based text classification model, a text convolution Neural Network (i.e., TextCNN) based text classification model, a recurrent Neural Network (i.e., TextRNN) based text classification model, or a Dilated Gated Convolutional Neural Network (DGCNN) based text classification model.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And performing text detection on the text image to obtain the category information and the position information of at least one text detection box corresponding to the text image.

According to an embodiment of the present disclosure, the position information corresponding to the text detection box may be used to characterize the position of the text detection box on the text image. The position information may be characterized by coordinate information of a four-corner box.

According to an embodiment of the present disclosure, operation S220 may include the following operations.

And extracting a text image corresponding to the target text detection box from the text image according to the position information corresponding to the target text detection box in the at least one text detection box.

According to an embodiment of the present disclosure, the position information may be used as a basis for extracting a text image corresponding to the target text detection box from the text image.

According to an embodiment of the present disclosure, extracting a text image corresponding to a target text detection box from the text images according to position information corresponding to the target text detection box of the at least one text detection box may include the following operations.

Position information corresponding to a target text detection box of the at least one text detection box is converted into target position information using affine transformation. And extracting the text image corresponding to the target text detection box from the text image according to the target position information.

According to an embodiment of the present disclosure, the affine transformation is a linear transformation between two-dimensional coordinates to two-dimensional coordinates for maintaining "straightness" and "parallelism" of a two-dimensional figure. The straightness can be understood as straight line or straight line after transformation, no bending, and circular arc or circular arc. The parallelism can be understood as keeping the relative position relationship between different two-dimensional patterns unchanged, and whether parallel lines or parallel lines, and the included angle of the intersected straight lines are unchanged. The affine transformation may be achieved by at least one of translation, scaling, flipping, rotation, and shearing, among others.

According to an embodiment of the present disclosure, the converting the position information corresponding to the target text detection box into the target position information using affine transformation may include: the text detection box in the form of the quadrangular dot box can be converted into the text detection box in the form of the rectangular box by affine transformation, and the position information corresponding to the text detection box in the form of the rectangular box is determined as the target position information, so that the text image corresponding to the target text detection box can be extracted from the text image according to the target position information.

For example, the target text detection box is a four-corner box, which may be defined by { P }₁，P₂，P₃，P₄Characterization, P₁Points representing the upper left corner of the four-corner box, P₂Representing the point in the upper right corner of the four-corner box, P₃Points characterizing the lower left corner of the four-corner box, P₄Points representing the lower right hand corner of the four-corner box. P₁Can be characterized as { x }₁，y₁}，P₂Can be characterized as { x }₂，y₂}，P₃Can be characterized as { x }₃，y₃}，P₄Can be characterized as { x }₄，y₄}. Transforming P by affine₁→P′₁，P₂→P′₂，P₃→P′₃，P₄→P′₄To obtain a rectangular frame { P'₁，P′₂，P′₃，P′₄}。P₁'can be characterized as { x'₁，y′₁}，P′₂Can be characterized as { x'₂，y′₂}，P′₃Can be characterized as { x'₃，y′₃}，P′₄Can be characterized as { x'₄，y′₄}。

And processing the text image by using the text detection model to obtain the category information of at least one text detection box corresponding to the text image.

According to an embodiment of the present disclosure, the text detection model may include a deep learning model, which may include a candidate box-based text detection model, a segmentation-based text detection model, or a hybrid text detection model based on both, and the like. The basic idea of realizing text detection based on the text detection model of the candidate frames is to generate a plurality of candidate text detection frames in advance, and then obtain category information and position information corresponding to the text detection frames by utilizing non-maximum suppression. The basic idea of the text detection model based on segmentation is to segment a text image at a pixel level by using a segmentation network and then process the segmented text image to obtain category information and position information corresponding to a text detection frame.

According to the embodiment of the disclosure, a first preset model may be trained by using a first training sample set and a first label set to obtain a text detection model, where the first training sample set includes a plurality of training text images, the first label set includes a first label corresponding to each training text image, and the first label represents real position information and real category information corresponding to at least one text detection box included in the training text images.

According to an embodiment of the present disclosure, training a first preset model by using a first training sample set and a first label set, and obtaining a text detection model may include: inputting each training text image in the plurality of training text images into a first preset model to obtain category information and position information corresponding to at least one text detection box included in each training text image. And inputting the category information, the position information and the first label corresponding to each text detection box into a second loss function to obtain a second output value. And adjusting the model parameters of the first preset model according to the second output value until the second output value is converged. And determining a first preset model obtained under the condition that the second output value is satisfied with convergence as a text detection model.

According to an embodiment of the present disclosure, operation S230 may include the following operations.

And processing the text image corresponding to the target text detection box by using the text recognition model to obtain a text recognition result of the text image corresponding to the text detection box.

According to the embodiment of the disclosure, a second preset model may be trained by using a second training sample set and a second label set to obtain a text recognition model, where the second training sample set may include a plurality of training text image slices, and the second label set includes a second label corresponding to each training text image slice.

According to an embodiment of the present disclosure, training a second preset model by using a second training sample set and a second label set, and obtaining a text recognition model may include: and inputting each training text image slice in the plurality of training text image slices into a second preset model to obtain a text recognition result corresponding to each training text image slice. And inputting the text recognition result and the second label corresponding to each training text image slice into a third loss function to obtain a third output value. And adjusting the model parameters of the second preset model according to the third output value until the third output value is converged. And determining a second preset model obtained under the condition that the third output value convergence is met as a text recognition model.

According to the embodiment of the disclosure, the trained text detection model, the trained text recognition model and the trained text classification model can be determined as the text structured model.

According to an embodiment of the present disclosure, the text structuring processing method may further include the following operations.

And preprocessing the data to obtain a text image.

According to embodiments of the present disclosure, the data pre-processing may include at least one of: noise reduction processing, tilt correction processing, and sharpening processing. For example, before text detection is performed on a text image, for a text image shot in an inclined manner, the text image may be corrected by some inclination correction algorithms and then input into a text detection model for text detection.

According to the embodiment of the disclosure, the quality of the text image can be improved by performing data preprocessing on the text image, so that the text structuring result is more accurate and practical.

According to an embodiment of the present disclosure, the text image may include a medical text image.

According to the embodiment of the disclosure, the medical text is an important way for saving information in a medical scene, and contains a lot of structured information of a user, and the acquisition of the structured information is helpful for understanding the health condition of the user, and then the targeted analysis and processing are performed. At the same time, a complete database and user representation may also be established. The medical text can exist in an image form, how to extract the required structural information from the medical text image is a technical difficulty in a medical scene, and the method can be realized by utilizing the text structuring scheme provided by the embodiment of the disclosure.

The method for training the abnormal audio classification model according to the embodiment of the present disclosure is further described with reference to fig. 3.

The text structuring processing method according to the embodiment of the present disclosure is further described with reference to fig. 3.

Fig. 3 schematically shows a schematic diagram of a text structuring process according to an embodiment of the present disclosure.

As shown in fig. 3, in the text structuring process 300, the text detection model 302 performs text detection on the text image 301 to obtain category information and position information of at least one text detection box corresponding to the text image 301, where the category information may include a keyword category or a numerical value category, and the at least one text detection box may include at least one of a text detection box 303, a text detection box 304, a text detection box 305, a text detection box 306, and a text detection box 307. The category information of the text detection box 304 and the text detection box 306 may be a keyword category, and the category information of the text detection box 303, the text detection box 305, and the text detection box 307 may be a numeric category.

And determining a text image corresponding to the target text detection box, wherein the target text detection box can be a text detection box with numerical value as category information. The target text detection box may include at least one of a text detection box 303, a text detection box 305, and a text detection box 307. The following describes operations of text recognition, text classification, and text structuring result generation, taking the text detection box 303 as a target text detection box.

The text recognition model 308 performs text recognition on the text image 303-1 corresponding to the text detection box 303 to obtain a text recognition result 303-2 of the text image 303-1 corresponding to the text detection box 303.

The text classification model 309 performs text classification on the text recognition result 303-2 to obtain a semantic classification result 310 corresponding to the text recognition result 303-2.

Semantic category results 310 and text recognition results 303-2 are grouped into structured results 311. The structured structure 311 includes values corresponding to a keyword category including the semantic category result 310 and values corresponding to a numerical category including the text recognition result 303-2.

Fig. 4 schematically shows a block diagram of a text structuring processing device according to an embodiment of the present disclosure.

As shown in fig. 4, the text structuring processing device 400 may include a text detecting module 410, a determining module 420, a text recognizing module 430, a text classifying module 440, and a generating module 450.

The text detection module 410 is configured to perform text detection on the text image to obtain category information of at least one text detection box corresponding to the text image, where the category information includes a keyword category or a numerical value category.

A determining module 420, configured to determine a text image corresponding to a target text detection box of the at least one text detection box, where the target text detection box is a text detection box whose category information is a numeric category.

And the text recognition module 430 is configured to perform text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box.

And the text classification module 440 is configured to perform text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result.

The generating module 450 is configured to generate a text structured result, where the text structured result includes a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category includes a semantic category result, and the value corresponding to the numerical category includes a text recognition result.

According to an embodiment of the present disclosure, the text classification module 440 may include a first obtaining sub-module.

And the first obtaining submodule is used for processing and text recognition results by utilizing the text classification model to obtain semantic classification results corresponding to the text recognition results.

According to an embodiment of the present disclosure, the text detection module 410 may include a second obtaining sub-module.

And the second obtaining submodule is used for carrying out text detection on the text image to obtain the category information and the position information of at least one text detection box corresponding to the text image.

According to an embodiment of the present disclosure, the determination module 420 may include an extraction sub-module.

And the extraction submodule is used for extracting the text image corresponding to the target text detection box from the text image according to the position information corresponding to the target text detection box in the at least one text detection box.

According to an embodiment of the present disclosure, the extraction sub-module may include a conversion unit and an extraction unit.

A conversion unit configured to convert position information corresponding to a target text detection box of the at least one text detection box into target position information using affine transformation.

And the extraction unit is used for extracting the text image corresponding to the target text detection box from the text image according to the target position information.

According to an embodiment of the present disclosure, the text detection module 410 may include a third obtaining sub-module.

And the third obtaining submodule is used for processing the text image by using the text detection model to obtain the category information of at least one text detection box corresponding to the text image.

According to an embodiment of the present disclosure, the text recognition module 430 may include a fourth obtaining sub-module.

And the fourth obtaining submodule is used for processing the text image corresponding to the target text detection box in the at least one text detection box by using the text recognition model to obtain a text recognition result of the text image corresponding to the target text detection box.

According to an embodiment of the present disclosure, the text structuring processing device 400 may further include an obtaining module.

An obtaining module, configured to obtain a text image by using data preprocessing, where the data preprocessing includes at least one of: noise reduction processing, tilt correction processing, and sharpening processing.

According to an embodiment of the present disclosure, the text image includes a medical text image.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as above.

Fig. 5 shows a block diagram of an electronic device suitable for a text structuring processing method according to an embodiment of the present disclosure. The electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 executes the respective methods and processes described above, such as the text structuring processing method. For example, in some embodiments, the text structuring method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the text structuring processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the text structuring processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A text structuring processing method comprises the following steps:

performing text detection on a text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category;

determining a text image corresponding to a target text detection box in the at least one text detection box, wherein the target text detection box is a text detection box of which the category information is the numerical category;

performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box;

performing text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result; and

generating a text structured result, wherein the text structured result comprises a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category comprises the semantic category result, and the value corresponding to the numerical category comprises the text recognition result.

2. The method of claim 1, wherein the text classifying the text recognition result to obtain a semantic category result corresponding to the text recognition result comprises:

and processing the text recognition result by using a text classification model to obtain a semantic classification result corresponding to the text recognition result.

3. The method of claim 2, wherein the text classification model comprises a deep learning model.

4. The method according to any one of claims 1 to 3, wherein the text detection on the text image to obtain the category information of at least one text detection box corresponding to the text image comprises:

performing text detection on the text image to obtain category information and position information of at least one text detection box corresponding to the text image;

wherein the determining a text image corresponding to a target text detection box of the at least one text detection box comprises:

5. The method of claim 4, wherein the extracting a text image corresponding to a target text detection box from the text images according to the position information corresponding to the target text detection box comprises:

converting position information corresponding to a target text detection box in the at least one text detection box into target position information by affine transformation; and

and extracting a text image corresponding to the target text detection box from the text image according to the target position information.

6. The method according to any one of claims 1 to 5, wherein the text detection on the text image to obtain the category information of at least one text detection box corresponding to the text image comprises:

and processing the text image by using a text detection model to obtain the category information of at least one text detection box corresponding to the text image.

7. The method according to any one of claims 1 to 6, wherein the performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box comprises:

and processing the text image corresponding to the target text detection box by using a text recognition model to obtain a text recognition result of the text image corresponding to the target text detection box.

8. The method of any of claims 1-7, further comprising:

obtaining the text image by using data preprocessing, wherein the data preprocessing comprises at least one of the following steps: noise reduction processing, tilt correction processing, and sharpening processing.

9. The method of any of claims 1-8, wherein the text image comprises a medical text image.

10. A text structuring processing device comprising:

the text detection module is used for performing text detection on a text image to obtain category information of at least one text detection box corresponding to the text image, wherein the category information comprises a keyword category or a numerical value category;

a determining module, configured to determine a text image corresponding to a target text detection box in the at least one text detection box, where the target text detection box is a text detection box of which the category information is the numerical category;

the text recognition module is used for performing text recognition on the text image corresponding to the target text detection box to obtain a text recognition result of the text image corresponding to the target text detection box;

the text classification module is used for performing text classification on the text recognition result to obtain a semantic classification result corresponding to the text recognition result; and

a generating module, configured to generate a text structured result, where the text structured result includes a value corresponding to the keyword category and a value corresponding to the numerical category, the value corresponding to the keyword category includes the semantic category result, and the value corresponding to the numerical category includes the text recognition result.

11. The apparatus of claim 10, wherein the text classification module comprises:

and the first obtaining submodule is used for processing the text recognition result by utilizing a text classification model to obtain a semantic classification result corresponding to the text recognition result.

12. The apparatus of claim 11, wherein the text classification model comprises a deep learning model.

13. The apparatus of any of claims 10-12, wherein the text detection module comprises:

the second obtaining submodule is used for carrying out text detection on the text image to obtain the category information and the position information of at least one text detection box corresponding to the text image;

wherein the determining module comprises:

14. The apparatus of claim 13, wherein the extraction submodule comprises:

a conversion unit configured to convert, by affine transformation, position information corresponding to a target text detection box of the at least one text detection box into target position information; and

15. The apparatus of any of claims 10-14, wherein the text detection module comprises:

and the third obtaining submodule is used for processing the text image by using a text detection model to obtain the category information of at least one text detection box corresponding to the text image.

16. The apparatus of any of claims 10-15, wherein the text recognition module comprises:

and the fourth obtaining submodule is used for processing the text image corresponding to the target text detection box by using the text recognition model to obtain a text recognition result of the text image corresponding to the target text detection box.

17. The apparatus of any of claims 10-16, further comprising:

an obtaining module, configured to obtain the text image by using data preprocessing, where the data preprocessing includes at least one of: noise reduction processing, tilt correction processing, and sharpening processing.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-9.

20. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 9.