CN114140810B

CN114140810B - Method, apparatus and medium for structured recognition of documents

Info

Publication number: CN114140810B
Application number: CN202210113484.5A
Authority: CN
Inventors: 张国强; 何佳彬; 郑伟; 连增; 张红达; 马祯; 陈云锡; 张静; 张天维
Original assignee: Beijing Ouying Information Technology Co ltd
Current assignee: Beijing Ouying Information Technology Co ltd
Priority date: 2022-01-30
Filing date: 2022-01-30
Publication date: 2022-04-22
Anticipated expiration: 2042-01-30
Also published as: CN114140810A

Abstract

Embodiments of the present disclosure relate to a method, apparatus, and medium for document structured recognition. According to the method, one or more documents to be identified are obtained; identifying the document to be identified based on a plurality of preset matching patterns to obtain a plurality of identification results of the document to be identified; determining whether a structured record file of a user about the document to be identified is included in an associated database; and in response to determining that a structured record file about the user is included in the database, populating the identification result to a corresponding location in the structured record file. Therefore, the text recognition of one or more documents to be recognized can be automatically carried out individually or in batches, and the related working efficiency can be improved.

Description

Method, apparatus and medium for structured recognition of documents

Technical Field

Embodiments of the present disclosure relate generally to the field of image text recognition, and more particularly, to a method, apparatus, and medium for structured recognition of documents.

Background

In the medical field, medical record data is an original record used for recording the whole process of diagnosis and treatment of a patient in a hospital, and can contain basic information, a disease course record, an examination and examination result, a medical order, an operation record, an admission record, a discharge record, a nursing record and the like of the patient, so that the medical record data is an important component of medical data and has great significance for data sharing, medical research, doctor consultation, follow-up visit and the like.

Currently, medical record data is usually recorded manually by related medical staff (e.g., doctors and the like) by using paper documents, but the paper medical record documents are not convenient for the medical staff to process and research, and the medical record data recorded by using the paper documents is easy to lose or lose due to improper storage. Therefore, it is necessary to identify and extract medical record data for paper document recording so as to facilitate electronic storage. Since medical record documents adopted by different medical institutions are usually different, and each medical record often contains named entities with the same meaning but different names (for example, blood pressure is sometimes recorded as its english abbreviation BP), so that it is complicated to filter and classify medical record data, and at present, the data is usually manually imported into an electronic database for storage. However, the manual introduction method is time-consuming and labor-consuming, and inefficient. With the development of big data and artificial intelligence technology, it has become a trend to electronically store and store various data (including medical data) for later use.

Disclosure of Invention

In view of the above problems, the present disclosure provides a method and device for document structured recognition, which enable text recognition of one or more documents to be recognized individually or in batches to be performed automatically, thereby improving the related work efficiency.

According to a first aspect of the present disclosure, there is provided a method for structured recognition of a document, comprising: acquiring one or more documents to be identified; identifying the document to be identified based on a plurality of preset matching patterns to obtain a plurality of identification results of the document to be identified; determining whether a structured record file of a user about the document to be identified is included in an associated database; and in response to determining that a structured record file about the user is included in the database, populating corresponding locations in the structured record file with the recognition results, wherein the structured record file is generated based on a pre-established structured template.

According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the disclosure.

In a third aspect of the present disclosure, a non-transitory computer readable storage medium is provided having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.

In some embodiments, each matching pattern of the plurality of matching patterns specifies an extraction location and an attribute of a respective text message to be extracted, and each recognition result of the plurality of recognition results includes a text message extracted from the document to be recognized based on the respective matching pattern and an associated attribute.

In some embodiments, identifying the document to be identified based on a plurality of preconfigured matching patterns comprises: executing the matching mode to extract a corresponding text message from the document to be recognized based on the extraction position specified by the matching mode; in response to failing to extract a corresponding text message from the document to be recognized based on the extraction position specified by the matching pattern, adjusting the matching pattern to extract the corresponding text message from the document to be recognized based on the adjusted matching pattern, wherein adjusting the matching pattern comprises replacing at least one named entity included in the matching pattern with another named entity having the same meaning; and in response to the extraction position specified based on the matching mode, extracting the corresponding text message from the document to be recognized, and combining the attribute indicated by the matching mode and the form of the extracted text message key value pair into a corresponding recognition result.

In some embodiments, the structured template defines which components, sub-portions and attributes the structured record file includes and associations between the components, sub-portions and attributes, and the structured template further defines whether each component or sub-portion can include a plurality of different groupings of records, each for recording text messages for a plurality of attributes associated with the respective component or sub-portion.

In some embodiments, the method further comprises: in response to determining that the structured record file for the user is not included in the database, generating a structured record file to be populated for the user based on the structured template.

In some embodiments, each recognition result further comprises a respective confidence, and prior to populating the recognition result into the respective location in the structured record file, the method further comprises: determining whether a plurality of first recognition results associated with the same attribute are included in the plurality of recognition results; in response to determining that the first plurality of recognition results is included in the plurality of recognition results, determining whether a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute based on the structured template; in response to determining that the component part or sub-part associated with the attribute cannot include a plurality of different groupings of records, retaining a first recognition result of the plurality of first recognition results having a highest confidence and ignoring other first recognition results; in response to determining that the component part or sub-part associated with the attribute can include a plurality of different groupings of records, determining a degree of overlap between text messages respectively included by the plurality of first recognition results; in response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results exceeds a predetermined threshold, retaining a first recognition result having a highest confidence among the plurality of first recognition results and ignoring the other first recognition results; and in response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results is less than or equal to the predetermined threshold, retaining the plurality of first recognition results.

In some embodiments, populating the recognition results into respective locations in the structured record file comprises: determining, based on the structured template, whether a plurality of different groupings of records can be included in a component part or sub-part associated with an attribute included in the recognition result; in response to determining that a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute, determining whether a grouping of records associated with the identification result has been included in the component part or sub-part associated with the attribute in the structured record file; in response to determining that the component or sub-component of the structured record file associated with the attribute already includes a record grouping associated with the recognition result, populating the recognition result into the record grouping; and in response to determining that the structured record does not include a record grouping associated with the identification result, adding a corresponding record grouping to be filled in a component or sub-component of the structured record file associated with the attribute to fill the identification result in the record grouping.

In some embodiments, populating the recognition results to respective locations in the structured record file further comprises: in response to determining that a plurality of different groupings of records cannot be included in the component or sub-portion associated with the attribute included in the recognition result, determining whether a record for the attribute has been populated in the structured record file; in response to determining that a record for the attribute has not been populated in the structured record file, populating the identification result into a corresponding component or sub-portion of the structured record file; and in response to determining that a record for the attribute has been populated in the structured record file, ignoring the recognition result.

In some embodiments, the matching pattern is implemented using a regular expression.

In some embodiments, the extraction position of the text message to be extracted indicated by each matching pattern of the plurality of matching patterns comprises: the text message to be extracted is located in which part of the document to be recognized or in which matching interval of which part of the document to be recognized the text message to be extracted is located.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.

FIG. 1 shows a schematic diagram of a system 100 for implementing a method for structured recognition of documents according to an embodiment of the invention.

FIG. 2 shows a flow diagram of a method 200 for structured recognition of documents, according to an embodiment of the present disclosure.

FIG. 3 shows a flowchart of a method 300 for identifying the document to be identified based on a plurality of pre-configured matching patterns according to an embodiment of the present disclosure.

Fig. 4A illustrates a flow diagram of an example embodiment of an aspect of a method 400 for populating recognition results to corresponding locations in a structured record file according to an embodiment of the present disclosure.

FIG. 4B illustrates a flow diagram of an example embodiment of another aspect of a method 400 for populating recognition results to corresponding locations in a structured record file in accordance with embodiments of the present disclosure.

Fig. 5 shows a block diagram of an electronic device 500 according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above, medical record data is generally recorded manually by an associated medical worker (e.g., a doctor, etc.) using a paper document, but the paper medical record document is not convenient for the medical worker to handle and study, and medical record data recorded using the paper document is easily lost or lost due to improper storage. Moreover, medical record documents adopted by different medical institutions are usually different, and each medical record often contains named entities with the same meaning but different names, so that the screening and classification of medical record data are complicated, and the data are usually manually imported into an electronic database for storage at present. However, the manual introduction method is time-consuming and labor-consuming, and inefficient.

To address at least in part one or more of the above issues and other potential issues, an example embodiment of the present disclosure proposes a method and apparatus for document structured recognition, comprising: a method for structured recognition of documents, comprising: acquiring one or more documents to be identified; identifying the document to be identified based on a plurality of preset matching patterns to obtain a plurality of identification results of the document to be identified; determining whether a structured record file of a user about the document to be identified is included in an associated database; and in response to determining that a structured record file about the user is included in the database, populating corresponding locations in the structured record file with the recognition results, wherein the structured record file is generated based on a pre-established structured template. In this way, text recognition of one or more documents to be recognized individually or in batches is enabled automatically, and therefore related working efficiency can be improved.

FIG. 1 shows a schematic diagram of a system 100 for implementing a method for structured recognition of documents according to an embodiment of the invention. As shown in fig. 1, the system 100 includes a computing device 110, a network 120, and a scanner 130 (or alternatively, an imaging apparatus or server storing documents to be identified). The computing device 110 and the scanner 130 may interact with data, for example, over a network 120 (e.g., the internet). In the present disclosure, the scanner 130 may be used to provide a document to be identified, such as a picture document of a paper case history document, and the like. The computing device 110 may communicate with the scanner 130 via the network 120 to obtain the scanned document to be identified from the scanner 130. The computing device 110 may include at least one processor 112 and at least one memory 114 coupled to the at least one processor 112, the memory 114 having stored therein instructions 116 executable by the at least one processor 112, the instructions 116 when executed by the at least one processor 112 performing the method 200 as described below. The specific structure of computing device 110 may be described, for example, as follows in connection with FIG. 5.

FIG. 2 shows a flow diagram of a method 200 for structured recognition of documents, according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 500 shown in FIG. 5. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the present disclosure is not limited in this respect.

At step 202, computing device 110 obtains one or more documents to be identified.

The document to be recognized is usually a document in a picture format, for example, a picture document obtained by scanning a paper document (e.g., a paper medical record document) with a scanner, or a picture document obtained by shooting a paper document with an imaging device such as a camera.

In step 204, a document to be recognized is recognized (e.g., Optical Character Recognition (OCR)) based on a plurality of matching patterns configured in advance, so as to obtain a plurality of Recognition results for the document to be recognized.

In the present disclosure, each matching pattern of the plurality of matching patterns specifies an extraction position and an attribute of a corresponding text message to be extracted, and each recognition result of the plurality of recognition results includes a text message extracted from a document to be recognized based on the corresponding matching pattern and an associated attribute.

In some embodiments, the respective matching patterns are implemented using regular expressions. And, the extraction position of the text message to be extracted indicated by each matching pattern of the plurality of matching patterns includes: the text message to be extracted is located in which part of the document to be recognized or in which matching interval of which part of the document to be recognized the text message to be extracted is located.

For example, "blood pressure: [ W:1-6]/[ W:1-5] mmHg physical examination SECOND _ LEVEL _ POINT systolic pressure [ W:1-6 ]" is one example of a matching pattern implemented using a regular expression, the matching pattern representing that 1 to 6 characters following "blood pressure:" and preceding "/" are extracted (or recognized) from a matching section of "blood pressure: [ W:1-6]/[ W:1-5] mmHg" in a section where a secondary title (SECOND LEVEL Point) of a document to be recognized is "physical examination", and an attribute of the extracted character (i.e., the extracted text message) is "systolic pressure". Thus, if "blood pressure is included in the sub-portion" physical exam "of the document to be identified: 94/58mmHg ", based on the matching pattern, the text message" 94 "can be extracted, and the extracted text message" 94 "belongs to" systolic blood pressure ". In the present disclosure, the portion indicated by the secondary heading refers to a sub-portion of the document to be identified.

For another example, the "main [ W:0-5] complain: -Admission record FIRST _ LEVEL _ POINT" is another example of a matching pattern implemented by a regular expression, which indicates that the matching pattern extracted from a matching section of the document to be recognized, in which the FIRST LEVEL Point is the "Admission record", is the "main [ W:0-5] complain:": "following string of characters, and wherein" subject "may include 0 to five spaces between the two words. Therefore, if "the subject matter: in the matching interval of 2 years after the osteotomy around the right hip joint, based on the matching pattern, the text message "2 years after the osteotomy around the right hip joint" may be extracted, and the extracted text message belongs to the "chief complaint". In the present disclosure, the portion indicated by the primary heading refers to a component of the document to be recognized.

In the present disclosure, each recognition result recognized from the document to be recognized based on the preconfigured plurality of matching patterns may include a respective confidence level. The confidence level may be calculated using any known or yet to be developed algorithm and will not be further described herein. In addition, in order to ensure the accuracy of the extracted text message, a plurality of different matching modes can be configured in advance for each attribute, so that a plurality of recognition results about the attribute can be extracted from the document to be recognized, and then the final recognition result can be selected based on the corresponding confidence degree, thereby being helpful for improving the recall rate and the accuracy of the document recognition. In the present disclosure, different paragraphs or segments of the document to be identified may be assigned different weights, and the calculation of confidence may be based at least in part on the weights, such that the confidence of an identified result identified from a paragraph or segment with a higher weight is higher relative to an identified result identified from a paragraph or segment with a lower weight, thereby facilitating further reference to recall and accuracy.

Step 204 is described in further detail below in conjunction with fig. 3.

At step 206, it is determined whether a structured record file for the user of the document to be identified is included in the associated database.

In the present disclosure, a structured record file refers to a structured file for filling in a text message extracted from a document to be recognized. Before any text message is not filled, the structured file is a standardized file in which the titles of the components (and if a component includes a sub-part, the title of the sub-part) are recorded and the names of the attributes of the content to be filled are also recorded in the components. Thus, after the corresponding text message is extracted from the document to be recognized, it can be filled in the corresponding area in the structured record file according to the attribute of the text message.

When the document to be identified is a medical record document, the user of the identified document refers to the patient, and correspondingly, the structured record file of the user refers to the structured medical record file of the patient.

At step 208, in response to determining that the structured record file for the user is included in the database, the recognition result is populated into a corresponding location in the structured record file. In the present disclosure, the structured record file is generated based on a pre-established structured template. In addition, in the present disclosure, the attribute indicated by the matching pattern and the form in which the extracted text message is a key-value pair are combined into the corresponding recognition result, and thus it is known in which position of the structured record file the extracted text message needs to be filled based on the corresponding attribute.

The structured template defines which components, sub-portions and attributes the structured record file includes and the associations between those components, sub-portions and attributes, and further defines whether each component or sub-portion can include a plurality of different groupings of records, each for recording text messages pertaining to a plurality of attributes associated with the respective component or sub-portion. Herein, the association relationship between the component parts, the sub-parts and the attributes refers to the dependency relationship between each component part and which sub-parts exist, and which attributes each component part or sub-part includes (or corresponds to). For example, the structured module can define a structured medical records file to include portions of patient basic information, admission records, discharge records, surgical records, and the like, where the admission records portion can include sub-portions such as admission records primary information, physical examinations, and the like, and the sub-portions physical examinations can include attributes such as body temperature, pulse, respiration, and the like. Additionally, the structured module may also define which components or sub-portions may include a plurality of different groupings of records, e.g., because the patient may be performing a plurality of procedures, and procedure information for each procedure may be recorded by a different grouping of records, and thus a component such as "procedure record" may be defined as including a plurality of different groupings of records (e.g., a first grouping including procedure records, a second grouping of procedure records, etc.), each of which may include a corresponding text message for a plurality of attributes associated with the portion. For example, if the structured template defines that the "surgical records" portion is associated with a first attribute, a second attribute, a third attribute, and a fourth attribute, and the structured template defines that the "surgical records" portion can include a plurality of different groupings of records, a first grouping of surgical records to populate text messages for the first attribute, the second attribute, the third attribute, and the fourth attribute associated with a first surgery on a respective patient, a second grouping of surgical records to populate text messages for the first attribute, the second attribute, the third attribute, and the fourth attribute associated with a second surgery on a respective patient, and so on can be included in a structured record document generated based on the structured template. As another example, a component of patient basic information may not include multiple different groupings of records, as the patient's basic information (such as name, gender, etc.) is typically fixed. In some embodiments, in the structured template, portions or sub-portions that may have different groupings of records may be marked with corresponding markers to indicate that the portions or sub-portions may include different groupings of records.

In some embodiments, the structured modules may be built based on the following. First, each document to be recognized in a plurality of documents to be recognized is analyzed to determine which parts the document to be recognized includes and the attributes of the respective text messages that need to be recognized from each part. A structuring module is then created based on the determined portions and the properties of the individual text messages that need to be identified from each portion. Finally, in the structuring module, each portion or sub-portion is assigned a flag indicating whether it can include a different grouping of records.

Step 208 is described in further detail below in conjunction with fig. 4A and 4B.

In the present disclosure, the method 200 may further include generating a structured record file to be populated for the user based on the structured template in response to determining that the structured record file for the user is not included in the database.

In some embodiments, the method 200 may also perform the following operations before populating the recognition results into the corresponding locations in the structured record file. First, it is determined whether a plurality of first recognition results associated with the same attribute are included in the plurality of recognition results. Then, in response to determining that the first plurality of recognition results is included in the plurality of recognition results, it is determined whether the component part or sub-part associated with the attribute can include a plurality of different groupings of records based on the structured template. In response to determining that the component part or sub-part associated with the attribute cannot include a plurality of different groupings of records, retaining a first recognition result of the plurality of first recognition results having a highest confidence and ignoring the other first recognition results. By this step, erroneous recognition results obtained for the same attribute can be excluded, thereby contributing to improvement of the accuracy of recognition. In response to determining that a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute, a degree of overlap between text messages respectively included in the plurality of first recognition results is determined. In response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results exceeds a predetermined threshold, the first recognition result with the highest confidence level among the plurality of first recognition results is retained and the other first recognition results are ignored. In response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results is less than or equal to the predetermined threshold, retaining the plurality of first recognition results. By these steps relating to recording packets, recording of multiple duplicate recording packets in the corresponding portion can be avoided.

FIG. 3 shows a flowchart of a method 300 for identifying the document to be identified based on a plurality of pre-configured matching patterns according to an embodiment of the present disclosure. The method 300 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 500 shown in FIG. 5. It should be understood that method 300 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

In step 302, a matching pattern is executed to extract a corresponding text message from the document to be recognized based on the extraction position specified by the matching pattern.

For example, for the aforementioned matching pattern "blood pressure: [ W:1-6]/[ W:1-5] mmHg physical examination SECOND _ LEVEL _ POINT systolic pressure [ W:1-6 ]", the corresponding text information is extracted from the extraction position specified by the matching pattern, i.e., the matching section "blood pressure: [ W:1-6]/[ W:1-5] mmHg" of the subsection "physical examination" of the document to be recognized.

For another example, if the aforementioned matching pattern "main [ W:0-5] is complained of the admission record FIRST _ LEVEL _ POINT", the corresponding text information is extracted from the extraction position specified by the matching pattern, i.e., the matching section "main [ W:0-5] of the admission record" which is the component of the document to be recognized.

At step 304, it is determined whether the corresponding text message was extracted from the document to be recognized based on the extraction location specified by the matching pattern.

In step 306, in response to failing to extract the corresponding text message from the document to be recognized based on the extraction position specified by the matching pattern, the matching pattern is adjusted to extract the corresponding text message from the document to be recognized based on the adjusted matching pattern, wherein adjusting the matching pattern includes replacing at least one named entity included in the matching pattern with another named entity having the same meaning.

For example, if only "BP: 97/61mmHg ", so that a corresponding recognition result cannot be recognized from the sub-part" physical exam "of the document to be recognized based on the aforementioned matching pattern" blood pressure: [ W:1-6]/[ W:1-5] mmHg physical exam SECOND _ LEVEL _ POINT systolic pressure [ W:1-6] "and thus the matching pattern can be adjusted to replace the named entity" blood pressure "included in the matching pattern with the named entity" BP "having the same meaning, whereby" 97 "can be extracted from the above matching interval. In some embodiments, in order to improve the recognition efficiency, one or more corresponding alternative matching patterns may be configured for each pre-configured matching pattern, and these alternative matching patterns replace the corresponding named entities in the original matching pattern with different named entities having the same meaning, so that when the corresponding recognition result cannot be recognized from the document to be recognized by using the original matching pattern, the recognition may be performed by adjusting to use the corresponding alternative matching pattern. By the method, the matching mode can be automatically adjusted in the structured recognition process, so that when a plurality of different documents to be recognized are subjected to batch structured recognition, corresponding text messages can be automatically extracted from the documents to be recognized even if the text messages to be extracted in the documents to be recognized are related to a plurality of different named entities with the same meaning, manual adjustment is not needed, and the working efficiency is improved.

In step 308, in response to the extraction position specified based on the matching pattern, the corresponding text message is extracted from the document to be recognized, and the attribute indicated by the matching pattern and the form of the extracted text message key value pair are combined into the corresponding recognition result. In this disclosure, the attribute indicated by the matching pattern is a key and the extracted text message is a value.

In the present disclosure, since the recognition result is a combination of the attribute indicated by the corresponding matching pattern and the form of the extracted text message key-value pair, when the recognition result is filled, it is possible to determine to which position of the structured record file the extracted text message needs to be filled based on the attribute.

In the present disclosure, in order to improve recall rate and accuracy, a plurality of recognition results finally obtained may be further checked to determine whether an erroneous recognition result is included therein, and in response to determining that an erroneous recognition result is included therein, a matching pattern with a corresponding matching pattern is adjusted according to the error.

Fig. 4A and 4B illustrate a flow diagram of an example embodiment of a method 400 for populating recognition results to corresponding locations in a structured record file according to an embodiment of the present disclosure. The method 400 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 500 shown in FIG. 5. It should be understood that method 400 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

At step 402, based on the structured template, it is determined whether the component parts or sub-parts associated with the attributes included in the recognition result can include a plurality of different groupings of records.

As previously mentioned, the structured template defines whether each component part or sub-part can include a plurality of different groupings of records, each for recording text messages relating to a plurality of attributes associated with the respective component part or sub-part. Thus, based on the structured template, it can be determined whether the component parts or sub-parts associated with the attributes included in the resulting knowledge results can include a plurality of different groupings of records.

In one aspect, at step 404, in response to determining in step 402 that a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute, a determination is made as to whether the grouping of records associated with the recognition result has been included in the component part or sub-part associated with the attribute in the structured record file.

At step 406, the recognition result is populated into the record group in response to determining that the record group associated with the recognition result is already included in the component or sub-component of the structured record file associated with the attribute.

At step 408, in response to determining that the record grouping associated with the identification result is not included in the structured record, adding the corresponding record grouping to be filled in the component or sub-component of the structured record file associated with the attribute to fill the identification result in the record grouping.

In particular, the structured record file to be populated generated for the user based on the structuring module may comprise only one record grouping to be populated of the respective components or sub-components, so if it is defined in the structuring module that a component may comprise a plurality of different record groupings, a new record grouping to be populated may be inserted in the corresponding structured record file when a text message that should belong to a different record grouping is extracted for that component.

Through the technical means in the

above step

404 and 408, the extraction and recording of the text message of the attribute of the content which may have a plurality of different versions in the document to be identified can be realized. For example, the medical record document to be identified may include text messages of a plurality of procedures that have occurred at different times, and the text message of each of the procedures may be automatically extracted from the medical record document and recorded through

step

404 and 408.

In another aspect, at step 410, in response to determining in step 402 that a plurality of different groupings of records cannot be included in the component part or sub-part associated with the attribute included in the recognition result, a determination is made as to whether the structured record file has been populated with records relating to the attribute.

In step 412, in response to determining that the structured record file has not been populated with records for the attribute, the recognition result is populated into the corresponding component or sub-portion of the structured record file.

At step 414, the recognition result is ignored in response to determining that the structured record file has been populated with records for the attribute.

Through the technical solution in the

above step

410 and 414, the extraction and recording of the text message in the document to be identified, which is associated with the attribute of the content that can only have one version, can be realized.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. For example, the computing device 110 as shown in fig. 1 may be implemented by the electronic device 500. As shown, electronic device 500 includes a Central Processing Unit (CPU) 501 that may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the random access memory 503, various programs and data necessary for the operation of the electronic apparatus 500 can also be stored. The central processing unit 501, the read only memory 502 and the random access memory 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A plurality of components in the electronic device 500 are connected to the input/output interface 505, including: an input unit 506 such as a keyboard, a mouse, a microphone, and the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The various processes and processes described above, such as the

method

200 and 400, may be performed by the central processing unit 501. For example, in some embodiments, the method 200-400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the read only memory 502 and/or the communication unit 509. When the computer program is loaded into the random access memory 503 and executed by the central processing unit 501, one or more of the actions of the

method

200 and 400 described above may be performed.

The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge computing devices. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for structured recognition of documents, comprising:

acquiring one or more documents to be identified;

identifying the document to be identified based on a plurality of pre-configured matching patterns to obtain a plurality of identification results about the document to be identified, wherein each matching pattern in the plurality of matching patterns specifies an extraction position and an attribute of a corresponding text message to be extracted, and each identification result in the plurality of identification results comprises the text message extracted from the document to be identified based on the corresponding matching pattern and the associated attribute;

determining whether a structured record file of a user about the document to be identified is included in an associated database; and

in response to determining that a structured record file about the user is included in the database, populating corresponding locations in the structured record file with the recognition results, wherein the structured record file is generated based on a pre-established structured template;

the identification of the document to be identified based on a plurality of pre-configured matching patterns comprises:

executing the matching mode to extract a corresponding text message from the document to be recognized based on the extraction position specified by the matching mode;

in response to failing to extract a corresponding text message from the document to be recognized based on the extraction position specified by the matching pattern, adjusting the matching pattern to extract the corresponding text message from the document to be recognized based on the adjusted matching pattern, wherein adjusting the matching pattern comprises replacing at least one named entity included in the matching pattern with another named entity having the same meaning.

2. The method of claim 1, wherein identifying the document to be identified based on a preconfigured plurality of matching patterns further comprises:

in response to the extraction position specified based on the matching pattern, extracting the corresponding text message from the document to be recognized, and combining the attribute indicated by the matching pattern and the form of the extracted text message key value pair into the corresponding recognition result.

3. The method of claim 1 wherein the structured template defines which component parts, sub-parts and attributes a structured record file includes and associations between the component parts, sub-parts and attributes, and the structured template further defines whether each component part or sub-part can include a plurality of different groupings of records, each grouping of records for recording text messages pertaining to a plurality of attributes associated with the respective component part or sub-part.

4. The method of claim 1, further comprising:

in response to determining that the structured record file for the user is not included in the database, generating a structured record file to be populated for the user based on the structured template.

5. The method of claim 3, wherein each recognition result further includes a respective confidence level, and prior to populating the recognition result into the respective location in the structured record file, the method further comprises:

determining whether a plurality of first recognition results associated with the same attribute are included in the plurality of recognition results;

in response to determining that the first plurality of recognition results is included in the plurality of recognition results, determining whether a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute based on the structured template;

in response to determining that the component part or sub-part associated with the attribute cannot include a plurality of different groupings of records, retaining a first recognition result of the plurality of first recognition results having a highest confidence and ignoring other first recognition results;

in response to determining that the component part or sub-part associated with the attribute can include a plurality of different groupings of records, determining a degree of overlap between text messages respectively included by the plurality of first recognition results;

in response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results exceeds a predetermined threshold, retaining a first recognition result having a highest confidence among the plurality of first recognition results and ignoring the other first recognition results; and

in response to determining that the degree of overlap between text messages respectively included in the plurality of first recognition results is less than or equal to the predetermined threshold, retaining the plurality of first recognition results.

6. The method of claim 3, wherein populating the recognition results to respective locations in the structured record file comprises:

determining, based on the structured template, whether a plurality of different groupings of records can be included in a component part or sub-part associated with an attribute included in the recognition result;

in response to determining that a plurality of different groupings of records can be included in the component part or sub-part associated with the attribute, determining whether a grouping of records associated with the identification result has been included in the component part or sub-part associated with the attribute in the structured record file;

in response to determining that the component or sub-component of the structured record file associated with the attribute already includes a record grouping associated with the recognition result, populating the recognition result into the record grouping; and

in response to determining that the record grouping associated with the identification result is not included in the structured record, adding the corresponding record grouping to be filled in the component part or sub-part associated with the attribute in the structured record file to fill the identification result into the record grouping.

7. The method of claim 6, wherein populating the recognition results into respective locations in the structured record file further comprises:

in response to determining that a plurality of different groupings of records cannot be included in the component or sub-portion associated with the attribute included in the recognition result, determining whether a record for the attribute has been populated in the structured record file;

in response to determining that a record for the attribute has not been populated in the structured record file, populating the identification result into a corresponding component or sub-portion of the structured record file; and

in response to determining that the structured record file has been populated with records for the attribute, ignoring the recognition result.

8. The method of claim 1, wherein the matching pattern is implemented using a regular expression.

9. The method of claim 3, wherein an extraction location of a text message to be extracted indicated by each matching pattern of the plurality of matching patterns comprises:

the text message to be extracted is located in which part of the document to be recognized or in which matching interval of which part of the document to be recognized the text message to be extracted is located.

10. A computing device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-9.