CN114170605A - Information extraction method and device for clinical test scheme - Google Patents

Information extraction method and device for clinical test scheme Download PDF

Info

Publication number
CN114170605A
CN114170605A CN202111500948.XA CN202111500948A CN114170605A CN 114170605 A CN114170605 A CN 114170605A CN 202111500948 A CN202111500948 A CN 202111500948A CN 114170605 A CN114170605 A CN 114170605A
Authority
CN
China
Prior art keywords
information
picture
recognition
identifying
ocr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111500948.XA
Other languages
Chinese (zh)
Inventor
赵洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Miaoyi Biotechnology Co ltd
Original Assignee
Shanghai Miaoyi Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Miaoyi Biotechnology Co ltd filed Critical Shanghai Miaoyi Biotechnology Co ltd
Priority to CN202111500948.XA priority Critical patent/CN114170605A/en
Publication of CN114170605A publication Critical patent/CN114170605A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to an information extraction method and device for a clinical test scheme, wherein the method comprises the following steps: performing layout analysis on the clinical test scheme document to identify position information of headers, footers, texts, tables and pictures; discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part; integrating information obtained after OCR recognition with the picture; and extracting information of the integrated information, and outputting the obtained key information. Through the method and the device, the problem that the accuracy and the efficiency of interpretation of the clinical test scheme in the related technology are low is solved, and the technical effect of improving the accuracy and the efficiency of interpretation of the clinical test scheme is realized.

Description

Information extraction method and device for clinical test scheme
Technical Field
The invention relates to the technical field of clinical trial scheme research, in particular to an information extraction method, an information extraction device, computer equipment and a computer readable storage medium for clinical trial schemes.
Background
The existing clinical test scheme needs to rely on a practitioner with an experienced medical professional to read and understand the clinical scheme, and then the clinical scheme is sent to a national approval bureau to be approved and then is started by a test center after the approval. This requires a certain amount of expertise and a great deal of effort in making an information check of the protocol, which results in less accurate and efficient interpretation of the clinical protocol.
At present, no effective solution is provided for the problem of low accuracy and efficiency of interpretation of clinical trial schemes in the related art.
Disclosure of Invention
The present application aims to overcome the defects in the prior art, and provides an information extraction method, an information extraction device, a computer device, and a computer-readable storage medium for clinical trial scenarios, so as to solve at least the problems of low accuracy and efficiency of interpretation of clinical trial scenarios in the related art.
In order to achieve the purpose, the technical scheme adopted by the application is as follows:
in a first aspect, an embodiment of the present application provides an information extraction method for a clinical trial scheme, including:
performing layout analysis on the clinical test scheme document to identify position information of headers, footers, texts, tables and pictures;
discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part;
integrating information obtained after OCR recognition with the picture;
and extracting information of the integrated information, and outputting the obtained key information.
In some embodiments, the layout analysis of the clinical trial protocol document and the identifying of the position information of the header, footer, body, table and picture comprises:
and predicting the clinical test scheme document through a trained FasterRCNN network, and identifying position information of headers, footers, texts, tables and pictures.
In some of these embodiments, identifying the frame body structure for the table portion comprises:
performing table line segmentation on the table part through an image segmentation model of Unet;
and extracting the connected region according to the table line segmentation result to obtain the row and column information of the table.
In some embodiments, OCR recognizing the text content of the text portion and the table portion includes:
putting recognition results obtained by performing OCR recognition on the text contents of the text part and the table part into a bert language model to obtain a plurality of true values;
calculating the font similarity between the real values and the recognition result;
and taking the real value with the highest font similarity with the recognition result as a final OCR recognition result.
In some embodiments, the extracting information from the integrated information comprises:
identifying the position of the trigger word and the corresponding event type from the integrated information according to the keyword by using a trigger word extraction model;
an argument recognition model is used to identify arguments in the event and the corresponding argument roles.
In some embodiments, outputting the obtained key information includes:
judging whether the picture is added to the key information or not according to an information extraction result to output;
and if the information extraction result indicates that the picture needs to be added to the key information, adding the picture to the key information and outputting the picture.
In some of these embodiments, layout analyzing the clinical trial protocol document includes:
paging the clinical test scheme documents, and storing each page as a picture;
and performing layout analysis on the picture.
In a second aspect, an embodiment of the present application provides an information extraction apparatus for a clinical trial scheme, including:
the layout analysis unit is used for carrying out layout analysis on the clinical test scheme document and identifying position information of headers, footers, texts, tables and pictures;
the processing unit is used for discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part and OCR identifying the text part and the character content of the table part;
the integration unit is used for integrating the information obtained after OCR recognition with the image;
and the output unit is used for extracting the information of the integrated information and outputting the obtained key information.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method according to the first aspect.
By adopting the technical scheme, compared with the prior art, the information extraction method of the clinical test scheme provided by the embodiment of the application identifies the position information of headers, footers, texts, tables and pictures by analyzing the layout of the clinical test scheme document; discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part; integrating information obtained after OCR recognition with the picture; information extraction is carried out on the integrated information, and the obtained key information is output, so that the problems of low accuracy and efficiency of interpretation of clinical test schemes in the related technology are solved, and the technical effect of improving the accuracy and efficiency of interpretation of the clinical test schemes is realized.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a block diagram of a mobile terminal according to an embodiment of the present application;
FIG. 2 is a flow chart of an information extraction method of a clinical trial protocol according to an embodiment of the present application;
FIG. 3 is a schematic illustration of a clinical trial protocol intelligent interpretation flow according to a preferred embodiment of the present application;
FIG. 4 is a block diagram of an information extraction device of a clinical trial protocol according to an embodiment of the present application;
fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The embodiment provides a mobile terminal. Fig. 1 is a block diagram of a mobile terminal according to an embodiment of the present application. As shown in fig. 1, the mobile terminal includes: a Radio Frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (WiFi) module 170, a processor 180, and a power supply 190. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 1 is not intended to be limiting of mobile terminals and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each constituent element of the mobile terminal in detail with reference to fig. 1:
the RF circuit 110 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 180; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuits include, but are not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.
The memory 120 may be used to store software programs and modules, and the processor 180 executes various functional applications and data processing of the mobile terminal by operating the software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the mobile terminal. Specifically, the input unit 130 may include a touch panel 131 and other input devices 132. The touch panel 131, also referred to as a touch screen, may collect touch operations of a user on or near the touch panel 131 (e.g., operations of the user on or near the touch panel 131 using any suitable object or accessory such as a finger or a stylus pen), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 131 may include two parts, i.e., a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 180, and can receive and execute commands sent by the processor 180. In addition, the touch panel 131 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 130 may include other input devices 132 in addition to the touch panel 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 140 may be used to display information input by a user or information provided to the user and various menus of the mobile terminal. The Display unit 140 may include a Display panel 141, and optionally, the Display panel 141 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 131 can cover the display panel 141, and when the touch panel 131 detects a touch operation on or near the touch panel 131, the touch operation is transmitted to the processor 180 to determine the type of the touch event, and then the processor 180 provides a corresponding visual output on the display panel 141 according to the type of the touch event. Although the touch panel 131 and the display panel 141 are shown in fig. 1 as two separate components to implement the input and output functions of the mobile terminal, in some embodiments, the touch panel 131 and the display panel 141 may be integrated to implement the input and output functions of the mobile terminal.
The mobile terminal may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 141 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 141 and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the mobile terminal, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile terminal, further description is omitted here.
A speaker 161 and a microphone 162 in the audio circuit 160 may provide an audio interface between the user and the mobile terminal. The audio circuit 160 may transmit the electrical signal converted from the received audio data to the speaker 161, and convert the electrical signal into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 160, and then outputs the audio data to the processor 180 for processing, and then transmits the audio data to, for example, another mobile terminal via the RF circuit 110, or outputs the audio data to the memory 120 for further processing.
WiFi belongs to a short-distance wireless transmission technology, and the mobile terminal can help a user to send and receive e-mails, browse webpages, access streaming media and the like through the WiFi module 170, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 170, it is understood that it does not belong to the essential components of the mobile terminal, and it can be omitted or replaced with other short-range wireless transmission modules, such as Zigbee module or WAPI module, etc., as required within the scope not changing the essence of the invention.
The processor 180 is a control center of the mobile terminal, connects various parts of the entire mobile terminal using various interfaces and lines, and performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120, thereby performing overall monitoring of the mobile terminal. Alternatively, processor 180 may include one or more processing units; preferably, the processor 180 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 180.
The mobile terminal also includes a power supply 190 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 180 via a power management system that may be configured to manage charging, discharging, and power consumption.
Although not shown, the mobile terminal may further include a camera, a bluetooth module, and the like, which will not be described herein.
In this embodiment, the processor 180 is configured to:
performing layout analysis on the clinical test scheme document to identify position information of headers, footers, texts, tables and pictures;
discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part;
integrating information obtained after OCR recognition with the picture;
and extracting information of the integrated information, and outputting the obtained key information.
In some of these embodiments, the processor 180 is further configured to:
and predicting the clinical test scheme document through a trained FasterRCNN network, and identifying position information of headers, footers, texts, tables and pictures.
In some of these embodiments, the processor 180 is further configured to:
performing table line segmentation on the table part through an image segmentation model of Unet;
and extracting the connected region according to the table line segmentation result to obtain the row and column information of the table.
In some of these embodiments, the processor 180 is further configured to:
putting recognition results obtained by performing OCR recognition on the text contents of the text part and the table part into a bert language model to obtain a plurality of true values;
calculating the font similarity between the real values and the recognition result;
and taking the real value with the highest font similarity with the recognition result as a final OCR recognition result.
In some of these embodiments, the processor 180 is further configured to:
identifying the position of the trigger word and the corresponding event type from the integrated information according to the keyword by using a trigger word extraction model;
an argument recognition model is used to identify arguments in the event and the corresponding argument roles.
In some of these embodiments, the processor 180 is further configured to:
judging whether the picture is added to the key information or not according to an information extraction result to output;
and if the information extraction result indicates that the picture needs to be added to the key information, adding the picture to the key information and outputting the picture.
In some of these embodiments, the processor 180 is further configured to:
paging the clinical test scheme documents, and storing each page as a picture;
and performing layout analysis on the picture.
The embodiment provides an information extraction method of a clinical test scheme. Fig. 2 is a flowchart of an information extraction method of a clinical trial protocol according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:
step S201, performing layout analysis on the clinical test scheme document, and identifying position information of headers, footers, texts, tables and pictures;
step S202, discarding the header and footer parts according to the position information, identifying the frame main structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part;
step S203, integrating information obtained after OCR recognition with the picture;
and step S204, extracting the information of the integrated information, and outputting the obtained key information.
Through the steps, the layout analysis is carried out on the clinical test scheme document, and the position information of the header, the footer, the text, the table and the picture is identified; discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part; integrating information obtained after OCR recognition with the picture; information extraction is carried out on the integrated information, and the obtained key information is output, so that the problems of low accuracy and efficiency of interpretation of clinical test schemes in the related technology are solved, and the technical effect of improving the accuracy and efficiency of interpretation of the clinical test schemes is realized.
In some of these embodiments, the layout analysis of the clinical trial protocol document may include: paging the clinical test scheme documents, and storing each page as a picture; and performing layout analysis on the picture.
In some embodiments, the layout analysis of the clinical trial protocol document and the identifying of the position information of the header, footer, body, table and picture comprises: and predicting the clinical test scheme document through a trained FasterRCNN network, and identifying position information of headers, footers, texts, tables and pictures.
In the documents of the clinical trial scheme, recognition by means of only pure OCR is not feasible, such as invalid information of header and footer in the document needs to be filtered, forms in the document need to be restored, or some images need to be kept, which all require intervention of layout analysis. The embodiment trains a FasterRCNN network to perform layout analysis on the clinical test scheme document, and detects the text, the table, the header, the footer, the picture and other areas.
The FasterRCNN network is trained, the resnet50 is used as a backbone to extract the network, and the training aims to enable information such as texts, tables, headers and footers to be regressed. And predicting the input document by using the trained FasterRCNN network to obtain the position information of texts, tables, headers, footers and pictures.
In some of these embodiments, identifying the frame body structure for the table portion comprises:
performing table line segmentation on the table part through an image segmentation model of Unet;
and extracting the connected region according to the table line segmentation result to obtain the row and column information of the table.
The embodiment of the application provides a scheme based on depth image segmentation, which is used for applying image segmentation to a picture of a form, wherein the purpose of the segmentation is to label the form lines, perform geometric analysis on the segmented picture and extract connected regions, so that a frame main body structure of the form is restored.
In an image segmentation model based on the Unet in the embodiment of the present application, the types of table lines are divided into four types: a wired horizontal line, a wireless horizontal line, a wired vertical line and a wireless vertical line. And (4) carrying out table line segmentation on the table image by using the Unet model, carrying out connectivity analysis according to the table segmentation result, and finally obtaining the row and column information of the table.
In some embodiments, OCR recognizing the text content of the text portion and the table portion includes:
putting recognition results obtained by performing OCR recognition on the text contents of the text part and the table part into a bert language model to obtain a plurality of true values;
calculating the font similarity between the real values and the recognition result;
and taking the real value with the highest font similarity with the recognition result as a final OCR recognition result.
In the intelligent reading task of the clinical test scheme, the method has higher requirements on the recognition accuracy of characters, different parameters need to be adopted according to the result of layout analysis, and the accuracy of recognition of the forms and the normal text content in the document can be improved. The OCR algorithm in the embodiment of the application adopts a two-stage algorithm of DB + CRNN, the DB algorithm is responsible for text detection, and the CRNN algorithm is responsible for character content recognition. The OCR algorithm in the embodiment of the application adopts a DB + CRNN two-stage algorithm, the improvement point is that in the detection and recognition of the scheme document, longer characters need to be detected and recognized, and in the selection of model parameters, larger character recognition length limitation is selected.
In OCR (optical character recognition), some character recognition errors are often caused by the unclear characters, and the embodiment of the application designs a character correction method based on bert, and combines a language model and a similarity measure of a font according to the result after OCR recognition. The language model can provide the most possible real values of the current recognition result, the probability that the real values are recognized as the current result is provided by the font similarity measurement, and based on the method, the accuracy of OCR recognition is further improved. In the embodiment of the application, the recognition result is firstly put into a bert language model to obtain a plurality of most possible real values, then the font similarity is calculated to give the possibility that the real values are recognized as the current result, and the result with the highest possibility is selected as the corrected result.
In some embodiments, the extracting information from the integrated information comprises:
identifying the position of the trigger word and the corresponding event type from the integrated information according to the keyword by using a trigger word extraction model;
an argument recognition model is used to identify arguments in the event and the corresponding argument roles.
In the intelligent interpretation of clinical trial protocols, important and our attention and extraction needs to be extracted. The embodiment of the application provides an information extraction scheme of a clinical test scheme, which extracts all sentences meeting constraints according to given natural language sentences and paragraphs and a set of predefined information according to the requirements, and generates a final result.
The embodiment of the application provides an information extraction scheme of a clinical experiment scheme, and a data set used by a user is a sentence-level extraction data set. The method comprises a trigger word extraction model and an argument identification model, and the position of the trigger word and the corresponding event type are firstly identified according to the keywords. And then, recognizing the argument and the corresponding argument role in the event by using the argument extraction model. For example: the input trigger words are: study drug, event types were: study of drug types, argument roles are: subjects, sample size, primary research purpose, arguments are specific information such as: study subjects: penicillin. Sample size: at least 50 available cases can be evaluated. The main research purposes are as follows: can be used for resisting gram-positive cocci.
In some embodiments, outputting the obtained key information includes:
judging whether the picture is added to the key information or not according to an information extraction result to output;
and if the information extraction result indicates that the picture needs to be added to the key information, adding the picture to the key information and outputting the picture.
According to the embodiment of the application, the clinical test scheme is intelligently read by using methods such as OCR (optical character recognition) and information extraction, the content of the clinical test scheme can be effectively extracted, the docking center can be directly and automatically started, and the labor cost is greatly reduced.
The intelligent interpretation of the clinical trial scheme requires multiple processes after document input to be solved, and the upstream and downstream connections are performed on multiple learning tasks.
(1) OCR recognition
The OCR technology is a technology for analyzing and recognizing an image file of text data to acquire character information, recognizing characters in a picture and returning the characters to a text form. OCR is widely used in various industries, and in the intelligent interpretation task of clinical experimental schemes, OCR technology is first required to recognize corresponding text information.
(2) Document layout analysis
The document layout analysis means outputting position information of pictures, tables, titles and texts in document pictures, obtaining a better formatting result according to the cooperation of the layout analysis result and the document OCR, and providing more standard input for downstream natural language tasks.
(3) Information extraction
Information extraction aims at extracting structured knowledge, such as entities, relationships, events, etc., from unstructured natural language text. The goal of event extraction is to identify events of a target event type in a sentence for a given natural language sentence, based on a pre-specified event type and role.
In interpretation of a clinical trial protocol, an embodiment of the present application provides an intelligent interpretation protocol of a clinical trial protocol, and as shown in fig. 3, a specific process of the protocol may include: inputting pdf clinical test documents, and then splitting paging pictures; then, performing layout analysis, identifying positions of headers, footers, texts, tables and pictures, discarding the headers and the footers, performing table identification on the tables, performing OCR (optical character recognition) identification and OCR text correction on the tables and the texts, then performing information integration on the pictures and results of the OCR identification and correction, finally performing information extraction and outputting key information. The embodiment of the application improves the reading efficiency of the clinical test scheme, reduces the labor cost, can automatically carry out butt joint according to the result of information extraction and centralized starting, and realizes the automation of the flow.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The present embodiment provides an information extraction apparatus for clinical trial schemes, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted here. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a structure of an information extraction apparatus of a clinical trial scenario according to an embodiment of the present application, which includes, as shown in fig. 4:
a layout analysis unit 41, configured to perform layout analysis on the clinical test scenario document, and identify position information of headers, footers, texts, tables, and pictures;
a processing unit 42, configured to discard header and footer portions according to the position information, identify a frame body structure of the table portion, and perform OCR identification on text portions and character contents of the table portion;
an integrating unit 43, configured to perform information integration on the information obtained after the OCR recognition and the picture;
and the output unit 44 is used for extracting information from the integrated information and outputting the obtained key information.
In some of the embodiments, the layout analysis unit 41 includes:
and the first identification module is used for predicting the clinical test scheme document through the trained FasterRCNN network and identifying the position information of headers, footers, texts, tables and pictures.
In some of these embodiments, the processing unit 42 includes:
the segmentation module is used for carrying out table line segmentation on the table part through an image segmentation model of Unet;
and the extraction module is used for extracting the connected region according to the table line segmentation result to obtain the row and column information of the table.
In some of these embodiments, the processing unit 42 includes:
the OCR recognition module is used for putting recognition results obtained by performing OCR recognition on the text contents of the text part and the table part into a bert language model to obtain a plurality of real values;
the calculation module is used for calculating the font similarity between the real values and the recognition result;
and the determining module is used for taking the real value with the highest font similarity with the recognition result as the final OCR recognition result.
In some of these embodiments, the output unit 44 includes:
the second identification module is used for identifying the position of the trigger word and the corresponding event type from the integrated information according to the keyword by using a trigger word extraction model;
and the third identification module is used for identifying the argument and the corresponding argument role in the event by using the argument identification model.
In some of these embodiments, the output unit 44 includes:
the judging module is used for judging whether the picture is added to the key information to be output according to an information extraction result;
and the adding module is used for adding the picture to the key information and outputting the picture if the information extraction result indicates that the picture needs to be added to the key information.
In some of the embodiments, the layout analysis unit 41 includes:
the paging module is used for paging the clinical test scheme documents and storing each page as a picture;
and the analysis module is used for carrying out layout analysis on the picture.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
An embodiment provides a computer device. The information extraction method of the clinical test scheme combined with the embodiment of the application can be realized by computer equipment. Fig. 5 is a hardware structure diagram of a computer device according to an embodiment of the present application.
The computer device may comprise a processor 51 and a memory 52 in which computer program instructions are stored.
Specifically, the processor 51 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.
Memory 52 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 52 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, magnetic tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 52 may include removable or non-removable (or fixed) media, where appropriate. The memory 52 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 52 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 52 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.
The memory 52 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions executed by the processor 51.
The processor 51 may be configured to implement the information extraction method of any one of the clinical trial scenarios described in the above embodiments by reading and executing computer program instructions stored in the memory 52.
In some of these embodiments, the computer device may also include a communication interface 53 and a bus 50. As shown in fig. 5, the processor 51, the memory 52, and the communication interface 53 are connected via the bus 50 to complete mutual communication.
The communication interface 53 is used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present application. The communication interface 53 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.
Bus 50 comprises hardware, software, or both coupling the components of the computer device to each other. Bus 50 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 50 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Association) Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 50 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.
In addition, in combination with the information extraction method of the clinical trial protocol in the above embodiments, the embodiments of the present application may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the information extraction method of any one of the clinical trial scenarios described in the embodiments above.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An information extraction method for a clinical trial protocol, comprising:
performing layout analysis on the clinical test scheme document to identify position information of headers, footers, texts, tables and pictures;
discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part, and performing OCR (optical character recognition) on the text part and the character content of the table part;
integrating information obtained after OCR recognition with the picture;
and extracting information of the integrated information, and outputting the obtained key information.
2. The method of claim 1, wherein performing layout analysis on the clinical trial protocol document and identifying the location information of headers, footers, text, tables, and pictures comprises:
and predicting the clinical test scheme document through a trained FasterRCNN network, and identifying position information of headers, footers, texts, tables and pictures.
3. The method of claim 1, wherein identifying the frame body structure for the table portion comprises:
performing table line segmentation on the table part through an image segmentation model of Unet;
and extracting the connected region according to the table line segmentation result to obtain the row and column information of the table.
4. The method of claim 1, wherein OCR recognizing the text content of the text portion and the table portion comprises:
putting recognition results obtained by performing OCR recognition on the text contents of the text part and the table part into a bert language model to obtain a plurality of true values;
calculating the font similarity between the real values and the recognition result;
and taking the real value with the highest font similarity with the recognition result as a final OCR recognition result.
5. The method of claim 1, wherein extracting the information from the integrated information comprises:
identifying the position of the trigger word and the corresponding event type from the integrated information according to the keyword by using a trigger word extraction model;
an argument recognition model is used to identify arguments in the event and the corresponding argument roles.
6. The method of claim 1, wherein outputting the obtained key information comprises:
judging whether the picture is added to the key information or not according to an information extraction result to output;
and if the information extraction result indicates that the picture needs to be added to the key information, adding the picture to the key information and outputting the picture.
7. The method of any one of claims 1 to 6, wherein the layout analysis of the clinical trial protocol document includes:
paging the clinical test scheme documents, and storing each page as a picture;
and performing layout analysis on the picture.
8. An information extraction device of a clinical trial scenario, comprising:
the layout analysis unit is used for carrying out layout analysis on the clinical test scheme document and identifying position information of headers, footers, texts, tables and pictures;
the processing unit is used for discarding the header part and the footer part according to the position information, identifying the frame main body structure of the table part and OCR identifying the text part and the character content of the table part;
the integration unit is used for integrating the information obtained after OCR recognition with the image;
and the output unit is used for extracting the information of the integrated information and outputting the obtained key information.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111500948.XA 2021-12-09 2021-12-09 Information extraction method and device for clinical test scheme Pending CN114170605A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111500948.XA CN114170605A (en) 2021-12-09 2021-12-09 Information extraction method and device for clinical test scheme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111500948.XA CN114170605A (en) 2021-12-09 2021-12-09 Information extraction method and device for clinical test scheme

Publications (1)

Publication Number Publication Date
CN114170605A true CN114170605A (en) 2022-03-11

Family

ID=80484979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111500948.XA Pending CN114170605A (en) 2021-12-09 2021-12-09 Information extraction method and device for clinical test scheme

Country Status (1)

Country Link
CN (1) CN114170605A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821612A (en) * 2022-05-30 2022-07-29 浙商期货有限公司 Method and system for extracting information of PDF document in securities future scene

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821612A (en) * 2022-05-30 2022-07-29 浙商期货有限公司 Method and system for extracting information of PDF document in securities future scene

Similar Documents

Publication Publication Date Title
CN107944380B (en) Identity recognition method and device and storage equipment
CN110472251B (en) Translation model training method, sentence translation equipment and storage medium
CN108885614B (en) Text and voice information processing method and terminal
WO2019047971A1 (en) Image recognition method, terminal and storage medium
CN108156508B (en) Barrage information processing method and device, mobile terminal, server and system
CN104217717A (en) Language model constructing method and device
CN106203235B (en) Living body identification method and apparatus
CN105630846B (en) Head portrait updating method and device
US20150234799A1 (en) Method of performing text related operation and electronic device supporting same
CN109543014B (en) Man-machine conversation method, device, terminal and server
CN110276010B (en) Weight model training method and related device
CN107885718B (en) Semantic determination method and device
CN110597957B (en) Text information retrieval method and related device
CN112214605A (en) Text classification method and related device
CN107992615B (en) Website recommendation method, server and terminal
CN110069407B (en) Function test method and device for application program
CN109063076B (en) Picture generation method and mobile terminal
CN108549681B (en) Data processing method and device, electronic equipment and computer readable storage medium
CN114170605A (en) Information extraction method and device for clinical test scheme
CN106934003B (en) File processing method and mobile terminal
CN112182461A (en) Method and device for calculating webpage sensitivity
CN107632985B (en) Webpage preloading method and device
CN112395524A (en) Method, device and storage medium for displaying word annotation and paraphrase
CN108804615B (en) Sharing method and server
CN109450853B (en) Malicious website determination method and device, terminal and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination