CN115527195A

CN115527195A - Medical equipment nameplate information identification and extraction algorithm

Info

Publication number: CN115527195A
Application number: CN202210614370.9A
Authority: CN
Inventors: 王玥; 李引
Original assignee: Suzhou Archimedes Network Technology Co ltd
Current assignee: Suzhou Archimedes Network Technology Co ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-12-27

Abstract

The invention relates to a medical equipment nameplate information identification and extraction algorithm which is characterized in that basic information of medical equipment is finally determined according to cross matching of medical equipment standard library information and a character identification result. According to the medical equipment name and model identification method and system, the name, model, brand, registration number, serial number, medical equipment classification and other equipment information of the medical equipment are extracted by identifying the nameplate picture of the medical equipment, and if some information is extracted and is missing or wrong, the information can be corrected through a standard library. For the Chinese nameplate of the medical equipment, the accuracy of identifying the equipment information including the equipment name, the model, the brand, the registration certificate number, the medical equipment classification and the like reaches 96 percent; for English nameplates of medical equipment, the accuracy of identifying the equipment information by the method reaches 80%. And another advantage of this method is that all information can be returned out of the device even if only a portion of the information is contained in the tag.

Description

Medical equipment nameplate information identification and extraction algorithm

Technical Field

The invention relates to the field of image recognition processing, in particular to a medical equipment nameplate information recognition and extraction algorithm.

Background

At present, a relatively mature picture and character recognition technology exists. The recognition technology of the picture characters is high in recognition accuracy of common texts, but when key information on the texts is to be extracted, the application fields of the recognition technology are standardized texts, such as identity cards, driving licenses, license plates, business licenses, machine invoices and the like. The standardized text is very standard in layout and wording, and information extraction is facilitated. At present, the standard texts on commercial algorithm platforms such as the Ali cloud, scientific news flight, baidu AI and the like are mature. In non-standardized texts, mature products do not exist at present. The medical equipment nameplate is a non-standardized text, and due to the fact that various different equipment and different manufacturers exist, hundreds of thousands of types of medical equipment nameplates are provided, and information is not convenient to extract.

Disclosure of Invention

According to the technical problems, the invention provides a medical equipment nameplate information identification and extraction algorithm, and the invention provides a method for determining the basic information of medical equipment finally according to the cross matching of the medical equipment standard library information and the character identification result.

The medical equipment standard library is a database obtained by manually processing data obtained by the company through official data and field investigation, and the database covers accurate information of most medical equipment such as equipment names, equipment models, equipment brands, equipment registration numbers, equipment medical classification and the like.

A medical equipment nameplate information identification and extraction algorithm specifically comprises the following identification steps:

step one, acquiring a digital image of a medical equipment nameplate;

secondly, identifying the characters in the image through an OCR technology, wherein because the characters on the nameplate are discontinuous, the characters which are not in the same row or have a larger blank in the middle can be divided into different character segments, and the character segments comprise character strings and region coordinates of the character segments;

and step three, segmenting each character segment obtained in the step two, matching the segmented character segment with a medical equipment name keyword library, and if the matching rate is high, confirming that the character segment is the equipment name. The word segmentation adopts a currently common jieba word segmentation tool to segment a text segment into a group of words, for example, a multi-parameter patient monitor is segmented into a multi-parameter patient monitor, a patient monitor and a monitor. The device name keyword library is a representative device name keyword which is selected by people according to the collected device names after word segmentation, and words such as Doppler, ultrasound, monitor, X-ray, disinfection, nuclear magnetic resonance, dialysis and the like which are strongly related to medical devices are contained in the keyword library. Matching the words obtained by word segmentation of the word segments in a medical equipment name library, adding 1 to the matching score of the word segments when the matching is successful each time, finally obtaining the name matching score Snae of the word segments and a matched medical equipment name keyword list, taking the corresponding keyword list with the highest Snae, and matching the keyword list with the equipment name library to obtain the equipment name Equ _ name with the highest similarity;

step four, searching whether characters such as 'model', 'specification', 'model', 'type' and the like exist in all the character segments, if yes, the character segments and characters on the right side of the character segments contain character strings of continuous English and numbers, namely the character segments with the optional equipment models; if the character segment of the equipment model is not extracted in the fourth step, all continuous character strings consisting of continuous non-Chinese characters (numbers, english letters, middle-drawn lines, spaces and the like) are extracted from all the character segments to be used as model alternatives. And matching the obtained model alternative character string with the model in the standard equipment library. There are tens of thousands of models in the standard equipment library, and in order to improve the matching rate, a prefix tree is introduced to assist in query. Each node in the prefix tree is a key composed of two characters, and the corresponding value is all device models containing the two characters. A moving window of an alternative character string is set to be 2, starting from a first character, two characters of the moving window are used as a key prefix removal tree to search a corresponding equipment model list, whether matching exists or not is judged, if the matching rate is higher than a threshold value, the equipment model is placed in the candidate model list, then the moving window is moved backwards by one bit, the process is repeated, and the fact that the last bit of the alternative character string is moved is known. And finally obtaining a candidate model list.

Step five, extracting the text segments which are inquired from the text obtained in the step two and meet the national medical equipment registration certificate number specification to be used as the registration certificate number, wherein the national medical equipment registration certificate number is standard, trying to match the text segments from the nameplate in a regular expression matching mode according to the national specification, if the text segments can be matched, using the text segments as the equipment registration certificate number, matching the equipment registration certificate number with a registration certificate number library, and listing the registration certificate number with the matching rate higher than a threshold value in a candidate registration certificate number list;

step six, respectively forming a device list L _ name (ID: score _ name _ ID) based on the device name, a device list L _ model (ID: score _ model _ ID) based on the device model and a device list L _ reg (ID: score _ reg _ ID) based on the device registration number, wherein each list comprises the ID and the matching rate Score of the device in the standard library, intersecting the devices contained in the three lists to obtain a device list [ Equ1, equ2, equ3.., equ _ N ], and summing the matching rate scores of the three lists for each Equ _ i to Score _ Equ _ i = SUM (Score _ name _ i, score _ model _ i, score _ reg _ i);

the device ID corresponding to the largest Score _ equ _ i is the most likely matching device ID. And inquiring the basic information of the equipment from the standard library according to the basic information of the equipment, and returning the basic information as a final result.

The basic information of the equipment returned in the sixth step comprises the name, model, brand, registration number and equipment classification of the equipment.

The invention has the beneficial effects that: according to the medical equipment name and model identification method and system, the name, model, brand, registration number, serial number, medical equipment classification and other equipment information of the medical equipment are extracted by identifying the nameplate picture of the medical equipment, and if some information is extracted and is missing or wrong, the information can be corrected through a standard library.

For the Chinese nameplate of the medical equipment, the accuracy of identifying the equipment information including the equipment name, the model, the brand, the registration certificate number, the medical equipment classification and the like reaches 96 percent; for English nameplates of medical equipment, the accuracy of identifying the equipment information by the method reaches 80%. And another advantage of this method is that all information can be returned out of the device even if only a portion of the information is contained in the tag. Generally, the method returns complete information for the device as long as the registration number, or name + model number, is recognized in the nameplate.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Example 1

The invention is further illustrated by reference to the following examples.

The medical equipment standard library based on which the equipment information is matched is a medical equipment database obtained by manually approving and sorting data acquired by a company from a national medical equipment database and the field investigation of hundreds of hospitals of the company. Each record in the database corresponds to a medical device, and the information contained in each record comprises: [ device ID, device name, device model, device brand, device registration number, device medical classification ];

the specific identification method comprises the following steps:

the method comprises the following steps of firstly, shooting a digital image of a medical equipment nameplate by using a mobile phone, a camera and the like;

and secondly, recognizing the characters in the image by an Ali OCR technology, wherein the characters on the nameplate are divided into line feed and blank spaces, so that the OCR technology can recognize the characters in each area into different character segments according to the line feed and the blank spaces. Each text segment contains coordinates in which a character string and four corners are located.

And step three, segmenting each character segment obtained in the step two, matching the segmented character segment with a medical equipment name keyword library, and if the matching rate is high, confirming that the character segment is the equipment name. The word segmentation adopts a currently common jieba word segmentation tool to segment a text segment into a group of words, for example, a multi-parameter patient monitor is segmented into a multi-parameter patient monitor, a patient monitor and a monitor. The device name keyword library is a representative device name keyword which is selected by dividing words according to the device names collected in the past, and the word library contains words strongly related to medical devices, such as Doppler, ultrasound, monitor, X-ray, disinfection, nuclear magnetic resonance, dialysis and the like. Words obtained by word segmentation of the character segments are matched in a medical equipment name library, and words with the matching rate higher than a threshold value are screened out to form an equipment name candidate list. For example, the name of the equipment matched with the patient monitor in the name brand is [ "clinical monitor", "multi-parameter monitor", "electrocardiograph monitor", etc.

Step four, searching whether characters such as 'model', 'specification', 'model', 'type' and the like exist in all the character segments, if yes, the character segments and characters on the right side of the character segments contain character strings of continuous English and numbers, namely the character segments with the optional equipment models; if the character segment of the equipment model is not extracted in the fourth step, all continuous character strings consisting of continuous non-Chinese characters (numbers, english letters, middle-drawn lines, spaces and the like) are extracted from all the character segments to be used as model alternatives. And matching the obtained model alternative character string with the models in the standard equipment library. The models in the standard equipment library are tens of thousands, and in order to improve the matching rate, a prefix tree is introduced to assist in query. Each node in the prefix tree is a key composed of two characters, and the corresponding value is all device models containing the two characters. A moving window of an alternative character string is set to be 2, starting from a first character, two characters of the moving window are used as a key prefix removal tree to search a corresponding equipment model list, whether matching exists or not is judged, if the matching rate is higher than a threshold value, the equipment model is placed in the candidate model list, then the moving window is moved backwards by one bit, the process is repeated, and the fact that the last bit of the alternative character string is moved is known. And finally obtaining a candidate model list.

For example, a certain candidate Model segment is "Model EVL000M", and "mo", "od", "de", "el", "le", "ev", "vL", "L0", "00", "0M" in the window are sequentially used as a Key prefix removal tree to perform query. The model number of the device with the high matching rate is found to be "EV1000M" (here, the number 1 is recognized as L due to OCR error, but the model number of the device of the standard library can be corrected to the correct model number).

Step five, extracting the text segments which are inquired from the text obtained in the step two and meet the national medical equipment registration certificate number standard to serve as the registration certificate number; the extraction of the registration certificate number may be performed according to the national medical device registration certificate number naming rule. The naming rule of the national medical equipment registration certificate number is that x 1 mechanical notes x 2 xxx x 3 x 4 xxx x 5 xxx x 6, x 1 is short for the location of the registration approval department; x 2 is the registration form; xxx x 3 is the first registration year; x 4 is the product management category; xx 5 is product classification code; xxx x 6 is the first registered serial number. Then the registration certificate number can be extracted by a regular matching method.

For example, the number of the registration certificate can be extracted from the nameplate as 'national food and drug supervision (in) character 2013 No. 8211526',

According to the sixth step, because the equipment is matched from three dimensions, the most possible direction is taken according to the probability, so that the error caused by the OCR recognition error can be corrected. And can also repair information that is missing due to nameplate. For example, if two devices are similar in model, one is a multi-parameter monitor CL1000, the other is a Doppler ultrasound diagnostic device CLL000, and the record recognized by OCR is the multi-parameter monitor CLL000, the device can be considered as the multi-parameter monitor CL1000 due to the similarity of model matching scores and the obviously higher name matching score than the multi-parameter monitor, thereby correcting the error of OCR recognition.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. While the invention has been described with respect to the above embodiments, it will be understood by those skilled in the art that the invention is not limited to the above embodiments, which are described in the specification and illustrated only to illustrate the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A medical equipment nameplate information identification and extraction algorithm is characterized in that basic information of medical equipment is finally determined according to cross matching of medical equipment standard library information and character identification results.

2. The identification and extraction algorithm for nameplate information of medical equipment as claimed in claim 1, wherein the specific identification steps are as follows:

step one, acquiring a digital image of a medical equipment nameplate;

step two, character recognition

Recognizing characters in the image through an OCR technology;

step three, identifying the equipment name

Matching each character segment obtained in the step two with a medical equipment name keyword library after word segmentation, and if the matching rate is high, confirming that the character segment is the equipment name;

step four, identifying the model of the equipment

Extracting models from all the character segments to obtain a candidate model list;

step five, extracting the text segments which are inquired from the text obtained in the step two and meet the national medical equipment registration certificate number specification to serve as registration certificate numbers, matching the equipment registration certificate numbers with a registration certificate number library, and listing the registration certificate numbers with the matching rate higher than a threshold value into a candidate registration certificate number list;

and step six, respectively forming a device list L _ name (ID: score _ name _ ID) based on the device name, a device list L _ model (ID: score _ model _ ID) based on the device model and a device list L _ reg (ID: score _ reg _ ID) based on the device registration certificate number by using the device name candidate list, the device model candidate list and the registration certificate number candidate list obtained in the step three, the step four and the step five, comparing and matching to obtain a final result, and then returning the final result.

3. The medical equipment nameplate information identification extraction algorithm of claim 2, wherein the word segmentation employs a jieba word segmentation tool commonly used at present to segment a text segment into a group of words.

4. The identification and extraction algorithm for nameplate information of medical equipment as claimed in claim 2, wherein the keyword library of names of medical equipment is a representative keyword of names of equipment selected by word segmentation based on the names of equipment collected in the past, and the word library includes words such as "doppler", "ultrasound", "monitor", "X-ray", "disinfection", "nuclear magnetic resonance", "dialysis", etc. strongly related to medical equipment.

5. The medical equipment nameplate information identification and extraction algorithm of claim 2, wherein the words obtained by word segmentation of the text segment are matched with a medical equipment name library, 1 is added to the matching score of the text segment every time matching is successful, and finally the name matching score Sname of the text segment and a matched medical equipment name keyword list are obtained.

6. The medical equipment nameplate information identification and extraction algorithm as claimed in claim 2, wherein the candidate model list is obtained by a specific method comprising the following steps:

firstly, model extraction is carried out, whether character patterns such as 'model', 'specification', 'model', 'type' and the like exist is searched, if yes, character strings containing continuous English and numbers in the character segment and the characters on the right side of the character segment are character segments which are standby for equipment models; if the character segments of the equipment models are not extracted in the fourth step, extracting all continuous non-Chinese characters from all the character segments, using the continuous character strings formed by the continuous character strings as model alternatives, matching the obtained model alternative character strings with models in a standard equipment library, wherein the models in the standard equipment library are tens of thousands, and introducing a prefix tree to help query in order to improve the matching rate; each node in the prefix tree is a key consisting of two characters, and the corresponding value is all equipment models containing the two characters; and (2) starting from the first character, searching a corresponding equipment model list in a key prefix removing tree by using two characters of the mobile window, judging whether the equipment model list is matched or not, if the matching rate is higher than a threshold value, putting the equipment model into the candidate model list, moving the mobile window backward by one bit, repeating the process, knowing that the equipment model is moved to the last bit of the alternative character string, and finally obtaining a candidate model list.

7. The medical device nameplate information identification extraction algorithm of claim 6 wherein the non-chinese characters are characters with numbers, english letters, dashes, spaces, etc.

8. The medical equipment nameplate information identification and extraction algorithm as defined in claim 2, wherein the national medical equipment registration certificate number is standardized, whether the national medical equipment registration certificate number can be matched with the nameplate text segment is tried in a regular expression matching mode according to the national standard, and if the national medical equipment registration certificate number can be matched with the nameplate text segment, the national medical equipment registration certificate number can be used as the equipment registration certificate number.

9. The medical equipment nameplate information identification and extraction algorithm of claim 2, wherein the comparison and matching method is as follows: each list in the device list based on the device name, the device list based on the device model and the device list based on the device registration number contains the ID and the matching rate Score of the device in the standard library, and the device lists [ Equ1, equ2, equ3.., equ _ N ] are obtained by taking the intersection of the devices contained in the three lists, and the matching rate scores of the devices in the three lists are summed up to Score _ Equ _ i = SUM (Score _ name _ i, score _ model _ i, score _ reg _ i) for each Equ _ i; and the equipment ID corresponding to the largest Score _ equ _ i is the equipment ID which is most likely to be matched, and the basic information of the equipment is inquired from the standard library according to the basic information of the equipment and returned as a final result.

10. The medical equipment nameplate information identification and extraction algorithm as set forth in claim 2, wherein the basic information of the equipment returned in the sixth step includes the name, model, brand, registration number and classification of the equipment.