CN117523570B - Correction method, device, equipment and storage medium for medicine title - Google Patents

Correction method, device, equipment and storage medium for medicine title Download PDF

Info

Publication number
CN117523570B
CN117523570B CN202311497100.5A CN202311497100A CN117523570B CN 117523570 B CN117523570 B CN 117523570B CN 202311497100 A CN202311497100 A CN 202311497100A CN 117523570 B CN117523570 B CN 117523570B
Authority
CN
China
Prior art keywords
title information
candidate
title
target
medicine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311497100.5A
Other languages
Chinese (zh)
Other versions
CN117523570A (en
Inventor
谢方敏
周峰
郭陟
李志权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Fangzhou Information Technology Co ltd
Original Assignee
Guangzhou Fangzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Fangzhou Information Technology Co ltd filed Critical Guangzhou Fangzhou Information Technology Co ltd
Priority to CN202311497100.5A priority Critical patent/CN117523570B/en
Publication of CN117523570A publication Critical patent/CN117523570A/en
Application granted granted Critical
Publication of CN117523570B publication Critical patent/CN117523570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/12Detection or correction of errors, e.g. by rescanning the pattern
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18086Extraction of features or characteristics of the image by performing operations within image blocks or by using histograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19187Graphical models, e.g. Bayesian networks or Markov models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a correction method, a device, equipment and a storage medium of a medicine title, wherein the method comprises the following steps: detecting a plurality of text blocks in image data when the medicine is put in storage, wherein the image data comprises a list of medicines ordered from a supplier; searching text blocks belonging to the title of the medicine as candidate title information; comparing the candidate title information with each of the universal title information; if the candidate title information is different from the general title information, calculating the probability that the candidate title information belongs to the general title information according to the edited distance; extracting part of general title information according to the probability to serve as target title information; and if the confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information. The embodiment can overcome the detection error of the optical character recognition model, the format non-standardization of a provider or the writing error at the same time, and reduce the work of manually checking the medicine titles, thereby improving the efficiency of inputting the medicine titles.

Description

Correction method, device, equipment and storage medium for medicine title
Technical Field
The present invention relates to the field of natural language processing, and in particular, to a method, an apparatus, a device, and a storage medium for correcting a drug title.
Background
The electronic commerce platform purchases the medicines from suppliers of the medicines, the medicines are transported by the suppliers, staff of the electronic commerce platform scans the list and recognizes text information in the list by using OCR (Optical Character Recognition ) technology when checking and accepting the medicines, and titles of the medicines in the list are input into the system.
On one hand, the title formats of the medicines in the list provided by the supplier are diversified and do not necessarily meet the specification of an electronic commerce platform, moreover, the titles of the medicines can be wrongly written, and on the other hand, due to the influences of factors such as font differences (such as Song body, regular script and the like), list folding, ink pollution, seal covering fonts and the like, OCR recognition is wrong, and the titles of the medicines with errors are obtained.
When checking the title of the medicine, if the operator finds an error, the operator manually corrects the title, the work for checking the title of the medicine is complicated, error leakage is easy to occur, and the efficiency of inputting the title of the medicine is low.
Disclosure of Invention
The invention provides a correction method, device, equipment and storage medium of a medicine title, which are used for solving the problem of how to improve the efficiency of entering the title of the medicine by using an OCR technology.
According to an aspect of the present invention, there is provided a correction method of a medicine title, including:
Detecting a plurality of text blocks in image data at the time of warehousing of a drug, the image data including a list of orders for the drug from a supplier;
Searching the text blocks belonging to the title of the medicine as candidate title information;
comparing the candidate title information with each piece of general title information, wherein the general title information represents the title of the drug which is put in storage;
If the candidate title information is different from the universal title information, calculating the probability that the candidate title information belongs to the universal title information according to the edited distance;
the general title information is extracted according to the probability to be used as target title information;
and if a confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information.
According to another aspect of the present invention, there is provided a correction device for a drug title, comprising:
A text block detection module for detecting a plurality of text blocks in image data containing a list of orders for the medicines from a supplier when the medicines are put in storage;
The candidate title information searching module is used for searching the text blocks belonging to the title of the medicine as candidate title information;
the general title information comparison module is used for comparing the candidate title information with each general title information, and the general title information represents the title of the drug which is put in storage;
the probability calculation module is used for calculating the probability that the candidate title information belongs to each piece of universal title information according to the edited distance if the candidate title information is different from each piece of universal title information;
the target title information extraction module is used for extracting part of the general title information according to the probability to serve as target title information;
And the candidate title information correction module is used for correcting the candidate title information into the target title information if a confirmation operation triggered by the target title information is received.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of correcting a drug title according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing a computer program for causing a processor to execute the method for correcting a drug title according to any one of the embodiments of the present invention.
In this embodiment, a plurality of text blocks are detected in image data at the time of medicine warehousing, the image data containing a list of ordered medicines from a supplier; searching text blocks belonging to the title of the medicine as candidate title information; comparing the candidate title information with each piece of general title information, wherein the general title information represents the title of the drug in storage; if the candidate title information is different from the general title information, calculating the probability that the candidate title information belongs to the general title information according to the edited distance; extracting part of general title information according to the probability to serve as target title information; and if the confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information. The method and the device for correcting the candidate title information identified from the list by using the universal title information can simultaneously solve the problems of detection errors of the optical character recognition model, format non-standardization of suppliers, writing errors and the like, and aim at the characteristic of shorter title length, the probability that the edited distance candidate title information belongs to each universal title information is used, so that the calculation is simple, the accuracy of the probability can be ensured, training of the optical character recognition model is avoided, the cost is low, the performance of the optical character recognition model in other businesses is ensured, in addition, the errors can be effectively reduced, the work of manually checking the medicine titles is reduced, the simplicity of inputting the medicine titles is greatly improved, and the efficiency of inputting the medicine titles is further improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for correcting a drug title according to a first embodiment of the present invention;
FIG. 2 is an exemplary diagram of a list of entries to be made in accordance with a first embodiment of the present invention;
Fig. 3 is an exemplary diagram of candidate title information provided according to an embodiment of the present invention;
Fig. 4 is an exemplary diagram of object header information provided according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of a drug title correction device according to a second embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a method for correcting a medicine title according to an embodiment of the present invention, where the method may be performed by a medicine title correction device, and the medicine title correction device may be implemented in hardware and/or software, and the medicine title correction device may be configured in an electronic device. As shown in fig. 1, the method includes:
Step 101, detecting a plurality of text blocks in image data during drug warehousing.
In practical application, the e-commerce platform purchases medicines to a plurality of suppliers, the suppliers send the medicines and the lists of the medicines to addresses appointed by the e-commerce platform in a physical distribution mode, and staff of the e-commerce platform check and accept the medicines so as to store the medicines in a warehouse.
In general, a list of medicines records various information of medicines in the form of a table.
As shown in fig. 2, in checking a medicine, a worker may acquire image data for a list of medicines using an imaging device such as a high-speed camera, that is, the image data contains a list of medicines ordered from a supplier.
In the present embodiment, an optical character recognition model, that is, an optical character recognition model for performing optical character recognition on image data, may be constructed and trained in advance based on the OCR technology.
The structure of the optical character recognition model is not limited to the artificially designed neural network, but can be optimized by a model quantization method, a neural network searching for characteristics of a drug list by a NAS (NeuralArchitectureSearch, neural network structure search) method, and the like, which is not limited in this embodiment.
Since the background of the list of medicines is clear and standard, the characters in the list of medicines can be considered to belong to a simple scene, a lightweight optical character recognition model (such as PaddleOCR and the like) can be used, and the characters can be detected in the simple scene by utilizing image morphological operations in computer vision, such as expansion, basic corrosion operation and the like, and higher accuracy is ensured.
In order to reduce the development effort, the optical character recognition model (e.g., paddleOCR, etc.) in this embodiment may be a pre-trained optical character recognition model, and later image data with the content of a list of medicines purchased from a supplier is collected as a sample, and fine-tuning (finetuning) is performed on the optical character recognition model.
The image data is input into an optical character recognition model, which performs optical character recognition on the image data to detect a plurality of independent text blocks in the image data, wherein the optical character recognition model marks one independent text block in a detection frame manner, and one or more characters can be contained in one independent text block, and the characters comprise characters (such as Chinese characters, english, arabic numerals, and the like), punctuation marks (such as dots, periods, brackets, and the like), and the like.
In practical use, as shown in fig. 2, the list of medicines has various semantic information such as the title of the medicines, the specifications of the medicines, the name of the manufacturer, the unit of the medicines, the number of the medicines, the unit price of the medicines, the amount of the medicines, the lot or approval number of the medicines, the date and validity of the production of the medicines, the retail price of the medicines, and the like.
Step 102, searching text blocks belonging to the title of the medicine as candidate title information.
In general, as shown in fig. 2, typesetting of information with multiple semantics in a list provided by a provider has a relatively stable rule, so that when a title of a medicine is input, text blocks with semantics being titles of the medicine can be searched in all text blocks according to the typesetting rule and marked as candidate title information.
In practical applications, the title of a drug typically contains key information about the drug.
Illustratively, the title of the drug (as in the generic name column of FIG. 3) may include at least one of:
The name of the drug, the dosage of the drug, the number of drugs, the brand of the supplier, the symbol (e.g., brackets, multiplier numbers, etc.).
Further, the optical character recognition model will generally recognize the information of the line feed as at least two independent text blocks, and the lengths of the titles of the medicines are different, and there may be a line feed, so the title of the medicine may be one text block or a plurality of text blocks, which is not limited in this embodiment.
In one way of searching for the title of the medicine, the typesetting mode of the list of the medicine of different suppliers has higher stability, so that the typesetting rule of the list of the medicine of different suppliers can be analyzed, thereby creating a template, wherein the template is marked with a detection frame belonging to the title, and the mapping relation between the identification information (such as the name of the supplier) of the supplier and the template is created.
Then, in the method, the identification information of the provider can be searched in the text block by means of keyword (i.e. name of the provider) matching and the like, and in the mapping relation, the template created for the provider is queried according to the identification information.
The template is mapped onto the image data and then, on the same image data (coordinate system), the template overlaps the text block.
If the degree of overlapping (IoU (Intersectionover Union, cross-over ratio) between the detection frame and the one or more text blocks) is greater than a preset first overlapping threshold, which indicates that the degree of overlapping between the target frame and the one or more text blocks is higher, determining that the target object belongs to the title of the medicine, and taking the title as candidate title information.
Further, in the case where a plurality of text blocks belong to the title of the medicine, a symbol (e.g., "#") representing a line feed may be added between the respective text blocks at the time of optical character recognition so that the target object is a single text block, or the target object includes a plurality of text blocks and at least one symbol representing a line feed.
In another way of looking up the amount of the drug, as shown in fig. 2, the identification information of the provider may be looked up in the text block using keyword (i.e. name of the provider) matching or the like.
The image data is input to a table identification model such as table_ recognition, cycle-CENTERNET in Modelscope, and cells are identified in the image data, wherein the cells have a plurality of vertices therein.
If the degree of overlapping between two cells (indicated by IoU or the like) is greater than the preset second overlapping threshold (e.g., 90%), which indicates that the degree of overlapping between two cells is higher and belongs to the nested abnormality, the cell with the smallest area can be deleted from the two cells.
If deletion is completed, the cells (including vertices) may be ordered in the order of rows and columns, so as to align the cells (including vertices), and in the case of alignment, adjacent vertices are merged, where the distance between the vertices (such as the euclidean distance) may be smaller than a preset pitch threshold, and the adjacent vertices are more errors that are detected by the table recognition model, and the merged vertices may enable the split cells to be merged into the same table.
If merging is completed, the vertices may be complemented in the order of rows and columns (e.g., top-down, left-to-right, etc.) using interpolation or the like, taking into account that a normal cell has four vertices.
If the completion is completed, the isolated vertices may be removed in the order of rows and columns, and the isolated vertices may not form normal cells, i.e., the isolated order belongs to noise for the cells.
If the removal is completed, a plurality of cells having the largest connected areas are extracted as a table.
A table is mapped onto image data and then, on the same image data (coordinate system), the table overlaps text blocks such that one or more text blocks fall into individual cells of the table.
And determining the title of the drug belonging to the target object in the cell in the appointed row or column in the table according to the identification information, and taking the title as candidate title information.
Further, in the case where a plurality of text blocks belong to the title of the medicine, a symbol (e.g., "#") representing a line feed may be added between the respective text blocks at the time of optical character recognition so that the target object is a single text block, or the target object includes a plurality of text blocks and at least one symbol representing a line feed.
In addition, in the list of medicines of the suppliers, it is customary to arrange the titles of the medicines in a certain row or a certain column (for example, column 1), and there is a certain difference in the row or column of the titles of the medicines arranged by different suppliers, but most of the titles of the medicines are arranged in a certain row or column (for example, column 1) by default so as to meet certain typesetting specifications, so that whether the row or column in which the title of the supplier marked medicine represented by the identification information is located can be queried.
If the line or column in which the title of the medicine is located is marked for the supplier, it is determined that the text block in the cell in the line or column in the table belongs to the title of the medicine as candidate title information.
If the post-worker checks that the row or the column is not the title of the recorded medicine and designates a certain row or a certain column, the information (such as the number of rows or the number of columns) of the row or the column designated by the worker is marked as the title of the medicine by taking the identification of the supplier as an index.
If the row or column of the medicine is not marked for the supplier, the text block in the cell on the default row or default column (such as column 1) in the table is determined to belong to the medicine's title as candidate title information.
If the post-worker verifies the header of the cell record drug on the row or column, the information (e.g., the number of rows or columns) of the row or column is marked with the supplier's identification as an index.
If the post-worker checks that the line or the column is not the amount of the recorded medicine and designates a certain line or a certain column, the information (such as the number of lines or the number of columns) of the line or the column designated by the worker is marked as the title of the medicine by taking the identification of the supplier as an index.
The template is used for extracting the candidate title information, the accuracy is higher, but the cost for manufacturing the template is higher, and when the list of the provider is newly added and the list is updated by the original provider, the template is newly added, so that the cost for maintaining the template is higher.
The candidate amount information is extracted by using a form mode, although the accuracy is reduced, the method can be suitable for most suppliers, can be suitable for the conditions of newly added suppliers and original suppliers for updating the list to a certain extent, and effectively controls the cost.
The candidate header information may be extracted by using a template and the candidate header information may be extracted by using a table, or may be used alone or in combination (e.g., distinguished according to identification information of a provider), and thus, the candidate header information may be extracted by selecting a template and/or a table according to actual requirements (e.g., accuracy, cost, etc.) of a service.
In one example, as shown in fig. 4, the result of OCR on the title of the medicine in fig. 3, the identified candidate title information is:
Shang Tong Biyanshui # AA
Granule for clearing away heat
Capsule of six ingredients with rehmannia
Fragrant and healthy qi water phase
Tomato flavor stomach clearing tablet # EE
Lemon-eliminating tablet # FFF
Where "AA", "EE", "FFF" represent manufacturer's trademark, "#" is the line feed symbol.
In this example, the list is contaminated (i.e., the "expiration date" is written), some noise is added during OCR (the "expiration date" is identified as the "expiration date"), the title of a part of the medicine is the name of the medicine and the trademark of the manufacturer, the title of a part of the medicine is only the name of the medicine, the format is not uniform, and a part of the text causes identification errors (the "herba epimedii" is identified as "photophobia", "tomato" and the "amaurosis" is identified as "lemon") because of the relatively fuzzy relation of the list printing.
Step 103, comparing the candidate title information with each general title information.
In this embodiment, the titles of the medicines that have been put in storage may be sorted at intervals, and recorded as general title information, that is, the general title information indicates the titles of the medicines that have been put in storage, and the titles of the medicines are indicated to be corrected by the staff of the e-commerce platform, so that the general title information does not have the problems of errors in OCR recognition, format non-standardization of suppliers, or writing errors, and the like.
The general title information refers to the title of the medicine in a unified format at least in the e-commerce platform, and the format of the title may be formulated according to the naming standards of some medicines, or may be customized according to the business of the e-commerce platform, and the formats of the general title information of various medicines may be the same or different, for example, the format of the general title information of a certain medicine is "name of the medicine", the format of the general title information of a certain medicine is "name of the medicine (trademark of manufacturer)", the format of the general title information of a certain medicine is "name (dose) of the medicine", etc., which is not limited in this embodiment.
In one example, the system database of the e-commerce platform has the following records of the drug that was successfully put in storage:
Id 1-Ditong rhinitis water (AA)
Id 2-Ditong rhinitis water (AA)
Id 3-Ditong rhinitis water (GGG)
Id 4-drop rhinitis water spray (10 ml) 3 bottles
Id 5-licorzinc capsule
……
Wherein, "AA" and "GGG" represent trademarks of manufacturers.
In this example, the drug record is divided into fields with "-" the first field being Id and the second field being generic header information.
And traversing each candidate title information in the list, comparing the candidate title information with the general title information of each drug in storage, and judging whether the candidate title information is the same as the general title information of each drug in storage.
If the candidate title information is the same as the general title information of a certain drug in storage, the candidate title information can be directly provided to staff as a reference for entering the title of the drug.
When the staff enters the title of the medicine, the staff can further check whether the candidate title information is wrong according to the image data or the list.
If the staff checks the candidate title information, the candidate title information is input into a system of the e-commerce platform as a title of the medicine.
In one example, the candidate title information "merchant rhinitis water #aa" is compared with the general title information "rhinitis water by drip (AA)", which is different, and the candidate title information "merchant rhinitis water #aa" is compared with the general title information "rhinitis water by drip (GGG)", thus traversing the respective general title information.
Step 104, if the candidate title information is different from each general title information, calculating the probability that the candidate title information belongs to each general title information according to the edited distance.
If the candidate title information is different from the general title information of each drug in storage, the candidate title information does not meet the storage requirement, and the problems that OCR recognition is wrong, the format of the supplier does not meet the specification and the like may exist.
In general, even if there are problems such as OCR recognition errors and format non-compliance of suppliers, the influence of these problems is relatively limited, so that the candidate title information is relatively similar to the correct title (i.e., the common title information) of the medicine, and therefore, natural language processing can be performed on the candidate title information and each common title information, and the editing distance between the candidate title information and each common title information is analyzed, so that the probability that the candidate title information belongs to each common title information is calculated.
The editing distance is a quantitative measurement of the difference degree between the candidate title information and the universal title information, and the probability of editing the universal title information into the candidate title information can be reflected to a certain degree.
In general, the editing distance between the candidate title information and the universal title information is inversely related to the probability that the candidate title information belongs to the universal title information, and the smaller the editing distance between the candidate title information and the universal title information is, the larger the probability that the universal title information is edited into the candidate title information is, i.e., the larger the probability that the candidate title information belongs to the universal title information is, whereas the larger the editing distance between the candidate title information and the universal title information is, the smaller the probability that the universal title information is edited into the candidate title information is, i.e., the smaller the probability that the candidate title information belongs to the universal title information is.
In one embodiment of the present invention, step 104 may include the steps of:
Step 1041, segmenting the candidate title information into a plurality of first candidate segmentations.
In this embodiment, word segmentation processing may be performed on the candidate title information in real time, thereby segmenting the candidate title information into a plurality of first candidate words.
In a specific implementation, a general word segmentation tool (such as jieba, hanLP and StanfordNLP) can be used for performing word segmentation on the candidate title information, so that the operation is convenient and the expansibility is good.
In consideration of that the title of the medicine contains professional medical vocabulary, word segmentation processing can be assisted on candidate title information by loading a dictionary, a custom dictionary, regular expression matching and other modes in the medical field on the basis of a universal word segmentation tool, so that word segmentation accuracy is improved.
In one example, word segmentation processing is performed on the candidate title information "merchant rhinitis water #aa", resulting.
Step 1042, the universal title information is segmented into a plurality of second candidate segmentations.
In this embodiment, word segmentation processing may be performed on each of the common header information in real time or offline, thereby segmenting each of the common header information into a plurality of second candidate word segments.
In general, the manner of dividing the candidate title information into a plurality of first candidate words is the same as the manner of dividing each common title information into a plurality of second candidate words.
In a specific implementation, a general word segmentation tool (such as Jieba, hanLP and StanfordNLP) can be used to perform word segmentation on each general title information, so that the operation is convenient and the expansibility is good.
In consideration of the fact that the titles of the medicines contain professional medical words, word segmentation processing can be assisted on each piece of general title information in a mode of loading dictionaries, custom dictionaries, regular expression matching and the like in the medical field on the basis of a general word segmentation tool, so that word segmentation accuracy is improved.
Further, statistics may be generated for the generic heading information at intervals (e.g., 30), including the number of sub-words of the generic heading information (i.e., second candidate sub-words), and binning.
In one example, the system database of the e-commerce platform is the following for data of partially successful drug statistics entered:
22-Ditong rhinitis water (AA) [ 'Ditong', 'rhinitis', 'Water', '(', 'AA', ')' ]
11-Ditong rhinitis water (GGG) - [ 'Ditong', 'rhinitis', 'Water', '(', 'GGG', ')' ]
33-Drop rhinitis water spray (10 ml) ×3 bottles- [ 'drop', 'rhinitis', 'water', 'spray', '(', '10ml', ')', '3', 'bottle',
25-Liquiritigenin Capsule- [ ' Liquiritigenin ', ' Zinc ', ' Capsule ]
In this example, the statistics divide the fields in "-", the first field is the number of times the universal title information is put in storage (i.e., "22", "11", "33", "25", etc.), the second field is the universal title information, and the third field is the word of the universal title information.
Step 1043, if the first candidate word is the same as the second candidate word, marking the first candidate word or the second candidate word as the target word.
For given candidate title information and general title information, each first candidate word may be compared with each second candidate word.
If the first candidate word segment is identical to the second candidate word segment, indicating that the first candidate word segment and the second candidate word segment belong to co-occurring words, the first candidate word segment or the second candidate word segment may be marked as a target word segment.
In one example, a first candidate word of the candidate title information "merchant rhinitis water #aa" (i.e., "drip", "rhinitis", "water", "AA") is compared with a second candidate word of the common title information "drip rhinitis water (AA)" (i.e., "drip", "rhinitis", "water", "AA)") to obtain the target word as "drip", "rhinitis", "water", "AA".
Step 1044, calculating the probability that the candidate title information belongs to each general title information according to the target word segmentation.
In general, for the titles of the same medicine, candidate title information and general title information are similar, and have a plurality of co-occurring words (i.e., target word segmentation), so that the co-occurring words (i.e., target word segmentation) can be used as a basis for quantifying the edited distance, thereby evaluating the probability that the candidate title information belongs to the general title information.
In general, the probability that the candidate title information belongs to the universal title information is inversely related to the number of target words, that is, the smaller the number of target words is, the lower the probability that the candidate title information belongs to the universal title information is, whereas the larger the number of target words is, the higher the probability that the candidate title information belongs to the universal title information is.
In one embodiment of the present invention, step 1044 may further include the steps of:
Step 10441, configuring sub-weights for each target word.
For the given candidate title information and the general title information, appropriate sub-weights can be configured for each target word according to the types (such as characters, characters and the like), positions, lengths and other factors of the target word.
In a specific implementation, a confidence interval may be preset, where the confidence interval has an upper limit value and a lower limit value, and the confidence interval is a range of keywords that have a greater influence on the title of the medicine in the word segmentation obtained by segmenting the title of the medicine in the medicine field.
In this example, the length of each target word may be counted and compared to the confidence interval.
In one case, if the length is less than the preset confidence interval (i.e., the length is less than the lower limit of the confidence interval), the preset first value is set as the sub-weight of the target word segment.
Further, the first value may be a default empirical value, and different kinds of medicines may uniformly use one empirical value as the first value.
In addition, considering that the names of the medicines contain the technical terms in the medicine field, the lengths of the technical terms are greatly different, the lengths of the word segmentation of the technical terms are also greatly different, and in order to adapt to different kinds of medicines, the first numerical value can be a numerical value which is adaptively adjusted based on the kinds of medicines, so that the accuracy of probability is improved.
For the dynamic second numerical value, the identification information (such as the name of the drug) of the drug can be queried in the candidate title information by using a keyword (such as the name of the drug) fuzzy matching and the like.
And inquiring a first numerical value configured for the current type of medicine according to the identification information, and setting the first numerical value as the sub-weight of the target word.
In another case, if the length is within the preset confidence interval (i.e., the length is greater than or equal to the lower limit value of the confidence interval and the length is less than or equal to the lower limit value of the confidence interval), the second value configured for the length is set as the sub-weight of the target word segment.
The second numerical value is positively correlated with the length, that is, the longer the length of the target word is, the larger the sub-weight configured for the target word is, whereas the shorter the length of the target word is, the smaller the sub-weight configured for the target word is;
Further, the second value may be a default empirical value, and different kinds of medicines may uniformly use one empirical value as the second value.
In addition, considering that the names of the medicines contain the technical terms in the medicine field, the lengths of the technical terms are greatly different, the lengths of the word segmentation of the technical terms are also greatly different, and in order to adapt to different kinds of medicines, the second numerical value can be a numerical value which is adaptively adjusted based on the kinds of medicines, so that the accuracy of probability is improved.
For the dynamic second numerical value, the identification information (such as the name of the drug) of the drug can be queried in the candidate title information by using a keyword (such as the name of the drug) fuzzy matching and the like.
And inquiring a plurality of mapping relations configured for the current type of medicine according to the identification information, wherein the relation between the length and the second value is recorded in each mapping relation.
And inquiring a second value corresponding to the current length in the mapping relation to serve as the sub-weight of the target word segmentation.
In yet another case, if the length is greater than a preset confidence interval (i.e., the length is greater than an upper limit of the confidence interval), the preset third value is set as the child weight of the target word segment.
Further, the third value may be a default empirical value, and different kinds of medicines may uniformly use one empirical value as the third value.
In addition, considering that the names of the medicines contain the technical terms in the medicine field, the lengths of the technical terms are greatly different, the lengths of the word segmentation of the technical terms are also greatly different, and in order to adapt to different kinds of medicines, the third numerical value can be a numerical value which is adaptively adjusted based on the kinds of medicines, so that the accuracy of probability is improved.
For the dynamic second numerical value, the identification information (such as the name of the drug) of the drug can be queried in the candidate title information by using a keyword (such as the name of the drug) fuzzy matching and the like.
And inquiring a third numerical value configured for the medicine according to the identification information, and setting the third numerical value as the sub-weight of the target segmentation.
Wherein the first value is less than the third value, and the third value is less than the second value.
In one example, the confidence interval set for a certain kind of medicine is [2,5], when the length of the target word is 1, 0 (first value) may be set as a sub-weight in consideration of the presence of a sign of a line feed, a bracket, etc., when the length of the target word is 2, 0.2 (second value) may be set as a sub-weight, when the length of the target word is 3, 0.4 (second value) may be set as a sub-weight, when the length of the target word is 4, 0.6 (second value) may be set as a sub-weight, and when the length of the target word is greater than or equal to 6, 0.1 (third value) may be set as a sub-weight.
Step 10442, calculating the sum of the sub weights of all the target words as the total weight.
For given candidate title information and general title information, the sub-weights of all target words may be added to obtain a total weight.
Step 10443, counting the first times of recording general title information historically.
In the system of the e-commerce platform, the first times of the general title information which is recorded in the database in the last period of time (such as 30 days) are read, namely, the first times are the times of the general title information serving as the title of the medicine to be put in storage.
Step 10444, adding one on the basis of the first number to obtain a second number.
The number of times of warehousing of the current medicine can be increased, and then one is added on the basis of the first number of times to obtain the second number of times.
Step 10445, counting the first number of the first candidate segmented words and the second number of the second candidate segmented words respectively.
For the given candidate title information, a first number of all first candidate words is counted for the given candidate title information, and for the given common title information, a second number of all second candidate words is counted for the given common title information.
Step 10446, adding the first number to the second number to obtain a total number.
The first number and the second number are summed, and the total number of all the segmented words is possible.
Step 10447, dividing the product between the total weight and the first number by the product between the total number and the second number to obtain the probability that the candidate title information belongs to the universal title information.
And obtaining the probability that the candidate title information belongs to the universal title information by dividing the product between the total weight and the first time number by the product between the total number and the second time number.
Then, the probability that the candidate title information belongs to the common title information can be expressed as:
Wherein, P is the probability that the candidate title information belongs to the common title information, w i is the sub-weight of the ith target word segment, n is the total number of target words, s 1 is the first number of times of historically inputting the common title information, s 2 is the second number of times of accumulating the titles of the medicines currently put in storage on the basis of the first number of times, t 1 is the first number of all first candidate words, and t 2 is the second number of all second candidate words.
In one example, the probability that the candidate title information "business-on rhinitis water #aa" belongs to the common title information "drip-on rhinitis water (AA)" is calculated as follows:
The sub-weights of the target words "drip through", "rhinitis" and "AA" are all 0.2, the sub-weight of the target word "water" is 0.1, the first number of the warehouse-in common header information "drip through rhinitis water (AA)" is 22, the second number of the current data is 23, the first number of all first candidate words in the candidate header information "merchant through rhinitis water #AA" is 5, and the first number of all first candidate words in the common header information "drip through rhinitis water (AA)" is 6.
Step 105, extracting part of the general title information as target title information according to the probability.
In this embodiment, the probabilities that the candidate title information belongs to each general title information may be compared, and the part of general title information with higher probability may be screened out and recorded as the target title information.
Illustratively, the general title information is sorted in descending order according to probability to obtain a title sequence, in which the higher the probability is, the earlier the general title information is sorted, and conversely, the lower the probability is, the later the general title information is sorted.
The top-ranked plurality of common header information is extracted as target header information, that is, the plurality of common header information having the highest probability is screened out as target header information.
In one example, the probability that the candidate title information "merchant ventilation rhinitis water #aa" belongs to each general title information is calculated, and the probability that the candidate title information "merchant ventilation rhinitis water #aa" belongs to the general title information "ventilation rhinitis water (AA)" is highest, the general title information "ventilation rhinitis water (AA)" may be set as the target title information.
Of course, the above-described manner of selecting the target title information is merely an example, and in the implementation of the present embodiment, other manners of setting the target title information may be set according to actual situations, for example, extracting a plurality of general title information having a probability greater than a certain threshold and a maximum value as the target title information, and the like, the present embodiment is not limited thereto. In addition, in addition to the above-described manner of target title information, a person skilled in the art may adopt other manners of target title information according to actual needs, which is not limited in this embodiment.
And 106, if a confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information.
In this embodiment, the general title information with higher partial probability (i.e., the target title information) is fed back to the staff of the e-commerce platform as a reference for entering the title of the medicine.
For example, as shown in fig. 3, in the system of the e-commerce platform, optional target title information is provided for each drug, where the target title information is respectively:
Business rhinitis-treating liquid (AA)
Yiqing granule (BBB)
Six ingredients with rehmannia capsule (CC)
Huoxiang Zhengqi liquid (DDDD)
Huoxiang qingwei tablet (EE)
Xiaomeng tablet (FFF)
Wherein, "AA", "BBB", "CC", "DDDD", "EE", "FFF" represent manufacturer's brands.
In this example, the correction eliminates the sign "#" of line feed by OCR, eliminates noise (i.e. "lay-out") of the list caused by pollution (i.e. "expiration") of the list during OCR, corrects recognition errors of OCR (recognizes "herba" as "photophobia", "tomato", recognizes "amax" as "lemon"), and unifies the format of "name of medicine (trademark of manufacturer)".
When the staff inputs the title of the medicine, the staff can further check whether the target title information is wrong according to the image data or the list.
If the staff checks the target title information, the target title information is input into a system of the e-commerce platform as a title of the medicine.
If the staff checks the target title information with errors, the target title information is corrected according to the image data or the list, and the corrected target title information is used as the amount of the medicine to be input into a system of the electronic commerce platform.
In general, the detection error of the optical character recognition model is a long tail phenomenon, if a large number of samples are collected to train the optical character recognition model in order to overcome the detection error of the optical character recognition model in recognizing the amount of medicine, the cost of collecting the samples and labeling the samples is high, and the over fitting is easy to cause, so that the performance of the optical character recognition model in other businesses is affected.
In addition, the suppliers have wide sources, the formats of the suppliers are diversified and are difficult to unify, and writing errors and format standardization problems of the suppliers are difficult to correct from the source.
In this embodiment, a plurality of text blocks are detected in image data at the time of medicine warehousing, the image data containing a list of ordered medicines from a supplier; searching text blocks belonging to the title of the medicine as candidate title information; comparing the candidate title information with each piece of general title information, wherein the general title information represents the title of the drug in storage; if the candidate title information is different from the general title information, calculating the probability that the candidate title information belongs to the general title information according to the edited distance; extracting part of general title information according to the probability to serve as target title information; and if the confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information. The method and the device for correcting the candidate title information identified from the list by using the universal title information can simultaneously solve the problems of detection errors of the optical character recognition model, format non-standardization of suppliers, writing errors and the like, and aim at the characteristic of shorter title length, the probability that the edited distance candidate title information belongs to each universal title information is used, so that the calculation is simple, the accuracy of the probability can be ensured, training of the optical character recognition model is avoided, the cost is low, the performance of the optical character recognition model in other businesses is ensured, in addition, the errors can be effectively reduced, the work of manually checking the medicine titles is reduced, the simplicity of inputting the medicine titles is greatly improved, and the efficiency of inputting the medicine titles is further improved.
Example two
Fig. 5 is a schematic structural diagram of a drug title correction device according to a second embodiment of the present invention. As shown in fig. 5, the apparatus includes:
a text block detection module 501 for detecting a plurality of text blocks in image data containing a list of orders for the medicines to a supplier at the time of medicine warehousing;
A candidate title information searching module 502, configured to search the text block belonging to the title of the drug as candidate title information;
A universal title information comparing module 503, configured to compare the candidate title information with each universal title information, where the universal title information indicates a title of the drug in storage;
a probability calculation module 504, configured to calculate, if the candidate header information is different from each of the universal header information, a probability that the candidate header information belongs to each of the universal header information according to the edited distance;
A target title information extraction module 505, configured to extract part of the generic title information according to the probability, as target title information;
And a candidate title information correction module 506, configured to correct the candidate title information to the target title information if a confirmation operation triggered by the target title information is received.
In one embodiment of the present invention, the candidate title information search module 502 includes:
the identification information searching module is used for searching the identification information of the provider in the text block;
The template inquiry module is used for inquiring a template created for the provider according to the identification information, wherein a detection frame belonging to a title is marked in the template;
a template mapping module for mapping the template onto the image data;
the overlapping determining module is used for determining that the text block belongs to the title of the medicine and is used as candidate title information if the overlapping degree between the detection frame and the target object is larger than a preset first overlapping threshold value;
wherein the target object is a single text block or comprises a plurality of text blocks and at least one symbol representing line feed;
And/or the number of the groups of groups,
The identification information searching module is used for searching the identification information of the provider in the text block;
a cell identification module for identifying a cell in the image data, the cell having a plurality of vertices therein;
The cell deleting module is used for deleting the cell with the smallest area from the two cells if the overlapping degree of the two cells is larger than a preset second overlapping threshold value;
The vertex merging module is used for merging adjacent vertexes if deletion is completed;
The vertex completion module is used for completing the vertex according to the sequence of the rows and the columns if the combination is completed;
The vertex removing module is used for removing the isolated vertexes according to the sequence of the rows and the columns if the completion is completed;
The table extraction module is used for extracting a plurality of cells with the largest connected areas as tables if the removal is completed;
a table mapping module for mapping the table onto the image data;
A row and column determining module, configured to determine, according to the identification information, a title of the drug to which a target object in the cell located in a specified row or column in the table belongs as candidate title information;
Wherein the target object is a single text block or the target object comprises a plurality of text blocks and at least one symbol representing line feed.
In one embodiment of the present invention, the probability calculation module 504 includes:
the first word segmentation module is used for segmenting the candidate title information into a plurality of first candidate word segments;
The second word segmentation module is used for segmenting the general title information into a plurality of second candidate word segments;
the target word segmentation marking module is used for marking the first candidate word segmentation or the second candidate word segmentation as target word segmentation if the first candidate word segmentation is the same as the second candidate word segmentation;
And the target word segmentation calculation module is used for calculating the probability that the candidate title information belongs to each general title information according to the target word segmentation.
In one embodiment of the present invention, the target word segmentation calculation module includes:
The sub-weight configuration module is used for configuring sub-weights for the target word segmentation;
The total weight calculation module is used for calculating the sum of the sub weights of all the target word segments to be used as the total weight;
the first time counting module is used for counting the first times of recording the general title information historically;
The second time calculation module is used for adding one on the basis of the first time to obtain a second time;
The word segmentation quantity counting module is used for counting the first quantity of the first candidate words and the second quantity of the second candidate words respectively;
A total number calculation module for adding the first number and the second number to obtain a total number;
And the ratio calculating module is used for dividing the product between the total weight and the first times and the product between the total number and the second times to obtain the probability that the candidate title information belongs to the universal title information.
In one embodiment of the present invention, the objective sub-weight configuration module includes:
The length statistics module is used for counting the length of each target word;
The first numerical value setting module is used for setting a preset first numerical value as the sub-weight of the target word segmentation if the length is smaller than a preset confidence interval;
the second value setting module is used for setting a second value configured for the length as the sub-weight of the target word segment if the length is in a preset confidence interval, and the second value is positively related to the length;
the third numerical value setting module is used for setting a preset third numerical value as the sub-weight of the target word segmentation if the length is larger than a preset confidence interval;
wherein the first value is less than the third value, and the third value is less than the second value.
In one embodiment of the present invention, the sub-weight configuration module further includes:
the medicine identification inquiring module is used for inquiring the identification information of the medicine in the candidate title information;
the first numerical setting module is further configured to:
Inquiring a first numerical value configured for the medicine according to the identification information;
Setting the first value as a sub-weight of the target word;
the second value setting module is further configured to:
Inquiring a plurality of mapping relations configured for the medicine according to the identification information, wherein the relation between the length and the second numerical value is recorded in each mapping relation;
Inquiring a second numerical value corresponding to the length in the mapping relation to be used as the sub-weight of the target segmentation;
The third value setting module is further configured to:
Inquiring a third numerical value configured for the medicine according to the identification information;
and setting the third numerical value as the sub-weight of the target word segmentation.
In one example of an embodiment of the present invention, the title of the drug includes at least one of:
the name of the drug, the dosage of the drug, the number of the drugs, the brand of the vendor, the symbol.
In one embodiment of the present invention, the object header information extraction module 505 includes:
the title ordering module is used for ordering the general title information in a descending order according to the probability to obtain a title sequence;
and the sequence extraction module is used for extracting a plurality of the general title information with the forefront sequence as target title information.
The medicine title correction device provided by the embodiment of the invention can execute the medicine title correction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the medicine title correction method.
Example III
Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM12 and the RAM13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the respective methods and processes described above, such as the correction method of the medicine title.
In some embodiments, the method of modifying a drug title may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM12 and/or the communication unit 19. When the computer program is loaded into the RAM13 and executed by the processor 11, one or more steps of the above-described correction method of the drug title may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method of correction of the drug title in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
Example IV
The embodiments of the present invention also provide a computer program product comprising a computer program which, when executed by a processor, implements a method of modifying a drug title as provided by any of the embodiments of the present invention.
Computer program product in the implementation, the computer program code for carrying out operations of the present invention may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (6)

1. A method for correcting a drug title, comprising:
Detecting a plurality of text blocks in image data at the time of warehousing of a drug, the image data including a list of orders for the drug from a supplier;
Searching the text blocks belonging to the title of the medicine as candidate title information;
comparing the candidate title information with each piece of general title information, wherein the general title information represents the title of the drug which is put in storage;
If the candidate title information is different from the universal title information, calculating the probability that the candidate title information belongs to the universal title information according to the edited distance;
the general title information is extracted according to the probability to be used as target title information;
If a confirmation operation triggered by the target title information is received, correcting the candidate title information into the target title information;
Wherein the calculating the probability that the candidate title information belongs to each universal title information according to the edited distance includes:
dividing the candidate title information into a plurality of first candidate words;
dividing the general title information into a plurality of second candidate words;
if the first candidate word is the same as the second candidate word, marking the first candidate word or the second candidate word as a target word;
Calculating the probability that the candidate title information belongs to each general title information according to the target word segmentation;
The calculating the probability that the candidate title information belongs to each general title information according to the target word segmentation includes:
configuring sub-weights for each target word segmentation;
calculating the sum of the sub weights of all the target segmentation words to be used as the total weight;
counting the first times of recording the general title information historically;
adding one on the basis of the first times to obtain a second times;
respectively counting the first number of the first candidate segmented words and the second number of the second candidate segmented words;
adding the first number to the second number to obtain a total number;
dividing the product between the total weight and the first number by the product between the total number and the second number to obtain the probability that the candidate title information belongs to the universal title information;
The step of configuring the sub-weights for each target word segment comprises the following steps:
counting the length of each target word;
if the length is smaller than the preset confidence interval, setting a preset first value as the sub-weight of the target word;
If the length is within the preset confidence interval, setting a second numerical value configured for the length as a sub-weight of the target word, wherein the second numerical value is positively related to the length;
if the length is greater than the preset confidence interval, setting a preset third value as the sub-weight of the target word;
Wherein the first value is less than the third value, the third value is less than the second value;
The step of configuring the sub-weights for each target word segment further comprises the following steps:
Inquiring identification information of the medicine in the candidate title information;
the setting the preset first value as the sub-weight of the target word includes:
Inquiring a first numerical value configured for the medicine according to the identification information;
Setting the first value as a sub-weight of the target word;
The setting the second value configured for the length as the sub-weight of the target word segment includes:
Inquiring a plurality of mapping relations configured for the medicine according to the identification information, wherein the relation between the length and the second numerical value is recorded in each mapping relation;
Inquiring a second numerical value corresponding to the length in the mapping relation to be used as the sub-weight of the target segmentation;
the setting the preset third value as the sub-weight of the target word includes:
Inquiring a third numerical value configured for the medicine according to the identification information;
and setting the third numerical value as the sub-weight of the target word segmentation.
2. The method according to claim 1, wherein said searching for said text block belonging to a title of said medicine as candidate title information comprises:
Searching the text block for the identification information of the provider;
Inquiring a template created for the provider according to the identification information, wherein a detection frame belonging to a title is marked in the template;
Mapping the template onto the image data;
If the degree of overlapping between the detection frame and the target object is larger than a preset first overlapping threshold value, determining that the text block belongs to the title of the medicine as candidate title information;
wherein the target object is a single text block or comprises a plurality of text blocks and at least one symbol representing line feed;
And/or the number of the groups of groups,
Searching the text block for the identification information of the provider;
Identifying a cell in the image data, the cell having a plurality of vertices therein;
If the overlapping degree between the two cells is larger than a preset second overlapping threshold value, deleting the cell with the smallest area from the two cells;
if the deletion is completed, merging the adjacent vertexes;
If the merging is completed, the vertexes are complemented according to the sequence of the rows and the columns;
if the completion is completed, removing the isolated vertexes according to the sequence of the rows and the columns;
if the removal is completed, extracting a plurality of cells with the largest connected areas as a table;
mapping the table onto the image data;
Determining the title of the medicine belonging to the target object in the cell positioned on the appointed row or column in the table according to the identification information, and taking the title as candidate title information;
Wherein the target object is a single text block or the target object comprises a plurality of text blocks and at least one symbol representing line feed.
3. The method of any one of claims 1-2, wherein the title of the pharmaceutical product comprises at least one of:
the name of the drug, the dose of the drug, the number of the drugs, the brand of the vendor, the symbol;
The extracting part of the general title information according to the probability as target title information includes:
sorting the general title information in a descending order according to the probability to obtain a title sequence;
And extracting a plurality of the general title information with the forefront sequence as target title information.
4. A medicine title correction device, comprising:
A text block detection module for detecting a plurality of text blocks in image data containing a list of orders for the medicines from a supplier when the medicines are put in storage;
The candidate title information searching module is used for searching the text blocks belonging to the title of the medicine as candidate title information;
the general title information comparison module is used for comparing the candidate title information with each general title information, and the general title information represents the title of the drug which is put in storage;
the probability calculation module is used for calculating the probability that the candidate title information belongs to each piece of universal title information according to the edited distance if the candidate title information is different from each piece of universal title information;
the target title information extraction module is used for extracting part of the general title information according to the probability to serve as target title information;
The candidate title information correction module is used for correcting the candidate title information into the target title information if a confirmation operation triggered by the target title information is received;
Wherein, the probability calculation module includes:
the first word segmentation module is used for segmenting the candidate title information into a plurality of first candidate word segments;
The second word segmentation module is used for segmenting the general title information into a plurality of second candidate word segments;
the target word segmentation marking module is used for marking the first candidate word segmentation or the second candidate word segmentation as target word segmentation if the first candidate word segmentation is the same as the second candidate word segmentation;
The target word segmentation calculation module is used for calculating the probability that the candidate title information belongs to each general title information according to the target word segmentation;
the target word segmentation calculation module comprises:
The sub-weight configuration module is used for configuring sub-weights for the target word segmentation;
The total weight calculation module is used for calculating the sum of the sub weights of all the target word segments to be used as the total weight;
the first time counting module is used for counting the first times of recording the general title information historically;
The second time calculation module is used for adding one on the basis of the first time to obtain a second time;
The word segmentation quantity counting module is used for counting the first quantity of the first candidate words and the second quantity of the second candidate words respectively;
A total number calculation module for adding the first number and the second number to obtain a total number;
The ratio calculating module is used for dividing the product between the total weight and the first time number by the product between the total number and the second time number to obtain the probability that the candidate title information belongs to the universal title information;
The sub-weight configuration module comprises:
The length statistics module is used for counting the length of each target word;
The first numerical value setting module is used for setting a preset first numerical value as the sub-weight of the target word segmentation if the length is smaller than a preset confidence interval;
the second value setting module is used for setting a second value configured for the length as the sub-weight of the target word segment if the length is in a preset confidence interval, and the second value is positively related to the length;
the third numerical value setting module is used for setting a preset third numerical value as the sub-weight of the target word segmentation if the length is larger than a preset confidence interval;
Wherein the first value is less than the third value, the third value is less than the second value;
the sub-weight configuration module further includes:
the medicine identification inquiring module is used for inquiring the identification information of the medicine in the candidate title information;
the first numerical setting module is further configured to:
Inquiring a first numerical value configured for the medicine according to the identification information;
Setting the first value as a sub-weight of the target word;
the second value setting module is further configured to:
Inquiring a plurality of mapping relations configured for the medicine according to the identification information, wherein the relation between the length and the second numerical value is recorded in each mapping relation;
Inquiring a second numerical value corresponding to the length in the mapping relation to be used as the sub-weight of the target segmentation;
The third value setting module is further configured to:
Inquiring a third numerical value configured for the medicine according to the identification information;
and setting the third numerical value as the sub-weight of the target word segmentation.
5. An electronic device, the electronic device comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of correcting a drug title as claimed in any one of claims 1-3.
6. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for causing a processor to execute the correction method of a drug title according to any one of claims 1-3.
CN202311497100.5A 2023-11-10 2023-11-10 Correction method, device, equipment and storage medium for medicine title Active CN117523570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311497100.5A CN117523570B (en) 2023-11-10 2023-11-10 Correction method, device, equipment and storage medium for medicine title

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311497100.5A CN117523570B (en) 2023-11-10 2023-11-10 Correction method, device, equipment and storage medium for medicine title

Publications (2)

Publication Number Publication Date
CN117523570A CN117523570A (en) 2024-02-06
CN117523570B true CN117523570B (en) 2024-05-14

Family

ID=89750761

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311497100.5A Active CN117523570B (en) 2023-11-10 2023-11-10 Correction method, device, equipment and storage medium for medicine title

Country Status (1)

Country Link
CN (1) CN117523570B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590447A (en) * 2017-08-29 2018-01-16 北京奇艺世纪科技有限公司 A kind of caption recognition methods and device
CN111582169A (en) * 2020-05-08 2020-08-25 腾讯科技(深圳)有限公司 Image recognition data error correction method, device, computer equipment and storage medium
CN112446351A (en) * 2020-12-09 2021-03-05 杭州米数科技有限公司 Medical bill intelligent recognition system solution
CN115223188A (en) * 2022-07-29 2022-10-21 盐城金堤科技有限公司 Bill information processing method, device, electronic equipment and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2651144C2 (en) * 2014-03-31 2018-04-18 Общество с ограниченной ответственностью "Аби Девелопмент" Data input from images of the documents with fixed structure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590447A (en) * 2017-08-29 2018-01-16 北京奇艺世纪科技有限公司 A kind of caption recognition methods and device
CN111582169A (en) * 2020-05-08 2020-08-25 腾讯科技(深圳)有限公司 Image recognition data error correction method, device, computer equipment and storage medium
CN112446351A (en) * 2020-12-09 2021-03-05 杭州米数科技有限公司 Medical bill intelligent recognition system solution
CN115223188A (en) * 2022-07-29 2022-10-21 盐城金堤科技有限公司 Bill information processing method, device, electronic equipment and computer storage medium

Also Published As

Publication number Publication date
CN117523570A (en) 2024-02-06

Similar Documents

Publication Publication Date Title
US20230021040A1 (en) Methods and systems for automated table detection within documents
US11514698B2 (en) Intelligent extraction of information from a document
CN108664574B (en) Information input method, terminal equipment and medium
CN109145260B (en) Automatic text information extraction method
US20100257440A1 (en) High precision web extraction using site knowledge
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
CN112015721A (en) E-commerce platform storage database optimization method based on big data
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
CN110633398A (en) Method for confirming central word, searching method, device and storage medium
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN110399493B (en) Author disambiguation method based on incremental learning
US20230205800A1 (en) System and method for detection and auto-validation of key data in any non-handwritten document
CN117523570B (en) Correction method, device, equipment and storage medium for medicine title
CN113408660A (en) Book clustering method, device, equipment and storage medium
CN113627892B (en) BOM data identification method and electronic equipment thereof
CN114595661B (en) Method, apparatus, and medium for reviewing bid document
CN117523590B (en) Method, device, equipment and storage medium for checking manufacturer name
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN112396056B (en) Method for high-accuracy line division of text image OCR result
CN117456532B (en) Correction method, device, equipment and storage medium for medicine amount
CN111310442B (en) Method for mining shape-word error correction corpus, error correction method, device and storage medium
CN114913537A (en) Method and device for generating structured data
CN112926577A (en) Medical bill image structuring method and device and computer readable medium
CN112395874A (en) Order information correction method, device, equipment and storage medium
CN111967246A (en) Error correction method for shopping bill recognition result

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant