CN105184052A

CN105184052A - Automatic coding method and system for medicine information

Info

Publication number: CN105184052A
Application number: CN201510496433.5A
Authority: CN
Inventors: 金以东; 朱华玲; 陈志永
Original assignee: Ebaotech Internet Medical Information Technology (beijing) Co Ltd
Current assignee: Ebaotech Internet Medical Information Technology (beijing) Co Ltd
Priority date: 2015-08-13
Filing date: 2015-08-13
Publication date: 2015-12-23
Anticipated expiration: 2035-08-13
Also published as: CN105184052B

Abstract

Embodiments of the present invention provide an automatic coding method and system for medicine information. The automatic coding method for medicine information comprises: inputting a medicine information string and performing preprocessing; performing segmenting to obtain a specification string and a package specification string; based on a pre-established dictionary set, performing segmenting to obtain sub strings; when the obtained sub strings are all first-type sub strings that directly match entries in the dictionary set and the entries do not contain entries with a blocking entry attribute, searching for a target sub-entry by using a medicine joint information dictionary; and finally, searching for a directly matching joint entry according to the target sub-entry, and assigning a code of the joint entry to the medicine information string. According to the automatic coding method and system for the medicine information provided by the present invention, features that the medicine information strings belong to a natural language, have complex and various formats and do not have a universal standard and the like are fully taken into consideration; accurate identification and precise coding of the medicine information strings are implemented; an identification result and a coding result have relatively high accuracy; and convenience is provided for effective use of medicine information.

Description

Automatic coding method and system for medicine information

Technical Field

The embodiment of the invention relates to the field of medical informatization, in particular to a method and a system for automatically encoding medicine information.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the rapid development of information technology, the medical industry in China is accelerating the construction of medical informatization. The medical information construction is beneficial to improving the medical treatment efficiency, provides good experience for patients and provides great help for improving the medical service quality.

Drug information management is an important basis for medical insurance settlement and also an important component of medical informatization construction. Different medicines are coded according to a certain rule, so that the medicine information can be standardized and formatted, the efficiency of utilizing and managing the medicine information is greatly improved, and the method has important significance for developing medical informatization construction.

Disclosure of Invention

In actual clinical practice, a large amount of medical record information is generated every day, and among the medical record information, there are many pieces of medicine information that medical practitioners input for treating diseases of patients, and research and utilization of the medicine information are of great significance for development of medical informatization. In the face of massive medicine information generated every day, the medicine information is identified and coded by a computer, and the method is one of effective ways for improving utilization and management of the medicine information.

However, since the character strings of the medicine information input by the medical practitioners belong to natural languages, the formats are complex and various, and there is no unified standard, for example, mixed expression of multiple languages is adopted, irregular grammar is used, incorrect information is input, abbreviation or common name is adopted to replace standard terms, and disordered information such as irrelevant symbols are mixed in characters, so that it is quite difficult for the computer to code the medicine, and even if the medicine can be coded according to the preset rules, the error rate is high.

For this reason, there is a great need for an automatic coding method for drug information, so as to code drugs according to drug information quickly, efficiently and accurately.

In this context, embodiments of the present invention are intended to provide a method and system for automatically encoding drug information.

In a first aspect of embodiments of the present invention, there is provided a method for automatically encoding drug information, including:

step 1, inputting a medicine information character string;

step 2, preprocessing the medicine information character string to obtain a preprocessed medicine information character string;

step 3, based on a preset specification dictionary and a preset packaging specification dictionary, cutting a specification character string and a packaging specification character string from the preprocessed medicine information character string; wherein the specification dictionary comprises a plurality of entries representing specification units of the medicine; the packaging specification dictionary comprises a plurality of entries which represent packaging specification units of the medicines; the specification character string represents specification information of the medicine; the packaging specification character string represents packaging specification information of the medicine;

step 4, based on a preset dictionary set, cutting the residual characters of the preprocessed medicine information character string into a plurality of sub character strings, wherein the sub character strings are first type sub character strings or second type sub character strings; the dictionary set consists of a plurality of dictionaries, wherein the dictionaries comprise a plurality of entries for representing the universal names, commodity names, product names, administration routes, dosage forms, manufacturers and packaging materials of medicines, and a plurality of entries for representing two or more combined universal names and names of medicines self-made by hospitals; the first type of substring is capable of directly matching an entry in the dictionary set, the second type of substring is not capable of directly matching an entry in the dictionary set;

step 5, judging whether all sub character strings cut from the residual characters of the preprocessed medicine information character strings are first type sub character strings; if the sub character string is divided into the second type sub character string, the processing is finished; if all the cut sub-character strings are the first type sub-character strings, determining the entry attributes of the entries matched with the sub-character strings, and continuing to execute the step 6; the entry attributes correspond to dictionaries to which the entries belong one to one, and the dictionaries have preset entry attributes;

step 6, judging whether entries corresponding to the shielding type entry attribute exist in the entries matched with the substring; if the entry corresponding to the shielding type entry attribute exists, ending the processing; if no entry corresponding to the masked entry attribute exists, continuing with step 7; wherein the entry corresponding to the shielding type entry attribute indicates that the medicine information character string represents information of a plurality of medicines or indicates that the medicine represented by the medicine information character string is a medicine self-made by a hospital;

step 7, judging whether each entry matched with the substring has an entry corresponding to the target entry attribute; if no entry corresponding to the target entry attribute exists, ending the processing; if the entries corresponding to the target entry attributes exist, combining the entries corresponding to the target entry attributes into entry combination groups, and matching the entry combination groups with the combined entries in the drug combined information dictionary; if the directly matched combined entry exists, assigning the combined code of the directly matched combined entry to the medicine information character string; the target entry attribute is the entry attribute of each sub-entry in the drug combined information dictionary; the drug combined information dictionary comprises a plurality of combined entries, each combined entry is provided with a one-to-one corresponding combined code, each combined entry consists of a plurality of sub-entries, and the sub-entries are entries which represent the general names, commodity names, product names, administration routes, dosage forms, manufacturers or packing materials of drugs in the dictionary;

and 8, outputting the joint code of the medicine information character string.

In a second aspect of embodiments of the present invention, there is provided a system for automatically encoding drug information, comprising:

the dictionary database provides a preset specification dictionary, a packaging specification dictionary, a dictionary set and a medicine combined information dictionary;

the input module is used for inputting a medicine information character string;

the preprocessing module is used for preprocessing the medicine information character string to obtain a preprocessed medicine information character string;

a first segmentation module for segmenting a specification character string and a packaging specification character string from the preprocessed medicine information character string based on the specification dictionary and the packaging specification dictionary; wherein the specification dictionary comprises a plurality of entries representing specification units of the medicine; the packaging specification dictionary comprises a plurality of entries which represent packaging specification units of the medicines; the specification character string represents specification information of the medicine; the packaging specification character string represents packaging specification information of the medicine;

the second segmentation module is used for segmenting the residual characters of the preprocessed medicine information character strings into a plurality of sub character strings based on the dictionary set, wherein the sub character strings are first type sub character strings and/or second type sub character strings; the dictionary set consists of a plurality of dictionaries, wherein the dictionaries comprise a plurality of entries for representing the universal names, commodity names, product names, administration routes, dosage forms, manufacturers and packaging materials of medicines, and a plurality of entries for representing two or more combined universal names and names of medicines self-made by hospitals; the first type of substring is capable of directly matching an entry in the dictionary set, the second type of substring is not capable of directly matching an entry in the dictionary set;

the first judgment processing module is used for judging whether all sub-character strings cut from the rest characters of the preprocessed medicine information character strings are first type sub-character strings; if the sub character string is divided into the second type sub character string, the processing is finished; if all the cut sub-character strings are the first type sub-character strings, determining the entry attributes of the entries matched with the sub-character strings, and triggering a second judgment processing module; the entry attributes correspond to dictionaries to which the entries belong one to one, and the dictionaries have preset entry attributes;

the second judgment processing module is used for judging whether each entry matched with the sub-character string has an entry corresponding to the shielding type entry attribute; if the entry corresponding to the shielding type entry attribute exists, ending the processing; if no entry corresponding to the shielding type entry attribute exists, triggering a third judgment processing module; wherein the entry corresponding to the shielding type entry attribute indicates that the medicine information character string represents information of a plurality of medicines or indicates that the medicine represented by the medicine information character string is a medicine self-made by a hospital;

the third judgment processing module is used for judging whether each entry matched with the sub-character string has an entry corresponding to the target entry attribute; if no entry corresponding to the target entry attribute exists, ending the processing; if the entries corresponding to the target entry attributes exist, combining the entries corresponding to the target entry attributes into entry combination groups, and matching the entry combination groups with the combined entries in the drug combined information dictionary; if the directly matched combined entry exists, assigning the combined code of the directly matched combined entry to the medicine information character string; the target entry attribute is the entry attribute of each sub-entry in the drug combined information dictionary; the drug combined information dictionary comprises a plurality of combined entries, each combined entry is provided with a one-to-one corresponding combined code, each combined entry consists of a plurality of sub-entries, and the sub-entries are entries which represent the general names, commodity names, product names, administration routes, dosage forms, manufacturers or packing materials of drugs in the dictionary;

and the output module is used for outputting the joint codes of the medicine information character strings.

By means of the technical scheme, the medicine information character strings input by medical practitioners are fully considered to belong to the characteristics of natural language, complex and various formats, no unified standard and the like, the medicine information character strings are segmented and matched by utilizing various dictionaries established in advance according to general standards in the medical field, so that medicine information is classified and identified, and the medicine information is coded according to an identification result. The invention strictly follows the following principle that the sub-character string cut from the medicine information character string can be used as the result of classification recognition only when being directly matched with the vocabulary entry in the dictionary set, and the automatic coding is carried out only when the result of classification recognition can be directly matched with the combined vocabulary entry in the medicine combined information dictionary, otherwise, the automatic coding is not carried out.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically illustrates an application scenario in which embodiments of the present invention may be implemented;

FIG. 2 schematically illustrates an exemplary method of automatically encoding drug information in accordance with the present invention;

FIG. 3 schematically illustrates a process of segmenting substrings according to an exemplary method of the present invention;

FIG. 4 schematically illustrates a process of matching a merged group of terms to a federated term in accordance with an exemplary method of the present invention;

FIG. 5 schematically illustrates a process of finding a target sub-entry of a reference entry in an exemplary method of the invention;

FIG. 6 schematically illustrates a process of finding a conjunctive term for a fuzzy match of a drug information string in accordance with an exemplary method of the present invention;

FIG. 7 schematically illustrates another process of finding a conjunctive term for a fuzzy match of a string of drug information characters of the exemplary method of the present invention;

FIG. 8 schematically illustrates a block diagram of an exemplary automatic drug information encoding system of the present invention;

FIG. 9 schematically illustrates another block diagram of an exemplary system for automatically encoding drug information in accordance with the present invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, the invention provides an automatic coding method and system of medicine information.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Application scene overview

Reference is first made to fig. 1, which illustrates an application scenario in which embodiments of the present invention may be implemented.

The scenario shown in fig. 1 includes a medical informatization platform 100 and a drug information automatic encoding system 200. The medical information platform 100 may be software loaded on a desktop computer, a notebook computer, a tablet computer, a personal digital assistant, or the like used by a doctor. The medicine information automatic encoding system 200 may be software or the like running in a hospital information server. The medical information platform 100 and the automatic drug information coding system 200 may be connected to each other by communication via a hospital lan, for example.

After a healthcare practitioner (e.g., a doctor) inputs drug information in the healthcare informatization platform 100, the drug information is transmitted to the drug information automatic coding system 200, and is subjected to natural language processing and automatic coding by the drug information automatic coding system 200, and finally, a coding result is output.

Exemplary method

The present exemplary method introduces an exemplary method of automatically encoding drug information of the present invention. The exemplary method is used to identify drug information entered by a healthcare practitioner and ultimately output the coded results of the drug.

Before introducing the exemplary method, various dictionaries that the exemplary embodiment needs to call are introduced through tables 1 to 23.

(1) Specification dictionary

The specification dictionary comprises a plurality of entries which represent specification units of the medicines, and in the invention, the specification dictionary is used for cutting out specification character strings from the medicine information character strings, and the specification character strings represent the specification information of the medicines.

An exemplary specification dictionary is as follows:

the specification dictionary includes a standard specification table and a specification synonym table.

The standard specification table includes a number of standard loading specification units and standard ingredient specification units.

The standard filling specification unit represents the weight or filling amount of the minimum preparation unit of the medicine, such as the number of medicines in a tablet and how many milliliters of medicine are filled in a bottle of injection.

The standard component specification unit represents the dosage or titer of the effective component contained in the minimum preparation unit of the medicine.

The standard loading specification unit and the standard component specification unit are both from the information published by the national food and drug administration (CFDA) ([ specification ]) for various drugs.

Table 1 shows a part of standard loading specification units and standard component specification units included in the standard specification table.

TABLE 1

Standard loading specification unit	Standard component standard unit
		Keke (Chinese character of 'Keke')	Keke (Chinese character of 'Keke')
Milligrams of	Milligrams of
		Milliliter (ml)	Microgram of

The specification synonym table comprises a plurality of loading specification unit synonyms and component specification unit synonyms.

Synonyms of the loading specification units are aliases, common names, English abbreviations, wrongly written characters and the like of the standard loading specification units.

The synonym of the component specification unit is the alias, common name, English abbreviation, wrongly written character and the like of the standard component specification unit.

The specification synonym table records the corresponding relationship between the synonym of the capacity specification unit and the standard capacity specification unit and the corresponding relationship between the synonym of the component specification unit and the standard component specification unit.

Table 2 shows the partial-loading-specification-unit synonyms and component-specification-unit synonyms included in the specification synonym table, and the corresponding standard-loading-specification units and standard-component-specification units.

TABLE 2

It should be noted that, in implementing the present invention, the specification dictionary containing other types of entries may be adopted according to actual situations to achieve the purpose of separating out the specification character strings, the type or source of the entries contained in the specification dictionary is not specifically limited in the present invention, that is, the above description is only a specific embodiment of the present invention, and is not intended to limit the protection scope of the present invention, and the specification dictionary containing other types or sources of entries should be included in the protection scope of the present invention within the spirit and principle of the present invention.

(2) Packaging specification dictionary

The packing specification dictionary comprises a plurality of entries which represent packing specification units of the medicines, and the packing specification dictionary is used for segmenting packing specification character strings from medicine information character strings, and the packing specification character strings represent packing specification information of the medicines.

An exemplary packaging specification dictionary is as follows:

the package specification dictionary includes a standard package specification table and a package specification synonym table.

The standard package size table includes a number of standard formulation minimum units and standard package size units.

The standard preparation minimum unit means the minimum preparation unit of the medicine, such as tablet and granule.

The standard packing specification unit represents the minimum packing unit of the medicine, such as a box and a bottle.

The standard packaging specification unit is from information of packaging specification published by national food and drug administration (CFDA) for various medicines and information of packaging specification in official networks of pharmaceutical manufacturers and pharmaceutical descriptions.

Table 3 shows a part of the standard formulation minimum units and standard package size units included in the standard package size table.

TABLE 3

Standard formulation minimum unit	Standard packing standard unit
		Sheet	Box
Granule	Bottle (Ref. TM. bottle)
		Branch stand	Bag (CN)

The package specification synonym table includes a number of formulation minimum unit synonyms and package specification unit synonyms.

Synonyms of minimum units of the preparation are alias, common name, English abbreviation or wrongly written characters of the minimum units of the standard preparation.

The synonym of the packing specification unit is the alias, common name, English abbreviation or wrongly written or mispronounced character of the standard packing specification unit.

The package specification synonym table details the correspondence between the formulation minimum unit synonym and the standard formulation minimum unit, and the correspondence between the package specification unit synonym and the standard package specification unit.

Table 4 shows the part of formulation minimum unit synonyms and packaging specification unit synonyms included in the packaging specification synonym table, and the corresponding standard formulation minimum units and standard packaging specification units.

TABLE 4

In the invention, the specification dictionary is used for segmenting the packing specification character strings. It should be noted that, in the implementation of the present invention, the packaging specification dictionary containing other types of entries may be used according to the actual situation to achieve the purpose of separating the packaging specification character strings, the present invention does not specifically limit the types or sources of the entries contained in the packaging specification dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the protection scope of the present invention, and the packaging specification dictionary containing other types or sources of entries should be included in the protection scope of the present invention within the spirit and principle of the present invention.

(3) Universal name dictionary

The universal name dictionary includes a plurality of entries representing universal names of medicines, and in the present invention, the universal name dictionary is an entry for segmenting an attribute of the entry as a "universal name".

An exemplary universal name dictionary is as follows:

the universal name dictionary comprises a standard universal name table and a universal name synonym table.

The standard universal name table comprises a plurality of standard universal names which are Chinese medicine universal names (CADN) established according to international non-proprietary medicine names and combining specific situations.

Table 5 shows a part of the standard common names included in the standard common name table.

TABLE 5

Standard generic name
	Anisodamine
Adenosine triphosphate
	Sodium hyaluronate
Mebromobenzidine

The common name synonym table includes a number of common name synonyms that are aliases, colloquials, english abbreviations, or mispronounced words of standard common names.

The common name synonym table details the correspondence between each common name synonym and a standard common name.

Table 6 shows partial common name synonyms, standard common names, and synonymy relationships between the two included in the common name synonym table.

TABLE 6

It should be noted that, in implementing the present invention, a universal name dictionary including other types of entries may be adopted according to actual situations to achieve the purpose of separating entries with the entry attribute of "universal name", and the present invention does not specifically limit the types or sources of the entries included in the universal name dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the scope of the present invention, and the universal name dictionary including entries of other types or sources is included in the scope of the present invention within the spirit and principle of the present invention.

(4) Commodity name dictionary

The commodity name dictionary comprises a plurality of entries for representing commodity names of medicines, and in the invention, the commodity name dictionary is used for segmenting the entries with the attribute of 'commodity name'.

An exemplary commodity name dictionary is as follows:

the commodity name dictionary comprises a standard commodity name table and a commodity name synonym table.

The standard trade name table includes several standard trade names from CFDA published under trade name information for pharmaceuticals as well as trade name information in official documents of manufacturers and specifications for pharmaceuticals.

The standard commodity name dictionary details the correspondence between each standard commodity name and the standard common name.

Table 7 shows a part of the standard names, the standard common names, and the corresponding relationship between the two included in the standard name table.

TABLE 7

Standard name of goods	Standard generic name
		Zuoke (medicine for curing diabetes)	Levofloxacin hydrochloride
Yundesu for curing nephrosis	Recombinant human interferon alpha 1b
		Memory cake	Simvastatin

The commodity name synonym table comprises a plurality of commodity name synonyms which are alias names, common names, English abbreviations or wrongly-written characters of standard commodity names.

The commodity name synonym describes in detail the correspondence between each commodity name synonym and the standard commodity name and the standard common name.

Table 8 shows a part of the commodity name synonyms, the standard commodity names, the standard common names, and the corresponding relationship among the three, which are included in the commodity name synonym table.

TABLE 8

It should be noted that, in implementing the present invention, a commodity name dictionary containing other types of terms can be used according to actual situations to achieve the purpose of separating terms with term attribute of "commodity name", and the present invention does not specifically limit the types or sources of the terms contained in the commodity name dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the protection scope of the present invention, and the commodity name dictionary containing terms of other types or sources should be included in the protection scope of the present invention within the spirit and principle of the present invention.

(5) Product name dictionary

The product name dictionary includes a plurality of entries representing product names of the medicines, and in the present invention, the product name dictionary is an entry for segmenting an attribute of the entry as a "product name".

An exemplary product name dictionary is as follows:

the product name dictionary includes a standard product name table, a product name synonym table.

The standard product name table includes several standard product names from the CFDA published [ product name ] information for various drugs.

The standard product name dictionary details the correspondence between each standard product name and the standard common name.

Table 9 shows a part of standard product names, standard common names, and a corresponding relationship between the two included in the standard product name table.

TABLE 9

Standard product name

Standard generic name

Albendazole tablets	Albendazole
		Albendazole chewable tablet	Albendazole
Amoxicillin capsule	Amoxicillin
		Amoxicillin granules	Amoxicillin
Ibuprofen suspension	Ibuprofen
		Ibuprofen sustained release suspension	Ibuprofen
Ibuprofen tablet	Ibuprofen

The product name synonym table comprises a plurality of product name synonyms which are aliases, common names, English abbreviations or wrongly-written characters and the like of standard product names.

The product name synonym details the correspondence between each product name synonym and the standard product name and the standard common name.

Table 10 shows a part of the product name synonyms, the standard product names, the standard common names, and the corresponding relationship among the three included in the product name synonym table.

Watch 10

It should be noted that, in implementing the present invention, a product name dictionary containing other types of entries may be adopted according to actual situations to achieve the purpose of separating entries with the entry attribute of "product name", and the present invention does not specifically limit the types or sources of the entries contained in the product name dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the scope of the present invention, and a product name dictionary containing entries of other types or sources is included in the scope of the present invention within the spirit and principle of the present invention.

(6) Administration route dictionary

The administration route dictionary includes a plurality of entries indicating administration routes of the medicines, and in the present invention, the administration route dictionary is an entry for segmenting an attribute of the entry as "administration route".

An exemplary administration route dictionary is as follows:

the administration route dictionary comprises a standard administration route terminology table and an administration route synonym table.

The standard administration route glossary includes several standard administration route terms that are established based on the anatomical therapeutics and chemical classification system (ATC) in conjunction with actual drug use.

Table 11 shows the standard route of administration terminology and a portion of the standard route of administration terminology included in the standard route of administration terminology table.

TABLE 11

Standard route of administration terminology
	Is administered orally
Containing clothes
	Mucosal administration
Under the tongue
	Administration by injection
Intramuscular injection
	Subcutaneous injection
Local infiltration
	Topical administration of drugs
Administration to urethra
	Administration by inhalation
Dental appliance
	Eye-use medicine

The administration route synonym table includes several administration route synonyms that are aliases, common names, english abbreviations, or mispronounced words of standard administration route terminology, and the like.

The administration route synonym table details the correspondence between the administration route synonyms and the standard administration route terminology.

Table 12 shows a part of the administration route synonyms included in the administration route synonym table, the standard drug route, and the synonym relationship between them.

TABLE 12

It should be noted that, in implementing the present invention, an administration route dictionary including other types of terms may be used according to actual situations to achieve the purpose of separating terms with "administration route" as the term attribute, and the present invention does not specifically limit the types or sources of the terms included in the administration route dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the scope of the present invention, and an administration route dictionary including terms of other types or sources is included in the scope of the present invention within the spirit and principle of the present invention.

(7) Formulation dictionary

The formulation dictionary includes a plurality of entries indicating formulations of medicines, and in the present invention, the formulation dictionary is an entry for dividing the entry attribute into "formulation".

An exemplary formulation dictionary is as follows:

the formulation dictionary comprises a standard formulation terminology table and a formulation synonym table.

The standard dosage form glossary includes several standard dosage form terms.

Standard dosage form terms include: based on the drug registered dosage form of CFDA, according to the rule and definition of 'general rule of preparation' in Chinese pharmacopoeia 2010 edition, the drug registered dosage form is subjected to standardization treatment to obtain a drug dosage form; secondly, in the national medical insurance catalogue, the medical insurance dosage form of the relevant registration information can not be inquired in the CFDA, and the standard medical insurance dosage form is determined according to the dosage form of the national medical insurance catalogue.

Table 13 shows some of the standard dosage form nomenclature included in the standard dosage form nomenclature table.

Watch 13

Standard dosage form nomenclature
	Tablet formulation
Powder preparation
	Granules
Spray agent

Ointment formulation
	Suppository
Oral sustained release dosage form
	Gargle

The dosage form synonym table includes a number of dosage form synonyms.

Dosage form synonyms are aliases, common names, english abbreviations, mispronounced words or subtypes of standard dosage form terminology.

The dosage form synonym table details the correspondence between dosage form synonyms and standard dosage form terms.

Table 14 shows the partial dosage form synonyms, standard dosage form terminology, and the correspondence between the two, which are included in the dosage form synonym table.

TABLE 14

It should be noted that, in the implementation of the present invention, a dosage form dictionary containing other types of entries may be used according to actual situations to achieve the purpose of separating the entry with the entry attribute of "dosage form", and the present invention does not specifically limit the types or sources of the entries contained in the dosage form dictionary, i.e., the above description is only a specific example of the present invention and is not intended to limit the protection scope of the present invention, and the dosage form dictionary containing other types or sources of entries should be included in the protection scope of the present invention within the spirit and principle of the present invention.

(8) Dictionary for manufacturer

The manufacturer dictionary includes a plurality of entries indicating manufacturers of the pharmaceutical products, and in the present invention, the manufacturer dictionary is used for segmenting entries having an attribute of "manufacturer".

An exemplary manufacturer dictionary is as follows:

the manufacturer dictionary comprises a standard manufacturer table and a manufacturer synonym table.

The standard manufacturer table includes names of a plurality of standard manufacturers, and the names of the standard manufacturers are from information of drug manufacturers published by CFDA (China manufacturer) or information of manufacturers (English).

Table 15 shows part of the standard manufacturer names included in the standard manufacturer table.

Watch 15

Name of Standard manufacturer
	Shanghai Changzhe Fumin pharmaceutical Tongling Co Ltd
NANJING HENCER PHARMACY Co.,Ltd.
	Heilongjiang Haxing pharmaceutical industry group Co., Ltd
GUANGZHOU JIULIANSHAN PHARMACEUTICAL Co.,Ltd.
	Sichuan kang special medicine industry
Dr.Reddy`s Laboratories Ltd.

The manufacturer synonym table includes a plurality of manufacturer name synonyms.

The synonym of the name of the manufacturer is an abbreviation, an English name and the like of the name of a standard manufacturer.

The synonym table of the manufacturer records the corresponding relationship between the synonym of the name of the manufacturer and the name of the standard manufacturer in detail.

Table 16 shows part of the manufacturer name synonyms, standard manufacturer names, and their corresponding relationships included in the manufacturer synonym table.

TABLE 16

It should be noted that, in implementing the present invention, a manufacturer dictionary containing other types of entries may be used according to actual situations to achieve the purpose of separating the entries with the attribute of "manufacturer", and the present invention does not specifically limit the types or sources of the entries contained in the manufacturer dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the protection scope of the present invention, and a manufacturer dictionary containing other types or sources of entries should be included in the protection scope of the present invention within the spirit and principle of the present invention.

(9) Packing material dictionary

The packing material dictionary comprises a plurality of entries for representing the packing materials of the medicines, and the packing material dictionary is used for segmenting the entries with the attribute of packing materials.

An exemplary wrapper dictionary is as follows:

the packing material dictionary comprises a standard packing material table and a packing material synonym table.

The standard packing material table comprises a plurality of standard packing material names which are from the medicine packing materials published by CFDA or the information related to the packing materials in the medicine specification.

Table 17 shows part of the standard packing names included in the standard packing table.

TABLE 17

Standard packing material name
	non-PVC soft bag
Glass bottle
	Plastic bottle

The package synonym table comprises a plurality of package name synonyms.

The synonym of the packing material name is the alias, common name or English abbreviation of the standard packing material name, etc.

The package synonym table details the correspondence between the package name synonym and the standard package name.

Table 18 shows the partial package name synonyms, the standard package names, and the synonym relationship between the two included in the package synonym table.

Watch 18

Synonym for name of packing material	Standard packing material name
		Glass bottle	Glass bottle
Plastic bottle	Plastic bottle

It should be noted that, in implementing the present invention, a packaging dictionary containing other types of entries may be used according to actual situations to achieve the purpose of separating the entry with the entry attribute of "packaging", and the present invention does not specifically limit the type or source of the entry contained in the packaging dictionary, i.e., the above description is only a specific example of the present invention, and is not intended to limit the scope of the present invention.

(10) Universal name and big name dictionary

The universal name large name dictionary comprises a plurality of entries formed by combining standard universal names of two or more medicines.

An exemplary common name large name dictionary is as follows:

the common name large name dictionary includes entries of the following types: common name large name terms. These generic name terms come from the drug names of the 2009 catalogue of national basic medical insurance, industrial injury insurance and fertility insurance drugs and the collection of generic name names in customer data.

The common name large name dictionary details correspondence between common name large name terms and standard common names constituting the same.

Table 19 shows the correspondence between the partial common name large name terms included in the common name large name dictionary and the standard common names.

Watch 19

It should be noted that, in implementing the present invention, a universal name dictionary including other types of entries may be used according to actual situations to achieve the purpose of separating entries with "universal name", and the present invention does not specifically limit the types or sources of the entries included in the universal name dictionary, i.e., the above description is only a specific example of the present invention and is not intended to limit the scope of the present invention.

(11) Dictionary for preparation in hospital

The thesaurus dictionary includes a plurality of entries indicating names of medicines self-made (developed) by hospitals, and in the present invention, the thesaurus dictionary is an entry for dividing the entry attribute into "nosocomial preparations".

An exemplary nosocomial preparation dictionary is as follows:

the nosocomial preparation dictionary includes entries of the following types: name of the formulation in the hospital. The name of the preparation in hospital comes from the variety of the preparation in hospital approved by food and drug administration in province and city and the collection of the name of the preparation in hospital in the client data. The college preparation dictionary also details the corresponding relationship between the college preparation name and its development unit.

Table 20 shows the correspondence between the names of some of the nosocomial preparations included in the nosocomial preparation dictionary and the units of development thereof.

Watch 20

Name of preparation in hospital	Research and development unit
		Burn treating liquid	Yiyang Central Hospital
Burner splendid achnatherum oil for burn	First-person hospital in Changde City
		Tinea tincture	First-person hospital in Changde City
Zinc boron paste	Zhangjiajie People's Hospital
		Eczema ointment	Sichuan province hospital skin research institute preparation room
White lotion	Sichuan province hospital skin research institute preparation room
		Simiao powder	The eighth Hospital of Changsha City
Xiaozhaoling pill for eliminating obstruction	Cili county traditional Chinese medicine hospital
		Intestines and stomach pill	Jiahe county traditional Chinese medicine hospital
Gynecological V-size powder	Chinese medicine-saving auxiliary two

It should be noted that, in implementing the present invention, the household preparation dictionary containing other types of entries may be used according to actual situations to achieve the purpose of separating the entry with the entry attribute of "household preparation", and the present invention does not specifically limit the type or source of the entry contained in the household preparation dictionary, i.e., the above description is only a specific example of the present invention and is not intended to limit the scope of the present invention, and the household preparation dictionary containing other types or sources of entries shall be included in the scope of the present invention within the spirit and principle of the present invention.

(12) Medicine combined information dictionary

The drug combined information dictionary comprises a plurality of combined entries, each combined entry comprises a plurality of sub-entries, and each combined entry is provided with a one-to-one corresponding combined code. In the invention, the medicine combined information dictionary is used for matching and coding the medicine information character strings with combined entries.

The sub-vocabulary in the drug association information dictionary may be: a general name sub-entry, a dosage form sub-entry, a standard grid entry, a packaging material name sub-entry, a manufacturer name sub-entry, an administration route sub-entry, a trade name sub-entry, or a product name sub-entry; each sub-entry has entry attributes in one-to-one correspondence. Table 21 shows the source and entry attributes of each entry.

TABLE 21

In the medicine combined information dictionary, the combined code is formed by combining the codes of each sub-entry forming the combined entry. Each type of sub-entry has a certain encoding rule, and each sub-entry is encoded according to the corresponding encoding rule, for example, the following encoding rules and encoding examples of partial sub-entries are as follows:

(1) common name sub-word strip

And (3) encoding rules: 6-bit code, 1-bit capital letter and 5-bit Arabic numeral. And (4) encoding by an independent system.

An example of encoding: x12345

(2) Dosage forms

And (3) encoding rules: a 5 bit sequential code. And (4) encoding by an independent system.

An example of encoding: 10041

(3) Specification of

An example of encoding: 10001

(4) Packaging specification

And (3) encoding rules: 2 bit sequence code, based on universal name, dosage form and specification, and coding in sequence. Non-independent coding schemes.

An example of encoding: 01

(5) Packing material

And (3) encoding rules: a 2-bit sequential code. And (4) encoding by an independent system.

An example of encoding: 01

(6) Manufacturer of the product

And (3) encoding rules: 9 bits of combination code, the first 3 bits are the area code of China or English abbreviation of non-Chinese country where the manufacturer is located, for example: in USA, if the English abbreviation is less than 3 bits, 0 is used to supplement. An independent coding scheme.

An example of encoding: USA123456

It should be noted that, in implementing the present invention, various sub-entries may be encoded by using an appropriate encoding rule according to actual situations, and the encoding rule of the sub-entries is not specifically limited in the present invention, that is, the above description is only a specific example of the present invention and is not intended to limit the protection scope of the present invention, and any form of encoding rule used for encoding the sub-entries within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The following is an exemplary medicine joint information dictionary, which will be referred to as a "full-version medicine joint information dictionary" hereinafter:

in the full-version drug combined information dictionary, each combined entry comprises the following sub-entries: the general name sub-entry, the dosage form sub-entry, the grid entry, the packing material name sub-entry and the manufacturer name sub-entry.

As shown in table 22, the partial united entries included in the full version integrated pharmaceutical integrated information dictionary.

TABLE 22

The following is another exemplary drug combination information dictionary, which will be referred to as a "simplified drug combination information dictionary" hereinafter:

in the simple drug combined information dictionary, each combined entry comprises the following sub-entries: common name sub-entries, administration route sub-entries.

Table 23 shows a part of the united entry included in the simplified drug united information dictionary.

TABLE 23

It should be noted that, in implementing the present invention, the combined medicine information dictionary containing other types of sub-entries may be adopted according to actual situations to achieve the purpose of matching and encoding the combined entry for the medicine information character string, the present invention does not specifically limit the types or sources of the sub-entries contained in the combined medicine information dictionary, that is, the above description is only a specific example of the present invention, and is not intended to limit the protection scope of the present invention, and any packing dictionary containing other types or sources of entries shall be included in the protection scope of the present invention within the spirit and principle of the present invention.

In the following, in conjunction with the application scenario of fig. 1, a method for automatically encoding medicine information according to an exemplary embodiment of the present invention is described with reference to tables 1 to 23 and fig. 2. It should be noted that the application scenario of fig. 1 is only illustrated for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in any way in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

Referring to fig. 2, a method for automatically encoding medicine information according to a first exemplary method of the present invention includes:

in step S11, a medicine information character string is input.

And step S12, preprocessing the medicine information character string to obtain a preprocessed medicine information character string.

The purpose of this step is to convert the characters in the drug information string into a uniform coding format for subsequent processing.

Alternatively, this step may be implemented in the following specific manner: format normalization processing is carried out on non-Chinese characters in the medicine information character string (for example, all symbols in the medicine information character string are converted into a half-corner format or all symbols in the medicine information character string are converted into a full-corner format, and all English letters in the medicine information character string are converted into an uppercase format or a lowercase format); and delete the extraneous characters in the medicine information string according to a pre-established extraneous character dictionary, e.g., delete □ Δ o ◢ ■ a ● · and the like extraneous characters.

In step S13, a specification character string and a packaging specification character string are cut out from the preprocessed medicine information character string based on the specification dictionary and the packaging specification dictionary.

Specifically, this step may be performed as follows:

step S131, judging whether numbers exist in the preprocessed medicine information character string; if the number exists, step S132 is executed; if no number is present, go directly to step S14.

Step S132, matching the character string immediately after the number with the entries in the specification dictionary and the packaging specification dictionary, and if the entry successfully matched comes from the specification dictionary, segmenting the number and the character string immediately after the number and capable of being matched with the entry in the specification dictionary to serve as the specification character string; and if the successfully matched entry is from the packaging specification dictionary, segmenting the number and the character string which is next to the number and can be matched with the entry in the packaging specification dictionary to be used as a packaging specification character string.

For example, the character string of the preprocessed medicine information is "foscarnet cream | fuxianling 0.15 g", it is first judged that there is a numeric character "0.15", and then the character "g" is matched with a specification dictionary and a packaging specification dictionary, and it is determined that it matches with the loading specification unit synonym "g" in the specification synonym table, so that "0.15 g" is cut out from the character string of the preprocessed medicine information as a specification character string.

And step S14, based on the dictionary set, dividing the residual characters of the preprocessed medicine information character string into a plurality of sub character strings, wherein the sub character strings are first type sub character strings or second type sub character strings.

The dictionary set comprises a plurality of entries for representing the common names, commodity names, product names, administration routes, dosage forms, manufacturers, packing materials, common name large names and hospital self-made medicine names of the medicines. The dictionary set is composed of a universal name dictionary, a commodity name dictionary, a product name dictionary, a drug administration route dictionary, a formulation type dictionary, a manufacturer dictionary, a packaging material dictionary, a universal name large name dictionary and a preparation in hospital dictionary.

The sub-strings that are cut out of the remaining characters of the pre-processed drug information string have independent semantics, i.e., the represented information is not affected by the characters before or after it. The first type of substring can be directly matched with an entry in the dictionary set and the second type of substring cannot be directly matched with an entry in the dictionary set.

Since the first type of substring can be directly matched to an entry in the lexicon set, the first type of substring may be any of the following: a standard generic name, a generic name synonym, a standard commodity name, a commodity name synonym, a standard product name, a product name synonym, a standard route of administration, a route of administration synonym, a standard dosage form term, a dosage form synonym, a standard manufacturer name, a manufacturer name synonym, a standard packaging material name, a packaging material name synonym, a generic name term, an nosocomial formulation name.

The purpose of this step is to segment the medicine information into substrings with independent semantics, so as to effectively avoid the problem that a plurality of characters with association relations are respectively recognized, thereby causing recognition errors.

A specific implementation of step S14 will be described in detail in the following.

And step S15, judging whether all the substrings cut out in step S14 are first-type substrings, if all the substrings are first-type substrings, determining the entry attributes of the entries matched with the substrings, executing step S16, and if the second-type substrings exist, ending the processing.

This step follows the principle of: the subsequent encoding step can be continued only if the cut substrings directly match the entries in the dictionary set, otherwise no encoding is performed.

Specifically, if the second-type substring is segmented in this step, subsequent joint entry matching is not required, and only when all the segmented substrings are the first-type substrings that can be directly matched with the entry, subsequent joint entry matching is performed.

In the invention, the entry attributes of the entries are in one-to-one correspondence with the dictionary types to which the entries belong, and each dictionary has preset entry attributes.

The correspondence between the entry attribute and the dictionary type is shown in table 24.

Watch 24

Entry attribute	Dictionary type
		Common name	Universal name dictionary
Name of commodity	Commodity name dictionary

Product name	Product name dictionary
		Route of administration	Administration route dictionary
Dosage forms	Formulation dictionary
		Manufacturer of the product	Dictionary for manufacturer
Packing material	Packing material dictionary
		Specification of	Specification dictionary
Packaging specification	Packaging specification dictionary
		Common name and large name	Universal name and big name dictionary
Preparation in hospital	Dictionary for preparation in hospital

The following steps of the exemplary method are to encode the drug information represented by the drug information string, however, some of the drug information represented by the drug information string may not be encoded, for example: if the entry corresponding to the entry attribute of the universal name exists in each entry matched with the substring, the medicine information string represents the information of a plurality of medicines (corresponding to a plurality of standard universal names forming the universal name); for another example: if the entry corresponding to the preparation with the entry attribute in the hospital exists in the entries matched with the substrings, the medicine information string represents the information of the medicine self-made by the hospital, and the medicine developed by the hospital is not listed in the medicines published by the CFDA, and has no general codes in the medical field to encode.

For such cases above, the present exemplary method employs the step S16 processing:

step S16, judging whether each entry matched with the substring has an entry corresponding to the shielding type entry attribute; if yes, ending the processing to stop the subsequent step of encoding; if not, then step S17 continues.

In this step, the attribute of the mask type entry is a common name (indicating that the drug information string indicates information on a plurality of drugs) or a hospital preparation (indicating that the drug indicated by the drug information string is a hospital-made drug).

Step S17, judging whether each entry matched with the substring has an entry corresponding to the target entry attribute; if so, go to step S18; if not, the process ends.

In this step, the target entry attribute is an entry attribute of each sub-entry in the drug association information dictionary.

For example, if the target entry is a full-version drug association information dictionary, the target entry attributes are: common name, dosage form, specification, packaging material name and manufacturer name; if the target entry is a simple version medicine combined information dictionary, the target entry attributes are respectively as follows: common name, route of administration.

In the step, in the process of matching the combined entry, different quantities and different types of target entry attributes can be selected, and the combined entry is matched for the medicine information character string according to the different quantities and different types of the target entry attributes. For example, when the number of the selected target entry attributes is large, the number of sub-entries included in the matched combined entry is also large, and correspondingly, the number of bits of the code of the final medicine information character string is also large, which is more beneficial to the management and utilization of the subsequent medicine code.

Step S18, merging the entries corresponding to the target entry attributes to be used as entry merging groups, matching the entry merging groups with the combined entries in the drug combined information dictionary, and assigning the combined codes of the directly matched combined entries to the drug information character strings and outputting the combined codes if the directly matched combined entries exist; if there is no directly matching conjunct entry, the process is terminated, or the fuzzy matching process of steps S19 to S112 is continued.

The purpose of the step is that the drug information character string of the combined entry can be directly matched for the entry combination group, and the matched code of the combined entry is assigned to the drug information character string.

This step follows the principle of: and coding is carried out only when the result of the classification recognition can be completely matched with the joint entries in the medicine joint information dictionary, and coding is not carried out otherwise.

Hereinafter, a specific implementation of the steps S17 to S18 will be described in detail in the second embodiment.

Step S19, determining each entry corresponding to the preset entry attribute as a reference entry; analyzing preset dimensions of each reference entry to obtain analysis results of each dimension of each reference entry; meanwhile, analyzing each sub-entry in the medicine combined information dictionary in a preset dimension respectively to obtain an analysis result of each dimension of each sub-entry.

In this step, the reference entry and the sub-entry are respectively used as analysis objects, and optionally, performing a preset dimension analysis on the analysis objects may include but is not limited to:

(1) determining each Chinese character in the analysis object;

(2) determining the initial consonant of each Chinese character in the analysis object;

(3) determining the vowel of each Chinese character in the analysis object;

(4) determining a first character of the analysis object;

(5) determining the pinyin of the first character of the analysis object; and the number of the first and second groups,

(6) and determining the non-Chinese characters in the analysis object, and if the non-Chinese characters are not contained in the analysis object, the analysis result of the item is null.

When the parsing object is a reference entry, the parsing results of the dimensions thereof may include, but are not limited to: each Chinese character in the reference entry, the initial consonant of each Chinese character in the reference entry, the vowel of each Chinese character in the reference entry, the first character of the reference entry, the pinyin of the first character of the reference entry, and the non-Chinese character in the reference entry.

When the parsing object is a sub-vocabulary, the parsing result may include, but is not limited to: each Chinese character in the sub-entry, the initial consonant of each Chinese character in the sub-entry, the vowel of each Chinese character in the sub-entry, the first character of the sub-entry, the pinyin of the first character of the sub-entry, and the non-Chinese character of the sub-entry.

For example, table 25 shows the results of the dimension resolution of the sub-bar "famotidine".

TABLE 25

Step S110, for each reference entry, matching the analysis result of each dimension of the reference entry with the analysis result of each dimension of each sub-entry corresponding to the same entry attribute, searching for one or more sub-entries matched with the reference entry, and determining the searched sub-entries as the target sub-entries of the reference entry.

A specific implementation of step S110 will be described in detail in the following.

And step S111, searching for a combined entry consisting of target sub-entries of all the reference entries in the medicine combined information dictionary, and determining the searched combined entry as the combined entry matched with the medicine information character string in a fuzzy manner.

Two specific implementations of step S111 will be described in detail in the following examples four and five.

And step S112, sending the combined entry fuzzy-matched with the medicine information character string to a manual processing platform for manual processing.

Step S19 to step S112 are to search for a fuzzy matching joint entry for the drug information character string, which aims to perform preset dimension analysis on each reference entry and each sub-entry in the drug joint information dictionary for the drug information character string for which the entry combination cannot be directly matched, then match the analysis result of each dimension of the reference entry with the analysis result of each dimension of each sub-entry, so as to search for the sub-entry matched with each reference entry, then search for the fuzzy matching joint entry of the drug information character string according to the sub-entry matched with each reference entry, and finally send the fuzzy matching joint entry to a manual processing platform, and continue processing (for example, manual coding) by manual, and the specific processing procedure is not limited by the present invention. For example, one of the multiple conjunctive terms of the fuzzy match may be manually selected as the final matched conjunctive term and its code may be assigned to the drug information string.

It should be noted that the fuzzy matching process of steps S19 to S112 may be selectively retained or deleted in the present exemplary method, that is, the principle strictly followed by the present invention is that only when the result of the classification recognition can directly match the conjunct entries in the drug conjunct information dictionary, the drug information character strings that cannot directly match the conjunct entries in the drug conjunct information dictionary are automatically encoded, and for the drug information character strings that cannot directly match the conjunct entries in the drug conjunct information dictionary, the present invention may selectively search the vaguely matching conjunct entries for them and send them to a manual processing platform for manual processing (e.g., manual encoding).

The method provided by the invention fully considers the characteristics that the medicine information character string input by medical practitioners belongs to natural language, has complex and various formats and has no unified standard, and the like, and utilizes various dictionaries established in advance according to the general standard in the medical field to segment and match the medicine information character string, so that medicine information is classified and identified, and the medicine information is coded according to the identification result. The invention strictly follows the following principle that the sub-character string cut from the medicine information character string can be used as the result of classification recognition only when being directly matched with the vocabulary entry in the dictionary set, and the automatic coding is carried out only when the result of classification recognition can be directly matched with the combined vocabulary entry in the medicine combined information dictionary, otherwise, the automatic coding is not carried out.

Example one

Referring now to FIG. 3, an exemplary embodiment of step S14 of the exemplary method of the present invention is shown.

As shown in fig. 3, the process of segmenting the remaining characters of the preprocessed medicine information character string into a plurality of sub-character strings (first type sub-character strings or second type sub-character strings) based on the dictionary set may include:

step S20, judging whether the residual characters of the character string of the preprocessed medicine information contain symbols; if the symbol is contained, performing step S21; if no symbol is contained, step S22 is performed.

Step S21, matching characters between every two adjacent symbols in the residual characters of the preprocessed medicine information character string with entries in a dictionary set as a whole; if the matching is successful, executing step S211; if the matching fails, step S212 is executed.

Step S211, the characters between the two adjacent symbols are cut out as the first type substring.

In step S212, the adjacent two symbols and the character therebetween are determined as the temporary non-split character string, and then step S23 is performed.

The processing rules according to which step S21, step S211, and step S212 are: matching all characters between adjacent symbols with entries in a dictionary set as a whole, and segmenting the characters only when the characters are matched, or not segmenting the characters temporarily.

For example, table 26 shows the segmentation of the "(lipitor) atorvastatin calcium tablet", wherein lipitor, atorvastatin calcium tablet and tablet are all characters between symbols, and matching entries can be found, and thus, segmented out, respectively.

Watch 26

Step S22, matching the residual characters of the character string of the preprocessed medicine information with entries in a dictionary set by adopting a mechanical word segmentation method; if all the remaining characters in the preprocessed medicine information character string can be matched with the entries, executing step S221; if there is a single character or a plurality of continuous characters that cannot be matched with the entry in the remaining characters of the preprocessed medicine information character string, step S222 is executed.

And step S221, segmenting the residual characters of the preprocessed medicine information character string according to the matched entry to be used as a first type substring.

Step S222, cutting the whole residual characters of the preprocessed medicine information character string to be used as a second type sub-character string.

The processing rules in step S22, step S221, and step S222 are: and matching the residual characters of the preprocessed medicine information character string with the entries by adopting a mechanical word segmentation method, segmenting only when all the characters can find the matched entries, and otherwise, not segmenting temporarily.

For example, the 'noh-long repaglinide tablet' is segmented, wherein the 'noh-long repaglinide tablet' and the 'repaglinide tablet' can both find matched entries, namely all characters can find matched entries, so that the 'noh-long repaglinide tablet' is segmented, and the segmentation result is the 'noh-long repaglinide tablet' and the 'repaglinide tablet'.

The mechanical word segmentation method adopted in step S22 may be a forward maximum matching type, a reverse maximum matching type, or a least-segmentation type. The specific segmentation process is not described in detail in this embodiment.

Step S23, judging whether the character string which is not cut temporarily contains a preset special symbol; if the string of characters is not cut for the moment and contains the special symbol, carry out step S231; if the temporary non-divided character string does not contain the special symbol, step S233 is performed.

Step S231, searching a character model to which the temporarily unsingulated character string belongs, and segmenting the temporarily unsingulated character string according to a segmentation rule corresponding to the character model to which the temporarily unsingulated character string belongs; the character model is provided by a pre-established character model library, and the character model has a one-to-one correspondence segmentation rule.

Step 232, matching the cut characters with entries in a dictionary set, if the matching is successful, determining the cut characters as first-type substrings, and if the matching is failed, determining the cut characters as second-type substrings;

in step S233, the non-split character string is directly determined as the second type substring.

The processing rules according to which step S23, step S231, step 232, and step S233 are: when the temporarily unsingulated character string contains a preset special symbol, segmenting according to a character model to which the temporarily unsingulated character string belongs, otherwise, directly segmenting; and matching the characters cut out based on the character model with the entries in the dictionary set again, wherein the characters which can be directly matched with the entries are used as first-type substrings, and the characters which can not be directly matched with the entries are used as second-type substrings.

For example, the predetermined special symbols may include, but are not limited to, vertical lines, parentheses, commas, pause signs, periods, colons, plus signs, semicolons, slashes, and the like.

For example, the following are part of the character models in the character model library and the segmentation rules thereof:

(1) character model: BCDE type, and C, E is parentheses, B, D is letters;

and (3) segmentation rule: cutting B, D;

(2) character model: FGH type, F, H are all Chinese characters, G is vertical line;

and (3) segmentation rule: f, H as cut out;

(3) character model: IJK type, I, K are all Chinese characters, J is semicolon, period, question mark, exclamation mark, pause mark, and the rule of segmentation: respectively cutting I and K;

(4) character model: STU type, T is slash line, and S, U can not be successfully matched with the dictionary;

and (3) segmentation rule: the STU is cut out as a whole.

The following are several examples of segmentation according to the character model:

the original character string "juhe li (shandong qilu)" conforms to the character model BCDE type, and is therefore divided into "juhe li" and "shandong qilu".

The original character string ' omeprazole magnesium enteric-coated tablet | -Luxek MUPS ' conforms to the character model FGH type, and therefore the original character string ' omeprazole magnesium enteric-coated tablet | ' Luxek MUPS ' is cut into ' omeprazole magnesium enteric-coated tablet ' and ' Luxek MUPS '.

The original character strings "meisha mulberry sustained release granules" and "adisha" conform to the character model IJK type, and are therefore classified as "meisha mulberry sustained release granules" and "adisha".

The original character string 'haemophilus B/hepatitis B vaccine' conforms to the character model STU type, so the original character string is divided into 'haemophilus B/hepatitis B vaccine'.

Example two

Referring to FIG. 4, a detailed description of steps S17-S18 of the exemplary method of the present invention is shown.

As shown in fig. 4, the process of matching the word combination and combination with the joint word in the drug joint information dictionary may include:

step S31, determining the entry attribute of each sub-entry in the full-page drug combined information dictionary as the current target entry attribute, and judging whether the entry matched with the sub-character string has the entry corresponding to the current target entry attribute; if so, step S32 is executed, and if not, step S33 is executed.

Specifically, referring to the full version drug federation information dictionary in the exemplary method, the target entry attributes thereof are: common name, dosage form, specification, packaging material name and manufacturer name.

The step is to judge whether each entry matched with the substring has entries respectively corresponding to a general name, a dosage form, a specification, a packaging material name and a manufacturer name.

Step S32, merging a plurality of entries corresponding to the current target entry attribute (universal name, dosage form, specification, packaging material name and manufacturer name) in each entry matched with the sub-character string to be used as entry merging groups, matching the entry merging groups with the combined entries in the full-version medicine combined information dictionary, and assigning the combined codes of the directly matched combined entries to the medicine information character string and outputting the medicine information character string if the directly matched combined entries exist; if there is no directly matching conjunct term, then the fuzzy matching process of steps S19-S112 in the exemplary method is performed.

Step S33, determining the entry attribute of each sub-entry in the simple drug combined information dictionary as the current target entry attribute, and judging whether each entry matched with the sub-character string has an entry corresponding to the current target entry attribute; if so, step S34 is executed, and if not, the process ends.

Specifically, referring to the simplified drug combination information dictionary in the exemplary method, the target entry attributes thereof are: common name, route of administration.

The step is to judge whether each entry matched with the sub-character string has entries respectively corresponding to the common name and the administration route.

Step S34, merging a plurality of entries corresponding to the current target entry attribute (common name, administration route) in each entry matched with the sub-character string to be used as an entry merging set, matching the entry merging set with the combined entry in the simple version medicine combined information dictionary, and assigning the combined code of the directly matched combined entry to the medicine information character string and outputting the same if the directly matched combined entry exists; if there is no directly matching conjunct term, then the fuzzy matching process of steps S19-S112 in the exemplary method is performed.

In the process of matching the combined vocabulary entry, the vocabulary entry attribute of the sub-vocabulary entry in the full version medicine combined information dictionary is preferentially adopted as the target vocabulary entry attribute, if the vocabulary entry corresponding to the sub-character string does not meet the target vocabulary entry attribute of the full version medicine combined information dictionary, the vocabulary entry attribute of the sub-vocabulary entry in the simple version medicine combined information dictionary is adopted as the target vocabulary entry attribute, the purpose is to preferentially match the combined vocabulary entry with more sub-vocabulary entries, and codes containing more digits are given to the medicine, so that the follow-up management and utilization of the medicine codes are facilitated.

EXAMPLE III

Referring now to FIG. 5, a specific embodiment of step S110 of the exemplary method of the present invention is shown.

As shown in fig. 5, the process of finding the target sub-entry of the reference entry may include:

step S41, for each reference entry, calculating the similarity between the reference entry and the sub-entries corresponding to the same entry attribute according to the following formula:

wherein M represents similarity;

t represents the resolution result of each dimension of the reference entry;

q represents a reference entry;

tinq represents the dimensions of the reference entry;

d represents a sub-entry corresponding to the same entry attribute as the reference entry;

tf (tind) represents the frequency of matching the analysis result of the reference entry with the analysis result of the sub-entries corresponding to the same entry attribute in the same dimension;

wherein, T represents the total number of sub-entries corresponding to the same entry attribute as the reference entry in the drug combined information dictionary; t (t) represents the total number of the sub-entries of which the analysis results of all dimensions are matched with the analysis results of all dimensions of the reference entry in all the sub-entries of the drug combined information dictionary corresponding to the same entry attribute as the reference entry;

getboost () represents preset weights of each dimension;

norm (t, d) represents the length normalization factor of the sub-entries corresponding to the same entry attribute as the reference entry;

wherein, each dimension of the analysis object is respectively: each Chinese character, the initial consonant of each Chinese character, the vowel of each Chinese character, the first character, the pinyin of the first character and the non-Chinese character.

Step S42, determining one or more sub-terms as target sub-terms of the reference term according to the similarity between the reference term and each sub-term corresponding to the same term attribute.

Alternatively, the step may be embodied as follows: sequencing each sub-entry corresponding to the same entry attribute according to the similarity of the reference entry, and determining a preset number (for example, 10) of sub-entries sequenced at the top as a target sub-entry of the reference entry; or, determining one or more sub-entries corresponding to the same entry attribute, the similarity of which with the reference entry reaches a preset threshold (for example, the similarity is greater than 0.8), as the target sub-entry of the reference entry.

In order to clearly refer to and utilize the similarity between the entry and each target sub-entry when the present invention is embodied, the similarity between each reference entry and each target sub-entry thereof can be output together in the final output result.

In the specific implementation of the invention, if higher and more accurate requirements are placed on the calculation of the similarity degree between the reference entry and the target sub-entry, the total confidence degree between the reference entry and the target sub-entry can be calculated. Wherein, the total confidence coefficient is calculated according to the following process:

and (1) determining each Chinese character in the reference entry.

Step (2), calculating the cosine confidence degrees of the reference entry and the target sub-entries thereof according to the following formula:

wherein N represents a cosine confidence;

v represents the total number of Chinese characters contained in the reference entry and the target sub-entry thereof;

q represents a reference entry;

d' represents a target sub-entry of the reference entry;

w_Q,jrepresenting the frequency of occurrence of each Chinese character in the reference entry;

w_d',jrepresenting the frequency of occurrence of each Chinese character in the target sub-entry of the reference entry;

j denotes the number of the Chinese characters contained in the reference entry and the target sub-entry.

And (3) calculating the total confidence of the reference entry and the target sub-entries thereof according to the following formula:

S＝M×a+N×b

wherein S represents the total confidence;

a represents a preset weight corresponding to the similarity M;

b represents a preset weight corresponding to the cosine confidence coefficient N, wherein b is 1-a.

For example, assume that one target sub-entry of the reference entry "nimodipine injection" is "nimodipine," where the frequency of occurrence of each chinese character is shown in table 27.

Watch 27

Then calculating the cosine confidence coefficients of the reference entry 'nimodipine injection' and the target sub-entry 'nimodipine' according to a cosine confidence coefficient calculation formula as follows:

according to the formula

The similarity M was calculated to be 0.92.

The total confidence of the nimodipine injection and the nimodipine for injection is calculated according to the total confidence calculation formula S-M × a + N × b, i.e., S-M × a + N × b is 0.92 × 40% +0.75 × 60% + 0.82.

Example four

Referring now to FIG. 6, a specific embodiment of step S111 of the exemplary method of the present invention is shown.

As shown in fig. 6, the process of finding the combined entry fuzzy-matched with the character string of the medicine information according to the target sub-entry may include:

step S51, searching the combined entry of the target sub-entries simultaneously including the reference entries in the drug combined information dictionary, and determining the searched combined entry as the combined entry to be selected.

For example, assume that the entry attribute is a reference entry of a common name, and its target sub-entry is "enteral nutrition (TP-HE)"; the entry attribute is a reference entry of a dosage form, and a target sub-entry is 'emulsion'; the entry attribute is a reference entry of the specification, and the target sub-entry is '500 ml'; the entry attribute is a reference entry of the packaging specification, and the target sub-entries are '1 bag' and '1 bottle'; the entry attribute is a reference entry of a package material name, and target sub-entries of the entry are a non-PVC film and a glass bottle; the entry attribute is a reference entry of the name of the manufacturer, and the target sub-entry is "fresenius kabi deutschland gmbh", so that the to-be-selected joint entry can be determined to be the joint entry shown in the shadow in the medicine joint information dictionary shown in table 28.

Watch 28

Step S52, determining the similarity between each target sub-entry in the to-be-selected joint entry and the corresponding reference entry according to the calculation result in step S52 in the fifth embodiment.

For example, assume that the similarity between each target sub-entry and the corresponding reference entry in table 28 is given by parentheses, which is specifically:

the similarity between the target sub-entry 'enteral nutrition (TP-HE)' and the reference entry with the entry attribute being a common name is 0.6;

the similarity between the target sub-entry emulsion and the reference entry with the entry attribute being the dosage form is 0.7;

the similarity between the target sub-entry of 500ml and the reference entry with the entry attribute as the specification is 0.7;

the similarity between the target sub-entry of 1 bag and the reference entry with the entry attribute of packaging specification is 0.5;

the similarity between the target sub-entry of 1 bottle and the reference entry with the entry attribute of packaging specification is 0.9;

the similarity between the target sub-entry 'non-PVC film' and the reference entry with the entry attribute being the wrapper name is 0.8;

the similarity between the target sub-entry glass bottle and the reference entry with the entry attribute being the package name is 0.5;

the similarity between the target sub-entry "fresenius kabi deutschland gmbh" and the reference entry whose entry attribute is the name of the manufacturer is 0.6.

Step S53, calculating a weighted average of the similarity of each target sub-entry and the corresponding reference entry in the to-be-selected joint entry, wherein the weight of the similarity of each target sub-entry and the corresponding reference entry is equal to the preset weight of the entry attribute corresponding to the target sub-entry and the corresponding reference entry.

Specifically, weights are set for the entry attributes such as a general name, a dosage form, a specification, a packaging material name, a manufacturer name and the like in advance, and when a weighted average is calculated, the weight of the similarity between a target sub-entry and a corresponding reference entry is equal to the weight of the entry attribute corresponding to the target sub-entry.

For example, it is assumed that the preset weights for the entry attributes such as common name, dosage form, specification, packaging material name, and manufacturer name are: 50%, 10% according to the similarity of each target sub-entry and the corresponding reference entry in table 28, the weighted average is calculated, and the results are shown in the rightmost column of table 28, and are 0.105 and 0.107 respectively from top to bottom.

And step S54, selecting one or more combined entries from all the combined entries to be selected as entries and combining the combined entries to be matched according to the weighted average value of all the combined entries to be selected.

Alternatively, the step may be embodied as follows: sorting all the to-be-selected joint entries according to the weighted average value of all the to-be-selected joint entries, and selecting a preset number (for example, 2) of to-be-selected joint entries with top sorting as joint entries matched with entry merging groups; or selecting one or more to-be-selected joint entries of all to-be-selected joint entries, the weighted average value of which is greater than a preset threshold (for example, greater than 0.1), as joint entries matched with the entry combination group.

For example, the candidate joint entries with the weighted average value greater than 0.106 in each candidate joint entry in the table 28 may be selected as the joint entry matching the entry merge group, that is, the joint entry corresponding to the second row from top to bottom.

EXAMPLE five

Referring now to FIG. 7, another embodiment of step S111 of the exemplary method of the present invention is shown.

As shown in fig. 7, the process of finding the combined entry fuzzy-matched with the character string of the medicine information according to the target sub-entry may include:

step S61, searching the combined entry of the target sub-entries simultaneously including the reference entries in the drug combined information dictionary, and determining the searched combined entry as the combined entry to be selected.

For a detailed implementation of this step, refer to step S51, which is not described herein again.

And step S62, determining the total confidence of each target sub-entry in the to-be-selected combined entry and the corresponding reference entry.

For a detailed implementation of this step, refer to step S52, which is not described herein again. Wherein, the total confidence coefficient is calculated according to the following process:

and (1) determining each Chinese character in the reference entry.

wherein N represents a cosine confidence;

q represents a reference entry;

d' represents a target sub-entry of the reference entry;

S＝M×a+N×b

wherein S represents the total confidence;

a represents a preset weight corresponding to the similarity M;

Step S63, calculating a weighted average of the total confidence of each target sub-entry and the corresponding reference entry in the to-be-selected joint entry, wherein the weight of the total confidence of each target sub-entry and the corresponding reference entry is equal to the preset weight of the target sub-entry and the target entry attribute corresponding to the corresponding reference entry.

Specifically, weights are set in advance for target entry attributes such as a general name, a dosage form, a specification, a packaging material name, a manufacturer name, a drug administration route, a commodity name, a product name and the like, and when a weighted average is calculated, the weight of the total confidence degree of the target sub-entry and the corresponding reference entry is equal to the weight of the target entry attribute corresponding to the target sub-entry.

For a detailed implementation of this step, refer to step S63, which is not described herein again.

And step S64, selecting one or more combined entries from all the combined entries to be selected as entries and combining the combined entries to be matched according to the weighted average value of all the combined entries to be selected.

Exemplary System

An exemplary system of the present invention, which corresponds to an exemplary method, is described below with reference to fig. 8 in conjunction with the application scenario of fig. 1.

Fig. 8 is a block diagram illustrating an exemplary natural language processing system for drug information according to the present invention, where the natural language processing system for drug information, as shown in fig. 8, includes:

the dictionary database 71 provides a preset specification dictionary, a package specification dictionary, a dictionary set, and a medicine combination information dictionary. For the specific information of the specification dictionary, the package specification dictionary, the dictionary set, and the drug combination information dictionary, reference is made to the exemplary method, which is not described herein again.

And the input module 72 is used for inputting a medicine information character string.

And the preprocessing module 73 is configured to preprocess the medicine information character string to obtain a preprocessed medicine information character string.

A first segmentation module 74 configured to segment a specification character string and a packaging specification character string from the preprocessed medicine information character string based on the specification dictionary and the packaging specification dictionary; wherein the specification dictionary comprises a plurality of entries representing specification units of the medicine; the packaging specification dictionary comprises a plurality of entries which represent packaging specification units of the medicines; the specification character string represents specification information of the medicine; the packing specification character string represents packing specification information of the medicine.

A second segmentation module 75, configured to segment remaining characters of the preprocessed medicine information character string into a plurality of sub-character strings based on the dictionary set, where the sub-character strings are first type sub-character strings and/or second type sub-character strings; the dictionary set consists of a plurality of dictionaries, wherein the dictionaries comprise a plurality of entries for representing the universal names, commodity names, product names, administration routes, dosage forms, manufacturers and packaging materials of medicines, and a plurality of entries for representing two or more combined universal names and names of medicines self-made by hospitals; the first type of substring can be directly matched with an entry in the dictionary set, and the second type of substring cannot be directly matched with an entry in the dictionary set.

A first judgment processing module 76, configured to judge whether all sub-character strings cut from remaining characters of the preprocessed medicine information character string are first-type sub-character strings; if the sub character string is divided into the second type sub character string, the processing is finished; if all the cut substrings are the first type substrings, determining the entry attributes of the entries matched with the substrings, and triggering a second judgment processing module 77; the entry attributes correspond to dictionaries to which the entries belong one to one, and the dictionaries have preset entry attributes. The term attribute division of the terms may refer to an exemplary method, and is not described herein again.

A second judgment processing module 77, configured to judge whether there is an entry corresponding to a shielding-type entry attribute in each entry matching the sub-character string; if the entry corresponding to the shielding type entry attribute exists, ending the processing; if there is no entry corresponding to the masked entry attribute, the third decision processing module 78 is triggered; wherein the entry corresponding to the shielding type entry attribute indicates that the medicine information character string indicates that the medicine is represented by a plurality of medicines, or indicates that the medicine represented by the medicine information character string is a medicine self-made by a hospital.

A third judgment processing module 78, configured to judge whether there is an entry corresponding to the target entry attribute in each entry matched with the sub-character string; if no entry corresponding to the target entry attribute exists, ending the processing; if the entries corresponding to the target entry attributes exist, combining the entries corresponding to the target entry attributes into entry combination groups, and matching the entry combination groups with the combined entries in the drug combined information dictionary; if the directly matched combined entry exists, assigning the combined code of the directly matched combined entry to the medicine information character string; the target entry attribute is the entry attribute of each sub-entry in the drug combined information dictionary; the drug combined information dictionary comprises a plurality of combined entries, each combined entry is provided with a one-to-one corresponding combined code, each combined entry consists of a plurality of sub-entries, and the sub-entries are entries which represent the general names, commodity names, product names, administration routes, dosage forms, manufacturers or packing materials of drugs in the dictionary.

And the output module 79 outputs the joint codes of the medicine information character strings.

Optionally, as shown in fig. 9, the exemplary system may further include: an analysis module 81, a matching module 82, a search module 83, and a sending module 84.

The third determining and processing module 78 is further configured to trigger the parsing module 81 if there is no directly matching combined entry when matching the merged and combined entry with the combined entry in the drug combined information dictionary.

The parsing module 81 is configured to determine each entry corresponding to the preset entry attribute as a reference entry, and perform preset dimension parsing on each reference entry and each sub-entry in the drug associated information dictionary to obtain a parsing result of each dimension of each reference entry and a parsing result of each dimension of each sub-entry.

The matching module 82 is configured to, for each reference entry, match an analysis result of each dimension of the reference entry with an analysis result of each dimension of each sub-entry corresponding to the same entry attribute, search for one or more sub-entries matched with the reference entry, and determine the searched sub-entries as target sub-entries of the reference entry.

The searching module 83 is configured to search for a combined entry in the drug combined information dictionary, where the combined entry is composed of target sub-entries of each reference entry, and determine the searched combined entry as a combined entry in which the drug information character strings are in fuzzy matching.

The sending module 84 is configured to send the combined entry fuzzy-matched with the drug information character string to a manual processing platform for manual processing.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Those of skill in the art will further appreciate that the various illustrative logical blocks, elements, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, or devices described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

Claims

1. A method of automatically encoding drug information, comprising:

step 1, inputting a medicine information character string;

and 8, outputting the joint code of the medicine information character string.

2. The automatic encoding method of medicine information according to claim 1,

the specification dictionary includes entries of the following types: standard loading specification unit, standard component specification unit, synonym of loading specification unit and synonym of component specification unit;

the standard filling specification unit is the weight or filling amount of the minimum preparation unit of the medicine;

the standard component specification unit is the dosage or the potency of the effective component contained in the minimum preparation unit of the medicine;

the synonym of the loading specification unit is an alias, a common name, an English abbreviation or a wrongly written or mispronounced character of the standard loading specification unit;

the synonym of the component specification unit is an alias, a common name, an English abbreviation or a wrongly written character of the standard component specification unit;

the packaging specification dictionary comprises the following types of entries: a standard formulation minimum unit, a standard packaging specification unit, a formulation minimum unit synonym, a packaging specification unit synonym;

the standard preparation minimum unit is the minimum preparation unit of the medicine;

the standard packaging specification unit is the minimum packaging unit of the medicine;

the synonym of the minimum unit of the preparation is an alias, common name, English abbreviation or wrongly written or mispronounced character of the minimum unit of the standard preparation;

the synonym of the packaging specification unit is an alias, a common name, an English abbreviation or a wrongly written or mispronounced character of the standard packaging specification unit;

the dictionary set comprises a universal name dictionary, a commodity name dictionary, a product name dictionary, a drug administration route dictionary, a formulation dictionary, a manufacturer dictionary, a packing material dictionary, a universal name large name dictionary and a preparation dictionary in hospital;

the generic name dictionary includes entries of the following types: standard generic name, generic name synonyms;

the standard universal name is a Chinese medicine universal name;

the synonym of the general name is an alias, a common name, an English abbreviation or a wrongly-written or wrongly-written character of a standard general name;

the commodity name dictionary includes the following types of entries: standard commodity name, commodity name synonym;

the standard commodity name is commodity name information published by the national food and drug administration (CFDA) to the drug and commodity name information in official documents and drug specifications of manufacturers;

the commodity name synonym is an alias, a common name, an English abbreviation or a wrongly-written or wrongly-written character of the standard commodity name;

the product name dictionary includes entries of the following types: synonyms of standard product name and product name;

the standard product name is the name information of the medicine product published by CFDA;

the synonym of the product name is an alias, a common name, an English abbreviation or a wrongly-written or wrongly-written character of the standard product name;

the administration route dictionary includes entries of the following types: standard route of administration terminology, route of administration synonyms;

the standard route of administration term is the route of administration specified in the anatomical, therapeutic and chemical Classification System ATC of drugs;

said route of administration synonym is an alias, common name, english abbreviation or misnomer of said standard route of administration terminology;

the formulation dictionary includes entries of the following types: standard dosage form terminology, dosage form synonyms;

the standard dosage form terms include: the preparation form of the medicine after chemical treatment is carried out on the registered preparation form of the medicine published by CFDA according to the general rule of preparations in Chinese pharmacopoeia, and the preparation form of the national medical insurance catalogue in which relevant registration information can not be inquired in the CFDA;

said dosage form synonym is an alias, common name, english abbreviation, mispronounced character, or subtype of said standard dosage form term;

the manufacturer dictionary comprises the following types of entries: synonyms of standard manufacturer name and manufacturer name;

the name of the standard manufacturer is the information of a drug manufacturing enterprise published by CFDA, and the Chinese information or English information of a manufacturer;

the synonyms of the names of the manufacturers are abbreviations or English names and great names of the standard manufacturers;

the packing material dictionary comprises the following types of entries: the synonyms of the standard packing material name and the packing material name;

the name of the standard packaging material is a medicine packaging material published by CFDA;

the synonym of the packing material name is an alias, common name or English abbreviation of the standard packing material name;

the common-name large-name dictionary comprises the following types of entries: common name large name terms;

the common name large name term is formed by combining two or more standard common names;

the nosocomial preparations dictionary includes entries of the type: the name of the formulation in the hospital;

the hospital preparation name is a name representing a medicine self-made by a hospital;

the entry attributes corresponding to the specification character string and the packaging specification character string are specification and packaging specification respectively;

when the entries matched with the first type substring and the second type substring belong to a universal name dictionary, the corresponding entry attributes are universal names;

when the vocabulary entry matched with the first type substring and the second type substring belongs to a commodity name dictionary, the corresponding vocabulary entry attribute is a commodity name;

when the entries matched with the first type substrings and the second type substrings belong to a product name dictionary, the corresponding entry attributes are product names;

when the entries matched with the first type substrings and the second type substrings belong to an administration route dictionary, the corresponding entry attributes are administration routes;

when the entries matched with the first type substrings and the second type substrings belong to a dosage form dictionary, the corresponding entry attributes are dosage forms;

when the entries matched with the first type substrings and the second type substrings belong to a dictionary of a manufacturer, the corresponding entry attributes are of the manufacturer;

when the entries matched with the first type substrings and the second type substrings belong to a packing material dictionary, the corresponding entry attributes are packing materials;

when the entries matched with the first type substrings and the second type substrings belong to a common name large name dictionary, the corresponding entry attributes are common name large names;

and when the entries matched with the first type substrings and the second type substrings belong to a preparation dictionary in a hospital, the corresponding entry attributes are preparations in the hospital.

3. The method for automatically encoding drug information according to claim 1, wherein the step 2 comprises:

carrying out format normalization processing on non-Chinese characters in the medicine information character string, and deleting irrelevant characters in the medicine information character string to obtain the preprocessed medicine information character string;

wherein the irrelevant characters are provided by a preset irrelevant character dictionary.

4. The method for automatically encoding drug information according to claim 1, wherein the step 3 comprises:

judging whether numbers exist in the preprocessed medicine information character string;

if no number exists in the character string of the preprocessed medicine information, directly executing the step 4;

if the character string of the preprocessed medicine information has a number, matching the character string which is next to the number with entries in the specification dictionary and the packaging specification dictionary;

if the successfully matched entry comes from the specification dictionary, segmenting the number and the character string which is next to the number and can be matched with the entry in the specification dictionary to serve as the specification character string;

and if the successfully matched entry is from the packaging specification dictionary, segmenting the number and the character string which is next to the number and can be matched with the entry in the packaging specification dictionary to be used as the packaging specification character string.

5. The method for automatically encoding drug information according to claim 1, wherein the step 4 comprises:

judging whether the rest characters of the preprocessed medicine information character string contain symbols or not;

if the residual characters of the preprocessed medicine information character string contain symbols, matching characters between every two adjacent symbols in the residual characters of the preprocessed medicine information character string with entries in the dictionary set as a whole;

if the characters between two adjacent symbols in the residual characters of the preprocessed medicine information character string are successfully matched with the entries in the dictionary set as a whole, segmenting the characters between the two adjacent symbols to be used as a first type sub-character string;

if the matching of the characters between two adjacent symbols in the residual characters of the preprocessed medicine information character string as a whole with the entries in the dictionary set fails, determining the two adjacent symbols and the characters between the two adjacent symbols as a temporary non-segmentation character string;

if the residual characters of the preprocessed medicine information character string do not contain symbols, matching the residual characters of the preprocessed medicine information character string with entries in the dictionary set by adopting a mechanical word segmentation method;

if all the remaining characters in the preprocessed medicine information character string can be matched with the entry, segmenting the remaining characters of the preprocessed medicine information character string according to the matched entry to serve as a first type sub-character string;

if the residual characters of the preprocessed medicine information character string comprise single characters or a plurality of continuous characters which cannot be matched with the entries, cutting the residual characters of the preprocessed medicine information character string into a whole as a second type sub-character string;

judging whether the temporarily unsingulated character string contains a preset special symbol or not;

if the temporarily unsingulated character string contains a preset special symbol, searching a character model to which the temporarily unsingulated character string belongs, segmenting the temporarily unsingulated character string according to a segmentation rule corresponding to the character model to which the temporarily unsingulated character string belongs, and matching segmented characters with entries in the dictionary set;

if the characters segmented from the temporary non-segmented character string are successfully matched with the entries in the dictionary set, determining the segmented characters as first-type sub-character strings;

if the characters cut out from the temporary non-segmentation character string are unsuccessfully matched with the entries in the dictionary set, determining the cut-out characters as second type sub-character strings;

and if the temporarily unsingulated character string does not contain the preset special symbol, directly determining the temporarily unsingulated character string as a second type substring.

6. The method of claim 5, wherein the mechanical segmentation is a forward maximum matching type, or a reverse maximum matching type, or a least-segmentation type.

7. The method of automatically encoding pharmaceutical information according to claim 2, wherein the masked entry attribute is a generic name or a hospital preparation.

8. The automatic encoding method of medicine information according to claim 2,

the sub-vocabulary included in the medicine combined information dictionary is respectively as follows: the general name sub-entry, the dosage form sub-entry, the grid entry, the packing material name sub-entry and the manufacturer name sub-entry;

the entry attributes of the sub-entries in the drug combined information dictionary are respectively as follows: common name, dosage form, specification, packaging material name and manufacturer name;

the universal name sub-entry is a standard universal name and a universal name synonym contained in the universal name dictionary;

the dosage form sub-entry is a standard dosage form term and a dosage form synonym included in the dosage form dictionary;

the standard lattice entries are specifications of various medicines published by CFDA;

the packaging standard lattice entry is the packaging specification of various medicines published by CFDA, and the packaging specification in the official website of the medicine manufacturer and the medicine specification;

the packing material name sub-entry is a synonym of a standard packing material name and a packing material name included in the packing material dictionary;

the manufacturer name sub-entry is a synonym of a standard manufacturer name and a manufacturer name included in the manufacturer dictionary.

9. The automatic encoding method of medicine information according to claim 2,

the sub-vocabulary included in the medicine combined information dictionary is respectively as follows: a common name sub-entry, a route of administration sub-entry;

the entry attributes of the sub-entries in the drug combined information dictionary are respectively as follows: common name, route of administration;

the sub-entry of the administration route is a standard administration route term and a synonym of the administration route, which are included in the administration route dictionary.

10. The automatic encoding method of medicine information according to claim 2,

the drug combination information dictionary includes: a full version medicine combined information dictionary and a simple version medicine combined information dictionary;

the sub-vocabulary included in the full version drug combined information dictionary is respectively as follows: the general name sub-entry, the dosage form sub-entry, the grid entry, the packing material name sub-entry and the manufacturer name sub-entry;

the entry attributes of each sub-entry in the full version drug combined information dictionary are respectively as follows: common name, dosage form, specification, packaging material name and manufacturer name;

the manufacturer name sub-entry is a synonym of a standard manufacturer name and a manufacturer name included in the manufacturer dictionary;

the sub-vocabulary bars included in the simple version medicine combined information dictionary are respectively as follows: a common name sub-entry, a route of administration sub-entry;

the entry attributes of each sub-entry in the simple version drug combined information dictionary are respectively as follows: common name, route of administration;

the administration route sub-vocabulary entry is a standard administration route term and an administration route synonym which are included in the administration route dictionary;

the step 7 comprises the following steps:

step 71, determining the entry attribute of each sub-entry in the full-version drug combined information dictionary as the current target entry attribute, and judging whether the entry matched with the sub-character string has an entry corresponding to the current target entry attribute; if there is an entry corresponding to the current target entry attribute, executing step 72, and if there is no entry corresponding to the current target entry attribute, executing step 73;

step 72, combining a plurality of entries corresponding to the current target entry attribute in the entries matched with the sub-character strings to form an entry combination group, matching the entry combination group with the combined entry in the full-version medicine combined information dictionary, and assigning the combined code of the directly matched combined entry to the medicine information character string if the directly matched combined entry exists;

step 73, determining the entry attribute of each sub-entry in the simple drug combined information dictionary as the current target entry attribute, and judging whether the entry matched with the sub-character string has an entry corresponding to the current target entry attribute; if there is an entry corresponding to the current target entry attribute, executing step 74, and if there is no entry corresponding to the current target entry attribute, ending the processing;

and 74, combining a plurality of entries corresponding to the current target entry attribute in the entries matched with the sub-character strings to form an entry combination group, matching the entry combination group with the combined entry in the simple version medicine combined information dictionary, and assigning the combined code of the directly matched combined entry to the medicine information character string if the directly matched combined entry exists.

11. The method for automatically encoding pharmaceutical information according to claim 1, wherein, when matching the combined and grouped vocabulary entry with the associated vocabulary entry in the pharmaceutical associated information dictionary in step 7, if there is no directly matching associated vocabulary entry, the method further comprises:

step 9, respectively determining the entries corresponding to the preset entry attributes as reference entries, and performing preset dimension analysis on the reference entries and sub-entries in the drug combined information dictionary to respectively obtain analysis results of the dimensions of the reference entries and the dimensions of the sub-entries;

step 10, aiming at each reference entry, matching the analysis result of each dimension of the reference entry with the analysis result of each dimension of each sub-entry corresponding to the same entry attribute, searching one or more sub-entries matched with the reference entry, and determining the searched sub-entries as target sub-entries of the reference entry;

step 11, searching a combined entry composed of target sub-entries of all reference entries in the drug combined information dictionary, and determining the searched combined entry as a combined entry matched with the drug information character string in a fuzzy manner;

and step 12, sending the combined entry fuzzy-matched with the medicine information character string to a manual processing platform for manual processing.

12. The method for automatically encoding pharmaceutical information according to claim 11, wherein the parsing result of the reference/sub-entry includes:

each Chinese character of the reference sub-entry/the sub-entry;

the initial consonant of each Chinese character of the reference sub-entry/the sub-entry;

the reference sub-entry/the vowel of each Chinese character of the sub-entry;

the reference sub-entry \ the first character of the sub-entry;

the pinyin of the reference sub-entry \ the first character of the sub-entry; and the number of the first and second groups,

the reference sub-entry \ non-Chinese characters in the sub-entry.

13. The method for automatically encoding drug information according to claim 11, wherein the step 10 comprises:

for each reference entry, calculating the similarity between the reference entry and the sub-entries corresponding to the same entry attribute according to the following formula:

wherein M represents similarity;

t represents the resolution result of each dimension of the reference entry;

q represents a reference entry;

tinq represents the dimensions of the reference entry;

getboost () represents preset weights of each dimension;

and determining one or more sub-entries as target sub-entries of the reference entry according to the similarity between the reference entry and each sub-entry corresponding to the same entry attribute.

14. The method for automatically encoding pharmaceutical information according to claim 13, wherein the step of determining one or more sub-terms as the target sub-terms of the reference term according to the similarity between the reference term and each sub-term corresponding to the same term attribute comprises:

sequencing all sub-entries corresponding to the same entry attribute according to the similarity of the reference entry, and determining the pre-set number of sub-entries with the top sequencing as target sub-entries of the reference entry;

or,

and determining one or more sub-entries corresponding to the same entry attribute and having similarity with the reference entry reaching a preset threshold as target sub-entries of the reference entry.

15. The method for automatically encoding drug information of claim 13, wherein the step 11 comprises:

searching for a combined entry of the target sub-entries including the reference entries in the drug combined information dictionary, and determining the searched combined entry as a combined entry to be selected;

determining the similarity between each target sub-entry in the to-be-selected joint entry and the corresponding reference entry;

calculating a weighted average value of the similarity of each target sub-entry in the to-be-selected joint entry and the corresponding reference entry; wherein, the weight corresponding to the similarity of each target sub-entry and the corresponding reference entry is equal to the preset weight of the entry attribute of the target sub-entry;

and selecting one or more combined entries from all the to-be-selected combined entries as entries and combining the combined entries matched with the entries according to the size of the weighted average value of each to-be-selected combined entry.

16. The method for automatically encoding medicine information according to claim 15, wherein the step of selecting one or more united terms from all the candidate united terms as the combined and combined matched term according to the magnitude of the weighted average of the candidate united terms comprises:

sorting all the to-be-selected joint entries according to the weighted average value of all the to-be-selected joint entries, and selecting a preset number of to-be-selected joint entries with the top sorting as joint entries matched with entry merging groups;

or,

and selecting one or more to-be-selected joint entries of all to-be-selected joint entries, wherein the weighted average value of the to-be-selected joint entries is larger than a preset threshold value, and the to-be-selected joint entries are used as entries and combined to form matched joint entries.

17. The method for automatically encoding pharmaceutical information according to claim 13, wherein said step 10 further comprises calculating the total confidence of the reference entry and its target sub-entries; wherein, the calculation process of the total confidence coefficient comprises the following steps:

determining each Chinese character in the reference entry;

calculating the cosine confidence coefficient of the reference entry and the target sub-entry thereof according to the following formula:

wherein N represents a cosine confidence;

q represents a reference entry;

d' represents a target sub-entry of the reference entry;

j represents the sequence number of the Chinese character contained in the reference entry and the target sub-entry;

calculating the total confidence of the reference entry and the target sub-entries thereof according to the following formula:

S＝M×a+N×b

wherein S represents the total confidence;

a represents a preset weight corresponding to the similarity M;

18. The method for automatically encoding drug information of claim 17, wherein the step 11 comprises:

determining the total confidence of each target sub-entry and the corresponding reference entry in the to-be-selected joint entry;

calculating the weighted average value of the total confidence degrees of each target sub-entry and the corresponding reference entry in the to-be-selected joint entry; wherein, the weight corresponding to the total confidence of each target sub-entry and the corresponding reference entry is equal to the preset weight of the entry attribute of the target sub-entry;

19. The method for automatically encoding medicine information according to claim 18, wherein the selecting one or more united terms from all the candidate united terms as matched terms in terms of term combination groups according to the magnitude of the weighted average of the candidate united terms comprises:

or,

20. An automated coding system for drug information, comprising:

the input module is used for inputting a medicine information character string;

21. The system for automatically encoding pharmaceutical information according to claim 20, wherein,

the standard universal name is a Chinese medicine universal name;

22. The system for automatically encoding drug information of claim 20, wherein the system for automatically encoding drug information further comprises: the device comprises an analysis module, a matching module, a searching module and a sending module;

the third judgment processing module is further configured to trigger the parsing module if no directly matched combined entry exists when the combined entry group is matched with a combined entry in the drug combined information dictionary;

the analysis module is used for respectively determining each entry corresponding to the preset entry attribute as a reference entry, and performing preset dimension analysis on each reference entry and each sub-entry in the drug combined information dictionary to respectively obtain an analysis result of each dimension of each reference entry and an analysis result of each dimension of each sub-entry;

the matching module is used for matching the analysis result of each dimension of the reference entry with the analysis result of each dimension of each sub-entry corresponding to the same entry attribute, searching one or more sub-entries matched with the reference entry, and determining the searched sub-entries as target sub-entries of the reference entry;

the searching module is used for searching a combined entry consisting of target sub-entries of all the reference entries in the medicine combined information dictionary and determining the searched combined entry as a combined entry matched with the medicine information character string in a fuzzy manner;

and the sending module is used for sending the combined entry matched with the medicine information character string in a fuzzy manner to a manual processing platform for manual processing.