CN114330309A

CN114330309A - Term processing method, device, equipment, storage medium and program product

Info

Publication number: CN114330309A
Application number: CN202111666306.7A
Authority: CN
Inventors: 张子恒; 李文琪; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12

Abstract

The application provides a method, a device and equipment for processing terms and a computer-readable storage medium; the method comprises the following steps: acquiring input words of a specific field; constructing a first term operation tree corresponding to the input word; recalling the term standard table based on the first term operation tree and a second term operation tree which is constructed in advance for each standard word in the term standard table in the specific field to obtain a plurality of candidate standard words corresponding to the input word; determining tree similarity between the first term operation tree and a second term operation tree corresponding to each candidate standard word respectively, performing descending ordering on the candidate standard words according to the tree similarity, and determining part of the candidate standard words positioned at the head in a descending ordering result as standard words to be checked; and screening the composition of each standard word to be searched, and determining the obtained standard word to be searched meeting the reasonability index as the standard word matched with the input word. Through the method and the device, the accuracy of matching the standard words can be effectively improved.

Description

Term processing method, device, equipment, storage medium and program product

Technical Field

The present application relates to artificial intelligence technologies, and in particular, to a method and an apparatus for processing terms, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The term standardization is an important application of artificial intelligence in natural language processing, and is applied to the fields of medical treatment, advice and the like.

In the related art, because the input words to be processed have greater randomness, the accuracy of the determined standard words matched with the input words is lower, and no effective solution exists in the related art for how to improve the accuracy of matching the standard words in the term standardization process.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing a term, an electronic device, a computer readable storage medium and a computer program product, which can effectively improve the accuracy of matching a standard word.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for processing terms, which comprises the following steps:

acquiring input words of a specific field;

constructing a first term operation tree corresponding to the input word;

recalling the term standard table based on the first term operation tree and a second term operation tree which is constructed in advance for each standard word in the term standard table of the specific field to obtain a plurality of candidate standard words corresponding to the input word;

determining tree similarity between the first term operation tree and a second term operation tree corresponding to each candidate standard word respectively, performing descending ordering on the candidate standard words according to the tree similarity, and determining part of the candidate standard words positioned at the head in the descending ordering result as standard words to be checked;

and screening the composition of each standard word to be searched, and determining the obtained standard word to be searched meeting the reasonability index as the standard word matched with the input word.

An embodiment of the present application provides a term processing apparatus, including:

the acquisition module is used for acquiring input words in a specific field;

the building module is used for building a first term operation tree corresponding to the input word;

a first recall module, configured to perform recall processing on the term standard table based on the first term operation tree and a second term operation tree pre-constructed for each standard word in the term standard table in the specific field, so as to obtain a plurality of candidate standard words corresponding to the input word;

a first determining module, configured to determine tree similarity between the first term operation tree and a second term operation tree corresponding to each candidate standard word, perform descending order on the multiple candidate standard words according to the tree similarity, and determine part of the candidate standard words located at the head in the descending order result as standard words to be checked;

and the second determining module is used for screening the composition of each standard word to be searched and determining the obtained standard word to be searched meeting the reasonableness index as the standard word matched with the input word.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the term processing method provided by the embodiment of the application when executing the executable instructions stored in the memory.

Embodiments of the present application provide a computer-readable storage medium, which stores executable instructions for causing a processor to implement a term processing method provided by embodiments of the present application when the processor executes the executable instructions.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the term processing method described above in the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

by operating the tree based on the first term corresponding to the input word and the second term corresponding to the standard word, recalling the term standard table to obtain a plurality of candidate standard words corresponding to the input words, thereby preliminarily determining the range of the standard words matched with the input words, further, according to the tree similarity between the first term operation tree and the second term operation tree, determining the standard word to be searched in the plurality of candidate standard words, thereby further reducing the range of the standard words matched with the input words, and finally, through screening treatment, thereby accurately determining the standard words matched with the input words, and thus, gradually reducing the range of the standard words through multi-layer screening, further accurately determining the standard words matched with the input words, and simultaneously in the screening process, the term operation tree can adapt to the disorder and randomness of the expression of the input words in different scenes, so that the accuracy of matching the standard words can be effectively improved.

Drawings

FIG. 1 is a block diagram of a processing system architecture provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a term processing device provided in an embodiment of the present application;

fig. 3A to fig. 3E are schematic flow charts of a term processing method provided in an embodiment of the present application;

FIGS. 4A-4E are schematic diagrams of a term processing method provided by an embodiment of the present application;

fig. 5A to 5G are schematic diagrams of a term processing method provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Term normalization: the method is an indispensable task in medical statistics, and in clinic, hundreds of different expression modes are often generated for the same diagnosis, and the problem to be solved by term standardization is to find corresponding medical term standard expression for various different expression modes in clinic.

2) International Classification Of Diseases (ICD): the system is a system which is classified by the World Health Organization (WHO) according to certain characteristics of diseases according to rules and is expressed in a coding mode, is the basis for determining global health trends and statistical data, and contains about 5.5 ten thousand unique codes related to injuries, diseases and causes of death, so that health practitioners can exchange health information all over the world through a universal language.

3) Short text matching task: the task is to predict semantic relevance of two short texts by using a Natural Language Processing (NLP) model, and generally performs matching by adopting a distance measurement mode in a vector space.

4) Operation Tree (calculation Tree): the method is a data structure and is a set with a hierarchical relationship, which is composed of n (n is more than or equal to 1) finite nodes. Each node in the operation tree has zero or more child nodes, wherein the node without a parent node is called a root node; each root node has one and only one father node; in addition to the root node, each child node may be divided into a plurality of disjoint sub-trees.

5) The standard word is: the words stored in the standard table are standard expression modes defined for describing certain things, are unified and standard expression for repetitive things, are issued in a specific form based on the combined result of scientific technology and practical experience, and serve as the basis and the criterion for common compliance. For example, in a medical scenario, the standard word may be a standard expression of medical terms in terms of disease, injury, medication, etc., e.g., an expression of a unified specification issued by the World Health Organization (WHO) in the international classification of disease for medical terms in terms of disease, injury, medication, etc.

6) Non-standard words: words which are not temporarily stored in the standard table are temporarily non-uniform expression modes. For example, in a medical scenario, the non-standard word may be a non-standardized expression of a medical term in terms of a disease, injury, medication, etc., e.g., the non-standard word may be a spoken expression of a medical term in terms of a disease, injury, medication, etc., by a physician, a patient with a disease, etc.

In the implementation process of the embodiment of the present application, the applicant finds that the following problems exist in the related art:

in the related art, input words are generally standardized based on a machine learning model algorithm, but the premise that a machine learning model needs to satisfy reasonableness is ignored, wherein the machine learning model is too simple for modeling the input words and standard words, so that the related art cannot accurately distinguish similar concepts. In the related art, the problem of mismatching caused by disorder of input word sequence is not solved, for example, the input word of "chronic peritoneal hemorrhage with tumor" should be properly disassembled to obtain "chronic peritoneal hemorrhage with peritoneal tumor", and then the standardized process of the machine learning model is performed. The term operation tree in the embodiment of the application is flexible enough, and can adapt to the disorder and the randomness of the expression of the input words in different scenes. Therefore, the accuracy of matching standard words can be effectively improved.

Medical term standardization is an important technical capability in the medical informatization process and is also an important foundation for medical artificial intelligence. Medical term standardization is intended to map non-standardized or non-standardized diagnostic expressions to standardized or standardized diagnostic expressions in a medical standards body. Most of the related technologies for term standardization adopt a machine learning or deep learning model algorithm, and the standardization task is simply and roughly understood as a short text matching task, while medical meanings in the term standardization task are ignored, or the technical schemes have no medical rationality, and further, explanations given in the related technologies are mostly black boxes and cannot be accepted by doctors or related practitioners.

The term processing method provided by the embodiment of the application decouples the medical part and the model algorithm part by fully analyzing and understanding term standardization tasks, and simultaneously realizes the interaction of the medical part and the model algorithm part by utilizing a specially designed data structure of a term operation tree (namely the operation tree described above), so that the medical part and the model algorithm part can perform their own functions, and the performance and the interpretability of the whole engine are further improved.

Embodiments of the present application provide a method and an apparatus for processing a phrase, an electronic device, a computer-readable storage medium, and a computer program product, which can effectively improve accuracy of matching a standard word, and an exemplary application of the electronic device provided in the embodiments of the present application is described below.

Referring to fig. 1, fig. 1 is an architecture diagram of a term processing system 100 provided in an embodiment of the present application, in order to implement an application scenario of term processing (for example, in an application scenario of medical informatization, standardizing medical terms, in an application scenario of educational informatization, standardizing professional terms, and in an application scenario of network search, standardizing search keywords), a terminal (terminal 400 is exemplarily shown) is connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

The terminal 400 is configured for use by a user of the client 410 for display on a graphical interface 410-1 (graphical interface 410-1 is illustratively shown). The terminal 400 and the server 200 are connected to each other through a wired or wireless network.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, etc., but is not limited thereto. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

In some embodiments, the client of the terminal 400 receives the input word and transmits the input word to the server 200 through the network 300, and the server 200 determines the standard word matching the input word and transmits the matching standard word to the graphical interface 410-1 in the terminal 400 for display.

In some embodiments, the client of the terminal 400 receives the input word, determines a standard word matching the input word, and displays the standard word matching the input word in the graphical interface 410-1 in the terminal 400.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 200 of the term processing method according to an embodiment of the present application, where the server 200 shown in fig. 2 includes: at least one processor 210, memory 250, at least one network interface 220. The various components in server 200 are coupled together by a bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 240 in fig. 2.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210.

The memory 250 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

The operating system 251, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for processing hardware-based tasks.

A network communication module 252 for communicating to other electronic devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the term processing device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows the term processing device 255 stored in the memory 250, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: obtaining module 2551, constructing module 2552, first recalling module 2553, first determining module 2554, second determining module 2555, which are logical and therefore can be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

In some embodiments, a terminal or a server may implement the term processing method provided by the embodiments of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; it may be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to be run, such as a medical informatization APP.

In other embodiments, the term processing Device provided in this Application may be implemented in hardware, and for example, the term processing Device provided in this Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the term processing method provided in this Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

It is understood that, in the embodiments of the present application, data related to input words and the like need to be obtained with user permission or consent when the embodiments of the present application are applied to specific products or technologies, and collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The term processing method provided by the embodiments of the present application will be described in conjunction with exemplary applications and implementations of the term processing server provided by the embodiments of the present application.

In some embodiments, fig. 4A is a schematic diagram of a term processing method provided by embodiments of the present application. Referring to fig. 4A, based on the first term operation tree corresponding to the input word and the second term operation tree pre-constructed for each standard word in the term standard table in the specific field, the standard words in the term standard table are recalled, sorted, and screened to obtain the standard words matched with the input word, so that the accuracy of matching the standard words can be effectively improved.

Referring to fig. 3A, fig. 3A is a schematic flowchart of a term processing method provided in an embodiment of the present application, and will be described with reference to steps 101 to 105 shown in fig. 3A, where an execution subject of the following steps 101 to 105 may be the aforementioned server or terminal.

In step 101, a domain-specific input word is obtained.

As an example, the specific field may be a medical field, an educational field, a recommendation field, and the like. Taking the medical field as an example, the obtained input words of the medical field may be spoken expressions of medical terms of diseases, injuries, medicines, and the like by doctors, patients, and the like. Taking the suggested field as an example, the input words of the obtained suggested field may be spoken expressions of some standard words.

In step 102, a first term operation tree corresponding to the input word is constructed.

In some embodiments, constructing the first term operation tree corresponding to the input word may be implemented by: splitting and coding the input word to obtain the components of the input word; wherein the components include a modification component, a site component, a root component, and a logic component. And constructing a first term operation tree corresponding to the input word based on the modification component, the part component, the root component and the logic component.

By way of example, referring to fig. 4B, fig. 4B is a schematic diagram of a term processing method provided by an embodiment of the present application. When the input word is "progressive diabetic chronic hemorrhage/companions/peritoneal tumors", the first term operation tree corresponding to the input word may be constructed as shown in fig. 4B, where a root node of the first term operation tree is "companions", and leaf nodes of the first term operation tree are "diabetic," "handed over," "chronic," and "peritoneal membranes", where parent nodes of the leaf nodes "diabetic," "handed over," "chronic" are "hemorrhage", and parent nodes of the leaf nodes "peritoneal membranes" are "tumors".

In step 103, a recall process is performed on the term standard table based on the first term operation tree and a second term operation tree pre-constructed for each standard word in the term standard table of the specific field, so as to obtain a plurality of candidate standard words corresponding to the input word.

As an example, in the medical field, the term standard table may be an international disease classification, which is a system in which the world health organization classifies diseases according to certain characteristics of the diseases according to rules and expresses the diseases in a coded manner. Referring to table 1 below, table 1 below is a partial standard list of terms provided in the examples of the present application.

TABLE 1 glossary of terms

As an example, based on the first term operation tree and the second term operation tree pre-constructed for each standard word in the domain-specific term standard table, the term standard table in table 1 above is recalled to obtain a plurality of candidate standard words corresponding to the input word, for example, when the input word is "cryptococcus neoformans encephalitis", the corresponding candidate standard words may be "cryptococcosis cerebri", "cryptococcal meningitis neoformans encephalitis".

As an example, referring to fig. 4C, the second term operation tree corresponding to the standard word "cryptococcus neomeningitis" in the term standard table may be the second term operation tree as shown in fig. 4C, the root node of the second term operation tree is "meningitis", and the child nodes of the root node are "new" and "cryptococcus".

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating a term processing method provided in an embodiment of the present application, and step 103 illustrated in fig. 3B may be implemented through steps 1031 to 1032, which are described below respectively.

In step 1031, node indexes between the first term operation tree and a second term operation tree pre-constructed for each standard word in the domain-specific term standard table are determined.

The node index characterizes the consistency between the nodes of the first term operation tree and the nodes of the second term operation tree, and each node corresponds to one term component.

For example, referring to fig. 4D, fig. 4D is a schematic diagram of a term processing method provided by an embodiment of the present application. A node index between the term component "companion" of the root node of the first term operation tree and the term component "meningitis" of the root node of the second term operation tree, which is pre-constructed for each standard word in the domain-specific term standard table, is determined.

In some embodiments, referring to fig. 3C, fig. 3C is a schematic flowchart of a term processing method provided in an embodiment of the present application, and step 1031 illustrated in fig. 3C may be implemented by performing steps 10311 to step 10313 on a second term operation tree corresponding to any one standard word in a term standard table in a specific field, which is described below.

In step 10311, a root node index is determined from the root node of the first term operation tree and the root node of the second term operation tree.

In some embodiments, referring to fig. 4D, the root node indicator is determined from the root node "companion" of the first term operation tree and the root node "meningitis" of the second term operation tree. The root node indicator characterizes a correspondence between a root node of the first term operation tree and a root node of the second term operation tree.

In step 10312, a leaf node index is determined according to the position leaf node of the first term operation tree and the position leaf node of the second term operation tree.

The leaf nodes comprise part leaf nodes and modified leaf nodes, the part leaf nodes correspond to part components, the modified leaf nodes correspond to modified components, the part leaf nodes of the first term operation tree correspond to the part components of the input words, and the part leaf nodes of the second term operation tree correspond to the part components of the standard words.

In some embodiments, referring to fig. 4D, the leaf node indicator is determined from the first term operation tree's part leaf node "peritoneum" and the second term operation tree's part leaf node "cryptococcus". The leaf node indicator characterizes a correspondence between leaf nodes of the first term operation tree and leaf nodes of the second term operation tree.

In step 10313, the root node index and the leaf node index are determined as node indexes between the first term operation tree and the second term operation tree.

In this way, the root node index and the leaf node index are determined as the node indexes between the first term operation tree and the second term operation tree, so that the determined node indexes represent the consistency between the leaf nodes of the first term operation tree and the leaf nodes of the second term operation tree and the consistency between the root node of the first term operation tree and the root node of the second term operation tree, and the determined node indexes can accurately reflect the consistency between the nodes of the first term operation tree and the nodes of the second term operation tree.

In step 1032, the standard words corresponding to the node indexes meeting the recall condition are determined as a plurality of candidate standard words corresponding to the input words.

The recall condition comprises that partial nodes of the first term operation tree and the second term operation tree are the same, and the recall condition comprises a root node index recall condition and a leaf node index recall condition.

As an example, referring to table 1 above, when the node index of the standard word "cryptococcosis" satisfies the recall condition and the standard word "cryptococcosis pulmonary" does not satisfy the recall condition, the standard word "cryptococcosis" is determined as a candidate standard word corresponding to the input word.

In some embodiments, the above step 1031 may further perform the following processing for each standard word in the term standard table: and when the root node index corresponding to the standard word meets the root node index recall condition and the leaf node index corresponding to the standard word meets the leaf node index recall condition, determining the standard word as a candidate standard word corresponding to the input word.

The root node index recall condition comprises that a root node of the first term operation tree is the same as a root node of the second term operation tree, and the leaf node index recall condition comprises that a leaf node at the position of the first term operation tree is the same as a leaf node at the position of the second term operation tree.

As an example, when the root node indicator corresponding to the standard word "cryptococcosis" characterizes that the root node of the first term operation tree is the same as the root node of the second term operation tree corresponding to the standard word "cryptococcosis", the root node indicator corresponding to the standard word "cryptococcosis" satisfies the root node indicator recall condition. When the leaf node index of the part corresponding to the standard word cryptococcosis represents that the leaf node of the part of the first term operation tree is the same as the leaf node of the part of the second term operation tree corresponding to the standard word cryptococcosis, the leaf node index corresponding to the standard word cryptococcosis meets the leaf node index recall condition. At this time, if the root node index corresponding to the standard word "cryptococcosis" satisfies the root node index recall condition and the leaf node index corresponding to the standard word "cryptococcosis" satisfies the leaf node index recall condition, the standard word "cryptococcosis" is determined as a candidate standard word corresponding to the input word.

Therefore, whether the standard words are determined to be candidate standard words corresponding to the input words or not is determined by judging whether the root node indexes corresponding to the standard words meet the root node index recall condition or not and whether the leaf node indexes corresponding to the standard words meet the leaf node index recall condition or not, so that the standard words are recalled or screened from the two aspects of the leaf node index recall condition and the root node index recall condition to obtain corresponding candidate standard words, and the accuracy of the obtained candidate standard words is remarkably improved.

In step 104, the tree similarity between the first term operation tree and the second term operation tree corresponding to each candidate standard word is determined, the candidate standard words are sorted in a descending order according to the tree similarity, and part of the candidate standard words at the head in the descending order result is determined as the standard words to be checked.

In some embodiments, the tree similarity characterizes a degree of similarity between the first term operation tree and the second term operation tree. The descending sorting may be sorting according to a descending sorting result from a big to a small numerical value of the tree similarity. The candidate standard word at the head in the descending order result may be the first candidate standard word and the second candidate standard word in the descending order result, and so on.

In some embodiments, referring to fig. 3B, fig. 3B is a flowchart illustrating a term processing method provided in an embodiment of the present application, and fig. 3B illustrates that, in step 104, the tree similarity between the first term operation tree and the second term operation tree corresponding to each candidate standard word is determined, and steps 1041 to 1044 may be performed on the second term operation tree corresponding to each candidate standard word, which is described below.

In step 1041, a first edit distance between a root node of the first term operation tree and a root node of the second term operation tree is determined.

In some embodiments, the first edit distance characterizes a minimum number of single character edit operations required to transition from a root node of the first term operation tree to a root node of the second term operation tree.

As an example, referring to fig. 4E, a first edit distance dist (C1, C2) between the root node C2 of the first term operation tree and the root node C1 of the second term operation tree may be determined.

In step 1042, a second edit distance is determined between each intermediate node of the first term operation tree and each intermediate node of the second term operation tree.

Wherein the intermediate node is a child node of the root node.

In some embodiments, the second edit distance characterizes a minimum number of single character edit operations required to transition from an intermediate node of the first term operation tree to a co-located intermediate node of the second term operation tree.

By way of example, referring to FIG. 4E, in the second term operation tree, intermediate node R1 and intermediate node R2 are each children of the root node C1. A second edit distance dist (R1, R3) between the intermediate node R3 of the first term operation tree and the intermediate node R1 of the second term operation tree may be determined. A second edit distance dist (R2, R4) between the intermediate node R4 of the first term operation tree and the intermediate node R2 of the second term operation tree may be determined.

In step 1043, a third edit distance between a leaf node of the first term operation tree and a leaf node of the second term operation tree is determined.

Wherein the leaf nodes are children of the intermediate node.

In some embodiments, the third edit distance characterizes a minimum number of single character edit operations required to transition from a leaf node of the first term operation tree to a co-located leaf node of the second term operation tree.

By way of example, referring to FIG. 4E, in the second term operation tree, leaf node A1 and leaf node A2 are both children of the intermediate node R1. A third edit distance dist (a1, A3) between the leaf node A3 of the first term operation tree and the leaf node a1 of the second term operation tree may be determined. A third edit distance dist (a2, a4) between the leaf node a4 of the first term operation tree and the leaf node a2 of the second term operation tree may be determined.

In step 1044, a tree similarity between the first term operation tree and a second term operation tree corresponding to the candidate standard word is determined based on the first edit distance, the second edit distance, and the third edit distance.

In some embodiments, the determining the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word based on the first edit distance, the second edit distance and the third edit distance in the above step 1044 may be implemented as follows: carrying out parameterization processing on the first editing distance to obtain a first tree similarity corresponding to the first editing distance; carrying out parameterization processing on the second editing distance to obtain a second tree similarity corresponding to the second editing distance; carrying out parameterization processing on the second editing distance and the third editing distance to obtain a third tree similarity corresponding to the third editing distance; and adding the first tree similarity, the second tree similarity and the third tree similarity to obtain the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word.

As an example, parameterizing the first edit distance, and obtaining an expression of the first tree similarity corresponding to the first edit distance may be:

SIM1(X，Y)＝α(1-dist(C_X，C_Y)) (1)

wherein, dist (C)_x，C_Y) Characterizing the first edit distance, SIM1(X, Y) characterizes the first tree similarity, α is a hyper-parameter that adjusts the first edit distance.

As an example, the second edit distance is parameterized, and the obtained expression of the second tree similarity corresponding to the second edit distance may be:

wherein, dist (R)_Xi，R_Yj) Characterizing the second edit distance, β being a hyper-parameter adjusting the second edit distance, SIM2(x, Y) characterizing the second tree similarity.

As an example, the second edit distance and the third edit distance are parameterized, and the obtained expression of the third tree similarity corresponding to the third edit distance may be:

wherein SIM3(X, Y) characterizes a third Tree similarity, dist (R)_Xi，R_Yj) Characterizing a second edit distance, dist (A)_Xi，A_Yj) Characterizing a third edit distance, dist (B)_Xi，B_Yj) Y is a hyper-parameter that adjusts the third edit distance.

As an example, the first tree similarity, the second tree similarity, and the third tree similarity are summed, and the expression of the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word may be:

wherein, dist (C)_X，C_Y) Characterizing a first edit distance, dist (R)_xi，R_Yj) Second edit distance, dist (A)_Xi，A_Yj) Characterizing a third edit distance, dist (B)_Xi，B_Yj) Characterizing a third edit distance, SIM (X, Y) characterizing a similarity between the first term operation tree X and the second term operation tree Y, alpha being a hyper-parameter for adjusting the first edit distanceβ is a hyperparameter adjusting the second edit distance, γ is a hyperparameter adjusting the third edit distance, SIM3(X, Y) characterizes the third tree similarity, SIM2(X, Y) characterizes the second tree similarity, and SIM1(X, Y) characterizes the first tree similarity.

In this way, the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word is determined based on the first editing distance, the second editing distance and the third editing distance, so that the similarity between the first term operation tree and the second term operation tree can be accurately judged, and the standard word to be searched in the candidate standard words can be accurately determined.

In step 105, the composition of each standard word to be checked is screened, and the obtained standard word to be checked meeting the reasonableness index is determined as the standard word matched with the input word.

As an example, when the standard words to be examined are "male asymptomatic uterine cancer" and "symptomatic uterine cancer", the composition of the standard words to be examined may be "male", "asymptomatic", "uterine cancer", "symptomatic type", and the composition of the standard words to be examined "male asymptomatic uterine cancer" may be "male", "asymptomatic", "uterine cancer" is subjected to a screening process, specifically, the screening process may include negative word examination, medical rationality examination, and the like, wherein the negative word examination may be that when a negative word is detected to be present in the composition of the standard words to be examined, the standard words to be examined are deleted, for example, when the "asymptomatic" is "absent", the standard words to be examined "male asymptomatic uterine cancer" are deleted. The medical rationality examination may be such that the standard word to be examined is deleted when there is a component that does not meet the medical rationality in the composition of the standard word to be examined, for example, since a male does not have a uterus, and "male" and "uterine cancer" cannot simultaneously appear in the standard word to be examined, the standard word to be examined "male asymptomatic uterine cancer" is determined not to meet the medical rationality by the medical rationality examination, and thus the standard word to be examined "male asymptomatic uterine cancer" is deleted. Similarly, the standard word "symptomatic uterine cancer" to be examined may be determined as a standard word matching the input word by the screening process.

Therefore, the obtained standard words to be searched meeting the reasonability index are determined to be the standard words matched with the input words by screening the composition of each standard word to be searched, so that the reasonability degree of the obtained standard words matched with the input words is higher.

In some embodiments, referring to fig. 3D, fig. 3D is a schematic flowchart of a term processing method provided in an embodiment of the present application, and before step 101 shown in fig. 3D, building a second term operation tree may be implemented by performing steps 106 to 108, which are described below separately.

In step 106, a glossary of terms is obtained.

For example, referring to the terminology criteria table shown in Table 1 above, the criteria word corresponding to code B45 is cryptococcosis, the criteria word corresponding to B45.0 is cryptococcosis pulmonary, and so on. The term standard table characterizes the mapping between codes and standard words.

In step 107, a splitting encoding process is performed on each standard word in the term standard table to obtain a component of each standard word.

Wherein the components include a modification component, a site component, a root component, and a logic component.

As an example, referring to table 1 above, the standard word "cryptococcus neoformans pneumonia" in the term standard table is subjected to a splitting coding process, and the obtained standard word "cryptococcus neoformans pneumonia" can be composed of "cryptococcus", "new" and "pneumonia", wherein the modified component is "new", the root component is "pneumonia", the site component is "cryptococcus", and the standard word "cryptococcus neoformans pneumonia" has no logical component.

In step 108, a second term operation tree corresponding to the standard word is constructed based on the modifier component, the part component, the root component, and the logic component.

In some embodiments, constructing the second term operation tree corresponding to the standard word based on the modification component, the part component, the root component and the logic component in the step 108 may be implemented by: determining the logical component as a root node of a second term operation tree; determining the root component as an intermediate node of the second term operation tree, wherein the intermediate node is a child node of the root node; determining the part component and the modification component as leaf nodes of the second term operation tree, wherein the leaf nodes are child nodes of the middle node; and constructing a second term operation tree corresponding to the standard words based on the root node, the middle node and the leaf nodes.

By way of example, referring to the second term operation tree in fig. 4E, logical component C1 is determined as the root node of the second term operation tree; determining root components R1 and R2 as intermediate nodes of a second term operation tree; determining the part components A1, A2 and the modified components B1, B2 as leaf nodes of a second term operation tree; a second term operation tree corresponding to the standard word is constructed based on the root node C1, the intermediate nodes R1, R2, and the leaf nodes a1, a2, B1, B2.

In some embodiments, referring to fig. 3D, before step 104 shown in fig. 3D, candidate annotation words corresponding to the input word can be determined by performing at least one of step 109 and step 110, which are described below separately.

In step 109, a recall index of the input word with respect to each standard word in the term standard table is determined, and the standard word with the largest recall index value is determined as the candidate standard word corresponding to the input word.

Wherein the recall index characterizes a degree of similarity between the input word and the standard word.

In some embodiments, referring to fig. 3E, fig. 3E is a flowchart of a term processing method provided in an embodiment of the present application, and the determining the recall index of the input word with respect to each standard word in the term standard table in step 109 may be implemented by performing steps 1091 to 1095.

In step 1091, edit distances between the input word and each of the standard words in the term standard table are determined, wherein the edit distances represent how similar the literal features between the input word and the standard words are.

As an example, the edit distance between the input word a and the standard word B in the term standard table may be dist (a, B). The edit distance of the input word and the standard word in the term standard table characterizes the minimum number of single character edit operations required to convert the input word to the standard word.

In step 1092, a vectorization process is performed based on the input word calling vector model to obtain an input word vector corresponding to the input word.

In step 1093, a vector model is sequentially invoked for vectorization based on each standard word in the term standard table, so as to obtain a standard word vector corresponding to the standard word.

In some embodiments, the vectorization process may be implemented by calling a vector model, where the vector model may be a BERT model, an ALBERT model, a word2vec model, a MedBERT model, or the like.

In step 1094, word vector similarity between the input word vector and the standard word vector corresponding to each standard word in the term standard table is determined, and the word vector similarity represents the similarity of semantic features between the input word and the standard words.

In some embodiments, the word vector similarity between the input word vector and the standard word vector may be a hamming distance, a euclidean distance, a cosine similarity, etc., between the input word vector and the standard word vector.

In step 1095, recall indexes of the input words with respect to each standard word in the term standard table are determined according to the word vector similarity and the edit distance.

In some embodiments, in the step 1095, the recall indicators of the input words with respect to each standard word in the term standard table are determined according to the word vector similarity and the edit distance, and the following processing may be performed for each standard word in the term standard table: determining the opposite number of the editing distance between the input word and the standard word as a character surface feature, and determining the word vector similarity between the input word and the standard word as a semantic feature; and performing weighted average processing on the literal characteristics and the semantic characteristics, and determining the obtained weighted average processing result as a recall index of the input word relative to each standard word in the term standard table.

As an example, the expression for the determined recall metric may be:

wherein, a represents an input word, B represents a standard word, score (a, B) represents a recall index, dist (a, B) represents an edit distance between the input word a and the standard word B, sim (a ', B') represents a word vector similarity between an input word vector a 'and a standard word vector B', a 'represents an input word vector, and B' represents a standard word vector.

In step 110, a target root level matching the input word is determined among a plurality of root levels of the term standard table, and a standard word of a sub-level under the target root level and a standard word of the target root level are determined as candidate standard words corresponding to the input word.

The granularity of the standard words of the root level is larger than that of the standard words of the sub-levels under the root level, and the granularity of the standard words is inversely related to the expression detail degree of the standard words.

As an example, referring to table 1 above, the root level of the term standard table may be the level encoded as B45 where the standard word is cryptococcosis, the encoding of the root level shown in table 1 above is 3 bits, and the other levels containing these three encoding are all sub-levels of the root level. The standard word 'cryptococcosis' at the root level has a larger granularity than the standard word 'cryptococcosis lung' at the sub-levels below the root level, the granularity of the standard word is inversely related to the expression elaboration degree of the standard word, and the expression elaboration degree of the standard word 'cryptococcosis lung' is larger than the expression elaboration degree of the standard word 'cryptococcosis lung'.

In some embodiments, the determining a target root level matching the input word from among the plurality of root levels of the term criteria table in step 110 may be implemented as follows: determining root level similarity between the input words and standard words of each root level in the term standard table; and determining the root level corresponding to the standard words with the root level similarity larger than the similarity threshold as the target root level matched with the input words.

As an example, a root level similarity 5 between the input word and a standard word of a root level (a level coded as a B45 standard word) in the term standard table is determined, a root level similarity 6 between the input word and a standard word of a root level (a level coded as a B46 standard word) in the term standard table is determined, wherein the root level similarity 5 is greater than a similarity threshold, the root level similarity 6 is less than the similarity threshold, and a root level corresponding to the root level similarity 5 is determined as a target root level matched with the input word.

Thus, the term standard table is recalled based on the first term operation tree corresponding to the input word and the second term operation tree corresponding to the standard word, so as to obtain a plurality of candidate standard words corresponding to the input word, thereby preliminarily determining the range of the standard word matched with the input word, further determining the standard word to be searched in the plurality of candidate standard words according to the tree similarity between the first term operation tree and the second term operation tree, further reducing the range of the standard word matched with the input word, finally, the standard word matched with the input word is accurately determined through screening treatment, thus, the range of the standard word is gradually reduced through multi-layer screening, further the standard word matched with the input word is accurately determined, and meanwhile, in the screening process, because the term operation tree can adapt to the disorder and randomness of the input word expression under different scenes, therefore, the accuracy of matching standard words can be effectively improved.

Next, an exemplary application of the embodiment of the present application in an actual medical informatization application scenario will be described.

Medical term standardization is an important foundation for medical artificial intelligence and plays an important role in many scenes. For example, in the data center of the health record scene, medical term standardization can help hospitals to greatly reduce the workload of medical record coders and help hospitals to quickly and inexpensively construct data center for information storage and query. For another example, medical term standardization can also standardize hospital data of a plurality of different levels and different regions, so that the intelligent epidemic situation prevention and control large screen and intelligent epidemic situation monitoring can be constructed.

By way of example, referring to fig. 5A, the clinical terminology standardization engine may help the relevant departments perform intelligent assisted underwriting, help the relevant departments pull through the various party data, and provide a unified annotated diagnostic data interface. In the display interface shown in fig. 5A, the subclasses corresponding to gastric cancer may include pyloric gastric cancer and virus-related gastric cancer; the father corresponding to the stomach cancer may have cancer, primary malignant tumor of the stomach, the morphological change of the stomach cancer may be the cancer, the occurrence part of the stomach cancer is the stomach, and a series of special terms about the stomach cancer are expressed.

Referring to fig. 5B, fig. 5B is a schematic diagram of an overall architecture of a term processing method according to an embodiment of the present application. The overall architecture of the term processing method as shown in FIG. 5B includes an engine input section, an offline computation section, and an online processing section. The engine input section, offline calculation section, and online processing section are explained below, respectively.

Referring to fig. 5B, in the engine input 1, the non-standard words (i.e., the input words described above) and the medical term standard table (i.e., the standard table described above) are input to the engine. After rejection processing, replacement processing and word segmentation processing are carried out on the input non-standard words, term principal component recognition is carried out on the basis of processing results, and a term operation tree corresponding to the non-standard words is constructed on the basis of recognition results.

Referring to fig. 5B, in the offline calculation 2, the medical term standard table is subjected to data processing, so as to obtain a medical term bridging table, wherein the data processing includes a blocking process, an indexing process and a replacing process. And performing term principal component recognition processing on the standard words in the medical term bridging table, and constructing a term operation tree based on the term principal component recognition processing result.

Referring to FIG. 5B, a multi-recall model, a fine-order model, and an output inspection model are included in the online computation 3. The input non-standard words (i.e. the input words described above) are processed by the multi-way recall model, the fine sorting model and the output check model after being subjected to the replacement processing, the word segmentation processing and the rejection processing, so as to obtain the output standard words.

The construction process of the term operation tree is explained below. Referring to fig. 5C, fig. 5C is a schematic flow chart of a term processing method provided in the embodiments of the present application.

Firstly, splitting an input word to obtain key components in the term operation tree.

By way of example, the term key components in an operation tree may include: a disease modifying component, a disease anatomical region component, a disease root component, and a disease link logic component.

As an example, splitting the input word may first identify and locate a disease link logic component C of the input word, and specifically, a dictionary recognition function (lookup) may be used to identify three types of split logics, namely "partner, and/or" and combine the split logics into a disease link logic component. And then, recognizing and positioning the disease modification component, the disease anatomical part component and the disease root component of the input word, specifically, recognizing by adopting a dictionary recognition function (lookup) or a sequence labeling model. The disease modification component, the disease anatomical part component and the disease root component of the input word can be identified and positioned independently and parallelly, so that the possibility of nesting or partial overlapping of the disease modification component, the disease anatomical part component and the disease root component obtained by identification and positioning can be avoided.

As an example, for one input word: m ═ M₁，m₂，m₃，...，m_nAnd n is the length of the input word, and the disease modification component, the disease anatomical region component and the disease root component of the input word are separately and parallelly identified and positioned in the following ways:

identifying and positioning the disease modification component of the input word to obtain the disease modification component A ═ { m } of the input word_k，m_k+1，m_k+2，…，m_k+lWherein, a ═ m_k，m_k+1，m_k+2，…，m_k+l}∈M＝{m₁，m₂，m₃，...，m_nK represents the position of the disease modifying component in the input word, l represents the length of the disease modifying component, and k is more than or equal to 1 and less than or equal to n; l is more than or equal to 0 and less than or equal to n-k.

Identifying and positioning the disease anatomical part component of the input word to obtain the disease anatomical part component B of the input word { m ═ m_q，m_q+1，m_q+2，...，m_q+wWherein, B ═ m_q，m_q+1，m_q+2，...，m_q+l}∈M＝{m₁，m₂，m₃，...，m_nQ represents the position of the disease anatomical part component in the input word, w represents the length of the disease modification component, and q is more than or equal to 1 and less than or equal to n; w is more than or equal to 0 and less than or equal to n-q.

Identifying and positioning the disease root component of the input word to obtain the disease root component R ═ m of the input word_r，m_r+1，m_r+2，...，m_r+xWherein R ═ m_r，m_r+1，m_r+2，...，m_r+x}∈M＝{m₁，m₂，m₃，...，m_nR represents the position of the disease anatomical part component in the input word, x represents the length of the disease modifying component, and r is more than or equal to 1 and less than or equal to n; x is more than or equal to 0 and less than or equal to n-r.

And then combining the disease modification component A of the input word, the disease anatomical part component B of the input word and the disease root component R of the input word to obtain a subtree structure of the term operation tree, wherein the subtree structure does not comprise a root node of the term operation tree.

Referring to fig. 5C, after obtaining the disease modification component a of the input word, the disease anatomical region component B of the input word, the disease root component R of the input word, and the disease link logic component C, the disease modification component a of the input word, the disease anatomical region component B of the input word, the disease root component R of the input word, and the disease link logic component C are encoded, and then a term operation tree is constructed based on the encoding result and the disease modification component a, the disease anatomical region component B of the input word, the disease root component R of the input word, and the disease link logic component C, to obtain the term operation tree. The term operation tree can also be decoded to obtain a corresponding decoding result.

Referring to fig. 5D, fig. 5D is a schematic diagram of a term operation tree provided in the embodiment of the present application. Referring to the term operation tree 51 in fig. 5D, the disease link logical component C is used as a root node of the term operation tree 51, the disease root component R (R1 and R2) of the input word is used as a child node of the root node of the term operation tree 51, and the disease modification component a of the input word and the disease anatomical region component B of the input word are used as leaf nodes of the term operation tree 51, wherein the disease modification component a1 of the input word, the disease modification component a2 of the input word and the disease anatomical region component B1 of the input word are child nodes of the disease root component R1 of the input word, and the disease anatomical region component B2 of the input word is a child node of the disease root component R2 of the input word.

In some embodiments, referring to the term operation tree 52 in fig. 5D, the term operation tree 51 and the term operation tree 52 shown in fig. 5D are term operation trees of the same structure, the term operation tree 52 is an example of the term operation tree 51, and it can be seen in the term operation tree 52 that the "bleeding" disease root component is followed by the "diabetic", "progressive" disease modifying component and the "chronic" disease root component, and the "swelling" disease root component is followed by the "peritoneal" disease anatomical site component.

Referring to fig. 5E, fig. 5E is a schematic diagram of a term operation tree provided in the embodiment of the present application. In the decoding process, the input words may be obtained based on the constructed term operation tree, and the input words obtained by decoding do not consider the order among sibling components of the input words, for example, the term operation tree 51 is decoded, and the obtained input words may be A1D1/C/R2, A1B1R1/C/B2R2, and A1A2B1R1/C/B2R 2. Decoding the term operation tree 52, the resulting input words may be diabetic/progressive/chronic/bleeding/concomitant/peritoneal/tumor, diabetic/chronic/progressive/bleeding/concomitant/peritoneal/tumor, progressive/diabetic/chronic/bleeding/concomitant/peritoneal/tumor.

Therefore, the input words are obtained based on the constructed term operation tree, the sequence of the input words among the same level components of the input words is not considered in the input words obtained through decoding, and the problem of disorder and randomness existing in the expression of the input words in different scenes can be effectively solved by converting the input words from a short text into a term operation tree structure.

In some embodiments, referring to fig. 5F, since in a real-world scenario, doctors or related medical workers have different writing specifications, and there are also a lot of pre-omitted natural language itself, after the above processing, the term operation tree can be further expanded. For example, the term operation tree 51 is expanded to obtain a term operation tree 53. As shown in fig. 5F, when the R2 component does not have a disease modifying component, the preceding R1 component can transfer both disease modifying components a1/a2 to the succeeding R2 component. For example, the above exemplified "diabetes/progressive/chronic/hemorrhage/concomitant/peritoneum/tumor" can be expanded to transfer any one or more of "diabetes", "progressive", "chronic" to "tumor" to result in "diabetes/progressive/chronic/hemorrhage/concomitant/chronic/peritoneum/tumor".

Thus, by constructing the term operation tree, medical terms (such as diagnosis) in a short text form can be structured in a finer granularity, so that a term operation tree is constructed, the term operation tree needs to be strictly defined, each part in the term operation tree corresponds to different medical meanings, and the flexibility and the lateral extensibility of the term operation tree also ensure that the business requirements of a real scene are met to the maximum extent.

Referring to fig. 5B, the recall mode of the BERT word vector similarity in the multi-way recall model depends on the selected features, and different features often greatly affect the final recall effect. In order to effectively improve the overall recall rate of the multi-channel recall model, the recall can be performed by adopting a mode of combining literal features and semantic features. The literal feature is mainly similarity Distance (Levenshtein Distance), and simply speaking, the editing Distance between two entity words is obtained, and the smaller the editing Distance, the closer the two entity words are represented. The semantic features refer to similarity obtained by obtaining semantic word vectors of input texts, and the word vector model provided by the embodiment of the application can be compatible with but not limited to: word2vec, BERT, ALBERT, etc., in a medical scenario, a word vector model (Med BERT) may be used to obtain a word vector, where the expression of the word vector model (Med BERT) may be:

w＝enc(w),enc∈{MedBERT,BERT,ALBERT} (6)

for input text w of different lengths, the word vector model can model the input text w into word vectors of equal dimensions.

And calculating cosine similarity between the input words and the standard words by combining the word vectors of the input words and the word vectors of the standard words, and sequencing to obtain the entity word with the highest similarity, namely the entity chain indicating result of the input entity word.

Wherein, A represents an input word, B represents a standard word, A 'represents a word vector corresponding to the input word, and B' represents a word vector corresponding to the standard word. dist (a, B) represents an edit distance between the input word and the standard word, and sim (a ', B') represents a cosine similarity between a word vector corresponding to the input word and a word vector corresponding to the standard word.

Referring to FIG. 5B, a three/six code multi-level recall in a multi-way recall model. For an input word a, all possible B needs to be recalled in the standard system, which may cause a problem of excessive computational complexity and may also cause noise. The embodiments of the present application therefore design a data object to speed up the efficiency and effectiveness of recalls with respect to the encoding characteristics of ICD 10. Referring to table 2 below, table 2 is a standard table of terms provided in the examples of the present application.

TABLE 2 glossary of terms

Referring to table 2 above, it can be seen that the shortest code is 3-bit code, and the longest code is 6-bit code + additional code, such a term system ensures that the same terms in the 3-bit code are expressed in terms of similar concepts or concepts under a certain general idea, while the differences between different 6-bit codes are mostly finer-grained concepts, such as "B45.0 cryptococcosis pulmonary" and "B45.1 cryptococcosis cerebri". Thus, only input non-standard words and all 3-digit code terms can be processed during the recall process, i.e., only similarities to "B45 cryptococcosis" will be calculated. If the similarity is greater than a certain threshold, all terms under the 3-bit code concept (i.e., all terms with the first three-bit code being "B45") will be recalled. Therefore, the calculation complexity is reduced, and the coarse-grained concept can be utilized to avoid the occurrence of some false recalls, so that the efficiency and the effect of the recall module are ensured.

Referring to fig. 5G, fig. 5G is a schematic diagram of a term operation tree provided in an embodiment of the present application. In the multi-recall model, a term operation tree constructed from the standard words and the original words (i.e., the input words described above) can help to screen the recalled candidate standard words. As shown in fig. 5G, the disease root component D and the anatomical part component B in the term operation tree can be used as a basis for screening, and it can be defined that the disease root component D of the recalled standard word and the input original word (i.e., the input word described above) are consistent, and the anatomical part component B is consistent. If the input original word is 'gastrorrhagia', diseases whose disease root component D is not 'stomach' are excluded after the term principal component screening. The term principal component screening can utilize the advantages of the term operation tree to screen the standard words in the recall stage, thereby further reducing the loss in the recall process and reducing the calculation amount of a fine-ranking model. Referring to fig. 5G, the recall module is used to recall disease root D and anatomical site B.

The embodiment of the application makes full use of the advantage of the data structure of the term operation tree, and simultaneously avoids greatly increasing the engine calculation amount. The information importance degrees of different levels are distinguished by adjusting the weights of three over-parameters, namely alpha (alpha), beta (beta) and gamma (gamma), wherein the alpha (alpha) represents the consistency of the connecting component, the beta (beta) represents the consistency of the disease root component in two diagnostic words, and the gamma (gamma) represents the consistency of the disease modifying component and the anatomical part component in the same disease root in the two diagnostic words.

Wherein, dist (C)_X，C_Y) Characterizing a first edit distance, dist (R)_Xi，R_Yj) Second edit distance, dist (A)_Xi，A_Yj) Characterizing a third edit distance, dist (B)_Xi，B_Yj) And SIM (X, Y) represents the similarity between the first term operation tree X and the second term operation tree Y, alpha is a hyperparameter for adjusting the first editing distance, beta is a hyperparameter for adjusting the second editing distance, gamma is a hyperparameter for adjusting the third editing distance, SIM3(X, Y) represents the third tree similarity, SIM2(X, Y) represents the second tree similarity, and SIM1(X, Y) represents the first tree similarity.

Therefore, similarity calculation is carried out on the recalled standard words and the input words by adopting the method for calculating the editing distance of the term operation tree, and then the standard words and the input words are arranged from large to small according to the similarity to obtain the final ordering result of the standard words.

Referring to fig. 5B, in the output inspection model, the medical reasonableness inspection is mainly performed on the sorted standard words, including: in the negative word examination, for example, "no fever" is negative, the standard word including "no fever" is directly excluded. Medical rationality examination "male" and "uterus" appear simultaneously in one standard word, and the standard word is directly excluded. The output inspection model can be configured according to actual business requirements, and simultaneously needs a large amount of medical knowledge to support.

With continued reference to fig. 5B, the input non-standard words may also be subjected to service white list matching, and when the input non-standard words are matched with the service white list, the input non-standard words are directly determined as output standard words.

To verify the validity of the term processing method provided by the embodiments of the present application, experimental parameters after performing experiments based on a large amount of real diagnostic data are provided below.

For example, by labeling the diagnostic data, 253 valid data items are labeled, and 400 data items are data items that should not be normalized. The term processing methods and related techniques provided in the examples of the present application were evaluated on this data. Evaluating by dividing the evaluation dimension into individual diagnosis layers (including precision rate, recall rate and score), namely, one-to-many cases are regarded as a plurality of pieces of data; and input level (accuracy), i.e. one-to-many, only if the strip of normalization is all correct.

TABLE 3 Experimental parameters

	Rate of accuracy	Recall rate	Score of	Rate of accuracy
					Prior Art	0.513	0.544	0.528	0.496
This application	0.802	0.601	0.687	0.560

Referring to table 3 above, it can be seen that the normalization effect of the term processing method provided in the embodiment of the present application in the diagnostic data of a real scene is improved by a fraction close to 30% compared to the related art, thereby proving the effectiveness of the term processing method provided in the embodiment of the present application, and the defects existing in the related art are also solved by the term operation tree, and simultaneously, the output result is made to conform to the medical rationality. On the basis, the outpatient service record data of a plurality of hospitals with different levels can be subjected to term alignment mapping, so that the communication of heterogeneous data is carried out through a term system.

Continuing with the exemplary structure provided by the present application for the term processing device 255 implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the memory 250 in the term processing device 255 may include: an obtaining module 2551, configured to obtain an input word in a specific field; a construction module 2552, configured to construct a first term operation tree corresponding to the input word; a first recall module 2553, configured to perform recall processing on the term standard table based on the first term operation tree and a second term operation tree pre-constructed for each standard word in the term standard table in the specific field, so as to obtain a plurality of candidate standard words corresponding to the input word; a first determining module 2554, configured to determine tree similarity between the first term operation tree and the second term operation tree corresponding to each candidate standard word, perform descending order sorting on the multiple candidate standard words according to the tree similarity, and determine part of the candidate standard words located at the head in a descending order sorting result as standard words to be checked; and the second determining module 2555 is configured to perform screening processing on the composition of each standard word to be searched, and determine the obtained standard word to be searched, which meets the reasonableness index, as the standard word matched with the input word.

In some embodiments, the first recall module 2553 is further configured to determine a node index between the first term operation tree and a second term operation tree pre-constructed for each standard word in the domain-specific term standard table, wherein the node index characterizes a correspondence between nodes of the first term operation tree and nodes of the second term operation tree, and each node corresponds to a term component; and determining the standard words corresponding to the node indexes meeting the recall condition as a plurality of candidate standard words corresponding to the input words, wherein the recall condition comprises that partial nodes of the first term operation tree and partial nodes of the second term operation tree are the same.

In some embodiments, the first recall module 2553 is further configured to perform the following processing for a second term operation tree corresponding to any one of the standard words in the domain-specific term standard table: determining a root node index according to a root node of the first term operation tree and a root node of the second term operation tree; determining a leaf node index according to a part leaf node of a first term operation tree and a part leaf node of a second term operation tree, wherein the leaf nodes comprise part leaf nodes and modified leaf nodes; and determining the root node index and the leaf node index as the node index between the first term operation tree and the second term operation tree.

In some embodiments, the first recall module 2553 is further configured to perform the following for each standard word in the terminology standard table: when the root node index corresponding to the standard word meets the root node index recall condition and the leaf node index corresponding to the standard word meets the leaf node index recall condition, determining the standard word as a candidate standard word corresponding to the input word; the root node index recall condition comprises that a root node of the first term operation tree is the same as a root node of the second term operation tree, and the leaf node index recall condition comprises that a leaf node at the position of the first term operation tree is the same as a leaf node at the position of the second term operation tree.

In some embodiments, the first determining module 2554 is further configured to perform the following processing for the second term operation tree corresponding to each candidate standard word: determining a first edit distance between a root node of the first term operation tree and a root node of the second term operation tree; determining a second edit distance between each intermediate node of the first term operation tree and each intermediate node of the second term operation tree, wherein the intermediate nodes are child nodes of the root node; determining a third edit distance between a leaf node of the first term operation tree and a leaf node of the second term operation tree, wherein the leaf node is a child node of the intermediate node; and determining tree similarity between the first term operation tree and a second term operation tree corresponding to the candidate standard word based on the first editing distance, the second editing distance and the third editing distance.

In some embodiments, the first determining module 2554 is further configured to perform parameterization on the first editing distance to obtain a first tree similarity corresponding to the first editing distance; carrying out parameterization processing on the second editing distance to obtain a second tree similarity corresponding to the second editing distance; carrying out parameterization processing on the second editing distance and the third editing distance to obtain a third tree similarity corresponding to the third editing distance; and adding the first tree similarity, the second tree similarity and the third tree similarity to obtain the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word.

In some embodiments, the term processing device 255 further includes: the second acquisition module is used for acquiring a term standard table; the splitting and coding module is used for splitting and coding each standard word in the term standard table to obtain the component of each standard word, wherein the component comprises a modification component, a part component, a root component and a logic component; and the second construction module is used for constructing a second term operation tree corresponding to the standard word based on the modification component, the part component, the root component and the logic component.

In some embodiments, the second building module is further configured to determine the logic component as a root node of a second term operation tree; determining the root component as an intermediate node of the second term operation tree, wherein the intermediate node is a child node of the root node; determining the part component and the modification component as leaf nodes of the second term operation tree, wherein the leaf nodes are child nodes of the middle node; and constructing a second term operation tree corresponding to the standard words based on the root node, the middle node and the leaf nodes.

In some embodiments, the term processing device 255 further includes: and the second recall module is used for determining the recall indexes of the input words relative to each standard word in the term standard table respectively, and determining the standard word with the maximum recall index value as a candidate standard word corresponding to the input word, wherein the recall indexes represent the similarity between the input word and the standard words. And the third recalling module is used for determining a target root level matched with the input words in a plurality of root levels of the term standard table, and determining the standard words of the sub-levels under the target root level and the standard words of the target root level as candidate standard words corresponding to the input words, wherein the granularity of the standard words of the root level is larger than that of the standard words of the sub-levels under the root level, and the granularity of the standard words is negatively related to the expression detail degree of the standard words.

In some embodiments, the second recall module is further configured to determine an edit distance between the input word and each standard word in the term standard table, where the edit distance represents a degree of similarity of the character features between the input word and the standard words; based on the input word, calling a vector model to carry out vectorization processing to obtain an input word vector corresponding to the input word; sequentially calling a vector model to carry out vectorization processing based on each standard word in the term standard table to obtain a standard word vector corresponding to the standard word; determining word vector similarity between the input word vector and a standard word vector corresponding to each standard word in the term standard table, wherein the word vector similarity represents semantic feature similarity between the input word and the standard words; and determining recall indexes of the input words relative to each standard word in the term standard table respectively according to the word vector similarity and the editing distance.

In some embodiments, the above-mentioned second recall module is further configured to perform the following processing for each standard word in the terminology standard table: determining the opposite number of the editing distance between the input word and the standard word as a character surface feature, and determining the word vector similarity between the input word and the standard word as a semantic feature; and performing weighted average processing on the literal characteristics and the semantic characteristics, and determining the obtained weighted average processing result as a recall index of the input word relative to each standard word in the term standard table.

In some embodiments, the third recall module is further configured to determine root-level similarities between the input word and the standard words of each root level in the term standard table; and determining the root level corresponding to the standard words with the root level similarity larger than the similarity threshold as the target root level matched with the input words.

Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform the term processing method provided by embodiments of the present application, for example, the term processing method as illustrated in fig. 3A.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the present application has the following beneficial effects:

(1) the method comprises the steps of recalling a term standard table to obtain a plurality of candidate standard words corresponding to an input word by means of a first term operation tree corresponding to the input word and a second term operation tree corresponding to the standard word based on the input word, preliminarily determining the range of the standard word matched with the input word, determining a standard word to be searched in the candidate standard words according to tree similarity between the first term operation tree and the second term operation tree, further reducing the range of the standard word matched with the input word, and finally accurately determining the standard word matched with the input word by means of screening processing, so that the range of the standard word is gradually reduced by means of multi-layer screening, the standard word matched with the input word is accurately determined, and meanwhile, in the screening process, the term operation tree can adapt to disorder and randomness of the input word expression under different scenes, therefore, the accuracy of matching standard words can be effectively improved.

(2) The root node index and the leaf node index are determined as the node indexes between the first term operation tree and the second term operation tree, so that the determined node indexes represent the consistency between the leaf nodes of the first term operation tree and the leaf nodes of the second term operation tree and the consistency between the root nodes of the first term operation tree and the root nodes of the second term operation tree, and the determined node indexes can accurately reflect the consistency between the nodes of the first term operation tree and the nodes of the second term operation tree.

(3) Whether the standard words are determined to be candidate standard words corresponding to the input words or not is determined by judging whether the root node indexes corresponding to the standard words meet the root node index recall condition or not and whether the leaf node indexes corresponding to the standard words meet the leaf node index recall condition or not, so that the standard words are recalled or screened from the two aspects of the leaf node index recall condition and the root node index recall condition to obtain corresponding candidate standard words, and the accuracy of the obtained candidate standard words is remarkably improved.

(4) The tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard words is determined based on the first editing distance, the second editing distance and the third editing distance, so that the similarity between the first term operation tree and the second term operation tree can be accurately judged, and the standard words to be searched in the candidate standard words can be accurately determined.

(5) The obtained standard words to be searched meeting the reasonability index are determined to be the standard words matched with the input words by screening the composition of each standard word to be searched, so that the reasonability degree of the obtained standard words matched with the input words is higher.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for term processing, the method comprising:

acquiring input words of a specific field;

constructing a first term operation tree corresponding to the input word;

2. The method according to claim 1, wherein the recalling the term standard table based on the first term operation tree and a second term operation tree pre-constructed for each standard word in the domain-specific term standard table to obtain a plurality of candidate standard words corresponding to the input word comprises:

determining a node index between the first term operation tree and a second term operation tree pre-constructed for each standard word in the domain-specific term standard table, wherein the node index characterizes consistency between nodes of the first term operation tree and nodes of the second term operation tree, and each node corresponds to a term component;

determining a standard word corresponding to the node index meeting a recall condition as a plurality of candidate standard words corresponding to the input word, wherein the recall condition comprises that partial nodes of the first term operation tree and the second term operation tree are the same.

3. The method of claim 2, wherein determining a node index between the first term operation tree and a second term operation tree pre-constructed for each standard word in the domain-specific term criteria table comprises:

performing the following processing for a second term operation tree corresponding to any one of the standard words in the domain-specific term standard table:

determining a root node indicator according to a root node of the first term operation tree and a root node of the second term operation tree;

determining a leaf node index according to a part leaf node of the first term operation tree and a part leaf node of the second term operation tree, wherein the leaf nodes comprise the part leaf node and a modified leaf node;

determining the root node index and the leaf node index as node indexes between the first term operation tree and the second term operation tree.

4. The method of claim 2, wherein the recall condition comprises a root node indicator recall condition and a leaf node indicator recall condition;

the determining, as a plurality of candidate standard words corresponding to the input word, a standard word corresponding to the node indicator that satisfies a recall condition includes:

performing the following processing for each of the standard words in the term standard table:

when the root node index corresponding to the standard word meets the root node index recall condition and the leaf node index corresponding to the standard word meets the leaf node index recall condition, determining the standard word as the candidate standard word corresponding to the input word;

wherein the root node indicator recall condition comprises a root node of the first term operation tree and a root node of the second term operation tree being the same, and the leaf node indicator recall condition comprises the site leaf node of the first term operation tree and the site leaf node of the second term operation tree being the same.

5. The method of claim 1, wherein determining tree similarity between the first term operation tree and the second term operation tree corresponding to each candidate standard word respectively comprises:

executing the following processing for the second term operation tree corresponding to each candidate standard word:

determining a first edit distance between a root node of the first term operation tree and a root node of the second term operation tree;

determining a second edit distance between each intermediate node of the first term operation tree and each intermediate node of the second term operation tree, wherein the intermediate nodes are children of the root node;

determining a third edit distance between leaf nodes of the first term operation tree and leaf nodes of the second term operation tree, wherein the leaf nodes are child nodes of the intermediate node;

determining tree similarity between the first term operation tree and a second term operation tree corresponding to the candidate standard word based on the first edit distance, the second edit distance, and the third edit distance.

6. The method of claim 5, wherein determining a tree similarity between the first term operation tree and a second term operation tree corresponding to the candidate standard word based on the first edit distance, the second edit distance, and the third edit distance comprises:

carrying out parameterization processing on the first editing distance to obtain a first tree similarity corresponding to the first editing distance;

carrying out parameterization processing on the second editing distance to obtain a second tree similarity corresponding to the second editing distance;

carrying out parameterization processing on the second editing distance and the third editing distance to obtain a third tree similarity corresponding to the third editing distance;

and summing the first tree similarity, the second tree similarity and the third tree similarity to obtain the tree similarity between the first term operation tree and the second term operation tree corresponding to the candidate standard word.

7. The method of claim 1, wherein prior to obtaining the domain-specific input word, the method further comprises:

acquiring the term standard table;

splitting and coding each standard word in the term standard table to obtain a component of each standard word, wherein the component comprises a modification component, a part component, a root component and a logic component;

and constructing a second term operation tree corresponding to the standard word based on the modification component, the part component, the root component and the logic component.

8. The method of claim 7, wherein constructing a second term operation tree corresponding to the standard word based on the modification component, the part component, the root component, and the logic component comprises:

determining the logical component as a root node of the second term operation tree;

determining the root component as an intermediate node of the second term operation tree, wherein the intermediate node is a child node of the root node;

determining the site component and the modification component as leaf nodes of the second term operation tree, wherein the leaf nodes are child nodes of the intermediate node;

and constructing a second term operation tree corresponding to the standard word based on the root node, the middle node and the leaf node.

9. The method of claim 1, wherein prior to determining the tree similarity between the first term operation tree and the second term operation tree corresponding to each of the candidate standard words, the method further comprises:

performing at least one of the following processes:

determining a recall index of the input word relative to each standard word in the term standard table respectively, and determining the standard word with the maximum recall index value as a candidate standard word corresponding to the input word, wherein the recall index represents the similarity degree between the input word and the standard word;

determining a target root level matched with the input word in a plurality of root levels of the term standard table, and determining a standard word of a sub-level under the target root level and a standard word of the target root level as a candidate standard word corresponding to the input word, wherein the granularity of the standard word of the root level is larger than that of the standard word of the sub-level under the root level, and the granularity of the standard word is inversely related to the expression elaboration degree of the standard word.

10. The method of claim 9, wherein determining recall indicators for the input word relative to each standard word in the term standard table, respectively, comprises:

determining an edit distance between the input word and each standard word in the term standard table, wherein the edit distance represents a degree of similarity of the literal features between the input word and the standard words;

based on the input word, calling a vector model to carry out vectorization processing to obtain an input word vector corresponding to the input word;

sequentially calling the vector model to carry out vectorization processing on the basis of each standard word in the term standard table to obtain a standard word vector corresponding to the standard word;

determining word vector similarity between the input word vector and a standard word vector corresponding to each standard word in the term standard table, wherein the word vector similarity characterizes semantic feature similarity between the input word and the standard words;

and determining recall indexes of the input words relative to each standard word in the term standard table respectively according to the word vector similarity and the editing distance.

11. The method of claim 10, wherein the determining recall indicators of the input words with respect to each standard word in the term standard table respectively according to the word vector similarity and the edit distance comprises:

performing the following for each standard word in the terminology standard table:

determining the opposite number of the editing distance between the input word and the standard word as a character surface feature, and determining the word vector similarity between the input word and the standard word as a semantic feature;

and performing weighted average processing on the literal features and the semantic features, and determining an obtained weighted average processing result as a recall index of the input word relative to each standard word in the term standard table.

12. The method of claim 9, wherein determining a target root level matching the input word among the plurality of root levels of the term criteria table comprises:

determining root level similarity between the input words and standard words of each root level in the term standard table respectively;

and determining the root level corresponding to the standard word with the root level similarity larger than the similarity threshold as a target root level matched with the input word.

13. A term processing apparatus, the apparatus comprising:

the acquisition module is used for acquiring input words in a specific field;

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the term processing method as claimed in any one of claims 1 to 12 when executing executable instructions or computer programs stored in the memory.

15. A computer-readable storage medium storing executable instructions or a computer program, characterized in that the executable instructions, when executed by a processor, implement the term processing method of any one of claims 1 to 12.

16. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the term processing method as claimed in any one of claims 1 to 12.