CN116680377B

CN116680377B - Chinese medical term self-adaptive alignment method based on log feedback

Info

Publication number: CN116680377B
Application number: CN202310647595.9A
Authority: CN
Inventors: 梁锐; 唐珂轲; 陈美莲; 黄毅宁; 钟冬赐; 林少泽; 吴豪
Original assignee: Guangzhou Zhongkang Digital Technology Co ltd
Current assignee: Guangzhou Zhongkang Digital Technology Co ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2024-04-23
Anticipated expiration: 2043-06-01
Also published as: CN116680377A

Abstract

The invention discloses a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning, and is characterized in that an operation log of a client is recorded, a medical term is analyzed, the medical term is identified and extracted, a conceptual subgraph is split, so that a training sample is automatically constructed, self-learning and automatic indexing are performed, self-learning and self-lifting can be performed along with the access of log data of a downstream business system, then a self-learned model is served to the downstream system, and automation and high efficiency of term alignment are realized through the closed loop of the whole flow.

Description

Chinese medical term self-adaptive alignment method based on log feedback

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a Chinese medical term self-adaptive alignment method based on log feedback.

Background

Medical concept alignment is an important research direction in the field of medical information processing. It mainly refers to standardizing terms, symbols, abbreviations, etc. used in the medical field. In a medical information system, the same medical term may have a plurality of different medical concept expressions. The non-uniformity and inaccuracy of the expression mode seriously prevent the integration, sharing and utilization of medical big data, and bring difficulties to clinic, teaching and scientific research in the medical field. For example, problems such as confusion of terms, inaccuracy of information, omission of information, duplication of information, difficulty in communicating across institutions, etc. may occur.

Medical institutions manually map medical concepts in clinical medical texts into medical term codes based on existing medical term standard dictionary by adopting a manual coding mode, and manual coding requires a large number of professionals with medical knowledge to operate, and is high in cost, limited in efficiency and low in accuracy.

In recent years, in order to solve the problems of cost and efficiency, deep neural networks and knowledge patterns are widely used, especially in the application of Chinese medical terms in the NLP field. Various approaches have emerged, such as more fine-grained decomposition of text based on NER, entity extraction based on a combination of semi-supervised and active learning, and extraction based on depth and search. These methods may be recommended by recall terms, followed by fine-ranking. Most of the processes need to collect data or translate by using foreign English term libraries, and if necessary, need to be marked by experts. It is apparent that there are several problems:

Problem 1, collection of data is an expensive project, and a great deal of funds, manpower and time cost are required;

Problem 2, correctness of translation from english term library to chinese remains a problem; if manual auditing is added, the accuracy of Chinese translation can be improved, but huge manual auditing amount can be used for remote indefinition according to term alignment engineering.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a Chinese medical term self-adaptive alignment method based on log feedback.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

A Chinese medical term self-adaptive alignment method based on log feedback specifically comprises the following steps:

S1, collecting open medical term resources, initializing medical terms, constructing an initial medical term sample, and training to obtain a Chinese medical term alignment model;

S2, a user can input a queried medical term through the client; then the application server searches the concept codes related to the query words through the Chinese medical term alignment model of the term server and returns a candidate concept code sequence, and at the moment, a user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;

S3, the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.

Further, the specific process of constructing the initial medical term sample data in step S1 is:

S1.1, selecting an UMLS open source medical resource library and collecting UMLS terms;

s1.2, translating UMLS terms collected in the step S1.2.

Further, the specific implementation process of step S2 is as follows:

S2.1, performing embedded point and log acquisition requests on an application interface of a client, and once an event point is excited, sending a log recording request to an application server by the client based on a script code to complete one-time operation log recording;

S2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data;

S2.3, synchronizing the operation log data and the transaction log data to a log warehouse:

The operation log record request of the client includes fields including request IP, user uui, time, event type and service parameters; the transaction log field of the application server also includes request time, user uui, event type, event method, and business parameters;

Grouping all the obtained logs according to the user uui, restoring the user behavior process according to the time sequence, extracting the user selection from the time sequence, and finishing the final data annotation; the log data is processed to form a data structure (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) and the formatted data is updated into the log repository.

Further, the specific process of step S3 is as follows:

S3.1, defining a sample format: first converting the structure of the log data into a form of (term 1, term 2, {1, -1 }), each sample containing a term pair; the specific conversion rule is as follows:

(1) Constructing the queried medical terms corresponding to the selection set into positive samples;

(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample;

S3.2, defining and establishing a learning model: in the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} ^N*(2M+|C|), C represents the sampling of any combination of log sources i and k, the related condition of the log sources is represented, the real label of the sample is Y epsilon-1, 1} ^N and the hidden variable is represented; the relation between the multi-source label and the real label is defined as a factor graph model of a probability graph model, which is marked as P _θ (A, Y), and three factors are defined as follows:

The tag matrix generated by formula (1) is noted as The correctness matrix generated by the formula (2) is expressed as/>The correlation matrix generated by equation (3) is denoted as/>Specifically, define/>Element/>Log data representing sample x _i derived from source j, if the terms are similar, then/>Otherwise/>For/>Element/>If the labeling label of the labeling sample is consistent with reality,/>Otherwise/>For/>Element/>If sample x _i is the same at source j as source k,/>Otherwise/>Therefore, combining the three factors can obtain/>

To sum up, the combined factor expression is therefore noted asThe learning model is defined as:

Wherein, Weights representing probability distributions;

s3.3, training a learning model: for the learning model P _θ (a, Y) containing hidden variables Y, the negative log marginal likelihood is minimized based on the matrix a visible to the log labeling tags:

solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel kit of Stanford to solve, and recording learned parameters as theta ^*;

S3.4, obtaining a parameter theta ^* of the learning model from the step S3.3 to obtain a trained learning model I.e.

S3.5, through learning of a learning model, merging noisy labels of the multi-source samples to obtain a soft label distribution, and generating a term pair sample set X _soft of the soft labels;

Setting a filtering labeling threshold alpha (alpha is more than or equal to 0.95), filtering X _soft to obtain a term pair sample set X _hard, and constructing a conceptual diagram:

B1, regarding a Term pair sample set X _hard, taking terms as nodes of a conceptual diagram, and taking Term pairs as edges forming two nodes to construct a conceptual subgraph G _sample, wherein the Term set is term_set _sample;

B2, based on UMLS priori library, concept coding CUI is used as a unit, term set term_set _umls is taken, T _i∈Term_set_umls is taken to create nodes, wherein the nodes can be Chinese terms or English terms, all the terms are built into node sets, and two sides of the nodes in the same CUI Term set are connected to form independent concept subgraph G _UMLS;

B3, constructing a conceptual subgraph G _x of other term libraries according to the process of constructing G _UMLS;

b4, obtaining a conceptual diagram G based on the same node terminology and a plurality of conceptual subgraphs:

Similarly, the overall set of medical terms is expressed as:

b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraph The acquisition process uses the connected_components method of python's third party package networkx to calculate in the form:

For each of Giving a unique global conceptual code to represent, called conceptual code, denoted/>Acquisition/>Node terms in corresponding subgraphs, and constructed medical term semantic equivalence set/>The full-scale term equivalent set is denoted as S ^C;

Unified concept coding Thereafter, simultaneously obtain/>A term mapping relation list gid2cid_list of the public source term library;

Automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtaining The Term mapping list tid2gid_list with term_set'.

Further, the Chinese medical term alignment model mainly comprises a term library, a text search engine and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2cid_list and a term concept equivalence set S ^C;

the text search engine is specifically constructed as follows:

C1, acquiring a concept set S _Concept, and taking a term concept as a unit to be recorded; creating a data model, wherein fields of the data model comprise an ID (identity), an English standard word, an English synonym, a Chinese standard word and a Chinese synonym, and the ID is The corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms;

C2, performing text cleaning on the data of the term library, then storing the data into a database, and establishing an index for the term;

C3, for a given query Chinese term Q _{zh_txt}, performing open professional medical term dictionary translation or open API translation to obtain English terms Q _{en_txt},Q_{zh_txt} and Q _{en_txt}, performing text cleaning and word segmentation, and respectively calculating the similarity of BM25 of corresponding Chinese and English fields;

the medical term text similarity is calculated as follows:

wherein D _{zh_fsn},D_{zh_sim},D_{zh_fsn},D_{zh_sim} respectively represents English standard words, english synonyms, chinese standard words and field documents of Chinese synonyms, alpha _i is a super-parameter, and represents the weight of each term, and the method meets the following requirements:

The semantic search engine models based on SimCSE models of contrast learning, and the specific process is as follows:

d1, model selection: the two terms are numbered respectively, then similarity is calculated, and the terms are trained based on a comparison learning selection SimCSE model;

D2, since SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked as In addition, since a large amount of supervision data is obtained from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are combined in pairs from S ^C by taking the term concept as a unit to form samplesFinally, get/>

D3, forward calculation: for sample x _i∈X^cse, the two terms are calculated by the encoder, expressed as:

vec₁,vec₂＝encoder(x_i,1,μ₁),encoder(x_i,2,μ₂)

Wherein, the encoder (·) represents the encoder function, the chinese BERT, μ ₁,μ₂ are taken to represent the parameters of the Dropout layer in the encoder, the larger the parameter setting is, the more invalid neural network elements are output by the neural network, and x _i,1,x_i,2 represents the terms 1 and 2 of the sample x _i respectively;

Because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample consists of x _j,1 in a batch of samples in training and x _j,2 in the same batch, wherein x _j∈X^cse, j >1;

d4, defining a loss function; for a batch size of B, the trained penalty function is:

Wherein, the similarity is calculated by cosine:

D5, reverse calculation: solving the gradient, carrying out reverse iterative updating, completing the learning of a model, and completing the learning of an encoder ^* (·) =encoder (·, 0);

D6, indexing the term vector set; obtaining T _i 'E term_set', calculate vector set term_ vecs as follows:

term_vec_i∈Term_vecs，

term_vec_i＝encoder^*(T_i')

D7, index term_ vecs vector set and search: a vector database or a vector retrieval component or tool is adopted to save and retrieve a vector set, and cosine calculation is adopted for similarity calculation of vectors;

For Q _{zh_txt} and Q _{en_txt}, encoding first, then calculating cosine similarity, and calculating query scores:

furthermore, the Chinese medical term alignment model retrieves the concept codes related to the query terms and returns the candidate concept code sequences as follows:

e1, for the Chinese term Q _{zh_txt} input by the user, translating to obtain the corresponding English term Q _{en_txt}, obtaining a candidate sequence, taking TOP150, and normalizing the score 0-1 to obtain three final sequences:

E2, the pair cand_seq _{vec_zh}、cand_seq_{vec_en} is combined with a mapping relation tid2gid_list and mapped to gid ^C, and when a plurality of tids appear to one gid ^C in the process, the score is obtained as the maximum value, thereby forming Sequences of elements, respectively denoted candseq' _{vec_zh},cand_seq'_{vec_en};

E3, merging cand_seq _txt,cand_seq'_{vec_zh},cand_seq'_{vec_en} by taking gid ^C as a key, generating a final candidate sequence cand_seq by adopting an external connection operation, and calculating the weight of the score as follows:

score_i＝κ₁·score_txt,i+κ₂·score_{zh_vec,i}+κ₃·score_{en_vec,i}

Wherein score _txt,i,score_{zh_vec,i},score_{en_vec,i} represents the score in candseq _txt,cand_seq'_{vec_zh},cand_seq'_{vec_en}, respectively, and κ _i is a super-parameter, representing the weight of each score, satisfying:

Kappa _i E [0,1], and

The invention has the beneficial effects that:

1. the method is realized based on log feedback, weak supervision and contrast learning, and by recording the operation log of the client, analyzing the action process in the log, identifying and extracting medical terms, thereby realizing automatic construction of training samples, self-learning and automatic indexing, self-learning and self-lifting along with the access of log data of a downstream business system, and then re-serving the self-learned model to the downstream system, and realizing automation and high efficiency of term alignment through the closed loop of the whole process.

2. The graph model is adopted to break through and split the medical terms, so that the core thought of term concepts can be embodied. An abstract medical concept exists only in the intangible notion, and the terms represented are scattered in the term data and databases. Effectively concatenating the fragmented terms through the graph, given a unique conceptual code; and through the constructed graph model, adopting a graph algorithm to automatically split a plurality of connected subgraphs, thereby forming the effect of the conceptual subgraphs.

3. For the term feedback of the log, weakly supervised data processing is innovatively provided, a learning model constructed by noise annotation and real annotation is used as noise annotation inference, the method is suitable for processing multi-application scene service, the quality problem of data generated by multi-user difference is solved, and the performance of a final model is improved.

4. The term vector embedding of the hybrid SimCSE model is proposed, and unified training of combining single terms with terms of similar pairs can be completed, so that embedding processing of massive terms is completed.

5. According to the invention, the alignment model is calculated based on the combination of text and semantics, the advantages of the text and the semantics are combined, the simplicity and the high efficiency of text search are exerted, when data tends to be saturated along with the running and iteration of the model, the matching and the searching of the text are directly adopted, so that a good term alignment effect can be achieved, and meanwhile, the problem of basic matching caused by insufficient training of the model due to insufficient term samples can be solved. The semantic retrieval can directly reflect the semantic characteristics of the Chinese medical terms, and can solve the problem that the literal similarity can be semantically inconsistent.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an implementation architecture of a method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the architecture of a Chinese medical term alignment model according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning. The specific process is shown in fig. 1.

S2, a user can input medical terms of a query through a client (i, j and k shown in fig. 2), such as inputting a query word of 'respiratory tract infection'; then the application server (g and h shown in figure 2) searches the concept codes related to the query terms through the Chinese medical term alignment model of the term server (a-f shown in figure 2) and returns a candidate concept code sequence, and at the moment, the user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;

the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.

The main purpose of this embodiment is to solve the alignment problem of chinese medical terms, so in the initial stage, a huge amount of open-source english term libraries are required to be used as starting points, authoritative chinese-english translation dictionary and standard are collected as bridges, and initial medical term samples are constructed. The specific process of step S1 is as follows:

S1.1, selecting an UMLS open source medical resource library and collecting UMLS terminology. As is well known, UMLS is a combination of a large number of term libraries that are contiguous if the cui code of a UMLS term (unique code representing a concept in a UMLS medical library) is known, including the term libraries SNOMEDCT Meta, ICD system version, meSH, LOINC, ATC, etc.

S1.2, translating UMLS terms collected in the step S1.2. The term translation plays a critical role, and the main sources of translation data selected in this example are:

1. The Chinese-English comparison table issued by the authority, such as the 10 th edition of International Classification of diseases, the SNOMED 3.4 edition of Chinese electronic version published by the institute of health and management of China, the ACT Chinese version, and the like, is recorded as CN_medical_terminal_Set1.

2. Medical dictionaries, such as the more authoritative "Xiangya medical big dictionary", english-Chinese medical Ci hai ", english-Chinese-English two-way medical dictionary", english-Chinese-English two-way medical dictionary ", and the like. And performing full word matching on English terms according to the dictionary.

Further, in step S2, the problem of difficult data sample collection is solved by the user behavior based on log feedback and transaction log collection of the application server. The collected logs comprise operation logs generated by term alignment of a user on a client side and transaction logs generated by an application server based on operation of the user. The application server synchronizes the operation logs collected by each node and the transaction logs thereof to a log warehouse of the term server for storage at fixed time or in real time. The specific implementation process of the step S2 is as follows:

S2.1, performing embedded point and log acquisition requests on an application interface of the client. For example, candidate concept code loading end event points, event points selection, event points completion and submission, and the like, once the event points are excited, the client sends a log record request to the application server based on script codes, and one-time operation log record is completed. For example, javaScript script at the user end sends an asynchronous request to the server based on Ajax.

S2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data. For example, a user queries the term "respiratory tract infection" at an application interface of a client, a log system of an application server receives a query log record request of the client, and at the same time, a request for processing a query candidate term list and related data list in business logic is also processed, such as "respiratory tract infection (C0035243)", "upper respiratory tract infection (C0041912)", "upper respiratory tract infection (C0149725)", "viral respiratory tract infection (C0729531)", and the codes of brackets are conceptual codes after term standardization.

S2.3, log synchronization. The embodiment synchronizes the log to the log warehouse based on a real-time log stream technology, combines the technology of Flume+Kafka+ SPARK STREAMING to process, reads log information by using Flume, stores the log information queue by Kafka, consumes the log information SPARK STREAMING, filters and cleans based on rules, and finally stores the log information to the log warehouse. The storage medium may be a file system, a distributed file system, or a relational database such as Mysql, nosql database, etc. The embodiment is based on Hadoop, adopts an HDFS distributed file system and adopts a Hive data warehouse to store data. Hadoop is a distributed system infrastructure developed by the Apache foundation, HDFS is a file system built on top of Hadoop, hive is a data warehouse tool based on Hadoop for data extraction, transformation and loading, which is a mechanism that can store, query and analyze large-scale data stored in Hadoop.

The content structure of the log data collection and the final output data form are illustrated below.

The operation log record request for the client comprises the following main fields: request IP, subscriber uui, time, event type, service parameters, etc. In addition, the transaction log field of the application server also includes request time, user uui, event type, event method, service parameters, etc.

Grouping all the obtained logs according to the user uui, restoring the user behavior process according to the time sequence, extracting the user selection from the time sequence, and finishing the final data annotation. For example, in the unique user uui, the process may wash out the candidate concept code set "[ C0035243, C0041912, C0149725, C0581381, C0729531]", then the user selects "C0035243", and finally forms (server code, uui, date, medical terms queried, candidate set, selection set, whether custom) the data structure in the format of (s 10009, 1657008203_1315242ec22b0d006e 24624429a 744b, 2022-06-06, { respiratory tract infection }, { C0035243, C0041912, C0149725, C0581381, C0729531}, { C0035243}, no) and updates the log formatted data to the HIVE data repository in the log repository.

In this embodiment, the log data is from a log system of a plurality of application servers, each of which is provided to a plurality of business scenario applications, each of which may be used by crowd workers of different characteristics. The data samples can thus be regarded as data sets from a plurality of heterogeneous sources, there is inevitably a lot of noise and even conflicting labels, and different judgments are generated when the complexity of the actual business application is understood from different angles and granularities.

For multi-source noise data samples, noise elimination is needed, and then labels are combined to construct a better training corpus. The specific process of step S3 is as follows:

S3.1, defining a sample format. The structure of the log data (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) is first converted into a form (term 1, term 2, {1, -1 }) with each sample containing a term pair. The specific conversion rule is as follows:

(1) The medical terms of the query are constructed as positive samples with terms corresponding to the selection set. For example, the query word "respiratory tract infection" input by the user and the term set corresponding to the C0035243 selected by the user are { "respiratory tract (upper respiratory tract and lower respiratory tract) infection", "upper and lower respiratory tract infection" }, and positive samples (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1) and (respiratory tract infection, upper and lower respiratory tract infection, 1) are constructed to form positive sample sets { (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1), (respiratory tract infection, upper and lower respiratory tract infection, 1) };

(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample. For example, the medical term "respiratory tract infection" of a query is constructed with the term set of candidate sets { C0041912, C0149725, C0581381, C0729531}, resulting in a negative sample set { (respiratory tract infection, upper Respiratory Tract Infection (URTI), -1), (respiratory tract infection, respiratory tract viral infection, -1), }.

S3.2, defining and establishing a learning model. In the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} ^N*(2M+|C|), C represents the sampling of any combination of log sources i and k, the related condition of the log sources is represented, the real label of the sample is Y epsilon-1, 1} ^N, and the hidden variable is represented. The embodiment is inspired by weak supervision, the relation between the multi-source label and the real label is defined as a factor graph model of a probability graph model, the factor graph model is marked as P _θ (A, Y), and three factors are defined as follows:

Wherein, Representing the weight of the probability distribution.

S3.3, training a learning model. For the learning model P _θ (a, Y) containing hidden variables Y, the negative log marginal likelihood is minimized based on the matrix a visible to the log labeling tags:

Solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel kit of Stanford to solve, and recording learned parameters as theta ^*.

S3.4, obtaining a parameter theta ^* of the learning model from the step S3.3 to obtain a trained learning modelI.e.

S3.5, learning through a learning model, and fusing noisy labels of the multi-source samples to obtain a soft label distribution. For example, for a medical entity of "respiratory tract infection," a soft-labeled term will be generated for the sample set X _soft { (respiratory tract infection, upper and lower respiratory tract infections, 1.0), (respiratory tract infection, respiratory tract (upper and lower respiratory tract) infections, 0.989), (respiratory tract infection, upper Respiratory Tract Infection (URTI), 0.75), (respiratory tract infection, respiratory tract viral infection, 0.65) }.

To date, the similarity relationship between term pairs is processed based on a data layer, and the present embodiment unifies medical concepts based on a term semantic layer, and uses unique concept CODEs to represent, for example, the CUI of UMLS and the CODE of SNOMED CT. Specifically, a filtering labeling threshold alpha (alpha is more than or equal to 0.95) is set, X _soft is filtered to obtain a term pair sample set X _hard, and a conceptual diagram is constructed. The specific process is as follows:

B1, regarding the Term pair sample set X _hard, the Term is taken as a node of the conceptual diagram, and the Term pair forms edges of two nodes to construct a conceptual subgraph G _sample, wherein the Term set is term_set _sample. For example (respiratory tract infection, upper and lower respiratory tract infection, 1.0), two nodes are node_1= { code= "sample_1", nane = "respiratory tract infection" } and node_2= { code= "sample_2", name= "upper and lower respiratory tract infection" }, edge is edge_12= < node_1, node_2 >.

B2, based on UMLS prior library, concept encoding CUI is used as a unit, a Term set terminal_set _umls is taken, T _i∈Term_set_umls is taken to create a node_i= { code= "UMLS _i", nane = "T_str" }, wherein the node can be Chinese Term or English Term, all terms are built into a node set, and two sides of the node in the same CUI Term set are connected to form independent concept subgraphs G _UMLS.

B3, constructing a conceptual subgraph G _x of other term libraries according to the process of constructing G _UMLS, for example, constructing G _TCM for related word libraries of traditional Chinese medicine.

Similarly, the overall set of medical terms is expressed as:

For each of Giving a unique global conceptual code to represent, called conceptual code, denoted/>Acquisition/>Node terms in corresponding subgraphs, and constructed medical term semantic equivalence set/>The full-scale term equivalent set is denoted as S ^C.

Unified concept codingThereafter, simultaneously obtain/>The term mapping relation list gid2cid_list with the public source term library.

Automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtainingThe Term mapping list tid2gid_list with term_set'.

For example, coded as "0085580", the corresponding set of medical terms equivalent are { "essential hypertension", "idiopathic (primary) hypertension", "primary hypertension", "Essentialhypertension", "Primary hypertension", "hypertension" }, the corresponding gid2cid_list list is [ "0085580": { "umls": "C0085580", "sct": "194760004, 59621000", "icd10": "I10",.. "0085580") ("0002": "0085580") ].

The term model can be technically divided into two main types of text and semantic similarity models. Text similarity generally calculates the distance between two terms based on text character matching. For semantic similarity computation, the term is first embedded into some vector space, the vector representation of the term is obtained, such as the coding of word2vec or BERT model, and then the approximation is calculated based on the distance formula. However, the accuracy of the embedded model based on the general purpose is not high, the semantic requirement of terms is difficult to reach, and even large models such as BERT have larger collapse phenomenon.

To sum up, this embodiment proposes a chinese medical term alignment model that merges english-text retrieval with contrast learning semantic reasoning, as shown in fig. 3 (in the example of UMLS), where the chinese medical term alignment model mainly includes a term library, a text search engine, and a semantic search engine. Wherein the term library comprises the contents of an open term library, concept codes related to the global aspects of a term library mapping table gid2cid_list, a term concept equivalence set S ^C and the like. The text search engine includes index preservation and retrieval similarity calculation of Chinese and English terms. The semantic search engine comprises sample strategies, semantic model training, vector data persistence, semantic retrieval and other relevant contents.

The text search engine is specifically constructed as follows:

And C1, acquiring a concept set S _Concept, and taking the term concept as a unit to record. Creating a data model (ID, english standard word, english synonym, chinese standard word, chinese synonym) of the term library table head shown in FIG. 3, wherein the ID is The corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms.

And C2, performing text cleaning on the data of the term library, such as converting into lower cases, removing special symbols and the like, and then storing the data into a database to establish indexes for the terms. In this embodiment, ES (elastic search) is selected, the data model creates corresponding fields, and the characters are filtered and processed by the methods of "html_strip", "lowercase", "asciifolding", "stemmer", "english _stop", and the like; then English word segmentation and Chinese word segmentation are carried out, and the inverted index is stored in a database.

And C3, for a given query Chinese term Q _{zh_txt}, obtaining English terms Q _{en_txt},Q_{zh_txt} and Q _{en_txt} through open professional medical term dictionary translation or open API translation, and respectively calculating the similarity of BM25 of corresponding Chinese and English fields through text cleaning and word segmentation.

Note that, the calculation formula of BM25 is as follows:

Wherein score _bm25 (Q, D) represents the approximation score of query term Q and term document D, Q _i represents the element after Q segmentation, N (Q _i) represents the number of all term documents including element i, N represents the number of all terms documents, f (Q _i, D) represents the frequency of element i in term D, |d| represents the term document length, avgdl represents the average length of term documents, k ₁, b represents the super parameter, and default values are 1.2 and 0.75, respectively. From this, the medical term text similarity calculation can be deduced as:

wherein D _{zh_fsn},D_{zh_sim},D_{zh_fsn},D_{zh_sim} respectively represents English standard words, english synonyms, chinese standard words and field documents of Chinese synonyms, alpha _i is a super-parameter, and represents the weight of each term, thereby meeting the requirements of

D1, selecting a model. The semantic framework selected in this embodiment is based on a double-tower model, that is, two terms are numbered first, and then the similarity is calculated. In addition, in order to overcome collapse phenomenon of the large model, a SimCSE model is selected for training the term based on contrast learning;

D2, since SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked as In addition, in the present embodiment, since a large amount of supervision data is obtained from log data, a hybrid SimCSE model is adopted, and the construction method is that samples are combined in pairs from S ^C in terms of terms concept to form samples/>Finally, get/>

D3, forward calculation: for sample x _i∈X^cse, there are samples of "< headache," > "," < headach, headach > ", etc., and samples of" < headache, ">", "< headache, headach >", etc., where the two terms are calculated by the encoder respectively, and the embodiment selects the chinese BERT as the encoder for calculation, which can be expressed as:

vec₁,vec₂＝encoder(x_i,1,μ₁),encoder(x_i,2,μ₂)

Wherein, the encoder (·) represents the encoder function, the chinese BERT, μ ₁,μ₂ is taken to represent the parameters of the Dropout layer in the encoder, the larger the parameter setting is, the more dead neural network elements are output by the neural network, and x _i,1,x_i,2 represents the first term and the second term of the sample x _i, respectively.

Because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample consists of x _j,1 in a batch of samples in training and x _j, 2 in the same batch, where x _j∈X^cse, j >1.

D4, defining a loss function. For a batch size of B, the trained penalty function is:

Wherein, the similarity is calculated by cosine:

And D5, reversely calculating. Solving the gradient, carrying out reverse iterative update, completing the learning of a model, and completing the learning of an encoder ^* (·) =encoder (·, 0).

D6, term vector set index. Obtaining T _i 'E term_set', calculate vector set term_ vecs as follows:

term_vec_i∈Term_vecs，

term_vec_i＝encoder^*(T_i')

D7, indexing Term vecs vector set and retrieval. Indexing Term vecs to a vector database such as Milvus, vald, qdrant, etc. or selecting vector retrieval components or tools such as Facebook FAISS, microsoft SPTAG, spotify Annoy, etc., the algorithm of similarity calculation is mainly based on the approximate nearest neighbor search concept. In this embodiment, the open source Milvus is selected to store and retrieve the vector set, and cosine calculation is used for the similarity calculation of the vectors.

further, the Chinese medical term alignment model predicts the query term input by the user at the client, and the specific process of giving the candidate concept coding result is as follows:

E1, carrying out translation on a Chinese term Q _{zh_txt} input and inquired by a user to obtain a corresponding English term Q _{en_txt}, obtaining a candidate sequence, taking TOP-150, and carrying out score 0-1 standardization to obtain three final sequences:

/>

E2, the pair cand_seq _{vec_zh}、cand_seq_{vec_en} is combined with a mapping relation tid2gid_list and mapped to gid ^C, and when a plurality of tids appear to one gid ^C in the process, the score is obtained as the maximum value, thereby forming Is a sequence of elements, denoted candseq' _{vec_zh},cand_seq'_{vec_en}, respectively.

score_i＝κ₁·score_txt,i+κ₂·score_{zh_vec,i}+κ₃·score_{en_vec,i}

Kappa _i E [0,1], and

Various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims

1. The Chinese medical term self-adaptive alignment method based on log feedback is characterized by comprising the following steps of:

S3, the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server:

The tag matrix generated by formula (1) is noted as The correctness matrix generated by the formula (2) is expressed as/>The correlation matrix generated by equation (3) is denoted as/>Specifically, define/>Element/>Log data representing sample x _i derived from source j, if the terms are similar, then/>Otherwise/>For/>Element/>If the labeling label of the labeling sample is consistent with reality,/>Otherwise/>For/>Element/>If sample x _i is the same at source j as source k, thenOtherwise/>Therefore, combining the three factors can obtain/>

Wherein, Weights representing probability distributions;

Similarly, the overall set of medical terms is expressed as:

Unified concept coding Thereafter, simultaneously obtain/>A term mapping relation list gid2cid_list with a public source term library;

2. The method according to claim 1, wherein the specific process of constructing the initial medical term sample data in step S1 is:

s1.2, translating UMLS terms collected in the step S1.2.

3. The method according to claim 1, wherein the specific implementation procedure of step S2 is as follows:

4. The method of claim 1, wherein the chinese medical term alignment model consists essentially of a term library, a text search engine, and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2 cid_list and a term concept equivalence set S ^C;

the text search engine is specifically constructed as follows:

the medical term text similarity is calculated as follows:

Alpha _i E [0,1], and

D2, since SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked as In addition, since a large amount of supervision data is obtained from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are combined in pairs from S ^C by taking the term concept as a unit to form samples/>Finally, get/>

vec₁,vec₂＝encoder(x_i,1,μ₁),encoder(x_i,2,μ₂)

Wherein, the similarity is calculated by cosine:

term_vec_i∈Term_vecs，

term_vec_i＝encoder^*(T_i')

5. The method of claim 4, wherein the chinese medical term alignment model retrieves the query term-related concept codes and returns the candidate concept code sequence as follows:

E2, combining the cand_seq _{vec_zh}、cand_seq_{vec_en} with a mapping relation tid2 gid_list, mapping to gid ^C, and taking the maximum value of scores when a plurality of tids appear for one gid ^C in the process, thereby forming a sequence with < gid _i ^C,score_i > as an element, and respectively marking the sequence as cand_seq' _{vec_zh},cand_seq'_{vec_en};

score_i＝κ₁·score_txt,i+κ₂·score_{zh_vec,i}+κ₃·score_{en_vec,i}

Kappa _i E [0,1], and