CN116680377A

CN116680377A - Chinese medical term self-adaptive alignment method based on log feedback

Info

Publication number: CN116680377A
Application number: CN202310647595.9A
Authority: CN
Inventors: 梁锐; 唐珂轲; 陈美莲; 黄毅宁; 钟冬赐; 林少泽; 吴豪
Original assignee: Guangzhou Zhongkang Digital Technology Co ltd
Current assignee: Guangzhou Zhongkang Digital Technology Co ltd
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-01
Anticipated expiration: 2043-06-01
Also published as: CN116680377B

Abstract

The invention discloses a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning, and is characterized in that an operation log of a client is recorded, a medical term is analyzed, the medical term is identified and extracted, a conceptual subgraph is split, so that a training sample is automatically constructed, self-learning and automatic indexing are performed, self-learning and self-lifting can be performed along with the access of log data of a downstream business system, then a self-learned model is served to the downstream system, and automation and high efficiency of term alignment are realized through the closed loop of the whole flow.

Description

Chinese medical term self-adaptive alignment method based on log feedback

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a Chinese medical term self-adaptive alignment method based on log feedback.

Background

Medical concept alignment is an important research direction in the field of medical information processing. It mainly refers to standardizing terms, symbols, abbreviations, etc. used in the medical field. In a medical information system, the same medical term may have a plurality of different medical concept expressions. The non-uniformity and inaccuracy of the expression mode seriously prevent the integration, sharing and utilization of medical big data, and bring difficulties to clinic, teaching and scientific research in the medical field. For example, problems such as confusion of terms, inaccuracy of information, omission of information, duplication of information, difficulty in communicating across institutions, etc. may occur.

Medical institutions manually map medical concepts in clinical medical texts into medical term codes based on existing medical term standard dictionary by adopting a manual coding mode, and manual coding requires a large number of professionals with medical knowledge to operate, and is high in cost, limited in efficiency and low in accuracy.

In recent years, in order to solve the problems of cost and efficiency, deep neural networks and knowledge patterns are widely used, especially in the application of Chinese medical terms in the NLP field. Various approaches have emerged, such as more fine-grained decomposition of text based on NER, entity extraction based on a combination of semi-supervised and active learning, and extraction based on depth and search. These methods may be recommended by recall terms, followed by fine-ranking. Most of the processes need to collect data or translate by using foreign English term libraries, and if necessary, need to be marked by experts. It is apparent that there are several problems:

problem 1, collection of data is an expensive project, and a great deal of funds, manpower and time cost are required;

problem 2, correctness of translation from english term library to chinese remains a problem; if manual auditing is added, the accuracy of Chinese translation can be improved, but huge manual auditing amount can be used for remote indefinition according to term alignment engineering.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a Chinese medical term self-adaptive alignment method based on log feedback.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a Chinese medical term self-adaptive alignment method based on log feedback specifically comprises the following steps:

s1, collecting open medical term resources, initializing medical terms, constructing an initial medical term sample, and training to obtain a Chinese medical term alignment model;

s2, a user can input a queried medical term through the client; then the application server searches the concept codes related to the query words through the Chinese medical term alignment model of the term server and returns a candidate concept code sequence, and at the moment, a user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;

s3, the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.

Further, the specific process of constructing the initial medical term sample data in step S1 is:

s1.1, selecting an UMLS open source medical resource library and collecting UMLS terms;

s1.2, translating UMLS terms collected in the step S1.2.

Further, the specific implementation process of step S2 is as follows:

s2.1, performing embedded point and log acquisition requests on an application interface of a client, and once an event point is excited, sending a log recording request to an application server by the client based on a script code to complete one-time operation log recording;

s2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data;

s2.3, synchronizing the operation log data and the transaction log data to a log warehouse:

the operation log record request of the client includes fields including request IP, user uui, time, event type and service parameters; the transaction log field of the application server also includes request time, user uui, event type, event method and service parameters;

grouping all the obtained logs according to the user uui, restoring the user behavior process according to the time sequence, extracting the user selection from the time sequence, and finishing the final data annotation; the log data is processed to form a data structure (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) and the formatted data is updated into the log repository.

Further, the specific process of step S3 is as follows:

s3.1, defining a sample format: first converting the structure of the log data into a form of (term 1, term 2, {1, -1 }), each sample containing a term pair; the specific conversion rule is as follows:

(1) Constructing the queried medical terms corresponding to the selection set into positive samples;

(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample;

s3.2, defining and establishing a learning model: in the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} ^N*(2M+|C|) C represents the sampling of any combination of log sources i and k, represents the correlation condition of the log sources, and the real label of the sample is Y epsilon-1, 1} ^N Is a hidden variable; defining the relation between the multi-source label and the real label as a factor graph model of a probability graph model, and recording as P _θ (A, Y), three factors are defined:

the tag matrix generated by formula (1) is noted as The correctness matrix generated by the formula (2) is marked as +.>The correlation matrix generated by the formula (3) is marked +.>Specifically, define->Element->Representing sample x _i Log data from source j, if there are terms similar, +.>Otherwise->For->Element->If the labeling label of the labeling sample is identical to the actual one +.>Otherwise->For->Element->If sample x _i In case source j is the same as source k, then +.>Otherwise->Therefore, the combination of the three factors can obtain +.>

To sum up, the combined factor expression is therefore noted asThe learning model is defined as:

wherein ,weights representing probability distributions;

s3.3, training a learning model: learning model P for included hidden variable Y _θ (a, Y) minimizing the negative log marginal likelihood from the matrix a visible to the log label:

solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel toolkit of Steady to solve, and recording learned parameters as theta ^* ；

S3.4, obtaining the parameter theta of the learning model from the step S3.3 ^* Obtaining a trained learning modelI.e.

S3.5, merging noisy labels of the multi-source samples through learning of a learning model to obtain a soft label distribution, and generating a term pair sample set X of the soft labels _soft ；

Setting a filtering labeling threshold alpha (alpha is more than or equal to 0.95) for X _soft Filtering to obtain a term pair sample set X _hard Constructing a conceptual diagram:

b1, for the term pair sample set X _hard The term is taken as the node of the conceptual diagram, and the term pair forms the edges of the two nodes to construct a conceptual subgraph G _sample The Term set at this time is term_set _sample ；

B2, based on UMLS priori library, taking concept coding CUI as unit, and taking Term set term_set _umls Taking T _i ∈Term_set _umls Creating nodes, wherein the nodes can be Chinese terms or English terms, all terms are constructed into node sets, and the nodes in the same CUI term set are connected by two edges to form independent conceptual subgraph G _UMLS ；

B3, according to construction G _UMLS Is to construct a conceptual subgraph G of other term libraries _x ；

B4, obtaining a conceptual diagram G based on the same node terminology and a plurality of conceptual subgraphs:

similarly, the overall set of medical terms is expressed as:

b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraphThe acquisition procedure uses the connected_components of the python's third party packet network xThe method performs calculation, and the form is expressed as:

for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +. >Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S ^C ；

Unified concept codingAfter that, get +.>A term mapping relation list gid2cid_list of the public source term library;

automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtainingThe Term mapping list tid2gid_list with term_set'.

Further, the Chinese medical term alignment model mainly comprises a term library, a text search engine and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2cid_list and a term concept equivalence set S ^C ；

The text search engine is specifically constructed as follows:

c1, acquiring concept set S _Concept Taking the term concept as a unit for recording; creating a data model, wherein fields of the data model comprise an ID (identity), an English standard word, an English synonym, a Chinese standard word and a Chinese synonym, and the ID isThe corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms;

C2, performing text cleaning on the data of the term library, then storing the data into a database, and establishing an index for the term;

c3, chinese term Q for given query _{zh_txt} English term Q is obtained through open professional medical term dictionary translation or open API translation _{en_txt} ，Q _{zh_txt} And Q is equal to _{en_txt} The similarity of BM25 of the corresponding fields of Chinese and English is calculated through text cleaning and word segmentation;

the medical term text similarity is calculated as follows:

wherein D_{zh_fsn} ,D _{zh_sim} ,D _{zh_fsn} ,D _{zh_sim} Field documents respectively representing English standard words, english synonyms, chinese standard words and Chinese synonyms, and alpha _i Is a super-parameter, and represents the weight of each term, and satisfies the following conditions:

the semantic search engine models based on a SimCSE model of contrast learning, and the specific process is as follows:

d1, model selection: the two terms are numbered respectively, then similarity is calculated, and the terms are trained by selecting a SimCSE model based on comparison learning;

d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked asIn addition, since a large amount of supervision data is acquired from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as a reference ^C Two by two combinations of the two are carried out to form a sampleFinally, get->

D3, forward calculation: for sample x _i ∈X ^cse The two terms are calculated by the encoder, expressed as:

vec ₁ ,vec ₂ ＝encoder(x _i，1 ,μ ₁ ),encoder(x _i，2 ,μ ₂ )

wherein, the encoder (·) represents the encoder function, and Chinese BERT, μ is taken ₁ ,μ ₂ The larger the parameter setting, the more dead neural network elements output by the neural network, x _i,1 ,x _i,2 Respectively represent sample x _i Is the term 1 and the term 2 of (a);

because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training _j，1 X of the same batch _j，2 Composition, where x _j ∈X ^cse ，j>1；

D4, defining a loss function; for a batch size of B, the trained penalty function is:

wherein, the similarity is calculated by cosine:

d5, reverse calculation: solving gradient, reverse iteration updating, completing model learning and completing encoder ^* Learning of (-) = encoder (·, 0);

d6, indexing the term vector set; obtain T _i 'E term_set', a vector set term_vecs is calculated as follows:

term_vec _i ∈Term_vecs，

term_vec _i ＝encoder ^* (T _i ')

d7, indexing term_vecs vector set and retrieval: a vector database or a vector retrieval component or tool is adopted to save and retrieve a vector set, and cosine calculation is adopted for similarity calculation of vectors;

For Q _{zh_txt} And Q is equal to _{en_txt} Encoding first, then calculating cosine similarity, and calculating query score:

furthermore, the Chinese medical term alignment model retrieves the concept codes related to the query terms and returns the candidate concept code sequences as follows:

e1, query Chinese term Q for user input _{zh_txt} Translation is carried out to obtain corresponding English term Q _{en_txt} Candidate sequences can be obtained and TOP150 is taken and normalized by scores 0-1 to obtain the final three sequences:

e2, couple_seq _{vec_zh} 、cand_seq _{vec_en} Combining the mapping relation tid2gid_list and mapping to gid ^C In this process, when multiple tids are present for one gid ^C When taking the maximum value of the score, thereby formingSequences of elements, denoted candjseq' _{vec_zh} ,cand_seq' _{vec_en} ；

E3, gid ^C For key, for cand_seq _txt ,cand_seq' _{vec_zh} ,cand_seq' _{vec_en} Merging, adopting an external connection operation, generating a final candidate sequence cand_seq, and calculating the weight of the score as follows:

score _i ＝κ ₁ ·score _txt,i +κ ₂ ·score _{zh_vec,i} +κ ₃ ·score _{en_vec,i}

wherein score _txt,i ,score _{zh_vec,i} ,score _{en_vec,i} Respectively represent cand_seq _txt ，cand_seq' _{vec_zh} ，cand_seq' _{vec_en} The fraction, κ of (3) _i Is a super-parameter, and represents the weight of each score, and meets the following conditions:

κ _i ∈[0,1]and (2) and

the invention has the beneficial effects that:

1. the method is realized based on log feedback, weak supervision and contrast learning, and by recording the operation log of the client, analyzing the action process in the log, identifying and extracting medical terms, thereby realizing automatic construction of training samples, self-learning and automatic indexing, self-learning and self-lifting along with the access of log data of a downstream business system, and then re-serving the self-learned model to the downstream system, and realizing automation and high efficiency of term alignment through the closed loop of the whole process.

2. The graph model is adopted to break through and split the medical terms, so that the core thought of term concepts can be embodied. An abstract medical concept exists only in the intangible notion, and the terms represented are scattered in the term data and databases. Effectively concatenating the fragmented terms through the graph, given a unique conceptual code; and through the constructed graph model, adopting a graph algorithm to automatically split a plurality of connected subgraphs, thereby forming the effect of the conceptual subgraphs.

3. For the term feedback of the log, weakly supervised data processing is innovatively provided, a learning model constructed by noise annotation and real annotation is used as noise annotation inference, the method is suitable for processing multi-application scene service, the quality problem of data generated by multi-user difference is solved, and the performance of a final model is improved.

4. The term vector embedding of the mixed SimCSE model is proposed, and unified training of combining single term and similar pair terms can be completed, so that embedding processing of massive terms is completed.

5. According to the invention, the alignment model is calculated based on the combination of text and semantics, the advantages of the text and the semantics are combined, the simplicity and the high efficiency of text search are exerted, when data tends to be saturated along with the running and iteration of the model, the matching and the searching of the text are directly adopted, so that a good term alignment effect can be achieved, and meanwhile, the problem of basic matching caused by insufficient training of the model due to insufficient term samples can be solved. The semantic retrieval can directly reflect the semantic characteristics of the Chinese medical terms, and can solve the problem that the literal similarity can be semantically inconsistent.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an implementation architecture of a method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the architecture of a Chinese medical term alignment model according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.

The embodiment provides a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning. The specific process is shown in fig. 1.

s2, a user can input medical terms of a query through a client (i, j and k shown in fig. 2), such as inputting a query word of 'respiratory tract infection'; then the application server (g and h shown in figure 2) searches the concept codes related to the query terms through the Chinese medical term alignment model of the term server (a-f shown in figure 2) and returns a candidate concept code sequence, and at the moment, the user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;

The log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.

The main purpose of this embodiment is to solve the alignment problem of chinese medical terms, so in the initial stage, a huge amount of open-source english term libraries are required to be used as starting points, authoritative chinese-english translation dictionary and standard are collected as bridges, and initial medical term samples are constructed. The specific process of step S1 is as follows:

s1.1, selecting an UMLS open source medical resource library and collecting UMLS terminology. As is well known, UMLS is a combination of a large number of term libraries that are contiguous if the cui code of the UMLS term (the unique code representing a concept in the UMLS medical library) is known, including the term libraries of snomed ct Meta, ICD system version, meSH, LOINC, ATC, etc.

S1.2, translating UMLS terms collected in the step S1.2. The term translation plays a critical role, and the main sources of translation data selected in this example are:

1. The Chinese-English comparison table issued by the authority, such as the 10 th edition of International Classification of diseases, the SNOMED 3.4 edition of Chinese electronic version published by the institute of health and management of China, the ACT Chinese version, and the like, is recorded as CN_medical_terminal_Set1.

2. Medical dictionaries, such as the more authoritative "Xiangya medical big dictionary", the "English-Chinese medical dictionary", the "English-Chinese-English two-way medical dictionary", the "English-Chinese-English two-way medical dictionary", and the like. And performing full word matching on English terms according to the dictionary.

Further, in step S2, the problem of difficult data sample collection is solved by the user behavior based on log feedback and transaction log collection of the application server. The collected logs comprise operation logs generated by term alignment of a user on a client side and transaction logs generated by an application server based on operation of the user. The application server synchronizes the operation logs collected by each node and the transaction logs thereof to a log warehouse of the term server for storage at fixed time or in real time. The specific implementation process of the step S2 is as follows:

s2.1, performing embedded point and log acquisition requests on an application interface of the client. For example, candidate concept code loading end event points, event points selection, event points completion and submission, and the like, once the event points are excited, the client sends a log record request to the application server based on script codes, and one-time operation log record is completed. For example, javaScript script at the user end sends an asynchronous request to the server based on Ajax.

S2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data. For example, a user queries the term "respiratory tract infection" at an application interface of a client, a log system of an application server receives a query log record request of the client, and at the same time, a request for processing a query candidate term list and related data list in business logic is also processed, such as "respiratory tract infection (C0035243)", "upper respiratory tract infection (C0041912)", "upper respiratory tract infection (C0149725)", "viral respiratory tract infection (C0729531)", and the codes of brackets are conceptual codes after term standardization.

S2.3, log synchronization. The embodiment synchronizes the log to the log warehouse based on a real-time log stream technology, combines the technology of Flume+Kafka+spark stream to process, reads log information by using Flume, stores the log information queue by Kafka, consumes log information by Spark stream, filters and cleans based on rules, and finally stores the log information to the log warehouse. The storage medium may be a file system, a distributed file system, or a relational database such as Mysql, nosql database, etc. The embodiment is based on Hadoop, adopts an HDFS distributed file system and adopts a Hive data warehouse to store data. Hadoop is a distributed system infrastructure developed by the Apache foundation, HDFS is a file system built on top of Hadoop, hive is a data warehouse tool based on Hadoop for data extraction, transformation and loading, which is a mechanism that can store, query and analyze large-scale data stored in Hadoop.

The content structure of the log data collection and the final output data form are illustrated below.

The operation log record request for the client comprises the following main fields: request IP, user uui, time, event type, service parameters, etc. In addition, the transaction log field of the application server also includes a request time, a user uui, an event type, an event method, a service parameter, and the like.

All logs obtained are grouped according to the user uui, the user behavior process is restored according to the time sequence, the user selection is extracted from the time sequence, and the final data annotation is completed. For example, in the unique user uui, the candidate concept code set "[ C0035243, C0041912, C0149725, C0581381, C0729531]", is returned from the "respiratory tract infection" query action through the process, then the user selects "C0035243", and finally the data structure in the format (server code, uui, date, medical terms queried, candidate set, selection set, whether custom) is formed through data processing, such as (s 10009, 1657008203_1315242ec22b0d 006e 2462442974b 4b,2022-06-06, { respiratory tract infection }, { C0035243, C0041912, C0149725, C0581381, C0729531}, { C0035243}, no), and the formatted data is updated into the HIVE data repository of the journal repository.

In this embodiment, the log data is from a log system of a plurality of application servers, each of which is provided to a plurality of business scenario applications, each of which may be used by crowd workers of different characteristics. The data samples can thus be regarded as data sets from a plurality of heterogeneous sources, there is inevitably a lot of noise and even conflicting labels, and different judgments are generated when the complexity of the actual business application is understood from different angles and granularities.

For multi-source noise data samples, noise elimination is needed, and then labels are combined to construct a better training corpus. The specific process of step S3 is as follows:

s3.1, defining a sample format. The structure of the log data (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) is first converted to a form (term 1, term 2, {1, -1 }) with each sample containing a term pair. The specific conversion rule is as follows:

(1) The medical terms of the query are constructed as positive samples with terms corresponding to the selection set. For example, the query word "respiratory tract infection" input by the user and the term set corresponding to the C0035243 selected by the user are { "respiratory tract (upper respiratory tract and lower respiratory tract) infection", "upper and lower respiratory tract infection" }, and positive samples (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1) and (respiratory tract infection, upper and lower respiratory tract infection, 1) are constructed to form positive sample sets { (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1), (respiratory tract infection, upper and lower respiratory tract infection, 1) };

(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample. For example, the medical term "respiratory tract infection" of a query is combined with the term set of candidate sets { C0041912, C0149725, C0581381, C0729531} to construct a negative sample set { (respiratory tract infection, upper Respiratory Tract Infection (URTI), -1), (respiratory tract infection, respiratory tract viral infection, -1), }.

S3.2, defining and establishing a learning model. In the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} ^N*(2M+|C|) C represents the sampling of any combination of log sources i and k, represents the correlation condition of the log sources, and the real label of the sample is Y epsilon-1, 1} ^N Is a hidden variable. The embodiment is inspired by weak supervision, and defines the relation between the multi-source label and the real label as a factor graph model of a probability graph model, and is marked as P _θ (A, Y), three factors are defined:

wherein ,representing the weight of the probability distribution.

S3.3, training a learning model. Learning model P for included hidden variable Y _θ (a, Y) minimizing the negative log marginal likelihood from the matrix a visible to the log label:

solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel toolkit of Steady to solve, and recording learned parameters as theta ^* .

S3.4, obtaining the parameter theta of the learning model from the step S3.3 ^* Obtaining after trainingLearning modelI.e.

S3.5, learning through a learning model, and fusing noisy labels of the multi-source samples to obtain a soft label distribution. For example, for a medical entity of "respiratory tract infection", a soft-labeled term pair sample set X is generated _soft { (respiratory tract infection, upper and lower respiratory tract infection, 1.0), (respiratory tract infection, respiratory tract (upper and lower respiratory tract) infection, 0.989), (respiratory tract infection, upper Respiratory Tract Infection (URTI), 0.75), (respiratory tract infection, respiratory tract viral infection, 0.65) }.

To date, the similarity relationship between term pairs is processed based on a data layer, and the present embodiment unifies medical concepts based on a term semantic layer, and uses unique concept CODEs to represent, for example, the CUI of UMLS and the CODE of SNOMED CT. Specifically, a filtering labeling threshold alpha (alpha is more than or equal to 0.95) is set for X _soft Filtering to obtain a term pair sample set X _hard And constructing a conceptual diagram. The specific process is as follows:

b1, for the term pair sample set X _hard The term is taken as the node of the conceptual diagram, and the term pair forms the edges of the two nodes to construct a conceptual subgraph G _sample The Term set at this time is term_set _sample . For example (respiratory tract infection, upper and lower respiratory tract infection, 1.0), two nodes are node_1= { code= "sample_1", nan= "respiratory tract infection" } and node_2= { code= "sample_2", name= "upper and lower respiratory tract infection" }, edge is edge_12= < node_1, node_2 >.

B2, based on UMLS priori library, taking concept coding CUI as unit, and taking Term set term_set _umls Taking T _i ∈Term_set _umls Creating a node_i= { code= "umls_i", nan= "t_str" }, wherein the node can be a Chinese term or an English term, and all the terms are constructed into a node set and are specific to the same CThe two-purpose edges of the nodes in the UI term set are connected to form a plurality of independent conceptual subgraphs G _UMLS 。

B3, according to construction G _UMLS Is to construct a conceptual subgraph G of other term libraries _x For example, for the related word stock of traditional Chinese medicine, G is constructed _TCM 。

similarly, the overall set of medical terms is expressed as:

b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraphThe acquisition process uses the connected_components method of the third party packet network x of python for calculation, expressed in the form:

for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +.>Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S ^C 。

Unified concept codingAfter that, get +.>The term mapping relation list gid2cid_list with the public source term library.

Automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtaining The Term mapping list tid2gid_list with term_set'.

For example, coded as "0085580", the corresponding equivalent set of medical terms is { "essential hypertension", "idiopathic (primary) hypertension", "primary hypertension", "essential hypertension", "Primary hypertension", "hypertension" }, the corresponding gid2cid_list list is [ "0085580": { "umls": "C0085580", "sct": "194760004, 59621000", "icd10": "I10",.. "0085580") ("0002": "0085580") ].

The term model can be technically divided into two main types of text and semantic similarity models. Text similarity generally calculates the distance between two terms based on text character matching. For semantic similarity computation, the term is first embedded into some vector space, the vector representation of the term is obtained, such as the coding of word2vec or BERT model, and then the approximation is calculated based on the distance formula. However, the accuracy of the embedded model based on the general purpose is not high, the semantic requirement of terms is difficult to reach, and even large models such as BERT have larger collapse phenomenon.

To sum up, the present embodiment proposes fusion of Chinese and EnglishThe Chinese medical term alignment model for text retrieval and contrast learning semantic reasoning is shown in fig. 3 (taking UMLS as an example), and mainly comprises a term library, a text search engine and a semantic search engine. Wherein the term library comprises an open term library, a concept coding and term library mapping table gid2cid_list and a term concept equivalence set S ^C Etc. related global content. The text search engine includes index preservation and retrieval similarity calculation of Chinese and English terms. The semantic search engine comprises sample strategies, semantic model training, vector data persistence, semantic retrieval and other relevant contents.

The text search engine is specifically constructed as follows:

c1, acquiring concept set S _Concept The record is in terms of concept units. Creating a data model (ID, english standard word, english synonym, chinese standard word, chinese synonym) of the term library table head shown in FIG. 3, wherein the ID isThe corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms.

And C2, performing text cleaning on the data of the term library, such as converting into lower cases, removing special symbols and the like, and then storing the data into a database to establish indexes for the terms. In this embodiment, ES (elastic search) is selected, the data model creates corresponding fields, and the characters are filtered and processed by the methods of "html_strip", "lowercase", "ascifioding", "stepmer", and "englist_stop"; then English word segmentation and Chinese word segmentation are carried out, and the inverted index is stored in a database.

C3, chinese term Q for given query _{zh_txt} English term Q is obtained through open professional medical term dictionary translation or open API translation _{en_txt} ，Q _{zh_txt} And Q is equal to _{en_txt} And respectively calculating the similarity of BM25 of the corresponding fields of Chinese and English through text cleaning and word segmentation.

Note that, the calculation formula of BM25 is as follows:

wherein score _bm25 (Q, D) represents the approximation score of query term Q and term document D, Q _i Represents Q-segmented elements, n (Q _i ) Represents the number of all term documents including element i, N represents the number of all the operation documents, f (q _i D) represents the frequency of the element i in the term D, |d| represents the term document length, avgdl represents the average length of the term document, k ₁ B represents the super-parameters, and the default values are respectively 1.2 and 0.75. From this, the medical term text similarity calculation can be deduced as:

wherein D_{zh_fsn} ,D _{zh_sim} ,D _{zh_fsn} ,D _{zh_sim} Field documents respectively representing English standard words, english synonyms, chinese standard words and Chinese synonyms, and alpha _i Is a super-parameter, and represents the weight of each term, which satisfies

d1, selecting a model. The semantic framework selected in this embodiment is based on a double-tower model, that is, two terms are numbered first, and then the similarity is calculated. In addition, in order to overcome collapse phenomenon of the large model, a SimCSE model is selected to train the term based on contrast learning;

d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked asIn addition, in the present embodiment, since a large amount of supervision data is acquired from log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as follows ^C Two by two to form a sample->Finally, get->

D3, forward calculation: for sample x _i ∈X ^cse Samples such as "< headache" >, "< headach >", and ">" exist, and samples such as "< headache" >, "< headach" > ", and" > "exist, and the two terms are calculated by the encoder respectively, in this embodiment, chinese BERT is selected as the encoder to calculate, which can be expressed as:

vec ₁ ,vec ₂ ＝encoder(x _i，1 ,μ ₁ ),encoder(x _i，2 ,μ ₂ )

Wherein, the encoder (·) represents the encoder function, and Chinese BERT, μ is taken ₁ ,μ ₂ The larger the parameter setting, the more dead neural network elements output by the neural network, x _i,1 ,x _i,2 Respectively represent sample x _i Is a first term and a second term.

Because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training _j，1 X of the same batch _j， 2, wherein x is _j ∈X ^cse ，j>1。

D4, defining a loss function. For a batch size of B, the trained penalty function is:

wherein, the similarity is calculated by cosine:

and D5, reversely calculating. Solving gradient, reverse iteration updating, completing model learning and completing encoder ^* (·) =learning of the encoder (·, 0).

D6, term vector set index. Obtain T _i 'E term_set', a vector set term_vecs is calculated as follows:

term_vec _i ∈Term_vecs，

term_vec _i ＝encoder ^* (T _i ')

d7, indexing term_vecs vector set and retrieval. The term_vecs is indexed to a vector database such as Milvus, vald, qdrant, etc., or a vector retrieval component or tool such as FAISS of Facebook, SPTAG of Microsoft, annoy of Spotify, etc., the algorithm of similar computation is mainly based on the approximate nearest neighbor search idea. In this embodiment, the open source Milvus is selected to store and retrieve the vector set, and cosine calculation is adopted for the similarity calculation of the vectors.

further, the Chinese medical term alignment model predicts the query term input by the user at the client, and the specific process of giving the candidate concept coding result is as follows:

e1, query Chinese term Q for user input _{zh_txt} Translation is carried out to obtain corresponding English term Q _{en_txt} Can obtainCandidate sequences are obtained, TOP-150 is taken, and the score 0-1 is normalized, so that three final sequences are obtained:

e2, couple_seq _{vec_zh} 、cand_seq _{vec_en} Combining the mapping relation tid2gid_list and mapping to gid ^C In this process, when multiple tids are present for one gid ^C When taking the maximum value of the score, thereby formingSequences of elements, denoted candjseq' _{vec_zh} ,cand_seq' _{vec_en} 。

κ _i ∈[0,1]and (2) and

various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims

1. The Chinese medical term self-adaptive alignment method based on log feedback is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the specific process of constructing the initial medical term sample data in step S1 is:

s1.2, translating UMLS terms collected in the step S1.2.

3. The method according to claim 1, wherein the specific implementation procedure of step S2 is as follows:

4. The method according to claim 1, wherein the specific procedure of step S3 is as follows:

the tag matrix generated by formula (1) is noted asThe correctness matrix generated by the formula (2) is marked as +.>The correlation matrix generated by the formula (3) is marked +.>Specifically, define->Element->Representing sample x _i Log data from source j, if there are terms similar, +.>Otherwise->For->Element->If the labeling label of the labeling sample is identical to the actual one +.>Otherwise->For->Element->If sample x _i In case source j is the same as source k, thenOtherwise->Therefore, the combination of the three factors can obtain +.>

wherein ,weights representing probability distributions;

Setting a filtering labeling threshold alpha (alpha is more than or equal to 0.95) for X _soft Filtering to obtain a term pair sample set X _hard Structure of the structureBuilding a conceptual diagram:

similarly, the overall set of medical terms is expressed as:

b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraph The acquisition process uses the connected components method of the third party packet network of python to calculate, and the form is expressed as:

for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +.>Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S ^C ；

Unified concept codingAfter that, get +.>A term mapping relation list gid2 cid_list with a public source term library;

automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtainingThe Term mapping list tid2 gid_list with term_set'.

5. The method of claim 4, wherein the chinese medical term alignment model consists essentially of a term library, a text search engine, and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2 cid_list and a term concept equivalence set S ^C ；

The text search engine is specifically constructed as follows:

C1、acquiring a concept set S _Concept Taking the term concept as a unit for recording; creating a data model, wherein fields of the data model comprise an ID (identity), an English standard word, an English synonym, a Chinese standard word and a Chinese synonym, and the ID is The corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms;

the medical term text similarity is calculated as follows:

α _i ∈[0,1]and (2) and

d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked as In addition, since a large amount of supervision data is acquired from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as a reference ^C Two by two to form a sample->Finally, get->

vec ₁ ,vec ₂ ＝encoder(x _i，1 ,μ ₁ ),encoder(x _i，2 ,μ ₂ )

because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training _j，1 X of the same batch _j， 2, wherein x is _j ∈X ^cse ，j>1；

wherein, the similarity is calculated by cosine:

term_vec _i ∈Term_vecs，

term_vec _i ＝encoder ^* (T _i ')

6. the method of claim 5, wherein the chinese medical term alignment model retrieves the query term-related concept codes and returns the candidate concept code sequence as follows:

e1, query Chinese term Q for user input _{zh_txt} Translation is carried out to obtain corresponding English term Q _{en_txt} Candidate sequences can be obtained and TOP150 can be taken and normalized by score 0-1 to obtain the final productThree sequences of (2):

e2, couple_seq _{vec_zh} 、cand_seq _{vec_en} Combining the mapping relation tid2 gid_list and mapping to gid ^C In this process, when multiple tids are present for one gid ^C When taking the maximum value of the score, thereby forming<gid _i ^C ,score _i >Sequences of elements, denoted candjseq' _{vec_zh} ,cand_seq' _{vec_en} ；

κ _i ∈[0,1]And (2) and