CN116680377A - Chinese medical term self-adaptive alignment method based on log feedback - Google Patents

Chinese medical term self-adaptive alignment method based on log feedback Download PDF

Info

Publication number
CN116680377A
CN116680377A CN202310647595.9A CN202310647595A CN116680377A CN 116680377 A CN116680377 A CN 116680377A CN 202310647595 A CN202310647595 A CN 202310647595A CN 116680377 A CN116680377 A CN 116680377A
Authority
CN
China
Prior art keywords
term
log
sample
terms
medical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310647595.9A
Other languages
Chinese (zh)
Other versions
CN116680377B (en
Inventor
梁锐
唐珂轲
陈美莲
黄毅宁
钟冬赐
林少泽
吴豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhongkang Digital Technology Co ltd
Original Assignee
Guangzhou Zhongkang Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Zhongkang Digital Technology Co ltd filed Critical Guangzhou Zhongkang Digital Technology Co ltd
Priority to CN202310647595.9A priority Critical patent/CN116680377B/en
Publication of CN116680377A publication Critical patent/CN116680377A/en
Application granted granted Critical
Publication of CN116680377B publication Critical patent/CN116680377B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3438Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment monitoring of user actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computer Hardware Design (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning, and is characterized in that an operation log of a client is recorded, a medical term is analyzed, the medical term is identified and extracted, a conceptual subgraph is split, so that a training sample is automatically constructed, self-learning and automatic indexing are performed, self-learning and self-lifting can be performed along with the access of log data of a downstream business system, then a self-learned model is served to the downstream system, and automation and high efficiency of term alignment are realized through the closed loop of the whole flow.

Description

Chinese medical term self-adaptive alignment method based on log feedback
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a Chinese medical term self-adaptive alignment method based on log feedback.
Background
Medical concept alignment is an important research direction in the field of medical information processing. It mainly refers to standardizing terms, symbols, abbreviations, etc. used in the medical field. In a medical information system, the same medical term may have a plurality of different medical concept expressions. The non-uniformity and inaccuracy of the expression mode seriously prevent the integration, sharing and utilization of medical big data, and bring difficulties to clinic, teaching and scientific research in the medical field. For example, problems such as confusion of terms, inaccuracy of information, omission of information, duplication of information, difficulty in communicating across institutions, etc. may occur.
Medical institutions manually map medical concepts in clinical medical texts into medical term codes based on existing medical term standard dictionary by adopting a manual coding mode, and manual coding requires a large number of professionals with medical knowledge to operate, and is high in cost, limited in efficiency and low in accuracy.
In recent years, in order to solve the problems of cost and efficiency, deep neural networks and knowledge patterns are widely used, especially in the application of Chinese medical terms in the NLP field. Various approaches have emerged, such as more fine-grained decomposition of text based on NER, entity extraction based on a combination of semi-supervised and active learning, and extraction based on depth and search. These methods may be recommended by recall terms, followed by fine-ranking. Most of the processes need to collect data or translate by using foreign English term libraries, and if necessary, need to be marked by experts. It is apparent that there are several problems:
problem 1, collection of data is an expensive project, and a great deal of funds, manpower and time cost are required;
problem 2, correctness of translation from english term library to chinese remains a problem; if manual auditing is added, the accuracy of Chinese translation can be improved, but huge manual auditing amount can be used for remote indefinition according to term alignment engineering.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a Chinese medical term self-adaptive alignment method based on log feedback.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a Chinese medical term self-adaptive alignment method based on log feedback specifically comprises the following steps:
s1, collecting open medical term resources, initializing medical terms, constructing an initial medical term sample, and training to obtain a Chinese medical term alignment model;
s2, a user can input a queried medical term through the client; then the application server searches the concept codes related to the query words through the Chinese medical term alignment model of the term server and returns a candidate concept code sequence, and at the moment, a user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;
s3, the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.
Further, the specific process of constructing the initial medical term sample data in step S1 is:
s1.1, selecting an UMLS open source medical resource library and collecting UMLS terms;
s1.2, translating UMLS terms collected in the step S1.2.
Further, the specific implementation process of step S2 is as follows:
s2.1, performing embedded point and log acquisition requests on an application interface of a client, and once an event point is excited, sending a log recording request to an application server by the client based on a script code to complete one-time operation log recording;
s2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data;
s2.3, synchronizing the operation log data and the transaction log data to a log warehouse:
the operation log record request of the client includes fields including request IP, user uui, time, event type and service parameters; the transaction log field of the application server also includes request time, user uui, event type, event method and service parameters;
grouping all the obtained logs according to the user uui, restoring the user behavior process according to the time sequence, extracting the user selection from the time sequence, and finishing the final data annotation; the log data is processed to form a data structure (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) and the formatted data is updated into the log repository.
Further, the specific process of step S3 is as follows:
s3.1, defining a sample format: first converting the structure of the log data into a form of (term 1, term 2, {1, -1 }), each sample containing a term pair; the specific conversion rule is as follows:
(1) Constructing the queried medical terms corresponding to the selection set into positive samples;
(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample;
s3.2, defining and establishing a learning model: in the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} N*(2M+|C|) C represents the sampling of any combination of log sources i and k, represents the correlation condition of the log sources, and the real label of the sample is Y epsilon-1, 1} N Is a hidden variable; defining the relation between the multi-source label and the real label as a factor graph model of a probability graph model, and recording as P θ (A, Y), three factors are defined:
the tag matrix generated by formula (1) is noted as The correctness matrix generated by the formula (2) is marked as +.>The correlation matrix generated by the formula (3) is marked +.>Specifically, define->Element->Representing sample x i Log data from source j, if there are terms similar, +.>Otherwise->For->Element->If the labeling label of the labeling sample is identical to the actual one +.>Otherwise->For->Element->If sample x i In case source j is the same as source k, then +.>Otherwise->Therefore, the combination of the three factors can obtain +.>
To sum up, the combined factor expression is therefore noted asThe learning model is defined as:
wherein ,weights representing probability distributions;
s3.3, training a learning model: learning model P for included hidden variable Y θ (a, Y) minimizing the negative log marginal likelihood from the matrix a visible to the log label:
solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel toolkit of Steady to solve, and recording learned parameters as theta *
S3.4, obtaining the parameter theta of the learning model from the step S3.3 * Obtaining a trained learning modelI.e.
S3.5, merging noisy labels of the multi-source samples through learning of a learning model to obtain a soft label distribution, and generating a term pair sample set X of the soft labels soft
Setting a filtering labeling threshold alpha (alpha is more than or equal to 0.95) for X soft Filtering to obtain a term pair sample set X hard Constructing a conceptual diagram:
b1, for the term pair sample set X hard The term is taken as the node of the conceptual diagram, and the term pair forms the edges of the two nodes to construct a conceptual subgraph G sample The Term set at this time is term_set sample
B2, based on UMLS priori library, taking concept coding CUI as unit, and taking Term set term_set umls Taking T i ∈Term_set umls Creating nodes, wherein the nodes can be Chinese terms or English terms, all terms are constructed into node sets, and the nodes in the same CUI term set are connected by two edges to form independent conceptual subgraph G UMLS
B3, according to construction G UMLS Is to construct a conceptual subgraph G of other term libraries x
B4, obtaining a conceptual diagram G based on the same node terminology and a plurality of conceptual subgraphs:
similarly, the overall set of medical terms is expressed as:
b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraphThe acquisition procedure uses the connected_components of the python's third party packet network xThe method performs calculation, and the form is expressed as:
for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +. >Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S C
Unified concept codingAfter that, get +.>A term mapping relation list gid2cid_list of the public source term library;
automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtainingThe Term mapping list tid2gid_list with term_set'.
Further, the Chinese medical term alignment model mainly comprises a term library, a text search engine and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2cid_list and a term concept equivalence set S C
The text search engine is specifically constructed as follows:
c1, acquiring concept set S Concept Taking the term concept as a unit for recording; creating a data model, wherein fields of the data model comprise an ID (identity), an English standard word, an English synonym, a Chinese standard word and a Chinese synonym, and the ID isThe corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms;
C2, performing text cleaning on the data of the term library, then storing the data into a database, and establishing an index for the term;
c3, chinese term Q for given query zh_txt English term Q is obtained through open professional medical term dictionary translation or open API translation en_txt ,Q zh_txt And Q is equal to en_txt The similarity of BM25 of the corresponding fields of Chinese and English is calculated through text cleaning and word segmentation;
the medical term text similarity is calculated as follows:
wherein Dzh_fsn ,D zh_sim ,D zh_fsn ,D zh_sim Field documents respectively representing English standard words, english synonyms, chinese standard words and Chinese synonyms, and alpha i Is a super-parameter, and represents the weight of each term, and satisfies the following conditions:
the semantic search engine models based on a SimCSE model of contrast learning, and the specific process is as follows:
d1, model selection: the two terms are numbered respectively, then similarity is calculated, and the terms are trained by selecting a SimCSE model based on comparison learning;
d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked asIn addition, since a large amount of supervision data is acquired from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as a reference C Two by two combinations of the two are carried out to form a sampleFinally, get->
D3, forward calculation: for sample x i ∈X cse The two terms are calculated by the encoder, expressed as:
vec 1 ,vec 2 =encoder(x i,11 ),encoder(x i,22 )
wherein, the encoder (·) represents the encoder function, and Chinese BERT, μ is taken 12 The larger the parameter setting, the more dead neural network elements output by the neural network, x i,1 ,x i,2 Respectively represent sample x i Is the term 1 and the term 2 of (a);
because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training j,1 X of the same batch j,2 Composition, where x j ∈X cse ,j>1;
D4, defining a loss function; for a batch size of B, the trained penalty function is:
wherein, the similarity is calculated by cosine:
d5, reverse calculation: solving gradient, reverse iteration updating, completing model learning and completing encoder * Learning of (-) = encoder (·, 0);
d6, indexing the term vector set; obtain T i 'E term_set', a vector set term_vecs is calculated as follows:
term_vec i ∈Term_vecs,
term_vec i =encoder * (T i ')
d7, indexing term_vecs vector set and retrieval: a vector database or a vector retrieval component or tool is adopted to save and retrieve a vector set, and cosine calculation is adopted for similarity calculation of vectors;
For Q zh_txt And Q is equal to en_txt Encoding first, then calculating cosine similarity, and calculating query score:
furthermore, the Chinese medical term alignment model retrieves the concept codes related to the query terms and returns the candidate concept code sequences as follows:
e1, query Chinese term Q for user input zh_txt Translation is carried out to obtain corresponding English term Q en_txt Candidate sequences can be obtained and TOP150 is taken and normalized by scores 0-1 to obtain the final three sequences:
e2, couple_seq vec_zh 、cand_seq vec_en Combining the mapping relation tid2gid_list and mapping to gid C In this process, when multiple tids are present for one gid C When taking the maximum value of the score, thereby formingSequences of elements, denoted candjseq' vec_zh ,cand_seq' vec_en
E3, gid C For key, for cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en Merging, adopting an external connection operation, generating a final candidate sequence cand_seq, and calculating the weight of the score as follows:
score i =κ 1 ·score txt,i2 ·score zh_vec,i3 ·score en_vec,i
wherein score txt,i ,score zh_vec,i ,score en_vec,i Respectively represent cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en The fraction, κ of (3) i Is a super-parameter, and represents the weight of each score, and meets the following conditions:
κ i ∈[0,1]and (2) and
the invention has the beneficial effects that:
1. the method is realized based on log feedback, weak supervision and contrast learning, and by recording the operation log of the client, analyzing the action process in the log, identifying and extracting medical terms, thereby realizing automatic construction of training samples, self-learning and automatic indexing, self-learning and self-lifting along with the access of log data of a downstream business system, and then re-serving the self-learned model to the downstream system, and realizing automation and high efficiency of term alignment through the closed loop of the whole process.
2. The graph model is adopted to break through and split the medical terms, so that the core thought of term concepts can be embodied. An abstract medical concept exists only in the intangible notion, and the terms represented are scattered in the term data and databases. Effectively concatenating the fragmented terms through the graph, given a unique conceptual code; and through the constructed graph model, adopting a graph algorithm to automatically split a plurality of connected subgraphs, thereby forming the effect of the conceptual subgraphs.
3. For the term feedback of the log, weakly supervised data processing is innovatively provided, a learning model constructed by noise annotation and real annotation is used as noise annotation inference, the method is suitable for processing multi-application scene service, the quality problem of data generated by multi-user difference is solved, and the performance of a final model is improved.
4. The term vector embedding of the mixed SimCSE model is proposed, and unified training of combining single term and similar pair terms can be completed, so that embedding processing of massive terms is completed.
5. According to the invention, the alignment model is calculated based on the combination of text and semantics, the advantages of the text and the semantics are combined, the simplicity and the high efficiency of text search are exerted, when data tends to be saturated along with the running and iteration of the model, the matching and the searching of the text are directly adopted, so that a good term alignment effect can be achieved, and meanwhile, the problem of basic matching caused by insufficient training of the model due to insufficient term samples can be solved. The semantic retrieval can directly reflect the semantic characteristics of the Chinese medical terms, and can solve the problem that the literal similarity can be semantically inconsistent.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of an implementation architecture of a method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the architecture of a Chinese medical term alignment model according to an embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings, and it should be noted that, while the present embodiment provides a detailed implementation and a specific operation process on the premise of the present technical solution, the protection scope of the present invention is not limited to the present embodiment.
The embodiment provides a Chinese medical term self-adaptive alignment method based on log feedback, which is realized based on log feedback, weak supervision and contrast learning. The specific process is shown in fig. 1.
S1, collecting open medical term resources, initializing medical terms, constructing an initial medical term sample, and training to obtain a Chinese medical term alignment model;
s2, a user can input medical terms of a query through a client (i, j and k shown in fig. 2), such as inputting a query word of 'respiratory tract infection'; then the application server (g and h shown in figure 2) searches the concept codes related to the query terms through the Chinese medical term alignment model of the term server (a-f shown in figure 2) and returns a candidate concept code sequence, and at the moment, the user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;
The log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.
The main purpose of this embodiment is to solve the alignment problem of chinese medical terms, so in the initial stage, a huge amount of open-source english term libraries are required to be used as starting points, authoritative chinese-english translation dictionary and standard are collected as bridges, and initial medical term samples are constructed. The specific process of step S1 is as follows:
s1.1, selecting an UMLS open source medical resource library and collecting UMLS terminology. As is well known, UMLS is a combination of a large number of term libraries that are contiguous if the cui code of the UMLS term (the unique code representing a concept in the UMLS medical library) is known, including the term libraries of snomed ct Meta, ICD system version, meSH, LOINC, ATC, etc.
S1.2, translating UMLS terms collected in the step S1.2. The term translation plays a critical role, and the main sources of translation data selected in this example are:
1. The Chinese-English comparison table issued by the authority, such as the 10 th edition of International Classification of diseases, the SNOMED 3.4 edition of Chinese electronic version published by the institute of health and management of China, the ACT Chinese version, and the like, is recorded as CN_medical_terminal_Set1.
2. Medical dictionaries, such as the more authoritative "Xiangya medical big dictionary", the "English-Chinese medical dictionary", the "English-Chinese-English two-way medical dictionary", the "English-Chinese-English two-way medical dictionary", and the like. And performing full word matching on English terms according to the dictionary.
Further, in step S2, the problem of difficult data sample collection is solved by the user behavior based on log feedback and transaction log collection of the application server. The collected logs comprise operation logs generated by term alignment of a user on a client side and transaction logs generated by an application server based on operation of the user. The application server synchronizes the operation logs collected by each node and the transaction logs thereof to a log warehouse of the term server for storage at fixed time or in real time. The specific implementation process of the step S2 is as follows:
s2.1, performing embedded point and log acquisition requests on an application interface of the client. For example, candidate concept code loading end event points, event points selection, event points completion and submission, and the like, once the event points are excited, the client sends a log record request to the application server based on script codes, and one-time operation log record is completed. For example, javaScript script at the user end sends an asynchronous request to the server based on Ajax.
S2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data. For example, a user queries the term "respiratory tract infection" at an application interface of a client, a log system of an application server receives a query log record request of the client, and at the same time, a request for processing a query candidate term list and related data list in business logic is also processed, such as "respiratory tract infection (C0035243)", "upper respiratory tract infection (C0041912)", "upper respiratory tract infection (C0149725)", "viral respiratory tract infection (C0729531)", and the codes of brackets are conceptual codes after term standardization.
S2.3, log synchronization. The embodiment synchronizes the log to the log warehouse based on a real-time log stream technology, combines the technology of Flume+Kafka+spark stream to process, reads log information by using Flume, stores the log information queue by Kafka, consumes log information by Spark stream, filters and cleans based on rules, and finally stores the log information to the log warehouse. The storage medium may be a file system, a distributed file system, or a relational database such as Mysql, nosql database, etc. The embodiment is based on Hadoop, adopts an HDFS distributed file system and adopts a Hive data warehouse to store data. Hadoop is a distributed system infrastructure developed by the Apache foundation, HDFS is a file system built on top of Hadoop, hive is a data warehouse tool based on Hadoop for data extraction, transformation and loading, which is a mechanism that can store, query and analyze large-scale data stored in Hadoop.
The content structure of the log data collection and the final output data form are illustrated below.
The operation log record request for the client comprises the following main fields: request IP, user uui, time, event type, service parameters, etc. In addition, the transaction log field of the application server also includes a request time, a user uui, an event type, an event method, a service parameter, and the like.
All logs obtained are grouped according to the user uui, the user behavior process is restored according to the time sequence, the user selection is extracted from the time sequence, and the final data annotation is completed. For example, in the unique user uui, the candidate concept code set "[ C0035243, C0041912, C0149725, C0581381, C0729531]", is returned from the "respiratory tract infection" query action through the process, then the user selects "C0035243", and finally the data structure in the format (server code, uui, date, medical terms queried, candidate set, selection set, whether custom) is formed through data processing, such as (s 10009, 1657008203_1315242ec22b0d 006e 2462442974b 4b,2022-06-06, { respiratory tract infection }, { C0035243, C0041912, C0149725, C0581381, C0729531}, { C0035243}, no), and the formatted data is updated into the HIVE data repository of the journal repository.
In this embodiment, the log data is from a log system of a plurality of application servers, each of which is provided to a plurality of business scenario applications, each of which may be used by crowd workers of different characteristics. The data samples can thus be regarded as data sets from a plurality of heterogeneous sources, there is inevitably a lot of noise and even conflicting labels, and different judgments are generated when the complexity of the actual business application is understood from different angles and granularities.
For multi-source noise data samples, noise elimination is needed, and then labels are combined to construct a better training corpus. The specific process of step S3 is as follows:
s3.1, defining a sample format. The structure of the log data (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) is first converted to a form (term 1, term 2, {1, -1 }) with each sample containing a term pair. The specific conversion rule is as follows:
(1) The medical terms of the query are constructed as positive samples with terms corresponding to the selection set. For example, the query word "respiratory tract infection" input by the user and the term set corresponding to the C0035243 selected by the user are { "respiratory tract (upper respiratory tract and lower respiratory tract) infection", "upper and lower respiratory tract infection" }, and positive samples (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1) and (respiratory tract infection, upper and lower respiratory tract infection, 1) are constructed to form positive sample sets { (respiratory tract infection, respiratory tract (upper respiratory tract and lower respiratory tract) infection, 1), (respiratory tract infection, upper and lower respiratory tract infection, 1) };
(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample. For example, the medical term "respiratory tract infection" of a query is combined with the term set of candidate sets { C0041912, C0149725, C0581381, C0729531} to construct a negative sample set { (respiratory tract infection, upper Respiratory Tract Infection (URTI), -1), (respiratory tract infection, respiratory tract viral infection, -1), }.
S3.2, defining and establishing a learning model. In the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} N*(2M+|C|) C represents the sampling of any combination of log sources i and k, represents the correlation condition of the log sources, and the real label of the sample is Y epsilon-1, 1} N Is a hidden variable. The embodiment is inspired by weak supervision, and defines the relation between the multi-source label and the real label as a factor graph model of a probability graph model, and is marked as P θ (A, Y), three factors are defined:
the tag matrix generated by formula (1) is noted as The correctness matrix generated by the formula (2) is marked as +.>The correlation matrix generated by the formula (3) is marked +.>Specifically, define->Element->Representing sample x i Log data from source j, if there are terms similar, +.>Otherwise->For->Element->If the labeling label of the labeling sample is identical to the actual one +.>Otherwise->For->Element->If sample x i In case source j is the same as source k, then +.>Otherwise->Therefore, the combination of the three factors can obtain +.>
To sum up, the combined factor expression is therefore noted asThe learning model is defined as:
wherein ,representing the weight of the probability distribution.
S3.3, training a learning model. Learning model P for included hidden variable Y θ (a, Y) minimizing the negative log marginal likelihood from the matrix a visible to the log label:
solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel toolkit of Steady to solve, and recording learned parameters as theta * .
S3.4, obtaining the parameter theta of the learning model from the step S3.3 * Obtaining after trainingLearning modelI.e.
S3.5, learning through a learning model, and fusing noisy labels of the multi-source samples to obtain a soft label distribution. For example, for a medical entity of "respiratory tract infection", a soft-labeled term pair sample set X is generated soft { (respiratory tract infection, upper and lower respiratory tract infection, 1.0), (respiratory tract infection, respiratory tract (upper and lower respiratory tract) infection, 0.989), (respiratory tract infection, upper Respiratory Tract Infection (URTI), 0.75), (respiratory tract infection, respiratory tract viral infection, 0.65) }.
To date, the similarity relationship between term pairs is processed based on a data layer, and the present embodiment unifies medical concepts based on a term semantic layer, and uses unique concept CODEs to represent, for example, the CUI of UMLS and the CODE of SNOMED CT. Specifically, a filtering labeling threshold alpha (alpha is more than or equal to 0.95) is set for X soft Filtering to obtain a term pair sample set X hard And constructing a conceptual diagram. The specific process is as follows:
b1, for the term pair sample set X hard The term is taken as the node of the conceptual diagram, and the term pair forms the edges of the two nodes to construct a conceptual subgraph G sample The Term set at this time is term_set sample . For example (respiratory tract infection, upper and lower respiratory tract infection, 1.0), two nodes are node_1= { code= "sample_1", nan= "respiratory tract infection" } and node_2= { code= "sample_2", name= "upper and lower respiratory tract infection" }, edge is edge_12= < node_1, node_2 >.
B2, based on UMLS priori library, taking concept coding CUI as unit, and taking Term set term_set umls Taking T i ∈Term_set umls Creating a node_i= { code= "umls_i", nan= "t_str" }, wherein the node can be a Chinese term or an English term, and all the terms are constructed into a node set and are specific to the same CThe two-purpose edges of the nodes in the UI term set are connected to form a plurality of independent conceptual subgraphs G UMLS
B3, according to construction G UMLS Is to construct a conceptual subgraph G of other term libraries x For example, for the related word stock of traditional Chinese medicine, G is constructed TCM
B4, obtaining a conceptual diagram G based on the same node terminology and a plurality of conceptual subgraphs:
similarly, the overall set of medical terms is expressed as:
b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraphThe acquisition process uses the connected_components method of the third party packet network x of python for calculation, expressed in the form:
for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +.>Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S C
Unified concept codingAfter that, get +.>The term mapping relation list gid2cid_list with the public source term library.
Automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtaining The Term mapping list tid2gid_list with term_set'.
For example, coded as "0085580", the corresponding equivalent set of medical terms is { "essential hypertension", "idiopathic (primary) hypertension", "primary hypertension", "essential hypertension", "Primary hypertension", "hypertension" }, the corresponding gid2cid_list list is [ "0085580": { "umls": "C0085580", "sct": "194760004, 59621000", "icd10": "I10",.. "0085580") ("0002": "0085580") ].
The term model can be technically divided into two main types of text and semantic similarity models. Text similarity generally calculates the distance between two terms based on text character matching. For semantic similarity computation, the term is first embedded into some vector space, the vector representation of the term is obtained, such as the coding of word2vec or BERT model, and then the approximation is calculated based on the distance formula. However, the accuracy of the embedded model based on the general purpose is not high, the semantic requirement of terms is difficult to reach, and even large models such as BERT have larger collapse phenomenon.
To sum up, the present embodiment proposes fusion of Chinese and EnglishThe Chinese medical term alignment model for text retrieval and contrast learning semantic reasoning is shown in fig. 3 (taking UMLS as an example), and mainly comprises a term library, a text search engine and a semantic search engine. Wherein the term library comprises an open term library, a concept coding and term library mapping table gid2cid_list and a term concept equivalence set S C Etc. related global content. The text search engine includes index preservation and retrieval similarity calculation of Chinese and English terms. The semantic search engine comprises sample strategies, semantic model training, vector data persistence, semantic retrieval and other relevant contents.
The text search engine is specifically constructed as follows:
c1, acquiring concept set S Concept The record is in terms of concept units. Creating a data model (ID, english standard word, english synonym, chinese standard word, chinese synonym) of the term library table head shown in FIG. 3, wherein the ID isThe corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms.
And C2, performing text cleaning on the data of the term library, such as converting into lower cases, removing special symbols and the like, and then storing the data into a database to establish indexes for the terms. In this embodiment, ES (elastic search) is selected, the data model creates corresponding fields, and the characters are filtered and processed by the methods of "html_strip", "lowercase", "ascifioding", "stepmer", and "englist_stop"; then English word segmentation and Chinese word segmentation are carried out, and the inverted index is stored in a database.
C3, chinese term Q for given query zh_txt English term Q is obtained through open professional medical term dictionary translation or open API translation en_txt ,Q zh_txt And Q is equal to en_txt And respectively calculating the similarity of BM25 of the corresponding fields of Chinese and English through text cleaning and word segmentation.
Note that, the calculation formula of BM25 is as follows:
wherein score bm25 (Q, D) represents the approximation score of query term Q and term document D, Q i Represents Q-segmented elements, n (Q i ) Represents the number of all term documents including element i, N represents the number of all the operation documents, f (q i D) represents the frequency of the element i in the term D, |d| represents the term document length, avgdl represents the average length of the term document, k 1 B represents the super-parameters, and the default values are respectively 1.2 and 0.75. From this, the medical term text similarity calculation can be deduced as:
wherein Dzh_fsn ,D zh_sim ,D zh_fsn ,D zh_sim Field documents respectively representing English standard words, english synonyms, chinese standard words and Chinese synonyms, and alpha i Is a super-parameter, and represents the weight of each term, which satisfies
The semantic search engine models based on a SimCSE model of contrast learning, and the specific process is as follows:
d1, selecting a model. The semantic framework selected in this embodiment is based on a double-tower model, that is, two terms are numbered first, and then the similarity is calculated. In addition, in order to overcome collapse phenomenon of the large model, a SimCSE model is selected to train the term based on contrast learning;
d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked asIn addition, in the present embodiment, since a large amount of supervision data is acquired from log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as follows C Two by two to form a sample->Finally, get->
D3, forward calculation: for sample x i ∈X cse Samples such as "< headache" >, "< headach >", and ">" exist, and samples such as "< headache" >, "< headach" > ", and" > "exist, and the two terms are calculated by the encoder respectively, in this embodiment, chinese BERT is selected as the encoder to calculate, which can be expressed as:
vec 1 ,vec 2 =encoder(x i,11 ),encoder(x i,22 )
Wherein, the encoder (·) represents the encoder function, and Chinese BERT, μ is taken 12 The larger the parameter setting, the more dead neural network elements output by the neural network, x i,1 ,x i,2 Respectively represent sample x i Is a first term and a second term.
Because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training j,1 X of the same batch j, 2, wherein x is j ∈X cse ,j>1。
D4, defining a loss function. For a batch size of B, the trained penalty function is:
wherein, the similarity is calculated by cosine:
and D5, reversely calculating. Solving gradient, reverse iteration updating, completing model learning and completing encoder * (·) =learning of the encoder (·, 0).
D6, term vector set index. Obtain T i 'E term_set', a vector set term_vecs is calculated as follows:
term_vec i ∈Term_vecs,
term_vec i =encoder * (T i ')
d7, indexing term_vecs vector set and retrieval. The term_vecs is indexed to a vector database such as Milvus, vald, qdrant, etc., or a vector retrieval component or tool such as FAISS of Facebook, SPTAG of Microsoft, annoy of Spotify, etc., the algorithm of similar computation is mainly based on the approximate nearest neighbor search idea. In this embodiment, the open source Milvus is selected to store and retrieve the vector set, and cosine calculation is adopted for the similarity calculation of the vectors.
For Q zh_txt And Q is equal to en_txt Encoding first, then calculating cosine similarity, and calculating query score:
further, the Chinese medical term alignment model predicts the query term input by the user at the client, and the specific process of giving the candidate concept coding result is as follows:
e1, query Chinese term Q for user input zh_txt Translation is carried out to obtain corresponding English term Q en_txt Can obtainCandidate sequences are obtained, TOP-150 is taken, and the score 0-1 is normalized, so that three final sequences are obtained:
e2, couple_seq vec_zh 、cand_seq vec_en Combining the mapping relation tid2gid_list and mapping to gid C In this process, when multiple tids are present for one gid C When taking the maximum value of the score, thereby formingSequences of elements, denoted candjseq' vec_zh ,cand_seq' vec_en
E3, gid C For key, for cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en Merging, adopting an external connection operation, generating a final candidate sequence cand_seq, and calculating the weight of the score as follows:
score i =κ 1 ·score txt,i2 ·score zh_vec,i3 ·score en_vec,i
wherein score txt,i ,score zh_vec,i ,score en_vec,i Respectively represent cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en The fraction, κ of (3) i Is a super-parameter, and represents the weight of each score, and meets the following conditions:
κ i ∈[0,1]and (2) and
various modifications and variations of the present invention will be apparent to those skilled in the art in light of the foregoing teachings and are intended to be included within the scope of the following claims.

Claims (6)

1. The Chinese medical term self-adaptive alignment method based on log feedback is characterized by comprising the following steps of:
s1, collecting open medical term resources, initializing medical terms, constructing an initial medical term sample, and training to obtain a Chinese medical term alignment model;
s2, a user can input a queried medical term through the client; then the application server searches the concept codes related to the query words through the Chinese medical term alignment model of the term server and returns a candidate concept code sequence, and at the moment, a user selects and submits the candidate concept codes at the client; the log system of the application server records the query operation of the user to obtain operation log data of the user, obtains transaction log data generated by the application server, and feeds back the operation log data of the user and the transaction log data of the application server to a log warehouse of the term server;
s3, the log warehouse learns the log data fed back by the application server through weak supervision to obtain a high-quality training sample; the term server trains the Chinese medical term alignment model based on contrast learning by using the obtained training samples, and the trained Chinese medical term alignment model continuously provides service for the application server.
2. The method according to claim 1, wherein the specific process of constructing the initial medical term sample data in step S1 is:
s1.1, selecting an UMLS open source medical resource library and collecting UMLS terms;
s1.2, translating UMLS terms collected in the step S1.2.
3. The method according to claim 1, wherein the specific implementation procedure of step S2 is as follows:
s2.1, performing embedded point and log acquisition requests on an application interface of a client, and once an event point is excited, sending a log recording request to an application server by the client based on a script code to complete one-time operation log recording;
s2.2, the log system of the application server responds to the log record request of the client, completes operation log record, and records the log of the service processing process of the application server to obtain transaction log data;
s2.3, synchronizing the operation log data and the transaction log data to a log warehouse:
the operation log record request of the client includes fields including request IP, user uui, time, event type and service parameters; the transaction log field of the application server also includes request time, user uui, event type, event method and service parameters;
Grouping all the obtained logs according to the user uui, restoring the user behavior process according to the time sequence, extracting the user selection from the time sequence, and finishing the final data annotation; the log data is processed to form a data structure (server code, uui, date, medical terms of query, candidate set, selection set, whether custom) and the formatted data is updated into the log repository.
4. The method according to claim 1, wherein the specific procedure of step S3 is as follows:
s3.1, defining a sample format: first converting the structure of the log data into a form of (term 1, term 2, {1, -1 }), each sample containing a term pair; the specific conversion rule is as follows:
(1) Constructing the queried medical terms corresponding to the selection set into positive samples;
(2) Removing the terms corresponding to the selection set from the terms corresponding to the candidate set to obtain a duplicate term set, and constructing the queried medical terms and the terms in the duplicate term set into a negative sample;
s3.2, defining and establishing a learning model: in the step S3.1, deleting samples with the frequency smaller than 3 in a positive sample set and a negative sample set generated according to two rules, and finally obtaining a sample set S, wherein the number of log sources is M, and the total sample data size is N; the sample labeling matrix is A epsilon-1, 1} N*(2M+|C|) C represents the sampling of any combination of log sources i and k, represents the correlation condition of the log sources, and the real label of the sample is Y epsilon-1, 1} N Is a hidden variable; defining the relation between the multi-source label and the real label as a factor graph model of a probability graph model, and recording as P θ (A, Y), three factors are defined:
the tag matrix generated by formula (1) is noted asThe correctness matrix generated by the formula (2) is marked as +.>The correlation matrix generated by the formula (3) is marked +.>Specifically, define->Element->Representing sample x i Log data from source j, if there are terms similar, +.>Otherwise->For->Element->If the labeling label of the labeling sample is identical to the actual one +.>Otherwise->For->Element->If sample x i In case source j is the same as source k, thenOtherwise->Therefore, the combination of the three factors can obtain +.>
To sum up, the combined factor expression is therefore noted asThe learning model is defined as:
wherein ,weights representing probability distributions;
s3.3, training a learning model: learning model P for included hidden variable Y θ (a, Y) minimizing the negative log marginal likelihood from the matrix a visible to the log label:
solving and optimizing problems by adopting a gradient descent method, specifically a Gibbs sampling algorithm, adopting a Snorkel toolkit of Steady to solve, and recording learned parameters as theta *
S3.4, obtaining the parameter theta of the learning model from the step S3.3 * Obtaining a trained learning modelI.e.
S3.5, merging noisy labels of the multi-source samples through learning of a learning model to obtain a soft label distribution, and generating a term pair sample set X of the soft labels soft
Setting a filtering labeling threshold alpha (alpha is more than or equal to 0.95) for X soft Filtering to obtain a term pair sample set X hard Structure of the structureBuilding a conceptual diagram:
b1, for the term pair sample set X hard The term is taken as the node of the conceptual diagram, and the term pair forms the edges of the two nodes to construct a conceptual subgraph G sample The Term set at this time is term_set sample
B2, based on UMLS priori library, taking concept coding CUI as unit, and taking Term set term_set umls Taking T i ∈Term_set umls Creating nodes, wherein the nodes can be Chinese terms or English terms, all terms are constructed into node sets, and the nodes in the same CUI term set are connected by two edges to form independent conceptual subgraph G UMLS
B3, according to construction G UMLS Is to construct a conceptual subgraph G of other term libraries x
B4, obtaining a conceptual diagram G based on the same node terminology and a plurality of conceptual subgraphs:
similarly, the overall set of medical terms is expressed as:
b5, for the graph G, separating independent connected subgraphs, wherein each connected subgraph is defined as a conceptual subgraph The acquisition process uses the connected components method of the third party packet network of python to calculate, and the form is expressed as:
for each ofGiving a unique global conceptual code to represent, called conceptual code, recorded +.>Acquisition->Node terms in the corresponding subgraph and constructed medical term semantic equivalence set +.>Full-scale term equivalent set is denoted S C
Unified concept codingAfter that, get +.>A term mapping relation list gid2 cid_list with a public source term library;
automatically numbering the medical terms of term_set, the numbered medical Term set being denoted term_set' and the code number field being denoted tid, while obtainingThe Term mapping list tid2 gid_list with term_set'.
5. The method of claim 4, wherein the chinese medical term alignment model consists essentially of a term library, a text search engine, and a semantic search engine; the content of the term library comprises an open term library, a concept coding and term library mapping table gid2 cid_list and a term concept equivalence set S C
The text search engine is specifically constructed as follows:
C1、acquiring a concept set S Concept Taking the term concept as a unit for recording; creating a data model, wherein fields of the data model comprise an ID (identity), an English standard word, an English synonym, a Chinese standard word and a Chinese synonym, and the ID is The corresponding English standard words are obtained from the open source library, if the corresponding first English term in the open source library is selected as the Chinese standard word without the English standard words, the rest of English standard words are used as English synonyms;
c2, performing text cleaning on the data of the term library, then storing the data into a database, and establishing an index for the term;
c3, chinese term Q for given query zh_txt English term Q is obtained through open professional medical term dictionary translation or open API translation en_txt ,Q zh_txt And Q is equal to en_txt The similarity of BM25 of the corresponding fields of Chinese and English is calculated through text cleaning and word segmentation;
the medical term text similarity is calculated as follows:
wherein Dzh_fsn ,D zh_sim ,D zh_fsn ,D zh_sim Field documents respectively representing English standard words, english synonyms, chinese standard words and Chinese synonyms, and alpha i Is a super-parameter, and represents the weight of each term, and satisfies the following conditions:
α i ∈[0,1]and (2) and
the semantic search engine models based on a SimCSE model of contrast learning, and the specific process is as follows:
d1, model selection: the two terms are numbered respectively, then similarity is calculated, and the terms are trained by selecting a SimCSE model based on comparison learning;
d2, because SimCSE is a comparative learning model, it can learn and train itself without labeling, so a part of the sample is constructed by the term itself, and is marked as In addition, since a large amount of supervision data is acquired from the log data, a hybrid SimCSE model is adopted, and the construction method is that samples are taken as terms and concepts as units, and S is taken as a reference C Two by two to form a sample->Finally, get->
D3, forward calculation: for sample x i ∈X cse The two terms are calculated by the encoder, expressed as:
vec 1 ,vec 2 =encoder(x i,11 ),encoder(x i,22 )
wherein, the encoder (·) represents the encoder function, and Chinese BERT, μ is taken 12 The larger the parameter setting, the more dead neural network elements output by the neural network, x i,1 ,x i,2 Respectively represent sample x i Is the term 1 and the term 2 of (a);
because the input is a term pair, the term pair is taken as a positive sample in the calculation of the model, and the negative sample is taken from x in a batch of samples in training j,1 X of the same batch j, 2, wherein x is j ∈X cse ,j>1;
D4, defining a loss function; for a batch size of B, the trained penalty function is:
wherein, the similarity is calculated by cosine:
d5, reverse calculation: solving gradient, reverse iteration updating, completing model learning and completing encoder * Learning of (-) = encoder (·, 0);
d6, indexing the term vector set; obtain T i 'E term_set', a vector set term_vecs is calculated as follows:
term_vec i ∈Term_vecs,
term_vec i =encoder * (T i ')
D7, indexing term_vecs vector set and retrieval: a vector database or a vector retrieval component or tool is adopted to save and retrieve a vector set, and cosine calculation is adopted for similarity calculation of vectors;
for Q zh_txt And Q is equal to en_txt Encoding first, then calculating cosine similarity, and calculating query score:
6. the method of claim 5, wherein the chinese medical term alignment model retrieves the query term-related concept codes and returns the candidate concept code sequence as follows:
e1, query Chinese term Q for user input zh_txt Translation is carried out to obtain corresponding English term Q en_txt Candidate sequences can be obtained and TOP150 can be taken and normalized by score 0-1 to obtain the final productThree sequences of (2):
e2, couple_seq vec_zh 、cand_seq vec_en Combining the mapping relation tid2 gid_list and mapping to gid C In this process, when multiple tids are present for one gid C When taking the maximum value of the score, thereby forming<gid i C ,score i >Sequences of elements, denoted candjseq' vec_zh ,cand_seq' vec_en
E3, gid C For key, for cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en Merging, adopting an external connection operation, generating a final candidate sequence cand_seq, and calculating the weight of the score as follows:
score i =κ 1 ·score txt,i2 ·score zh_vec,i3 ·score en_vec,i
wherein score txt,i ,score zh_vec,i ,score en_vec,i Respectively represent cand_seq txt ,cand_seq' vec_zh ,cand_seq' vec_en The fraction, κ of (3) i Is a super-parameter, and represents the weight of each score, and meets the following conditions:
κ i ∈[0,1]And (2) and
CN202310647595.9A 2023-06-01 2023-06-01 Chinese medical term self-adaptive alignment method based on log feedback Active CN116680377B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310647595.9A CN116680377B (en) 2023-06-01 2023-06-01 Chinese medical term self-adaptive alignment method based on log feedback

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310647595.9A CN116680377B (en) 2023-06-01 2023-06-01 Chinese medical term self-adaptive alignment method based on log feedback

Publications (2)

Publication Number Publication Date
CN116680377A true CN116680377A (en) 2023-09-01
CN116680377B CN116680377B (en) 2024-04-23

Family

ID=87781772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310647595.9A Active CN116680377B (en) 2023-06-01 2023-06-01 Chinese medical term self-adaptive alignment method based on log feedback

Country Status (1)

Country Link
CN (1) CN116680377B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725076A (en) * 2024-02-01 2024-03-19 厦门她趣信息技术有限公司 Faiss-based distributed massive similarity vector increment training system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170337268A1 (en) * 2016-05-17 2017-11-23 Xerox Corporation Unsupervised ontology-based graph extraction from texts
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN113254418A (en) * 2021-05-11 2021-08-13 广州中康数字科技有限公司 Medicine classification management data analysis system
CN113377897A (en) * 2021-05-27 2021-09-10 杭州莱迈医疗信息科技有限公司 Multi-language medical term standard standardization system and method based on deep confrontation learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170337268A1 (en) * 2016-05-17 2017-11-23 Xerox Corporation Unsupervised ontology-based graph extraction from texts
CN112712804A (en) * 2020-12-23 2021-04-27 哈尔滨工业大学(威海) Speech recognition method, system, medium, computer device, terminal and application
CN113254418A (en) * 2021-05-11 2021-08-13 广州中康数字科技有限公司 Medicine classification management data analysis system
CN113377897A (en) * 2021-05-27 2021-09-10 杭州莱迈医疗信息科技有限公司 Multi-language medical term standard standardization system and method based on deep confrontation learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙茂松 等: "面向中英平行专利的双语术语自动抽取", 清华大学学报(自然科学版), vol. 54, no. 10, 31 October 2014 (2014-10-31), pages 1339 - 1343 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725076A (en) * 2024-02-01 2024-03-19 厦门她趣信息技术有限公司 Faiss-based distributed massive similarity vector increment training system
CN117725076B (en) * 2024-02-01 2024-04-09 厦门她趣信息技术有限公司 Faiss-based distributed massive similarity vector increment training system

Also Published As

Publication number Publication date
CN116680377B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
CN110781683B (en) Entity relation joint extraction method
CN108875051B (en) Automatic knowledge graph construction method and system for massive unstructured texts
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN112559556B (en) Language model pre-training method and system for table mode analysis and sequence mask
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN114048350A (en) Text-video retrieval method based on fine-grained cross-modal alignment model
WO2023065858A1 (en) Medical term standardization system and method based on heterogeneous graph neural network
CN116680377B (en) Chinese medical term self-adaptive alignment method based on log feedback
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN113988075A (en) Network security field text data entity relation extraction method based on multi-task learning
CN110377690B (en) Information acquisition method and system based on remote relationship extraction
CN116049422A (en) Echinococcosis knowledge graph construction method based on combined extraction model and application thereof
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
Jin et al. Fintech key-phrase: a new Chinese financial high-tech dataset accelerating expression-level information retrieval
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN117149974A (en) Knowledge graph question-answering method for sub-graph retrieval optimization
CN116562302A (en) Multi-language event viewpoint object identification method integrating Han-Yue association relation
CN114943216B (en) Case microblog attribute level view mining method based on graph attention network
CN116127097A (en) Structured text relation extraction method, device and equipment
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
CN115481220A (en) Post and resume content-based intelligent matching method and system for comparison learning human posts
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
CN115081445A (en) Short text entity disambiguation method based on multitask learning
Santosh et al. ECtHR-PCR: A Dataset for Precedent Understanding and Prior Case Retrieval in the European Court of Human Rights
CN117540734B (en) Chinese medical entity standardization method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant