CN115982222A - Searching method based on special disease and special medicine scenes - Google Patents

Searching method based on special disease and special medicine scenes Download PDF

Info

Publication number
CN115982222A
CN115982222A CN202310017705.3A CN202310017705A CN115982222A CN 115982222 A CN115982222 A CN 115982222A CN 202310017705 A CN202310017705 A CN 202310017705A CN 115982222 A CN115982222 A CN 115982222A
Authority
CN
China
Prior art keywords
special
character
term
vocabulary
recall
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310017705.3A
Other languages
Chinese (zh)
Inventor
田东坡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maxin Health Technology Co ltd
Original Assignee
Shanghai Maxin Health Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maxin Health Technology Co ltd filed Critical Shanghai Maxin Health Technology Co ltd
Priority to CN202310017705.3A priority Critical patent/CN115982222A/en
Publication of CN115982222A publication Critical patent/CN115982222A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a searching method based on special disease and special medicine scenes, which comprises the following steps: acquiring search keywords or texts input by a user, and performing preprocessing steps such as cleaning, filtering, rejecting and the like on the search keywords or texts, wherein unnecessary symbols, spaces and tone auxiliary words are cleaned and filtered, and special characters are rejected; based on the processed data, performing entry vocabulary accurate matching in a category term lexicon to obtain a closest term configured by a user; and repackaging the retrieval conditions, accessing the recall sorting identification system through a data interface based on the repackaged retrieval conditions, recalling and sorting, and outputting results. The invention adopts the understanding of the query semantics of the user in the search system, uses the normalization model for the terms of diseases, medicines and the like, and improves the recall rate through the search functions including semantic error correction, normalization, recall and sequencing.

Description

Searching method based on special disease and special medicine scenes
Technical Field
The invention relates to the field of special disease and special medicine scenes, in particular to a searching method based on a special disease and special medicine scene.
Background
The special medicine is a specific medicine which is generally higher in cost, exact in curative effect, small in side effect and can be replaced without other treatment schemes for treating serious diseases such as malignant tumor and the like.
Most of traditional retrieval methods aiming at special medicine scenes use a scheme based on character string matching, such as methods of searching lcs, bm25 and the like to calculate matching degree, and reverse indexes are used for displaying search information, wherein bm25 refers to an input problem Q 0 The BM25 may be used to sort when other qs are to be unmatched in the data. "BM" means actually Best MatThe ping, BM25 is also known as Okapi BM25.
However, both lcs and bm25 have the following problems when matching to the specific drug field:
1. the traditional search is based on the understanding of character default semantics, lacks of not paying attention to the disease species and the search behavior of the user, and lacks of accurate search sequencing of the personalized requirements of the user.
2. The traditional search is difficult to satisfy the description of flexible and complex users or medical workers on special diseases and special drugs, and because the cancer types in special disease and special drug scenes are divided into a plurality of types, the description of different stages is more diverse, and different drugs or treatment schemes are required to be used. Such as "squamous non-small cell lung cancer" can be described as "squamous cell lung cancer", etc., and the traditional searching method can not satisfy the searching of the result wanted by the user from multi-dimension directions, such as medicine welfare, patient interest, etc.
Disclosure of Invention
The invention aims to provide a searching method based on a special disease and special medicine scene so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a searching method based on special disease and special medicine scenes comprises the following steps:
s1: acquiring search keywords or texts input by a user, and performing preprocessing steps such as cleaning, filtering, rejecting and the like on the search keywords or texts, wherein unnecessary symbols, spaces and tone auxiliary words are cleaned and filtered, and special characters are rejected;
s2: based on the processed data, performing entry vocabulary accurate matching in a category term lexicon to obtain a closest term configured by a user;
s3: and repackaging the retrieval conditions, accessing the recall sorting identification system through a data interface based on the repackaged retrieval conditions, recalling and sorting, and outputting results.
Preferably, the optional matching and spacing in S1 includes a zero-length spacing, a zero-length connector, and a zero-length non-connector.
Preferably, the step of removing the special characters in S1 includes: cleaning special characters of the text keywords, and when the special characters are pictographic characters, replacing the pictographic characters with the original characters according to the mapping relation between the pictographic characters presented by the pictographic character table and the original characters; when the special character is a deletable character, adopting different cleaning modes according to different deletable characters, wherein the cleaning modes comprise: when the deletable character is a backspace character, deleting the backspace character and a character before the backspace character at the same time; and when the deletable character is a deleted character, deleting the deleted character and the character behind the deleted character at the same time.
Preferably, the specific process of performing entry vocabulary accurate matching in the category term lexicon in S2 includes: acquiring processed data, and calculating the correlation coefficient of the data relative to other vocabularies in the category term lexicon according to a Pearson correlation coefficient algorithm; determining classification levels of every two vocabularies according to the correlation coefficients through a preset level classification rule; extracting the vocabulary according to a preset vocabulary extraction rule, and determining at least one associated search vocabulary based on the extracted vocabulary; processing the current associated search vocabulary based on a pre-trained model aiming at each associated search vocabulary to obtain a to-be-associated word vector of the current associated search vocabulary; aiming at each word vector to be associated, determining the similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched, wherein the similarity value is the closest term, and the similarity value is realized by a cosine similarity algorithm, namely
Figure BDA0004040543900000021
Wherein S 1 For the current word vector to be associated, S 2 And the characteristic vector corresponding to each entity information to be matched.
Preferably, the category term lexicon in S3 is based on N term lexicons in a target recognition scenario, N is a positive integer, the N term lexicons include but are not limited to a disease term lexicon, a drug term lexicon, a equity term lexicon, and a hospital term lexicon, the N term lexicons include a plurality of normalized terms and their aliases, and the plurality of normalized terms and their aliases are output after pinyin extraction, font extraction, and vector extraction.
Preferably, the category term lexicon provides two interfaces, one interface is used for providing preprocessing steps such as cleaning and filtering in the S1, the other interface is used for recalling and sequencing processes of the recall sequencing and recognition system in the S3, functions such as word segmentation, vocabulary change monitoring/polling and dictionary reconstruction index adding are provided, and the recall sequencing and recognition system builds a retrieval system integrating medicines, rights and interests, hospitals, consultations and experience lexicons based on the term lexicon provided by the category term lexicon and integrates experiences.
Preferably, the recall sequencing and identifying system in the S3 is also connected with a service database, the service database allows log-stack/script data synchronization, and the recall sequencing and identifying system is accessed after the steps of ES index structure design and the like.
Preferably, in S3, the recall ranking identification system is accessed through a data interface based on the repackaged retrieval condition, and performs retrieval identification based on the retrieval logic and the ranking policy through the recall ranking identification system, and outputs the result.
Compared with the prior art, the invention has the beneficial effects that:
1. the implementation method of the invention adopts the preprocessing steps of cleaning, filtering, eliminating and the like to the search text, and adopts the eliminating logic to process the special characters to the text modal data, thus realizing the deep cleaning of the data and avoiding the influence of the special characters on the semantics to cause inaccurate indexing.
2. The invention adds comprehension of query semantics of a user in a search system, uses a normalization model for terms of diseases, medicines and the like, improves recall rate through search functions including semantic error correction, normalization, recall and sequencing, adopts a lightweight model for a recall identification system, recalls by using a multi-model weighting scheme, extracts a candidate set top50, embeds texts by using a pre-training model, and fuses vector representation of relevant information of the user and inverted index, thereby obtaining search results of the user.
3. In the implementation method, the correlation coefficient of the data relative to other vocabularies in the category term lexicon is calculated by adopting a Pearson correlation coefficient algorithm, and the data are classified according to the preset grade classification rule, so that the grade classification of the vocabularies is quickly realized, a basis is provided for the retrieval of subsequent key search vocabularies, and meanwhile, the associated vocabularies are associated based on a similarity model, so that the association function based on the vocabularies can be realized, more accurate search keywords are guided, and the search effect and the user experience are improved.
Drawings
FIG. 1 is a flow chart of a search method of the present invention;
FIG. 2 is a schematic block diagram of a system method of the present invention;
FIG. 3 is a logic diagram of special character culling according to an embodiment of the invention;
FIG. 4 is a diagram illustrating a specific process of vocabulary exact matching according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-2, the present invention provides a technical solution: a searching method based on special disease and special medicine scenes comprises the following steps:
s1: acquiring search keywords or texts input by a user, and performing preprocessing steps such as cleaning, filtering, rejecting and the like on the search keywords or texts, wherein unnecessary symbols, spaces and tone auxiliary words are cleaned and filtered, and special characters are rejected;
s2: based on the processed data, performing entry vocabulary accurate matching in a category term lexicon to obtain a closest term configured by a user;
s3: and repackaging the retrieval conditions, accessing the recall sorting identification system through a data interface based on the repackaged retrieval conditions, recalling and sorting, and outputting results.
In this embodiment, the blank space includes a zero-length blank space, a zero-length connector, and a zero-length non-connector in S1.
In the present embodiment, S1 mainly aims at: and (4) query semantic understanding, namely washing the query, extracting pinyin, vector representation embedding, correcting errors, normalizing and the like. The understanding of query semantics of a user is added into a search system, a normalization model is used for terms such as diseases and medicines, and the recall rate is improved through search functions including semantic error correction, normalization, recall and sorting.
Referring to fig. 3, in the present embodiment, the step of removing the special character in S1 includes: cleaning special characters of the text keywords, and when the special characters are pictographic characters, replacing the pictographic characters with the original characters according to the mapping relation between the pictographic characters presented by the pictographic character table and the original characters; when the special character is a deletable character, adopting different cleaning modes according to different deletable characters, wherein the cleaning modes comprise: when the deletable character is a backspace character, deleting the backspace character and a character before the backspace character at the same time; and when the deletable character is a deleted character, deleting the deleted character and the character behind the deleted character at the same time. The implementation method adopts the preprocessing steps of cleaning, filtering, eliminating and the like on the search text, and adopts the eliminating logic to process the special characters on the text modal data, so that the deep cleaning of the data is realized, and the problem of inaccurate index caused by the influence of the special characters on the semantics is avoided.
Referring to fig. 4, in the present embodiment, the specific process of performing accurate matching of vocabulary entries and vocabularies in the category term lexicon in S2 includes: acquiring processed data, and calculating the correlation coefficient of the data relative to other vocabularies in the category term lexicon according to a Pearson correlation coefficient algorithm; determining classification levels of every two vocabularies according to the correlation coefficients through a preset level classification rule; extracting the vocabulary according to a preset vocabulary extraction rule, and determining at least one associated search vocabulary based on the extracted vocabulary; aiming at each associated search vocabulary, the current associated search vocabulary is searched based on a pre-trained modelPerforming line processing to obtain a word vector to be associated of the current associated search word; aiming at each word vector to be associated, determining the similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched, wherein the similarity value is the closest term, and the similarity value is realized by a cosine similarity algorithm, namely
Figure BDA0004040543900000051
Wherein S 1 As the current word vector to be associated, S 2 And the characteristic vector corresponding to each entity information to be matched. Based on the processing method and the logic, the grade classification of the vocabulary can be quickly realized to provide a basis for the retrieval of the subsequent key search vocabulary, and meanwhile, the associated vocabulary is associated based on the similarity model, so that the association function based on the vocabulary can be realized, more accurate search keywords can be guided, and the search effect and the user experience can be improved.
Referring to fig. 1-4, in the present embodiment, the category term lexicon in S3 is based on N term lexicons in the target recognition scenario, where N is a positive integer, the N term lexicons include but are not limited to a disease term lexicon, a drug term lexicon, a equity term lexicon, and a hospital term lexicon, the N term lexicons include a plurality of normalized terms and their aliases, and the plurality of normalized terms and their aliases are output after pinyin extraction, font extraction, and vector extraction.
In this embodiment, the category term lexicon provides two interfaces, one interface is used for providing preprocessing steps such as cleaning and filtering in S1, the other interface is used for recalling the recall & ranking process of the recall ranking recognition system in S3, functions such as word segmentation, monitoring/polling vocabulary change, dictionary reconstruction index adding and the like are provided, and the recall ranking recognition system constructs a retrieval system integrating medicines, rights and interests, hospitals, consultations and experience lexicon based on the term lexicon provided by the category term lexicon and integrates experience.
In this embodiment, the category term lexicon mainly integrates drug welfare, rights and interests, patient's speech, expert's speech, medical maps, and the like, and is represented by extracting pinyin vectors.
In this embodiment, the recall sequencing and identifying system in S3 is further connected to the service database, the service database allows log-stack/script data synchronization, and the recall sequencing and identifying system is accessed after the steps of ES index structure design and the like.
In this embodiment, in S3, the recall ranking recognition system is accessed through the data interface based on the repackaged search condition, and performs search recognition based on the search logic and the ranking policy through the recall ranking recognition system, and outputs the result.
In this embodiment, the recall ranking identifying system has:
the recall module is used for recalling through a lightweight model by using a multi-model weighting scheme to extract a candidate set top50;
and the ordering module is used for embedding the text by using the pre-training model and simultaneously fusing vector representation of the relevant information of the user, and reversely indexing to obtain a user search result.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A searching method based on special disease and special medicine scenes is characterized by comprising the following steps:
s1: acquiring search keywords or texts input by a user, and performing preprocessing steps such as cleaning, filtering, rejecting and the like on the search keywords or texts, wherein unnecessary symbols, spaces and tone auxiliary words are cleaned and filtered, and special characters are rejected;
s2: based on the processed data, performing entry vocabulary accurate matching in a category term word bank to obtain a closest term configured by a user;
s3: and repackaging the retrieval conditions, accessing the recall sorting identification system through a data interface based on the repackaged retrieval conditions, recalling and sorting, and outputting results.
2. The special medicine scene searching method according to claim 1, wherein: and the optional matching in the S1, wherein the space comprises a zero-length space, a zero-length connector and a zero-length non-connector.
3. The special medicine scene searching method according to claim 1, wherein: the step of eliminating the special characters in the S1 specifically comprises the following steps: cleaning special characters of the text keywords, and when the special characters are pictographic characters, replacing the pictographic characters with the original characters according to the mapping relation between the pictographic characters presented by the pictographic character table and the original characters; when the special character is a deletable character, adopting different cleaning modes according to different deletable characters, wherein the cleaning modes comprise: when the deletable character is a backspace character, deleting the backspace character and a character before the backspace character at the same time; and when the deletable character is a deleted character, deleting the deleted character and the character behind the deleted character at the same time.
4. The special medicine scene searching method according to claim 1, wherein: the specific process of accurately matching the vocabulary entries and vocabularies in the category term lexicon in the S2 comprises the following steps: acquiring processed data, and calculating the correlation coefficient of the data relative to other vocabularies in the category term lexicon according to a Pearson correlation coefficient algorithm; determining classification levels of every two vocabularies according to the correlation coefficients through a preset level classification rule; extracting the vocabulary according to a preset vocabulary extraction rule, and determining at least one associated search vocabulary based on the extracted vocabulary; processing the current associated search vocabulary based on a pre-trained model aiming at each associated search vocabulary to obtain a to-be-associated word vector of the current associated search vocabulary; aiming at each word vector to be associated, determining the similarity value between the current word vector to be associated and at least one feature vector corresponding to each entity information to be matched, wherein the similarity value is the closest term, and the similarity value is realized by a cosine similarity algorithm, namely
Figure FDA0004040543890000021
Wherein S 1 For the current word vector to be associated, S 2 And the characteristic vector corresponding to each entity information to be matched.
5. The special medicine scene searching method according to claim 1, wherein: the category term lexicon in the S3 is based on N term lexicons under the target recognition scene, N is a positive integer, the N term lexicons include but are not limited to a disease term lexicon, a medicine term lexicon, a rights and interests term lexicon and a hospital term lexicon, the N term lexicons contain a plurality of normalized terms and aliases thereof, and the normalized terms and the aliases thereof are output after pinyin extraction, font extraction and vector extraction.
6. The method according to claim 5, wherein the method comprises the following steps: the category term lexicon provides two interfaces, one interface is used for providing pretreatment steps such as cleaning and filtering in S1, the other interface is used for recalling and sequencing processes of the recall sequencing recognition system in S3, functions such as word segmentation, monitoring/polling vocabulary change and dictionary reconstruction index adding are provided, and the recall sequencing recognition system is used for constructing a retrieval system integrating medicines, interests, hospitals, consultations and experience lexicon based on the term lexicon provided by the category term lexicon and integrating experiences.
7. The special medicine scene searching method according to claim 1, wherein: and the recall sequencing and identifying system in the S3 is also accessed to a recall sequencing and identifying system after the steps of communicating a service database, allowing the service database to synchronize logstack/script data, designing an ES index structure and the like.
8. The special medicine scene searching method according to claim 1, wherein: and in the S3, the recall sorting identification system is accessed through a data interface based on the repackaged retrieval condition, and is used for carrying out retrieval identification based on the retrieval logic and the sorting strategy and outputting the result.
CN202310017705.3A 2023-01-06 2023-01-06 Searching method based on special disease and special medicine scenes Pending CN115982222A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310017705.3A CN115982222A (en) 2023-01-06 2023-01-06 Searching method based on special disease and special medicine scenes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310017705.3A CN115982222A (en) 2023-01-06 2023-01-06 Searching method based on special disease and special medicine scenes

Publications (1)

Publication Number Publication Date
CN115982222A true CN115982222A (en) 2023-04-18

Family

ID=85960918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310017705.3A Pending CN115982222A (en) 2023-01-06 2023-01-06 Searching method based on special disease and special medicine scenes

Country Status (1)

Country Link
CN (1) CN115982222A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system
CN116936024A (en) * 2023-09-05 2023-10-24 北京中薪科技有限公司 Data processing system of traditional Chinese medicine recuperation scheme based on AI

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system
CN116628129B (en) * 2023-07-21 2024-02-27 南京爱福路汽车科技有限公司 Auto part searching method and system
CN116936024A (en) * 2023-09-05 2023-10-24 北京中薪科技有限公司 Data processing system of traditional Chinese medicine recuperation scheme based on AI
CN116936024B (en) * 2023-09-05 2023-12-15 北京中薪科技有限公司 Data processing system of traditional Chinese medicine recuperation scheme based on AI

Similar Documents

Publication Publication Date Title
CN111709233B (en) Intelligent diagnosis guiding method and system based on multi-attention convolutional neural network
KR101999152B1 (en) English text formatting method based on convolution network
CN115982222A (en) Searching method based on special disease and special medicine scenes
CN111899890B (en) Medical data similarity detection system and method based on bit string hash
CN111611775B (en) Entity identification model generation method, entity identification device and equipment
EP0996927A1 (en) Text classification system and method
CN112559684A (en) Keyword extraction and information retrieval method
CN110929498B (en) Method and device for calculating similarity of short text and readable storage medium
CN111950283B (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
CN111402092B (en) Law and regulation retrieval system based on multilevel semantic analysis
CN111695336A (en) Disease name code matching method and device, computer equipment and storage medium
CN112307190B (en) Medical literature ordering method, device, electronic equipment and storage medium
CN111444704A (en) Network security keyword extraction method based on deep neural network
CN115983233A (en) Electronic medical record duplication rate estimation method based on data stream matching
CN114358001A (en) Method for standardizing diagnosis result, and related device, equipment and storage medium thereof
CN110399493B (en) Author disambiguation method based on incremental learning
CN115713078A (en) Knowledge graph construction method and device, storage medium and electronic equipment
CN115828854B (en) Efficient table entity linking method based on context disambiguation
CN111104481A (en) Method, device and equipment for identifying matching field
CN116662479A (en) Text matching method for medical insurance catalogs
CN112287217B (en) Medical document retrieval method, medical document retrieval device, electronic equipment and storage medium
CN110532538A (en) Property dispute judgement document's critical entities extraction algorithm
CN114970554A (en) Document checking method based on natural language processing
CN111046665B (en) Domain term semantic drift extraction method
CN112735584A (en) Malignant tumor diagnosis and treatment auxiliary decision generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination