CN112667571A

CN112667571A - Biomedical literature search and sorting method and device

Info

Publication number: CN112667571A
Application number: CN201910980643.XA
Authority: CN
Inventors: 郭敏; 裴健新; 余晴; 于雪
Original assignee: Kangmaxin Shanghai Intelligent Technology Co ltd
Current assignee: Kangmaxin Shanghai Intelligent Technology Co ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2021-04-16

Abstract

The invention discloses a biomedical literature search sequencing method and a biomedical literature search sequencing device, wherein the method comprises the following steps: acquiring query content input by a user; preprocessing the query content to obtain a search word set at least comprising one search word; searching the search word set on a specified data search platform to obtain related medical documents; roughly ordering the relevant medical documents according to the relevance from high to low; extracting a specified number or a specified percentage of related medical documents ranked in the top from the roughly ranked related medical documents as target medical documents; inputting the target medical documents into a trained optimized sorting model for optimized sorting, and outputting the optimized sorted target medical documents; and outputting the optimally sorted target medical documents and the coarse sorting related medical documents left after extraction to a user. By the method and the device, the search sequencing can be performed more accurately, the relevant contents of the search query which the user wants to obtain are displayed to the user, and the user experience is greatly improved.

Description

Biomedical literature search and sorting method and device

Technical Field

The invention relates to the field of data search, in particular to a biomedical literature search sorting method and device.

Background

As the biomedical literature big data increases, the search demand of users on the biomedical big data also increases with different demands of specific biological problems, and the search of the most relevant articles aiming at the query is more and more challenging. The query requirements of users are higher and higher, the searching speed is required to be high, the searched documents are required to be sorted in front and back, and the required information can be found quickly.

In the existing biomedical search ranking technology, BM25(BM stands for the best match) is an algorithm for evaluating the relevance between search terms and documents in information retrieval. It was based on the 20 th century in the 70 s and 80 s by Stephen e.robertson,

and probabilistic retrieval frameworks developed by others. BM25F is an improved algorithm for the typical BM25, BM25 considers documents as a whole when calculating correlations, but as search techniques develop, documents are slowly replaced by structured data, each of which is cut into multiple independent domains, especially vertically-oriented searches. For example, the web page may be divided into domains such as title, content, subject word, etc., which cannot treat the weight of the article topic equally, so the weight is weighted to some extent, and BM25 does not consider this, so BM25F makes some improvements on this basis, i.e., the word is not considered as an individual, and the document is also considered as an individual according to the field. In the BM25F algorithm, the score of a document is larger, which indicates that the document is more relevant, and only when characters in a search word are all included, the biomedical document query at present can generate a result based on some factors of the document, such as title, text and the like, for example, some documents may be related, but when you put it in front, the user does not click to see the document, which is not the best thought of the user, so that the search ranking does not achieve a good ranking effect, the search result is directly pushed to the user, and the user experience is reduced.

Disclosure of Invention

In order to solve the technical problems, the invention provides a biomedical literature search ordering method and a biomedical literature search ordering device, and specifically, the technical scheme of the invention is as follows:

in one aspect, the invention discloses a biomedical literature search ranking method, which comprises the following steps: acquiring query content input by a user; preprocessing the query content to obtain a search word set at least comprising one search word; searching the search word set on a specified data search platform, and acquiring related medical documents related to the search word set; roughly sorting the relevant medical documents according to the relevance from high to low; extracting a specified number or a specified percentage of related medical documents ranked in the top from the roughly ranked related medical documents as target medical documents; inputting the target medical documents into a trained optimized sorting model for optimized sorting, and outputting the optimized sorted target medical documents; and outputting the optimally sorted target medical documents and the coarse sorting related medical documents left after extraction to a user.

Preferably, the roughly sorting the relevant medical documents according to the relevance from high to low specifically comprises: calculating the inverse text frequency index and the word frequency of each search word in the search word set; the length of the document to be scored currently and the average length of all documents; scoring the relevant medical documents according to the following formula, and sorting the relevant medical documents according to the scoring size:

wherein, the parameter d is a scoring document needing scoring currently in the relevant medical documents, q is a search word set, score (d, q) is a ranking score aiming at the search word set q and the scoring document d; t is a single search term in the search term set; TFt is the word frequency of the search word t; IDFt is the inverse text frequency index of the search term t; dl is the length of document d currently needing to be scored; avdl is the average length of all relevant medical documents containing the search term t in the search term set; k1 is a free adjustment parameter, b is a free adjustment parameter.

Preferably, the biomedical literature search ranking method further comprises: training the optimized sequencing model; the method specifically comprises the following steps: acquiring training sample data, wherein the training sample data comprises biomedical documents and search and click data thereof; extracting the characteristics of the training sample data, and performing label processing on the training sample data; dividing the training sample data after label processing into a training set, a test set and a verification set according to a preset proportion; calling a sequencing model in a designated machine learning library, and setting learning parameters of the sequencing model; loading the training set and the test set to train an initial training model; according to a preset evaluation index for measuring the sequencing quality, adopting the training sample data in the verification set to perform verification test on the trained model; and taking the model passing the verification test as a trained optimized sequencing model.

Preferably, the extracting the features of the training sample data specifically includes: extracting basic information, stop word information, parameter information and flow information of the medical literature in the training sample data, wherein the parameter information and the flow information of search words in the search word set appear in a specified domain of the medical literature; wherein the flow information of the medical literature comprises: any one or more of the number of clicks, collections, praise for the medical document.

Preferably, the performing label processing on the training sample data specifically includes: calculating the relevancy score of the relevant medical literature searched by the user in the training sample data according to the search click data in the training sample data; sorting according to the degree of relevancy scores of the relevant medical documents in the training sample data, and taking the sorting result as a gold standard; and dividing each medical document in the training sample data into several grades according to the gold standard, and setting corresponding labels.

Preferably, according to the gold standard, each medical document obtained by user search in the training sample data is divided into several grades, and a corresponding label is set. The method specifically comprises the following steps: in each medical document obtained by user search in the training sample data, if the medical document is in the medical document with the top ranking 10 in the gold standard, setting the label of the medical document to be 12 minus the value of the search ranking number; if the medical document is a medical document ranked between 10 and 20 in the gold standard, setting the label of the medical document to 2; if the medical document is in a medical document with a ranking greater than 20 in the gold standard, setting the label of the medical document to 1; setting a tag of the medical document to 0 if the medical document is not in the gold standard.

Preferably, the called ranking model is a LambdaMART model, and the setting of the learning parameter of the ranking model specifically includes: the number of trees in the Lambdamart parameter is set to 200, and the learning rate is set to 0.3.

On the other hand, the present invention also discloses a biomedical literature search ranking device, which performs search ranking by using the biomedical literature search ranking method according to any one of claims 1 to 7, the biomedical literature search ranking device comprising: the input acquisition module is used for acquiring query contents input by a user; the preprocessing module is used for preprocessing the query content to obtain a search word set at least comprising one search word; the database searching module is used for searching the searching word set on a specified data searching platform and acquiring related medical documents related to the searching word set; the rough sorting module is used for roughly sorting the relevant medical documents from high to low according to the relevance; the extraction module is used for extracting the related medical documents with the specified number or the specified percentage in the top ranking from the related medical documents with the rough ranking as target medical documents; the optimized sorting module is used for inputting the target medical documents into a trained optimized sorting model for optimized sorting and outputting the optimized sorted target medical documents; and the output feedback module is used for outputting the optimally sorted target medical documents and the extracted coarse sorting related medical documents to a user.

Preferably, the biomedical literature search ranking device further comprises: a model training module for training the optimized ranking model, the model training module specifically comprising: the sample acquisition submodule is used for acquiring training sample data, and the training sample data comprises biomedical documents and search and click data thereof; the sample processing submodule is used for extracting the characteristics of the training sample data and carrying out label processing on the training sample data; the sample division submodule is used for dividing the training sample data after the label processing into a training set, a test set and a verification set according to a preset proportion; the model selection submodule is used for calling a sequencing model in a designated machine learning library and setting learning parameters of the sequencing model; the loading training submodule is used for loading the training set and the testing set to train an initial training model; the verification submodule is used for performing verification test on the trained model by adopting the training sample data in the verification set according to a preset evaluation index for measuring the sequencing quality; and taking the model passing the verification test as a trained optimized sequencing model.

Preferably, the sample processing sub-module comprises: the characteristic extraction unit is used for extracting basic information of the medical literature, stop word information, parameter information of search words in the search word set appearing in a specified domain of the medical literature, and flow information in the training sample data; wherein the flow information comprises any one or more of the number of clicks, the number of collections, and the number of praises of the medical literature; the label processing unit is used for carrying out label processing on the training sample data; the treatment method specifically comprises the following steps: the label processing unit calculates the relevancy score of the relevant medical literature searched by the user in the training sample data according to the search click data in the training sample data; the label processing unit sorts the relevant medical documents according to the scores and takes the sorting result as a gold standard; and the label processing unit divides each medical document in the training sample data into several grades according to the gold standard and sets corresponding labels.

The invention at least comprises the following technical effects:

(1) the method is different from the traditional sequencing search method, only carries out search sequencing according to single factors such as the relevance and the like, and obtains the relevant documents aiming at the query content of the user, carries out rough sequencing on the relevant documents according to the relevance, and then further carries out optimized sequencing by utilizing a machine learning model, so that the sequencing accuracy is greatly improved, the sequencing result is closer to the result desired by the user, and the user experience is improved.

(2) According to the method, the related medical documents with the designated number or percentage in the front of the coarse sorting are selected as the target medical documents and then are subjected to optimized sorting instead of being subjected to optimized sorting, on one hand, the related medical documents with the rear of the sorting are low in relevance and unlikely to be the result which a user wants to search, and on the other hand, only a part of the related medical documents in the front are selected, so that the efficiency of optimized sorting can be greatly improved, and the speed of searching and inquiring is improved.

(3) The rough ordering of the invention comprehensively considers the factors of word frequency, inverse text frequency index, document length, average length of all documents and the like, and the finally calculated document relevancy score is more accurate.

(4) On the basis of rough sorting, a layer of reordering of the roughly sorted part of documents by using machine learning is added, the method is more intelligent, and particularly, flow information such as click rate, collection number, praise number and the like is considered, so that the effect of optimizing sorting is really achieved, and the user search experience is greatly improved.

(5) The invention adopts a LambdaMART model, sets 200 trees to make a decision together, adopts a learning rate of 0.3, and adopts default settings for the rest trees. Compared with the default learning rate of 0.1 by adopting 1000 trees, the method provided by the invention trains 800 trees in less time, the evaluation index on the training set rises by nearly 4%, the evaluation index on the verification set rises by nearly 1%, the evaluation index on the test set rises by nearly 0.4%, and a good model is trained under the condition of reducing the training time. The same effect is achieved under the condition of few training trees, and the model with the best effect is trained in the least time.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow diagram of one embodiment of a biomedical document search ranking method of the present invention;

FIG. 2 is a flow chart of training an optimized ranking model in accordance with the present invention;

FIG. 3 is a block diagram of the biomedical document search ranking device according to one embodiment of the present invention;

FIG. 4 is a flowchart of the operation of the biomedical literature search ranking device of the present invention;

FIG. 5 is a block diagram of a model training module according to the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically depicted, or only one of them is labeled. In this document, "one" means not only "only one" but also a case of "more than one".

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In particular implementations, if terminal devices are described in embodiments herein, the terminal devices include, but are not limited to, other portable devices such as mobile phones, laptop computers, family computers, or tablet computers having touch sensitive surfaces (e.g., touch screen displays and/or touch pads). It should also be understood that in some embodiments the terminal device is not a portable communication device, but is a desktop computer having a touch-sensitive surface (e.g., a touch screen display and/or touchpad).

In the discussion that follows, where a terminal device is described that includes a display and a touch-sensitive surface, it should be understood that the terminal device may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal device supports various applications, such as one or more of the following: a drawing application, a presentation application, a network creation application, a word processing application, a disc burning application, a spreadsheet application, a gaming application, a telephone application, a video conferencing application, an email application, an instant messaging application, an exercise support application, a photo management application, a digital camera application, a digital video camera application, a Web browsing application, a digital music player application, and/or a digital video player application.

Various applications that may be executed on the terminal device may use at least one common physical user interface device, such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within respective applications. In this way, a common physical architecture (e.g., touch-sensitive surface) of the terminal can support various applications with user interfaces that are intuitive and transparent to the user.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

The invention discloses a biomedical literature search and sorting method, which comprises the following steps:

s101, acquiring query contents input by a user;

the user is to input a biomedical keyword, i.e., a search word. For example: biomedical subject words such as estrogen receptor alpha, gene regulation, brain, antipollogenic valid, mucin protease store, and the like.

S102, preprocessing the query content to obtain a search word set at least comprising one search word;

specifically, the preprocessing of the query content mainly refers to performing word segmentation processing on the query content, for example, if the search content is a plurality of words, the search content is first changed into one search word, and the set of the search words is a search word set.

Further, the preprocessing may further include performing synonym search processing on the search terms, for example, obtaining a single search term through the term segmentation processing, and then using the synonym of each search term as a query target according to the established synonym hash table.

S103, searching the search word set on a specified data search platform to obtain related medical documents related to the search word set;

specifically, a specified data search platform, such as an elastic search (non-relational database and real-time search platform), may be searched in the data search platform after a search term set (query target) composed of all search terms is obtained in the previous step, so as to obtain a medical document related to the query content input by the user.

S104, roughly sorting the relevant medical documents from high to low according to the relevance;

and after the relevant medical documents relevant to the query content input by the user in the database are obtained, roughly sorting the medical documents according to the relevance from high to low. Specifically, the documents are roughly sorted by the relevance, and a traditional relevance calculation method can be used, for example, BM25 algorithm or BM25F (an algorithm based on a probability retrieval model) rough sorting (a sorting result obtained by BM25F algorithm) is adopted.

S105, extracting the related medical documents with the designated number or percentage ranked in the top order from the related medical documents after the rough ranking as target medical documents;

specifically, for example, documents are sorted according to the scores to obtain a coarse sorting result, and then 10% of the documents before cutting (generally not exceeding 100, and if exceeding, 100 documents before cutting) are subjected to next optimization.

S106, inputting the target medical documents into a trained optimized sorting model for optimized sorting, and outputting the optimized sorted target medical documents;

specifically, the rough sorting may be performed only according to a relevance algorithm, but factors such as click rate, praise number and the like are not considered, so that some documents sorted only according to relevance may not be necessarily searched by the user, and good user experience cannot be brought to the user. In order to make the effect of the optimized sequencing better, a machine learning training can be adopted to obtain an optimized sequencing model, and the trained optimized sequencing model is used for performing the optimized sequencing, so that a sequencing result closer to the user's intention is obtained.

And S107, outputting the optimized and sequenced target medical documents and the residual rough sequencing related medical documents after extraction to a user.

Finally, the result of the optimized sorting and the remaining coarse sorting documents are combined and displayed to the user, specifically, for example, 1000 related medical documents after the coarse sorting are provided, if the first 100 related medical documents are assigned to be optimized sorted, after the first 100 related medical documents are optimized sorted, the 100 related medical documents are sorted according to the result of the optimized sorting, and the 100 related medical documents are followed by the remaining 900 related medical documents after the coarse sorting, of course, the 900 related medical documents are sorted according to the sequence of the coarse sorting.

In the above embodiment, the step S104 of roughly sorting the relevant medical documents according to the relevance from high to low specifically includes:

s1041, calculating an inverse text frequency index and a word frequency of each search word in the search word set; the length of the document to be scored currently and the average length of all documents;

specifically, taking the coarse sorting method using the BM25F algorithm as an example, we need to calculate the inverse text frequency index IDF of each search term, the term frequency TF, the length dl of the document to be scored currently, and the average length avdl of all documents.

For the calculation of the inverse text frequency index IDF, the following formula can be used:

where N is the total number of medical documents in the database, N is the number of medical documents containing the search term, and the ratio of the smoothing factor 1 (to avoid the numerator being 0) to the total number of documents containing the search term plus the smoothing factor 1 (to avoid the denominator being 0).

Regarding the term frequency TF, since different fields (title, medical subject, abstract) contribute differently to the document as a whole, and the fields have their own weight, it is more important to appear in a specific field than in other fields. For example, when calculating the weights of the fields, the search word appearing in the title has a weight of 5, the medical subject word has a weight of 5, and the abstract has a weight of 1, which is more important than the search word appearing in the abstract.

Specifically, the following formula can be used to calculate TF (word frequency):

where occurrencies of t in f denotes the number of occurrences of each search term in a specific field (such as a title, an abstract, etc.) is calculated separately), FL denotes the length of the field, and FW denotes the weight.

Because different fields affect differently, we also need to compute their average weights. Firstly, the length dl of the document to be scored is obtained, and then the average length (avdl) of all relevant medical documents is obtained:

avdl＝average of dlacross documents (4)

s1042, scoring the relevant medical documents according to the following formula, and sorting the relevant medical documents according to the score:

wherein, the parameter d is a scoring document needing scoring currently in the relevant medical documents, q is a search word set, and score (d, q) is the ranking score of the scoring document d aiming at the search word set q; t is a single search term in the search term set; TFt is the word frequency of the search word t; IDFt is the inverse text frequency index of the search term t; dl is the length of document d currently needing to be scored; avdl is the average length of all relevant medical documents containing the search term t in the search term set; k1 is a free adjustment parameter, b is a free adjustment parameter.

Specifically, the documents containing all the search terms are scored, that is, the documents related to each search term are intersected. The parameter d is a document, q is all search terms, namely a search term set, t is a single search term when the number of the search terms is more than 2, and the BM25F formula comprises 2 free adjustment parameters of k1(1.2) and b (0.75). We performed modification tests on k1, b, and performed continuous comparison modification tests on b and then compared with the golden standard under the condition that k1 was kept unchanged to obtain an optimal b value, which is generally set in the range of 0 to 1000, and similarly, an optimal k1 value can be obtained, and they can be used as appropriate free adjustment parameters.

On the basis of the embodiment, the biomedical literature search ranking method further comprises the following steps: training the optimized sequencing model; specifically, training the optimized ranking model is shown in fig. 2, and specifically includes:

s201, obtaining training sample data, wherein the training sample data comprises biomedical documents and search and click data thereof;

specifically, for example, crawler software or the like may be used to crawl website search log data and biomedical documents searched and clicked by the user as training data.

S202, extracting the characteristics of training sample data, and performing label processing on the training sample data;

specifically, features are constructed for machine learning training. The extracting of the features of the training sample data specifically comprises: extracting basic information, stop word information, parameter information and flow information of a medical document in training sample data, wherein the parameter information and the flow information of search words in a search word set appear in a specified domain of the medical document; wherein the flow information of the medical literature comprises: any one or more of the number of clicks, collections, praise of the medical literature.

Preferably, the present invention constructs 27 features to train the model, such as:

1. a literature language.

2. Date of filing.

3. Document category (these features can all be extracted directly from the database document field).

4. Whether the documents are available for full-text search or not (the documents available for full-text search in the database are firstly obtained, and then compared according to the documents searched by the search words, the existence is 1, and otherwise, the existence is 0).

5. Number of characters of document title.

6. Number of characters of the summary.

7. Number of characters of the medical subject word.

8. Frequency of stop words (words which appear more frequently but have no meaning in the literature, such as a, are, about, and, after, before, but are found …) in the title (the ratio of the number of stop words in the title to the total number of words in the title).

9. Frequency of occurrence of stop words (stop words) in the summary (the ratio of the number of stop words in the summary to the total number of summary words).

10. The weight of the search term in the document title (the invention takes both the search term and its synonym and calculates their weight in the document title).

11. The number of occurrences of a search term in a document title (defining a counter that increments by 1 when it occurs once).

12. The ratio of the number of occurrences of a search term in a document title to the number of search terms.

13. Average position of search word in document title (defining title character position, recording character string position at the time when the character is the same as search word character, then averaging all the appeared positions).

14. The weights of the search terms in the abstract of the document (the invention takes the search terms and their synonyms and calculates their weights in the abstract of the document).

15. The number of occurrences of a search term in the abstract of the document (defining a counter, which is incremented by 1 upon one occurrence),

16. the ratio of the number of occurrences of a search term in the document summary to the number of search terms.

17. Average position of search word in document abstract (define abstract character position, when identical with search word character, record position of character string at the moment, then average all appearing positions together).

18. The weights of the search terms in the medical topic terms (the invention takes the search terms and their synonyms and calculates their weights in the medical topic terms).

19. The number of occurrences of the search term in the medical subject term (defining a counter, which is incremented by 1 upon one occurrence).

20. The ratio of the number of occurrences of the search term in the medical subject term to the number of search terms.

21. Average position of the search word in the medical subject word (defining the character position of the medical subject word, recording the position of the character string at that time when the character is the same as the search word character, and then averaging the positions of all occurrences).

22. The search term is the number of characters that are non-alphabetical and non-numeric.

23. The number of search terms.

24. The number of search terms that do not repeat.

25. Number of hits in literature.

26. Collection number of documents.

27. Praise number of documents.

The features are used as a feature training model to optimize the sorting, and particularly, the invention additionally adds the document click rate, the collection rate and the praise number, so that a better sorting result is comprehensively obtained.

After the characteristics are extracted, label processing is carried out on the training samples, so that the training samples can be input into a sequencing model for training. Regarding constructing the label, performing label processing on training sample data specifically comprises:

step A, calculating a relevancy score of relevant medical documents searched by a user in training sample data according to search click data in the training sample data;

b, sorting according to the relevance scores of the relevant medical documents in the training sample data, and taking the sorting result as a gold standard;

and step C, dividing each medical document obtained by user search in the training sample data into several grades according to the gold standard, and setting corresponding labels. Specifically, in each medical document obtained by user search in the training sample data, if the medical document is in the medical document ranked at the top 10 in the gold standard, setting the label of the medical document to be 12 minus the value of the search ranking number of the medical document; if the medical document is a medical document ranked between 10 and 20 in the gold standard, setting the label of the medical document to 2; if the medical document is in a medical document with a ranking greater than 20 in the gold standard, setting the label of the medical document to 1; setting a tag of the medical document to 0 if the medical document is not in the gold standard.

For example, let q be a query, d be a document, and a (d, q) be the number of digest requests d after q. f (d, q) is the full text request number d after q. FT represents a subset of articles in the corpus that are available in full text. 1FT (d) is an index function such that 1FT (d): (1 means if d ∈ FT, full text is available; 0 means if

Then the full text is not available). Mu e (0, 1) is the weight of the abstract and the full-text click, lambda e R + is the number of papers in PubMed without full-text links, K is the relevance score, and the relevance score of d with respect to q is calculated as follows:

K＝μ·a(d,q)+(1-μ)·f(d,q)+a(d,q)λ·(1-1FT(d)) (6)

then sorting is performed, the sorted documents are used as gold standards, and then our search documents are compared with the gold standards, such as:

1. the document appears in the first 10, with 12 minus its index as the label.

2. Indexes between 10 and 20, giving them a label of 2.

3. The label with index greater than 20 is 1.

4. The label that is not in the gold standard is 0.

The level is set in the range of 0 to 11 in the above manner.

S203, dividing the training sample data after label processing into a training set, a test set and a verification set according to a preset proportion;

s204, calling a sequencing model in a designated machine learning library, and setting learning parameters of the sequencing model;

specifically, the step is mainly to construct a machine learning model, and a designated machine learning library, for example, a ranklib package can be adopted, and the library contains 8 sorting algorithms. Therefore, Ranklib is an open source realization of a set of excellent Learning to Rank fields. Therefore, the ranklib library model can be invoked to set the learning parameters of the model.

Preferably, LambdaMART, available from Microsoft, is a better Learning to Rank model. Therefore, a Lambdamart model can be constructed as the ranking model for this training. Learning parameters are set, 200 trees are set for decision learning in the invention, a learning rate of 0.3 is adopted, and default settings are adopted for the rest trees. The default of the Best Match of PubMed is 1000 trees with a learning rate of 0.1, the method adopts 200 trees to make a decision, adopts a learning rate of 0.3, achieves the same effect under the condition of less training trees, trains a model with the Best effect in the least time, trains 800 trees in less time, increases the evaluation index on the training set by nearly 4%, increases the evaluation index on the verification set by nearly 1%, increases the evaluation index on the test set by nearly 0.4%, and trains a good model under the condition of reducing the training time.

In addition, the construction of the LambdaMART model in the invention refers to the Best Match of PubMed, and is improved to a certain extent. For example, the downloaded Best Match data code has a small problem that pmid can be extracted, the number of pmid can only be accepted as 9 bits, if the pmid is larger than 9 bits, error is reported, and the lamb damard MART model constructed by the method is modified and perfected, so that the code is optimized, and the pmid larger than 9 bits can be accepted. In addition, it is also improved to cut synonym parts of a search term that cuts out a single letter, not the entire word.

S205, loading a training set and a testing set to train an initial training model;

s206, according to a preset evaluation index for measuring the sequencing quality, adopting training sample data in a verification set to carry out verification test on the trained model;

specifically, in the previous step, the training data is divided into a training set, a test set and a verification set, then the training set and the test set are used as the training data and are transmitted to the parameters of the training model, the verification set is used for testing, and the NDCG @ K (for example, in the embodiment, the value K is 20, namely whether the first 20 are correctly ranked or not is concerned, and if not, whether all the first 20 are correctly ranked or not is concerned) is used as the evaluation index training model.

And S207, taking the model passing the verification test as a trained optimization sequencing model.

Finally, the model passing the verification test can be used as a trained optimized sorting model, so that the roughly sorted related medical documents in the specified number are further optimized and sorted by using the model, the sorting effect is improved, and the user experience is improved.

Based on the same technical concept, the invention also discloses a biomedical document searching and sorting device, which can adopt the biomedical document searching and sorting method to search and sort the biomedical documents queried by the user, specifically, the structural block diagram of one embodiment of the biomedical document searching and sorting device of the invention is shown in fig. 3, and the work flow diagram of one embodiment of the device is shown in fig. 4. The biological literature search ranking device includes:

an input obtaining module 100, configured to obtain query content input by a user;

the preprocessing module 200 is configured to preprocess the query content to obtain a search term set including at least one search term; specifically, the preprocessing of the query content mainly refers to performing word segmentation processing on the query content, for example, if the search content is a plurality of words, the search content is first changed into one search word, and the set of the search words is a search word set.

The database searching module 300 is configured to search the search word set on a specified data searching platform, and obtain a related medical document related to the search word set; specifically, a specified data search platform, such as an elastic search (non-relational database and real-time search platform), may be searched in the data search platform after a search term set (query target) composed of all search terms is obtained in the previous step, so as to obtain a medical document related to the query content input by the user.

A rough ordering module 400, configured to perform rough ordering on the relevant medical documents according to the relevance from high to low; and after the relevant medical documents relevant to the query content input by the user in the database are obtained, roughly sorting the medical documents according to the relevance from high to low. Specifically, the documents are roughly sorted by correlation, and a conventional correlation calculation method can be used, for example, BM25 or BM25F (an algorithm proposed based on a probability retrieval model) is used for roughly sorting.

An extracting module 500, configured to extract, as a target medical document, a specified number or a specified percentage of related medical documents ranked in the top from among the roughly ranked related medical documents;

in order to improve the sorting efficiency, the medical documents with poor relevance do not need to be subjected to optimized sorting, so that only a specified number or a specified percentage of the related medical documents are selected to be used as the target medical documents to be subjected to optimized sorting.

The optimizing and sequencing module 600 is configured to input the target medical documents into a trained optimizing and sequencing model for optimizing and sequencing, and output the optimized and sequenced target medical documents;

specifically, since the rough ranking may be performed only according to a relevance algorithm, but the factors such as click rate, praise number, and the like are not considered, some documents that are ranked only according to relevance may not necessarily be what the user wants to find, and thus a good user experience cannot be brought to the user.

And the output feedback module 700 is configured to output the optimally sorted target medical document and the extracted coarse-sorted related medical document to a user.

Regarding the coarse sorting module to perform coarse sorting, the algorithm BM25 or BM25F can be used to perform sorting. Specifically, the documents are roughly sorted by correlation, and the score of the documents is calculated and the documents are sorted by using the conventional correlation calculation method, which is explained by taking BM25F as an example in the invention. And respectively calculating IDF (inverse text frequency index) and the like of each word to obtain all pmid (unique identification code in PubMed), document contents.

Specifically, the rough ranking model specifically includes:

the inverse text frequency index calculation sub-module is configured to calculate an inverse text frequency IDF of each search term, and may specifically adopt the following formula to calculate, where: n is the total number of biomedical documents in the database, N is the number of documents containing the search term, and the ratio of the smoothing factor of 1 (to avoid 0 for numerator) to the total number of documents containing the search term plus the smoothing factor of 1 (to avoid 0 for denominator).

And the word frequency calculation submodule is used for calculating the word frequency of the search word. The specific calculation process comprises the following steps: the intersection is firstly solved to obtain the document containing all the search terms, and because the fields have different overall weights to the document and also have own weights, the appearance in a specific field is more important than that in other fields. For example, when calculating the weight of a document, search terms appearing in the title of the document are more important than those appearing in the article. For example, when calculating the weights of the fields, the search word appearing in the title has a weight of 5, the medical subject word has a weight of 5, and the abstract has a weight of 1, which is more important than the search word appearing in the abstract. Then, the different length of each field is obtained, the sum of the weighted frequencies of all the fields is calculated, and the higher the frequency of the search term in the specific field of the document is, the more important the document is. Solving according to the following formula:

among them, occurrencies of t in f (the number of occurrences of each search term in a specific field is calculated separately), FL (length of field), FW (weight).

And the document length calculation submodule is used for calculating the length of the currently scored document and the average length of all biomedical documents. Because different fields affect differently, we compute the average weight of all fields. The average length (avdl) of all documents is calculated.

avdl＝average of dl across documents (4)

And the scoring submodule is used for calculating the score of each search term, then summing the scores to obtain the final score of the document, and calculating according to the following formula:

the documents containing all the search terms are scored, and the intersection of the documents related to all the search terms is obtained. The parameter d is literature, q is all search terms), t is a single search term when the number of the search terms is more than 2, and the BM25F formula contains 2 free adjustment parameters of k1(1.2) and b (0.75). K1, b is subjected to modification test, b is subjected to continuous comparison modification test under the condition that k1 is kept unchanged, and then is compared with the golden standard to obtain an optimal b value, the value is generally set to be in the range of 0 to 1000, and an optimal k1 value can be obtained in the same way, and the optimal k1, b can be used as an appropriate free adjustment parameter,

the relevant medical documents are sorted according to the scores to obtain a coarse sorting result, and then the medical documents with the designated number or percentage in the top sorting are cut and sorted optimally, for example, the 10% documents in the top sorting after the coarse sorting (but not more than 100, if the documents in the top sorting exceed 100) are selected for next optimization.

In another embodiment of the biomedical literature search ranking device of the present invention, based on the above device embodiments, the biomedical literature search ranking device of the present embodiment further includes: as shown in fig. 5, the model training module 800 specifically includes:

the sample acquisition submodule 810 is configured to acquire training sample data, where the training sample data includes biomedical documents and search and click data thereof; specifically, for example, the website search log data and the biomedical articles searched and clicked by the user are used as training data.

The sample processing submodule 820 is used for extracting the characteristics of the training sample data and performing label processing on the training sample data;

the features are extracted for machine learning training. Extracted features such as document date, language, number of times a search term appears in a title, weight of occurrences of terms in a medical subject, and click rate, like number of likes, collection rate, etc. The number of clicked documents and the number of collected documents are added, the two characteristics can reflect that the user thinks the article is good to a certain extent, and the two characteristics are used as a characteristic training model to optimize sequencing and comprehensively obtain a better sequencing result.

The sample division submodule 830 is configured to divide the training sample data after the label processing into a training set, a test set, and a verification set according to a preset ratio; specifically, for example, the training sample data is divided into a training set, a test set and a verification set according to the proportions of 80%, 10% and 10%, respectively.

The model selection submodule 840 is used for calling a sequencing model in a designated machine learning library and setting the learning parameters of the sequencing model; specifically, the step is mainly to construct a machine learning model, and a designated machine learning library, for example, a ranklib package can be adopted, and the library contains 8 sorting algorithms. Therefore, Ranklib is an open source realization of a set of excellent Learning to Rank fields. Therefore, the ranklib library model can be invoked to set the model parameters.

Preferably, LambdaMART, available from Microsoft, is a better Learning to Rank model. Therefore, a Lambdamart model can be constructed as the ranking model for this training. The parameters are set, 1000 trees and the learning rate of 0.1 are adopted in Best Match of PubMed as a default, 200 trees are adopted for decision making, the learning rate of 0.3 is adopted, the default settings are other default settings, the same effect is achieved under the condition that the number of training trees is small, a model with the Best effect is trained in the least time, 800 trees are trained in the least time, the evaluation index on a training set rises by nearly 4%, the evaluation index on a verification set rises by nearly 1%, the evaluation index on a test set rises by nearly 0.4%, and the good model is trained under the condition that the training time is reduced.

A loading training submodule 850, configured to load a training set and a test set to train an initial training model;

the verification sub-module 860 is used for performing verification test on the trained model by adopting training sample data in a verification set according to a preset evaluation index for measuring the sequencing quality; and taking the model passing the verification test as a trained optimized sequencing model.

Specifically, after a training set and a test set are used as training data and are transmitted into parameters of a training model, sample data in the test set are used for testing, and an NDCG @ k is used as an evaluation index for training the model.

And after the model training is finished, selecting a part at the front of the sequence for optimized sequencing. For example, the top 100 cut documents from the rough training are optimally ranked and then placed with the rest of the rough training. And the output feedback module outputs the sequenced related documents to the user according to the sequence, so that the search requirement of the user is met.

Preferably, on the basis of the above-mentioned embodiment of the apparatus, the sample processing sub-module 820 comprises:

the feature extraction unit 821 is configured to extract basic information of the medical document, stop word information, parameter information of a search word in the search word set appearing in a specified domain of the medical document, and flow information in the training sample data; wherein the flow information comprises any one or more of the number of clicks, the number of collections and the number of praises of the medical literature.

Specifically, the basic information includes a document language, a document date, a document category, a document title character number, whether the document is available for a full-text search, a document abstract character number, a medical subject word character number, and the like. The stop word information comprises the frequency of stop words appearing in the title and the abstract; the parameter information of the search word appearing in the designated field of the medical document comprises the weight of the search word in the document title, the number of times of the search word appearing in the document title, the ratio of the number of times of the search word appearing in the document title to the number of the search word, the average position of the search word in the document title, the weight of the search word in the document abstract, the number of times of the search word appearing in the document abstract, the ratio of the number of times of the search word appearing in the document abstract to the number of the search word, the average position of the search word in the document abstract, the weight of the search word in the medical subject word, the number of times of the search word appearing in the medical subject word, the ratio of the number of times of the search word appearing in the medical subject word to the number of the search word, the average position of the search word in the medical subject word, the number of characters of the search word which are not alphabetical and not numeric. The traffic information includes the number of clicks on documents, the number of collections on documents, the number of praise on documents, and the like.

A label processing unit 822, configured to perform label processing on training sample data; the treatment method specifically comprises the following steps: the label processing unit calculates the relevancy score of the relevant medical literature searched by the user in the training sample data according to the search click data in the training sample data; the label processing unit sorts the relevant medical documents according to the scores and takes the sorting result as a gold standard; the label processing unit divides each medical document in the training sample data into several grades according to the gold standard and sets corresponding labels.

K＝μ·a(d,q)+(1-μ)·f(d,q)+a(d,q)λ·(1-1FT(d)) (6)

1. the document appears in the first 10, with 12 minus its index as the label.

2. Indexes between 10 and 20, giving them a label of 2.

3. The label with index greater than 20 is 1.

4. The label that is not in the gold standard is 0.

The level is set in the range of 0 to 11 in the above manner.

By adopting the mode, label processing can be accurately carried out on each training sample data, thereby being beneficial to subsequent machine learning.

The sorting method of the biomedical documents corresponds to the sorting device of the biomedical documents, and the technical details of the method embodiment are also applicable to the sorting device embodiment of the biomedical documents, and are not repeated for reducing the repetition.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A biomedical literature search ranking method, comprising:

acquiring query content input by a user;

preprocessing the query content to obtain a search word set at least comprising one search word;

searching the search word set on a specified data search platform, and acquiring related medical documents related to the search word set;

roughly sorting the relevant medical documents according to the relevance from high to low;

extracting a specified number or a specified percentage of related medical documents ranked in the top from the roughly ranked related medical documents as target medical documents;

inputting the target medical documents into a trained optimized sorting model for optimized sorting, and outputting the optimized sorted target medical documents;

and outputting the optimally sorted target medical documents and the coarse sorting related medical documents left after extraction to a user.

2. The biomedical literature search ranking method according to claim 1, wherein the coarse ranking of the relevant medical literature from high to low according to relevance specifically comprises:

calculating the inverse text frequency index and the word frequency of each search word in the search word set; the length of the document to be scored currently and the average length of all documents;

scoring the relevant medical documents according to the following formula, and sorting the relevant medical documents according to the scoring size:

wherein, the parameter d is a scoring document needing scoring currently in the relevant medical documents, q is a search word set, score (d, q) is a ranking score aiming at the search word set q and the scoring document d;

t is a single search term in the search term set;

TFt is the word frequency of the search word t;

IDFt is the inverse text frequency index of the search term t;

dl is the length of document d currently needing to be scored;

avdl is the average length of all relevant medical documents containing the search term t in the search term set;

k1 is a free adjustment parameter, b is a free adjustment parameter.

3. The biomedical literature search ranking method according to claim 1, further comprising:

training the optimized sequencing model; the method specifically comprises the following steps:

acquiring training sample data, wherein the training sample data comprises biomedical documents and search and click data thereof;

extracting the characteristics of the training sample data, and performing label processing on the training sample data;

dividing the training sample data after label processing into a training set, a test set and a verification set according to a preset proportion;

calling a sequencing model in a designated machine learning library, and setting learning parameters of the sequencing model;

loading the training set and the test set to train an initial training model;

according to a preset evaluation index for measuring the sequencing quality, adopting the training sample data in the verification set to perform verification test on the trained model;

and taking the model passing the verification test as a trained optimized sequencing model.

4. The method according to claim 3, wherein said extracting features of the training sample data specifically comprises:

extracting basic information, stop word information, parameter information and flow information of the medical literature in the training sample data, wherein the parameter information and the flow information of search words in the search word set appear in a specified domain of the medical literature; wherein the flow information of the medical literature comprises: any one or more of the number of clicks, collections, praise for the medical document.

5. The method according to claim 3, wherein the labeling of the training sample data specifically comprises:

calculating the relevancy score of the relevant medical literature searched by the user in the training sample data according to the search click data in the training sample data;

sorting according to the degree of relevancy scores of the relevant medical documents in the training sample data, and taking the sorting result as a gold standard;

and dividing each medical document obtained by the user in the training sample data by several grades according to the gold standard, and setting corresponding labels.

6. The method according to claim 5, wherein each medical document obtained by user search in the training sample data is classified into several grades according to the gold standard, and a corresponding label is set. The method specifically comprises the following steps:

in each medical document obtained by user search in the training sample data, if the medical document is in the medical document with the top ranking 10 in the gold standard, setting the label of the medical document to be 12 minus the value of the search ranking number;

if the medical document is a medical document ranked between 10 and 20 in the gold standard, setting the label of the medical document to 2;

if the medical document is in a medical document with a ranking greater than 20 in the gold standard, setting the label of the medical document to 1;

setting a tag of the medical document to 0 if the medical document is not in the gold standard.

7. The biomedical literature search ranking method according to claim 3, wherein the called ranking model is a LambdaMART model, and the setting of the learning parameters of the ranking model specifically comprises:

the number of trees in the Lambdamart parameter is set to 200, and the learning rate is set to 0.3.

8. A biomedical literature search ranking device, wherein search ranking is performed by the biomedical literature search ranking method according to any one of claims 1 to 7, the biomedical literature search ranking device comprising:

the input acquisition module is used for acquiring query contents input by a user;

the preprocessing module is used for preprocessing the query content to obtain a search word set at least comprising one search word;

the database searching module is used for searching the searching word set on a specified data searching platform and acquiring related medical documents related to the searching word set;

the rough sorting module is used for roughly sorting the relevant medical documents from high to low according to the relevance;

the extraction module is used for extracting the related medical documents with the specified number or the specified percentage in the top ranking from the related medical documents with the rough ranking as target medical documents;

the optimized sorting module is used for inputting the target medical documents into a trained optimized sorting model for optimized sorting and outputting the optimized sorted target medical documents;

and the output feedback module is used for outputting the optimally sorted target medical documents and the extracted coarse sorting related medical documents to a user.

9. The biomedical literature search ranking device according to claim 8, further comprising: a model training module for training the optimized ranking model, the model training module specifically comprising:

the sample acquisition submodule is used for acquiring training sample data, and the training sample data comprises biomedical documents and search and click data thereof;

the sample processing submodule is used for extracting the characteristics of the training sample data and carrying out label processing on the training sample data;

the sample division submodule is used for dividing the training sample data after the label processing into a training set, a test set and a verification set according to a preset proportion;

the model selection submodule is used for calling a sequencing model in a designated machine learning library and setting learning parameters of the sequencing model;

the loading training submodule is used for loading the training set and the testing set to train an initial training model;

the verification submodule is used for performing verification test on the trained model by adopting the training sample data in the verification set according to a preset evaluation index for measuring the sequencing quality; and taking the model passing the verification test as a trained optimized sequencing model.

10. The biomedical literature search ranking device according to claim 9, wherein said sample processing sub-module comprises:

the characteristic extraction unit is used for extracting basic information of the medical literature, stop word information, parameter information of search words in the search word set appearing in a specified domain of the medical literature, and flow information in the training sample data; wherein the flow information comprises any one or more of the number of clicks, the number of collections, and the number of praises of the medical literature;

the label processing unit is used for carrying out label processing on the training sample data; the treatment method specifically comprises the following steps: the label processing unit calculates the relevancy score of the relevant medical literature searched by the user in the training sample data according to the search click data in the training sample data; the label processing unit sorts the relevant medical documents according to the scores and takes the sorting result as a gold standard; and the label processing unit divides each medical document in the training sample data into several grades according to the gold standard and sets corresponding labels.