CN112214580A - Article identification method and device, computer equipment and storage medium - Google Patents

Article identification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112214580A
CN112214580A CN202011213480.1A CN202011213480A CN112214580A CN 112214580 A CN112214580 A CN 112214580A CN 202011213480 A CN202011213480 A CN 202011213480A CN 112214580 A CN112214580 A CN 112214580A
Authority
CN
China
Prior art keywords
medical
word
words
article
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011213480.1A
Other languages
Chinese (zh)
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011213480.1A priority Critical patent/CN112214580A/en
Publication of CN112214580A publication Critical patent/CN112214580A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The embodiment of the invention discloses an article identification method, an article identification device, computer equipment and a storage medium based on an artificial intelligence technology, wherein the method comprises the following steps: extracting a medical word set from a target medical article to be identified, and constructing a medical knowledge graph of the target medical article by adopting a plurality of medical words in the medical word set; calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph; and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using the word vectors of the key disease words, wherein the key topic vector is used for indicating the key disease topic of the target medical article. The embodiment of the invention can automatically identify the key disease theme of the medical article, effectively save labor cost and improve the accuracy of the identification result.

Description

Article identification method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to the field of computer technologies, and in particular, to an article identification method, an article identification apparatus, a computer device, and a computer storage medium.
Background
With the development of internet technology, internet medical platforms (such as internet medical APPs (applications), internet medical websites and the like) have come into force; the internet medical platform can provide a large number of medical information articles (medical articles for short) for users, so that the users can obtain relevant medical information by reading the medical information articles. For a medical article, it will typically include relevant content for one or more diseases, such that the medical article typically has one or more disease topics; then, in order to better classify, retrieve and recommend a large number of medical articles provided by the internet medical platform, it is often necessary to identify key disease topics (or called main disease topics) of the medical articles.
At present, the key disease topic of each medical article is usually identified by a manual labeling mode, and such an identification mode not only consumes a large amount of labor cost, but also causes low accuracy of an identification result due to subjectivity of manual identification. Based on this, how to better identify key disease topics for medical articles becomes a research hotspot.
Disclosure of Invention
The embodiment of the invention provides an article identification method, an article identification device, computer equipment and a storage medium, which can automatically identify key disease topics of medical articles, effectively save labor cost and improve the accuracy of identification results.
In one aspect, an embodiment of the present invention provides an article identification method, where the method includes:
extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words;
constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article.
In another aspect, an embodiment of the present invention provides an article identification apparatus, where the apparatus includes:
the medical treatment article recognition system comprises an extraction unit, a recognition unit and a recognition unit, wherein the extraction unit is used for extracting a medical treatment word set from a target medical article to be recognized, the medical treatment word set comprises a plurality of medical treatment words, and the plurality of medical treatment words at least comprise disease words;
the construction unit is used for constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, and the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
the processing unit is used for calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
the processing unit is further configured to select a key disease word of the target medical article from the set of medical words according to the importance of each medical word, and construct a key topic vector of the target medical article by using a word vector of the key disease word, where the key topic vector is used to indicate a key disease topic of the target medical article.
In one embodiment, the processing unit, when being configured to construct the key topic vector of the target medical article using the word vector of the key disease word, may be specifically configured to:
acquiring related non-disease words corresponding to the key disease words from the medical word set, wherein the related non-disease words meet the following conditions: in the medical knowledge graph, nodes for recording the related non-disease words are connected with nodes for recording the key disease words;
acquiring word vectors of the key disease words and word vectors of the related non-disease words;
and fusing the word vector of the key disease word and the word vector of the related non-disease word to obtain the key topic vector of the target medical article.
In another embodiment, the processing unit, when configured to select the key disease word of the target medical article from the medical word set according to the importance of each medical word, may be specifically configured to:
selecting a plurality of candidate keywords of the target medical article from the medical word set according to the importance of each medical word and a keyword selection strategy; the candidate keywords comprise at least one candidate disease word;
and selecting the candidate disease word with the maximum importance degree from the at least one candidate disease word as a key disease word of the target medical article.
In yet another embodiment, the processing unit is further operable to:
obtaining word vectors of all candidate keywords, and calculating the vector similarity between the word vectors of all candidate keywords and the key topic vectors;
selecting article keywords of the target medical article from the candidate keywords according to the vector similarity between the word vector of each candidate keyword and the key topic vector; wherein the vector similarity between the word vector of the article keyword and the key topic vector is greater than a similarity threshold;
and performing associated storage on the target medical article and the article keywords so as to perform business processing on the target medical article according to the article keywords.
In another embodiment, when the processing unit is configured to select, according to a keyword selection policy, a plurality of candidate keywords of the target medical article from the set of medical words according to the importance of each medical word, the processing unit may be specifically configured to:
according to the sequence of the importance degrees from large to small, selecting a preset number of medical words from the medical word set as a plurality of candidate keywords of the target medical article; alternatively, the first and second electrodes may be,
and selecting the medical words with the importance degrees larger than the importance degree threshold value from the medical word set as a plurality of candidate keywords of the target medical article.
In yet another embodiment, the key topic vector is a dominant topic vector of the target medical article, and the key disease topic is a main disease topic of the target medical article; accordingly, the processing unit may be further operable to:
selecting a reference disease word of the target medical article from the at least one candidate disease word, wherein the importance of the reference disease word is less than that of the key disease word;
constructing an affiliated subject vector of the target medical article by using the word vector of the reference disease word, wherein the affiliated subject vector of the target medical article is used for indicating an affiliated disease subject of the target medical article;
and storing the target medical article, the key topic vector and the subordinate topic vector into a storage space in an associated manner, so that when an article search request exists, article search processing is carried out according to the key topic vector and the subordinate topic vector.
In yet another embodiment, the storage space further includes at least one other medical article, and each other medical article has a corresponding dominant topic vector and a corresponding subordinate topic vector; accordingly, the processing unit may be further operable to:
when an article searching request exists, acquiring an information vector of article searching information carried by the article searching request;
acquiring each medical article in the storage space and at least one topic vector of each medical article; wherein each medical article has a recommended weight value, and the at least one topic vector of each medical article comprises a leading topic vector and a dependent topic vector of each medical article;
respectively calculating the matching degree between each topic vector of each medical article and the information vector, and updating the recommendation weight value of each medical article according to the calculated matching degree;
according to the updated recommended weight value of each medical article, performing descending arrangement on each medical article; and selecting the first medical article as the medical article to be recommended and outputting the medical article.
In another embodiment, when the processing unit is configured to update the recommended weight value of each medical article according to the calculated matching degree, the processing unit may be specifically configured to:
for any medical article, according to the matching degree between each topic vector of the medical article and the information vector, determining a topic vector with the maximum matching degree from at least one topic vector of the medical article;
if the topic vector with the maximum matching degree is the dominant topic vector of any medical article, increasing the recommendation weight value of any medical article;
and if the topic vector with the maximum matching degree is the subordinate topic vector of any medical article, reducing the recommended weight value of any medical article.
In another embodiment, the constructing unit, when configured to construct the medical knowledge map of the target medical article using the plurality of medical words, may be specifically configured to:
constructing an initial knowledge graph of the target medical article by adopting the plurality of medical words, wherein the initial knowledge graph comprises a plurality of nodes, and each node records one medical word;
selecting at least one pair of co-occurring word pairs from the plurality of medical words, wherein the co-occurring word pairs are word pairs formed by two medical words having a co-occurring relation in the target medical article;
determining at least one node group from the initial knowledge graph according to the at least one pair of co-occurring word pairs, wherein any node group comprises: respectively recording two nodes of two medical words in a pair of co-occurrence word pairs;
and respectively connecting two nodes in each node group in the initial knowledge graph to obtain a medical knowledge graph of the target medical article.
In another embodiment, the construction unit, when being configured to select at least one pair of co-occurrence word pairs from the plurality of medical words, may be specifically configured to:
determining a first distribution position of the first medical word in the target medical article, wherein the first medical word is any one of the plurality of medical words;
acquiring a second medical word from the plurality of medical words according to a first distribution position of the first medical word, wherein the distance between a second distribution position of the second medical word in the target medical article and the first distribution position is smaller than a position distance threshold value;
calculating a semantic distance value between the first medical word and the second medical word, the semantic distance value being indicative of a semantic similarity between the first medical word and the second medical word;
if the semantic distance value between the first medical word and the second medical word is larger than a semantic threshold value, determining that the first medical word and the second medical word have the co-occurrence relationship in the target medical article, and constructing a pair of co-occurrence word pairs by using the first medical word and the second medical word.
In another embodiment, the processing unit, when configured to calculate the importance of the medical word recorded by each node based on the connection relationship between the nodes in the medical knowledge graph, may be specifically configured to:
aiming at a medical word recorded by any node, determining at least one associated node connected with the any node based on the connection relation among all nodes in the medical knowledge graph;
calculating semantic distance values between the medical words recorded by any node and the medical words recorded by each associated node;
and calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node.
In another embodiment, the extracting unit, when configured to extract the medical word set from the target medical article to be identified, may be specifically configured to:
performing word segmentation processing on a target medical article to be recognized to obtain an initial word set, wherein the initial word set comprises a plurality of initial words;
screening a plurality of intermediate words from the initial word set according to at least one medical dictionary, wherein the intermediate words refer to the initial words existing in the at least one medical dictionary;
and constructing a medical word set of the target medical article by adopting the plurality of intermediate words.
In yet another embodiment, each initial word in the set of initial words has a part of speech; correspondingly, the extracting unit, when being configured to filter out a plurality of intermediate words from the initial word set according to at least one medical dictionary, may be specifically configured to:
screening initial words of a target part of speech from the initial word set, wherein the target part of speech comprises at least one of the following items: nouns, verbs, and adjectives;
and screening a plurality of intermediate words from the initial words of the target part of speech according to at least one medical dictionary.
In another embodiment, the extracting unit, when configured to construct the medical word set of the target medical article by using the plurality of intermediate words, may be specifically configured to:
if the intermediate words meeting the bonding conditions exist in the plurality of intermediate words, performing bonding treatment on the intermediate words meeting the bonding conditions; adding the words after bonding treatment and the intermediate words without bonding treatment into a medical word set of the target medical article as medical words;
if the intermediate words meeting the bonding conditions do not exist in the plurality of intermediate words, all the intermediate words are used as medical words and added to a medical word set of the target medical article;
wherein the bonding conditions include: the distribution locations in the target medical article are adjacent and exist in the same medical dictionary.
In another aspect, an embodiment of the present invention provides a computer device, where the terminal includes an input interface and an output interface, and the terminal further includes:
a processor adapted to implement one or more instructions; and the number of the first and second groups,
a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:
extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words;
constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article.
In yet another aspect, an embodiment of the present invention provides a computer storage medium, where one or more instructions are stored, and the one or more instructions are adapted to be loaded by a processor and execute the following steps:
extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words;
constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article.
According to the embodiment of the invention, aiming at the target medical article to be identified, a plurality of medical words can be extracted from the target medical article, and a medical knowledge graph of the target medical article is constructed by adopting the plurality of medical words. Since the medical words recorded by any two connected nodes in the medical knowledge graph have a co-occurrence relationship in the target medical article, the more co-occurrence relationship the medical words are generally more important; therefore, the importance of the medical words recorded by each node can be accurately calculated based on the connection relation among all nodes in the medical knowledge graph. Then, key disease words of the target medical article can be selected from the medical word set according to the importance of each medical word, and a key topic vector for indicating the key disease topic of the target medical article is constructed by adopting the word vectors of the key disease words. Therefore, the accuracy of the key disease words can be effectively improved by improving the accuracy of the importance of each medical word, so that the accuracy of the key disease topics is improved; in addition, the key disease topics of the target medical article can be automatically identified without the manual participation of a user in the whole topic identification process, and the labor cost is effectively saved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1a is a system architecture diagram of an identification system according to an embodiment of the present invention;
FIG. 1b is a system architecture diagram of another recognition system provided by an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an article identification method according to an embodiment of the present invention;
FIG. 3a is a schematic diagram of a sliding window for sliding on a target medical article according to an embodiment of the present invention;
FIG. 3b is a diagram illustrating a medical knowledge graph of a target medical article constructed using a plurality of medical words according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an article recognition method according to another embodiment of the present invention;
FIG. 5 is a block diagram of a word vector generator model according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an article recognition apparatus according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
With the continuous development of internet technology, AI (Artificial Intelligence) technology has also been developed better. AI refers to a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is an integrated technique of computer science; the intelligent machine is mainly produced by knowing the essence of intelligence and can react in a manner similar to human intelligence, so that the intelligent machine has multiple functions of perception, reasoning, decision making and the like. Accordingly, AI Technology is a comprehensive discipline, which mainly includes Natural Language Processing (NLP) Technology, Computer Vision Technology (CV), Speech Technology (Speech Technology), and Machine Learning (ML)/deep Learning. The natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and the natural language processing refers to science integrating linguistics, computer science and mathematics into a whole; natural language processing techniques may generally include techniques such as knowledge-mapping, text processing, semantic understanding, machine translation, and robotic question and answer.
Based on the natural language processing technology in the artificial intelligence technology, the embodiment of the invention provides an article identification scheme for a medical article (i.e., a medical information article), so that a key disease topic of the medical article can be identified more accurately. Reference herein to a key disease topic refers to the disease topic that is most predominant or most important among one or more disease topics possessed by a medical article; the term "disease topic" refers to a word or sentence that can summarize the content related to a disease in a medical article. In a specific implementation, the article identification scheme may be executed by a computer device, where the computer device may be a terminal device (hereinafter referred to as a terminal) or a server; specifically, the general principle of the article recognition scheme is as follows: the computer device may obtain at least one medical dictionary for the target medical article to be identified, the medical dictionary being a dictionary composed of a large number of medical words in a medical field issued by an authoritative medical institution. The term "medical word" as referred to herein means a word for describing disease information, such as a disease word constituted by a disease name, a non-disease word constituted by a non-disease name such as a disease symptom or a drug name, and the like. After the target medical article to be recognized is acquired, the computer device may extract a plurality of medical words included in the target medical article according to the at least one medical dictionary. Then, according to the co-occurrence relation of the plurality of medical words in the target medical article, a medical knowledge graph is constructed by adopting the plurality of medical words; and selecting key disease words of the target medical article from the plurality of medical words according to the medical knowledge graph, so that the key disease topics of the target medical article can be determined according to the selected key disease words. Optionally, after determining the key disease topic of the target medical article, the computer device may also identify article keywords of the target medical article based on the key disease topic.
In order to better realize the article identification scheme, the embodiment of the invention also provides a related topic identification system; the subject recognition system may include at least: a computer device 11 and a medical dictionary provider 12, where reference to the medical dictionary provider 12 is intended to refer to a terminal, client (i.e., APP), or server that can be used to provide at least one medical dictionary for the computer device 11. As can be seen from the foregoing, the computer device 11 may be a terminal or a server; then, when the computer device 11 is a terminal, the system architecture of the topic identification system can be seen in fig. 1 a; when the computer device 11 is a server, the system architecture of the topic identification system can be seen in fig. 1 b. As shown in fig. 1b, the topic identification system in this case may further comprise at least one terminal 13, the terminal 13 being configured to send the target medical article to be identified to the computer device 11 (i.e. the server); the terminal 13 and the computer device 11 (i.e., the server) may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present invention is not limited herein. It should be noted that the above-mentioned terminal may be a smart phone, a tablet computer, a notebook computer, a smart watch, a desktop computer, or the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform, and the like.
As can be seen from the above description, the topic identification system and the article identification scheme provided in the embodiments of the present invention can automatically identify the key disease topic of the target medical article; the whole theme identification process does not need manual participation of a user, and the labor cost can be effectively saved. Moreover, the subjectivity of human identification can be reduced through an automatic identification mode, the fact that the subjectivity of human identification influences the accuracy of an identification result is avoided, and the accuracy of the identification result can be effectively improved.
Based on the above description, the embodiment of the present invention proposes an article identification method, which can be executed by the above-mentioned computer device. Referring to fig. 2, the article identification method may include the following steps S201 to S204:
s201, extracting a medical word set from a target medical article to be identified.
Research has shown that for any medical article, the medical article usually includes a large number of medical words, and the large number of medical words may include one or more disease words; the term "disease" refers to a term composed of names of diseases, such as "gastric ulcer", "acute gastritis", "chronic gastritis", and the like. And optionally, one or more non-disease words may be included in the plurality of medical words; by non-disease words is meant words consisting of disease-related information other than the name of the disease, where such other disease-related information may include, but is not limited to: disease symptoms, disease action sites, disease treatment means (disease treatment method, disease treatment apparatus), drug names of treatment drugs for treating diseases, and the like. Accordingly, the non-disease word may be a word consisting of a disease symptom (e.g., "lower abdominal distention pain"), a word consisting of a disease action site (e.g., "stomach"), a word consisting of a disease treatment means (e.g., "gastroscope" or "color Doppler ultrasound"), a word consisting of a drug name of a treatment drug (e.g., "molt-dine"), and the like.
Since these numerous medical terms are used to describe the disease information (such as disease name, disease symptom, disease action site, etc.) involved in the medical article, the key disease topic of the medical article is used to summarize the related content of the main disease (i.e. key disease) involved in the medical article; it can be seen that the medical words in the medical article and the key disease topics of the medical article have a relationship to some extent. Accordingly, embodiments of the present invention propose recognition concepts that can determine key disease topics of a medical article by means of medical words included in the medical article. Based on the identification concept, when the disease topic identification requirement about the target medical article exists, the computer device can acquire the target medical article to be identified. Then, a medical word set can be extracted from the target medical article through step S201; the medical word set herein may include a plurality of medical words, and at least one of the plurality of medical words includes a disease word. In a specific implementation process, the computer device may perform word segmentation on the target medical article, and match an initial word set obtained through the word segmentation with at least one medical dictionary to match and extract initial words included in the initial word set and located in the at least one medical dictionary. And then, determining the medical words in the target medical article according to the initial words extracted by matching, so as to construct a medical word set of the target medical article.
S202, constructing a medical knowledge map of the target medical article by adopting a plurality of medical words.
In an embodiment of the present invention, the medical knowledge graph of the target medical article may include a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relationship in the target medical article. That is, the nodes corresponding to two medical words having a co-occurrence relationship in the target medical article are connected to each other in the medical knowledge graph. Wherein the co-occurrence referred to herein may include any of the following meanings:
in one embodiment, the above mentioned co-occurrence relationship may refer to: and in the process of sliding on the target medical article by adopting a sliding window, two medical words appear in the sliding window at the same time. The window length of the sliding window can be set according to an empirical value or the maximum sentence length (i.e. the number of characters included in the sentence with the longest length) in the target medical article; for example, the maximum sentence length in the target medical article is 20 characters, the window length may be set to be less than or equal to 20 characters. For example, a sliding window with a window length of 5 characters is adopted to slide on the target medical article; and the plurality of medical words comprise: the medical word a, the medical word B, the medical word C, and the medical word D … …, and the distribution position of each medical word in the target medical article can be seen in the first diagram in fig. 3 a. Since the medical word a and the medical word B may appear in the sliding window at the same time during the sliding process (as shown in the second diagram in fig. 3 a), the medical word a and the medical word B may be considered to have a co-occurrence relationship in the target medical article. Since the medical word B and the medical word C may appear in the sliding window at the same time during the sliding process (as shown in the third diagram in fig. 3 a), the medical word B and the medical word C may be considered to have a co-occurrence relationship in the target medical article. Since the medical word C and the medical word D cannot appear in the sliding window at the same time (as shown in the fourth diagram in fig. 3 a), the medical word C and the medical word D may be considered to have no co-occurrence relationship in the target medical article, and so on.
In another embodiment, the above mentioned co-occurrence relationship may be: and in the process of sliding on the target medical article by adopting a sliding window, two medical words appear in the sliding window at the same time, and the semantic distance value between the two medical words is larger than the semantic threshold value. The semantic distance value between the two medical words can be obtained by calculating according to word vectors of the two medical words; the semantic distance value between the two medical words can be used for reflecting the semantic similarity between the two medical words, and the semantic distance value is in direct proportion to the semantic similarity; that is, the greater the semantic distance value between two medical words, the greater the semantic similarity between the two medical words. Also, the semantic threshold may be set based on empirical values or traffic requirements. For example, let K be the semantic threshold and K be the semantic distance value between the medical word A and the medical word BABAnd the semantic distance value between the medical word B and the medical word C is kBCMedical treatmentThe semantic distance value between the word C and the medical word D is kCD(ii) a And k isAB<K,kBC>K,kCD<K. The example shown in fig. 3a above still holds: due to the semantic distance value (i.e., k) between the medical word A and the medical word BAB) Less than the semantic threshold (K), so although the medical word a and the medical word B may appear in the sliding window at the same time, the computer device may consider the medical word a and the medical word B to have no co-occurrence in the target medical article. Due to the semantic distance value (i.e., k) between the medical word B and the medical word CBC) Greater than the semantic threshold (K), and the medical word B and the medical word C may appear in the sliding window at the same time, so the computer device may consider the medical word B and the medical word C to have a co-occurrence relationship in the target medical article, and so on. Therefore, when judging whether the two medical words have the co-occurrence relationship in the target medical article, the method not only considers the distance of the distribution positions of the two medical words in the target medical article through the sliding window, but also considers the semantic similarity between the two medical words, so that the judgment accuracy of the co-occurrence relationship can be effectively improved, and the accuracy of the medical knowledge graph is improved.
Based on the above description, in the process of implementing step S202, the computer device may first construct an initial knowledge graph of the target medical article by using a plurality of medical words; the initial knowledge graph includes a plurality of nodes, each node recording a medical term. Second, the computer device may select at least one pair of co-occurring word pairs from the plurality of medical words, the co-occurring word pair being a pair of words consisting of two medical words having a co-occurring relationship in the target medical article. Then, the computer device may traverse each pair of co-occurring word pairs; aiming at the current co-occurrence word pair traversed currently, two nodes for recording two medical treatment words in the current co-occurrence word pair can be respectively connected in the initial knowledge graph; when all the co-occurrence word pairs are traversed, a medical knowledge graph of the target medical article can be obtained. See, for example, FIG. 3 b: setting a plurality of medical terms includes: medical word a (recorded with node a), medical word B (recorded with node B), medical word C (recorded with node C), medical word D (recorded with node D), medical word E (recorded with node E) … …; and there are 5 pairs of co-occurring words in total among the plurality of medical words, which are: (medical word a, medical word B), (medical word a, medical word D), (medical word B, medical word C), (medical word B, medical word E), and (medical word D, medical word E). Then, the computer device may connect node a and node B, connect node a and node D, connect node B and node C, connect node B and node E, and connect node D and node E, respectively, in the initial knowledge graph to obtain a medical knowledge graph of the target medical article.
S203, calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph.
In a specific implementation, the co-occurrence relationship refers to a relationship in which two medical words appear in a sliding window at the same time, or a relationship in which two medical words appear in a sliding window at the same time and a semantic distance value between the two medical words is greater than a semantic threshold; it can be known that the higher the frequency of occurrence of the medical words having the more co-occurrence relationships in the target medical article, the more important the medical words having the more co-occurrence relationships can be considered. Based on this, when executing step S203, the computer device may count the number of relationships of the co-occurrence relationship of each medical word according to the connection relationship between the nodes in the medical knowledge graph; and determining the importance of each medical word according to the relationship quantity corresponding to each medical word according to the principle that the relationship quantity and the importance are positively correlated.
Specifically, the corresponding relationship number of each medical term can be directly used as the importance of each medical term. Or, carrying out normalization processing on the corresponding relation quantity of each medical word to obtain the importance of each medical word; wherein, the normalization processing refers to the processing of mapping the relation quantity to the range of 0-1. Or, weighting calculation can be carried out on the corresponding relation quantity of each medical word by adopting the importance parameter to obtain the importance of each medical word; wherein, the importance parameter can be set according to an empirical value or a service requirement. For example, referring to the medical knowledge graph shown in fig. 3B, if the node a is connected to the node B, and the node a is connected to the node D, it may be statistically determined that the number of the relationships of the co-occurrence relationship of the medical word a recorded by the node a is 2; the number of relationships may be directly used as the importance of the medical word a (i.e., the importance is 2), or the number of relationships may be weighted by using an importance parameter (e.g., 1.5) to obtain the importance of the medical word a (i.e., the importance is 3), and so on.
In yet another implementation, studies have shown that if two medical words have a co-occurrence relationship in the target medical article, the importance of the two medical words will generally affect each other because the two medical words occur simultaneously. Based on this, when the computer device executes step S203, for the medical word recorded by any node, the importance of the medical word of any node can be calculated by combining the importance of the medical word recorded by the associated node connected to the any node, so as to improve the accuracy of the importance. Specifically, for a medical word recorded by any node, at least one associated node connected with the node can be determined based on the connection relationship between the nodes in the medical knowledge graph; then, the importance of the medical word recorded by any one node is calculated according to the importance of the medical word recorded by each associated node.
The specific implementation of calculating the importance of the medical word recorded by any node according to the importance of the medical word recorded by each associated node may include any one of the following:
the first implementation mode comprises the following steps: the computer device may calculate an initial value of the medical term recorded by any node according to the number of the co-occurrence relations of the medical terms recorded by any node. Secondly, the times of the medical words recorded by any node and the medical words recorded by each associated node appearing in the sliding window at the same time can be respectively counted, and the counted times are respectively normalized to obtain the weight value of each associated node. For example, for the medical word a recorded by node a, node a has two associated nodes, node B and node D; if the medical word A and the medical word B recorded by the node B are simultaneously appeared in the sliding window for 15 times, the medical word A and the medical word D recorded by the node D are simultaneously appeared in the sliding window for 5 times; node B has a weight of 15/(15+5) to 0.75, and node D has a weight of 5/(15+5) to 0.25. After the weight value of each associated node is obtained, the weight value of each associated node can be adopted to carry out weighted summation on the importance of each associated node; for example, if the importance of node B is 0.4 and the importance of node D is 0.2, 0.4 × 0.75+0.2 × 0.25 may be executed as 0.35. Then, the value obtained by weighted summation and the initial value of the medical word recorded by any node can be subjected to summation operation to obtain the importance of the medical word recorded by any node.
The second embodiment: the computer device can also adopt a calculation formula shown in formula 1.1 to calculate the importance of the medical word recorded by any node according to the importance of the medical word recorded by each associated node. Wherein d in the formula 1.1 is a damping coefficient, and the value range thereof is usually (0, 1), for example, d may be set to be equal to 0.85; viRepresents the ith node, WS (V) in the medical knowledge graphi) Representing the importance of the medical word recorded by the ith node; in (V)i) Representing an associated node set corresponding to the ith node, wherein e represents the node to which the ith node belongs; vjThe method comprises the steps that j associated nodes in an associated node set corresponding to the ith node are represented, namely j associated nodes connected with the ith node; out (V)j) Represents the associated node set, | Out (V) corresponding to the jth associated node connected with the ith nodej) L represents the number of associated nodes contained in an associated node set corresponding to a jth associated node connected with the ith node; WS (V)j) And the importance degree of the medical word recorded by the jth associated node connected with the ith node is represented.
Figure BDA0002758716640000141
In another specific implementation, since the semantic distance value can be used to indicate semantic similarity between two medical words, research has shown that for any medical word, if the semantic similarity between another medical word and the medical word is greater, the influence of the importance of the other medical word on the importance of the medical word is generally greater. Based on this, when the computer device executes step S203, for the medical word recorded by any node, the importance of the medical word of any node can be calculated by combining the importance of the medical word recorded by the associated node connected to the any node and the semantic distance value between the medical word recorded by any node and the medical word recorded by each associated node, so as to further improve the accuracy of the importance. Specifically, for a medical word recorded by any node, at least one associated node connected to any node may be determined based on a connection relationship between nodes in the medical knowledge graph. Then, calculating semantic distance values between the medical words recorded by any node and the medical words recorded by each associated node; and calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node.
The specific implementation of calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node may include any of the following:
the first implementation mode comprises the following steps: the computer device may calculate an initial value of the medical term recorded by any node according to the number of relationships of the medical term recorded by any node. Secondly, weighted summation can be carried out on the importance of each associated node by adopting each semantic distance value. For example, for the medical word a recorded by node a, node a has two associated nodes, node B and node D; and the importance of node B is 0.4 and the importance of node D is 0.2. If the semantic distance value between the medical word A and the medical word B recorded by the node B is kABThe semantic distance value between the medical word A and the medical word D recorded by the node D is kAD(ii) a Then 0.4 xk can be performedAB+0.2×kAD. Then, the value obtained by weighted summation and the initial value of the medical word recorded by any node can be subjected to summation operation to obtain the importance of the medical word recorded by any node.
The second embodiment: the computer equipment can also adopt a calculation formula shown in formula 1.2 according to a meterAnd calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node. Wherein w in formula 1.2jiThe semantic distance value between the medical word recorded by the ith node and the medical word recorded by the jth associated node connected with the ith node is represented; vkThe node is used for representing a kth associated node in an associated node set corresponding to a jth associated node connected with the ith node, namely the kth associated node connected with the jth associated node connected with the ith node; w is ajkRepresenting a semantic distance value between the medical word recorded by the jth associated node and the medical word recorded by the kth associated node. It should be noted that specific definitions of other parameters in formula 1.2 can be referred to in the description of formula 1.1, and are not repeated herein.
Figure BDA0002758716640000151
S204, selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing key topic vectors of the target medical article by using the word vectors of the key disease words, wherein the key topic vectors are used for indicating key disease topics of the target medical article.
Since the key disease topic of the target medical article is used for summarizing the content related to the main diseases (namely, key diseases) involved in the target medical article, after the importance of each medical word is obtained, the computer device can select the disease word with the highest importance from the medical word set according to the importance of each medical word as the key disease word of the target medical article. Then, a key topic vector of the target medical article can be constructed by using the word vector of the key disease word, and the key topic vector is used for indicating the key disease topic of the target medical article.
The specific implementation manner of selecting the key disease word from the medical word set by the computer device may be as follows: screening various disease words included in the medical word set, and then selecting the disease word with the maximum importance degree from the screened disease words as the key disease word of the target medical article. Or the computer device may select a plurality of candidate keywords of the target medical article from the medical word set according to the keyword selection policy and the importance of each medical word. The plurality of candidate keywords comprise at least one candidate disease word, and the keyword selection strategy can be set according to business requirements. For example, a keyword selection strategy may be set to indicate that a preset number of medical words are selected from the medical word set as candidate keywords in an order from a large importance level to a small importance level; or a keyword selection strategy can be set for indicating that medical words in the medical word set are selected, medical words with the importance degree larger than the importance degree threshold value are selected as candidate keywords, and the like.
According to the embodiment of the invention, aiming at the target medical article to be identified, a plurality of medical words can be extracted from the target medical article, and a medical knowledge graph of the target medical article is constructed by adopting the plurality of medical words. Since the medical words recorded by any two connected nodes in the medical knowledge graph have a co-occurrence relationship in the target medical article, the more co-occurrence relationship the medical words are generally more important; therefore, the importance of the medical words recorded by each node can be accurately calculated based on the connection relation among all nodes in the medical knowledge graph. Then, key disease words of the target medical article can be selected from the medical word set according to the importance of each medical word, and a key topic vector for indicating the key disease topic of the target medical article is constructed by adopting the word vectors of the key disease words. Therefore, the accuracy of the key disease words can be effectively improved by improving the accuracy of the importance of each medical word, so that the accuracy of the key disease topics is improved; in addition, the key disease topics of the medical articles can be automatically identified without the manual participation of the user in the whole topic identification process, and the labor cost is effectively saved.
Based on the above-mentioned related description of the embodiment of the article identification method shown in fig. 2, the embodiment of the present invention further provides a flowchart of another more specific article identification method, and the article main identification method may be executed by the above-mentioned computer device. Referring to fig. 4, the article identification method may include the following steps S401 to S409:
s401, extracting a medical word set from a target medical article to be identified.
In a specific implementation process, the computer equipment can firstly perform word segmentation processing on a target medical article to be identified to obtain an initial word set; the initial word set comprises a plurality of initial words, and each initial word in the initial word set has a part of speech. Specifically, the computer device may perform full segmentation and word segmentation processing on the target medical article by using an open word segmentation device to obtain an initial word set. The full segmentation word segmentation processing means that: performing word segmentation processing on the target medical article by adopting various segmentation forms; the various segmentation formats mentioned herein may include, but are not limited to: a segmentation form based on frequency statistics of words, a segmentation form based on thesaurus matching, a segmentation form based on knowledge understanding, and the like. Wherein the segmentation form based on frequency statistics of words is used to indicate: counting the frequency of any two characters appearing simultaneously in the article, and if the frequency obtained by counting is greater than a frequency threshold value, segmenting the any two characters into one word; the segmentation form based on thesaurus matching is used to indicate: matching the article with entries in a word stock, and if a certain character string in the article can be found in the word stock, segmenting each character in the character string into a word; the knowledge understanding based segmentation form is used to indicate: and performing semantic analysis on the context content of the article by combining syntactic analysis and syntactic analysis, and segmenting characters in the article according to an analysis result of information provided by the context content.
After obtaining the initial set of words, the computer device may screen a plurality of intermediate words from the initial set of words according to at least one medical dictionary; the term "intermediate" as referred to herein refers to an initial word present in at least one medical dictionary, i.e., an intermediate word refers to a word present in both at least one medical dictionary and a target medical article. In one embodiment, the computer device may filter out a plurality of intermediate words from the initial set of words directly from the at least one medical dictionary; specifically, the computer device may traverse each initial word in the set of initial words; matching the currently traversed initial words with at least one medical dictionary to detect whether the currently traversed initial words exist in the at least one medical dictionary; and if so, taking the currently traversed initial word as an intermediate word. In another embodiment, since some stop words may exist in the initial word set, the stop words refer to functional words that need to be filtered out and have no actual medical meaning; specifically, the stop words may include, but are not limited to: qualifiers (e.g., "a", "these", "there", etc.) for expressing concepts of location, quantity, etc., language words (e.g., "a", "good" etc.) for expressing language atmosphere, etc. In this case, the computer device may first remove these stop words from the initial set of words to update the initial set of words such that the updated initial set of words (which may be denoted by M) only retain words of the specified part of speech; and then, at least one medical dictionary and the updated initial word set are adopted for matching processing, so that words to be matched can be effectively reduced, the matching efficiency is effectively improved, and processing resources are saved.
Based on the method, when the computer equipment screens out a plurality of intermediate words from the initial word set according to at least one medical dictionary, the computer equipment can screen out initial words of target parts of speech from the initial word set; the target part of speech mentioned herein may be specified according to business needs or empirical values, which may include at least one of: nouns, verbs, and adjectives. Where nouns are used to denote uniform names of people, things, or abstractions; verbs are used to denote actions or states, and adjectives are used to describe or modify nouns or pronouns, which are used primarily to denote the nature, state, character, or property of a person or thing. Then, the computer device may select a plurality of intermediate words from the initial words of the target part of speech according to the at least one medical dictionary, and the specific implementation thereof is similar to the aforementioned specific implementation of the step of "selecting a plurality of intermediate words from the initial word set directly according to the at least one medical dictionary", and will not be described herein again.
After the plurality of intermediate words are screened out, a medical word set (which may be represented by M') of the target medical article may be constructed using the plurality of intermediate words. In one embodiment, a plurality of intermediate words may be directly employed to construct a medical word set of a target medical article; in this embodiment, the medical word in the medical word set is the intermediate word, and the number of the medical words is equal to the number of the intermediate words. In another embodiment, some intermediate words which are adjacent to each other and have special meanings in the target medical article may exist in the screened intermediate words; for these intermediate words, they usually appear in the medical dictionary at the same time, and the words made up of them stuck together have more medical meaning than a single intermediate word. For example, for the interword "lower abdomen" and "distending pain", they would typically appear in the medical dictionary at the same time, and "lower abdomen distending pain" has a more medical meaning than "lower abdomen" and "distending pain". In this case, the computer device can bond the intermediate words together, and the bonded words are used as medical words, so as to improve the accuracy of subsequent topic identification. Based on the method, when the computer equipment adopts a plurality of intermediate words to construct the medical word set of the target medical article, whether the intermediate words meeting the bonding condition exist in the plurality of intermediate words can be judged; the bonding conditions herein may include: the distribution locations in the target medical article are adjacent and exist in the same medical dictionary. If the intermediate words meeting the bonding condition exist in the plurality of intermediate words, performing bonding treatment on the intermediate words meeting the bonding condition; and adding the words after the bonding treatment and the intermediate words without the bonding treatment into the medical word set of the target medical article as medical words. If there is no intermediary word that satisfies the adhesion condition among the plurality of intermediary words, each intermediary word may be added as a medical word to the medical word set of the target medical article.
S402, constructing a medical knowledge map of the target medical article by adopting a plurality of medical words.
In a specific implementation, an initial knowledge graph of a target medical article can be constructed by adopting a plurality of medical words; the initial knowledge graph includes a plurality of nodes, each node recording a medical term. Next, at least one pair of co-occurring word pairs may be selected from the plurality of medical words, the co-occurring word pair being a word pair consisting of two medical words having a co-occurring relationship in the target medical article. Then, at least one node group can be determined from the initial knowledge graph according to the at least one pair of co-occurrence word pairs; wherein, any node group can include: two nodes of two medical words in a pair of co-occurring word pairs are recorded, respectively. Then, two nodes in each node group can be connected in the initial knowledge graph respectively to obtain a medical knowledge graph of the target medical article.
And S403, calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph.
S404, selecting a plurality of candidate keywords of the target medical article from the medical word set according to the keyword selection strategy and the importance of each medical word.
In one embodiment, the computer device may select a preset number of medical words from the medical word set as a plurality of candidate keywords of the target medical article according to the order of the importance degree from large to small. In another embodiment, the computer device may select a medical word with an importance greater than an importance threshold from the set of medical words as a plurality of candidate keywords of the target medical article. Wherein the plurality of candidate keywords comprise at least one candidate disease word. For convenience of explanation, a preset number of medical terms selected from the medical term set according to the order of importance degrees from high to low are used as examples of the candidate keywords in the following description.
S405, selecting the candidate disease word with the maximum importance degree from the at least one candidate disease word as a key disease word of the target medical article.
S406, constructing a key topic vector of the target medical article by using the word vector of the key disease word, wherein the key topic vector is used for indicating the key disease topic of the target medical article.
In a specific implementation process, the computer equipment can call a word vector generation model to obtain a word vector of a key disease word; the important meaning of the word vector is that the natural language is converted into a vector which can be understood by a computer device, and the context and the semantics of the word can be grasped, so that the word vector can be used for measuring the semantic similarity between words. The term vector generation model referred to herein may include, but is not limited to: medical Word2vec model, medical bert model, medical glove model, etc. The medical Word2vec model is a Word vector calculation model which can determine a Word vector of a medical Word by combining the context of the medical Word; the term medical bert model refers to a word vector calculation model that can determine a word vector of a medical word by combining the meanings of the medical word in different contexts, so that the medical word can have the same word vector in different contexts; the medical glove model is a word vector calculation model for determining a word vector by using a co-occurrence matrix while considering local information and global information.
For convenience of explanation, the Word vector generation model is taken as a medical Word2vec model for example in the following description; referring to fig. 5, the medical Word2vec model may be a three-layer neural network model, which may include an input layer, a hidden layer, and an output layer. The input layer is used for acquiring a plurality of candidate keywords of the target medical article and constructing a vocabulary list by adopting the candidate keywords; and constructing initial sparse vectors of the key disease words according to the positions of the key disease words in the vocabulary. For example, if the target medical article has 5 candidate keywords, the vocabulary may include the 5 candidate keywords; assuming that the position of the key disease word in the vocabulary is 3 rd, the initial sparse vector of the key disease word may be [0, 0, 1, 0, 0 ]. After the input layer obtains the initial sparse vector, the input layer can also be used for transmitting the initial sparse vector to the hidden layer. The hidden layer can be used for converting the initial sparse vector transmitted by the input layer into a dense vector so as to obtain a word vector of the key disease word; correspondingly, the output layer is used for outputting the word vectors obtained by the hidden layer. It should be noted that the input order of each Word is not important under the assumption of the Word2vec model. Moreover, the medical Word2vec model is obtained by using mass medical information texts to perform model finetune on the reference Word2vec model in advance by computer equipment; the reference Word2Vec model refers to a Word2Vec model obtained by model training of other devices using a large amount of other information texts (such as news information texts). Therefore, the medical Word2vec model is obtained in a model fine tuning mode, so that resources consumed by computer equipment due to model training can be effectively saved, and the training time can be shortened to improve the training efficiency.
After the word vectors of the key disease words are obtained, the word vectors of the key disease words can be adopted to construct key topic vectors of the target medical article. Specifically, related non-disease words corresponding to the key disease words can be obtained from the medical word set; the so-called related non-disease words satisfy the following condition: in the medical knowledge graph, nodes for recording related non-disease words are connected with nodes for recording key disease words. Secondly, word vectors of key disease words and word vectors of related non-disease words can be obtained; it should be noted that any word vector mentioned herein may be generated by a computer device by invoking a word vector generation model, and any word vector may be a P-dimensional vector; the value of P may be set according to empirical values, for example P200. Then, the word vectors of key disease words and the word vectors of related non-disease words can be fused to obtain key topic vectors of the target medical article; specifically, the word vectors of the key disease words and the word vectors of the related non-disease words can be accumulated according to each dimension to obtain the key topic vectors in the target medical text. For example, let the key disease word be "gastric ulcer," and its corresponding word vector be V(s)0)=(s0 1,s0 2,…,s0 P) (ii) a Related non-disease words include: "lower abdominal pain", "gastroscope", "color Doppler ultrasound" and "molsidine", and the corresponding word vectors are as follows in order: v(s)1)=(s1 1,s1 2,…,s1 P)、V(s2)=(s2 1,s2 2,…,s2 P)、V(s3)=(s3 1,s3 2,…,s3 P) And V(s)4)=(s4 1,s4 2,…,s4 P). Then, the word vectors are accumulated according to each dimension to obtain the key topic vector v(s) ═ s0 1+s1 1+s2 1+s3 1+s4 1,s0 2+s1 2+s2 2+s3 2+s4 2,…,s0 P+s1 P+s2 P+s3 P+s4 P)。
S407, obtaining word vectors of the candidate keywords, and calculating vector similarity between the word vectors of the candidate keywords and the key topic vectors.
S408, selecting the article keywords of the target medical article from the candidate keywords according to the vector similarity between the word vector and the key topic vector of each candidate keyword.
In the implementation of steps S407-S408, the computer device may first invoke the word vector generation model to represent each candidate keyword as a P-dimensional vector, so as to obtain a word vector (i.e., a P-dimensional vector) of each candidate keyword. Secondly, calculating the vector similarity between the word vector and the key topic vector of each candidate keyword by adopting a vector similarity algorithm; the vector similarity algorithm herein may include, but is not limited to: cosine similarity algorithms, hamming Distance algorithms, Euclidean Distance (Euclidean Distance) algorithms, manhattan Distance algorithms, and the like. Then, according to the vector similarity between the word vector and the key topic vector of each candidate keyword, selecting the candidate keyword with the similarity greater than the similarity threshold value from the candidate keywords as the article keyword of the target medical article. It should be noted that the similarity threshold mentioned herein may be set according to the business requirement or experience value, for example, may be set to 0.8, 0.7, etc.; the article keywords selected in steps S407-S408 may or may not include the aforementioned key disease words, and are not limited thereto.
Therefore, the vector similarity between the word vector and the key topic vector of the article keyword is greater than the similarity threshold value through the steps S407 to S408, so that candidate keywords irrelevant to the key disease topic are prevented from being selected as the article keyword, and if the key disease topic of the target medical article is located on gynecological related diseases, the candidate keyword orthopedics irrelevant to the key disease topic can be prevented from being selected as the article keyword of the target medical article. Therefore, the selected article keywords can better reflect the subject content of the target medical article through the steps S407 to S408, and the accuracy of the article keywords is improved; moreover, the article keywords of the target medical article all belong to the same key disease topic, and the topic consistency among the article keywords is ensured.
S409, the target medical article and the article keywords are stored in a correlated mode.
In a specific implementation, after the computer device selects the article keyword of the target medical article through the above steps S407-S408, the target medical article and the article keyword may be stored in an associated manner. Specifically, the computer device can directly add and store the article keywords into a keyword list related to the target medical article; alternatively, the computer device may add and store word vectors for individual article keywords to a list of keywords associated with the target medical article. By storing the target medical article and the article keywords in an associated manner, the target medical article can be subjected to business processing subsequently according to the article keywords; the business processes referred to herein may include, but are not limited to: an article classification process, a keyword reminding process, a keyword search process, an article recommendation reading process, and the like.
The article classification processing means: and acquiring related medical articles having the same article keywords or similar article keywords as the target medical article, and dividing the acquired related medical articles and the target medical article into the same article set. The keyword reminding processing means: and synchronously displaying the article keywords of the target medical article when the target medical article is displayed. The keyword search processing means: and when receiving an input keyword input by a user, if the similarity between the input keyword and the article keyword is identified to be larger than a threshold value, outputting the processing of the target medical article. The recommended reading processing of the article is as follows: and when the user browses the current content, if the current content is identified to include the article keyword, recommending the processing of the target medical article to the user.
It should be noted that the above-mentioned key topic vector is a dominant topic vector (i.e. a main topic vector) of the target medical article, and the key disease topic is a main disease topic of the target medical article. In an alternative embodiment, when the number of candidate disease words extracted from the target medical article is multiple, it indicates that the target medical article has multiple disease topics. In this case, the computer device may further select a reference disease word of the target medical article from the at least one candidate disease word, the importance of the reference disease word being less than the importance of the key disease word. Second, a word vector of the reference disease word may be employed to construct a dependent topic vector (i.e., a non-primary topic vector) of the target medical article that is used to indicate a slave disease topic of the target medical article. The target medical article, the key topic vector, and the subordinate topic vector may then be stored in association in a storage space such that when there is an article search request, an article search process may be performed according to the key topic vector and the subordinate topic vector.
Specifically, the computer device can also calculate the leading topic vectors and the subordinate topic vectors of other medical articles by adopting the steps of the method, and store the calculated topic vectors into the storage space; that is, the storage space also includes at least one other medical article, and each other medical article has a corresponding dominant topic vector and a corresponding subordinate topic vector. When there is an article search request, the computer device may obtain an information vector of article search information carried by the article search request. Secondly, each medical article in the storage space and at least one topic vector of each medical article can be obtained; wherein each medical article has a recommended weight value, and the at least one topic vector of each medical article includes a leading topic vector and a dependent topic vector for each medical article. Then, the matching degree between each topic vector and the information vector of each medical article can be respectively calculated, and the recommendation weight value of each medical article is updated according to the calculated matching degree. Specifically, for any medical article, the topic vector with the maximum matching degree can be determined from at least one topic vector of any medical article according to the matching degree between each topic vector of any medical article and the information vector. If the topic vector with the maximum matching degree is the dominant topic vector of any medical article, the recommendation weight value of any medical article can be increased so as to update the recommendation weight value of any medical article; if the topic vector with the maximum matching degree is the subordinate topic vector of any medical article, the recommendation weight value of any medical article can be reduced so as to update the recommendation weight value of any medical article.
After the recommended weight values of the medical articles are updated based on the updating principle, the computer equipment can perform descending arrangement on the medical articles according to the updated recommended weight values of the medical articles; and selecting the first medical article as the medical article to be recommended and outputting the medical article. Therefore, the embodiment of the invention can realize that for the medical article with a plurality of disease topics, the medical article can be subjected to weighting-up or weighting-down sequencing according to the matching condition of the search information input by the user and the disease topics, so that the medical article which is most matched with the search information is output for the user; therefore, the accuracy of article retrieval can be effectively improved, and the user experience is further improved.
According to the embodiment of the invention, aiming at the target medical article to be identified, a plurality of medical words can be extracted from the target medical article, and a medical knowledge graph of the target medical article is constructed by adopting the plurality of medical words. Since the medical words recorded by any two connected nodes in the medical knowledge graph have a co-occurrence relationship in the target medical article, the more co-occurrence relationship the medical words are generally more important; therefore, the importance of the medical words recorded by each node can be accurately calculated based on the connection relation among all nodes in the medical knowledge graph. Then, key disease words of the target medical article can be selected from the medical word set according to the importance of each medical word, and a key topic vector for indicating the key disease topic of the target medical article is constructed by adopting the word vectors of the key disease words. Therefore, the accuracy of the key disease words can be effectively improved by improving the accuracy of the importance of each medical word, so that the accuracy of the key disease topics is improved; in addition, the key disease topics of the medical articles can be automatically identified without the manual participation of the user in the whole topic identification process, and the labor cost is effectively saved. Moreover, after the key topic vectors are obtained, the computer equipment can also purify the keywords of the target medical article based on the vector similarity between the candidate keywords and the key topic vectors to obtain the article keywords with topic consistency, and the accuracy of the article keywords can be effectively improved.
Based on the description of the above-mentioned embodiment of the article identification method, an embodiment of the present invention further discloses an article identification apparatus, which may be a computer program (including program code) running in the above-mentioned computer device. The article recognition means may perform the method shown in fig. 2 or fig. 4. Referring to fig. 6, the article recognition apparatus may operate as follows:
an extracting unit 601, configured to extract a medical word set from a target medical article to be identified, where the medical word set includes a plurality of medical words, and the plurality of medical words at least includes a disease word;
a constructing unit 602, configured to construct a medical knowledge graph of the target medical article by using the plurality of medical words, where the medical knowledge graph includes a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
the processing unit 603 is configured to calculate, based on a connection relationship between each node in the medical knowledge graph, an importance of the medical word recorded by each node;
the processing unit is further configured to select a key disease word of the target medical article from the set of medical words according to the importance of each medical word, and construct a key topic vector of the target medical article by using a word vector of the key disease word, where the key topic vector is used to indicate a key disease topic of the target medical article.
In one embodiment, the processing unit 603, when configured to construct the key topic vector of the target medical article using the word vector of the key disease word, may be specifically configured to:
acquiring related non-disease words corresponding to the key disease words from the medical word set, wherein the related non-disease words meet the following conditions: in the medical knowledge graph, nodes for recording the related non-disease words are connected with nodes for recording the key disease words;
acquiring word vectors of the key disease words and word vectors of the related non-disease words;
and fusing the word vector of the key disease word and the word vector of the related non-disease word to obtain the key topic vector of the target medical article.
In another embodiment, when the processing unit 603 is configured to select a keyword of the target medical article from the set of medical terms according to the importance of each medical term, the processing unit may be specifically configured to:
selecting a plurality of candidate keywords of the target medical article from the medical word set according to the importance of each medical word and a keyword selection strategy; the candidate keywords comprise at least one candidate disease word;
and selecting the candidate disease word with the maximum importance degree from the at least one candidate disease word as a key disease word of the target medical article.
In yet another embodiment, the processing unit 603 is further configured to:
obtaining word vectors of all candidate keywords, and calculating the vector similarity between the word vectors of all candidate keywords and the key topic vectors;
selecting article keywords of the target medical article from the candidate keywords according to the vector similarity between the word vector of each candidate keyword and the key topic vector; wherein the vector similarity between the word vector of the article keyword and the key topic vector is greater than a similarity threshold;
and performing associated storage on the target medical article and the article keywords so as to perform business processing on the target medical article according to the article keywords.
In another embodiment, when the processing unit 603 is configured to select, according to a keyword selection policy and according to the importance of each medical term, a plurality of candidate keywords of the target medical article from the medical term set, the processing unit may be specifically configured to:
according to the sequence of the importance degrees from large to small, selecting a preset number of medical words from the medical word set as a plurality of candidate keywords of the target medical article; alternatively, the first and second electrodes may be,
and selecting the medical words with the importance degrees larger than the importance degree threshold value from the medical word set as a plurality of candidate keywords of the target medical article.
In yet another embodiment, the key topic vector is a dominant topic vector of the target medical article, and the key disease topic is a main disease topic of the target medical article; accordingly, the processing unit 603 is further operable to:
selecting a reference disease word of the target medical article from the at least one candidate disease word, wherein the importance of the reference disease word is less than that of the key disease word;
constructing an affiliated subject vector of the target medical article by using the word vector of the reference disease word, wherein the affiliated subject vector of the target medical article is used for indicating an affiliated disease subject of the target medical article;
and storing the target medical article, the key topic vector and the subordinate topic vector into a storage space in an associated manner, so that when an article search request exists, article search processing is carried out according to the key topic vector and the subordinate topic vector.
In yet another embodiment, the storage space further includes at least one other medical article, and each other medical article has a corresponding dominant topic vector and a corresponding subordinate topic vector; accordingly, the processing unit 603 is further operable to:
when an article searching request exists, acquiring an information vector of article searching information carried by the article searching request;
acquiring each medical article in the storage space and at least one topic vector of each medical article; wherein each medical article has a recommended weight value, and the at least one topic vector of each medical article comprises a leading topic vector and a dependent topic vector of each medical article;
respectively calculating the matching degree between each topic vector of each medical article and the information vector, and updating the recommendation weight value of each medical article according to the calculated matching degree;
according to the updated recommended weight value of each medical article, performing descending arrangement on each medical article; and selecting the first medical article as the medical article to be recommended and outputting the medical article.
In another embodiment, the processing unit 603, when configured to update the recommended weight value of each medical article according to the calculated matching degree, may specifically be configured to:
for any medical article, according to the matching degree between each topic vector of the medical article and the information vector, determining a topic vector with the maximum matching degree from at least one topic vector of the medical article;
if the topic vector with the maximum matching degree is the dominant topic vector of any medical article, increasing the recommendation weight value of any medical article;
and if the topic vector with the maximum matching degree is the subordinate topic vector of any medical article, reducing the recommended weight value of any medical article.
In another embodiment, the constructing unit 602, when configured to construct the medical knowledge map of the target medical article by using the plurality of medical words, may specifically be configured to:
constructing an initial knowledge graph of the target medical article by adopting the plurality of medical words, wherein the initial knowledge graph comprises a plurality of nodes, and each node records one medical word;
selecting at least one pair of co-occurring word pairs from the plurality of medical words, wherein the co-occurring word pairs are word pairs formed by two medical words having a co-occurring relation in the target medical article;
determining at least one node group from the initial knowledge graph according to the at least one pair of co-occurring word pairs, wherein any node group comprises: respectively recording two nodes of two medical words in a pair of co-occurrence word pairs;
and respectively connecting two nodes in each node group in the initial knowledge graph to obtain a medical knowledge graph of the target medical article.
In another embodiment, the constructing unit 602, when being configured to select at least one pair of co-occurrence word pairs from the plurality of medical words, may be specifically configured to:
determining a first distribution position of the first medical word in the target medical article, wherein the first medical word is any one of the plurality of medical words;
acquiring a second medical word from the plurality of medical words according to a first distribution position of the first medical word, wherein the distance between a second distribution position of the second medical word in the target medical article and the first distribution position is smaller than a position distance threshold value;
calculating a semantic distance value between the first medical word and the second medical word, the semantic distance value being indicative of a semantic similarity between the first medical word and the second medical word;
if the semantic distance value between the first medical word and the second medical word is larger than a semantic threshold value, determining that the first medical word and the second medical word have the co-occurrence relationship in the target medical article, and constructing a pair of co-occurrence word pairs by using the first medical word and the second medical word.
In another embodiment, the processing unit 603, when configured to calculate the importance of the medical word recorded by each node based on the connection relationship between the nodes in the medical knowledge graph, may specifically be configured to:
aiming at a medical word recorded by any node, determining at least one associated node connected with the any node based on the connection relation among all nodes in the medical knowledge graph;
calculating semantic distance values between the medical words recorded by any node and the medical words recorded by each associated node;
and calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node.
In another embodiment, the extracting unit 601, when being configured to extract the medical word set from the target medical article to be identified, may be specifically configured to:
performing word segmentation processing on a target medical article to be recognized to obtain an initial word set, wherein the initial word set comprises a plurality of initial words;
screening a plurality of intermediate words from the initial word set according to at least one medical dictionary, wherein the intermediate words refer to the initial words existing in the at least one medical dictionary;
and constructing a medical word set of the target medical article by adopting the plurality of intermediate words.
In yet another embodiment, each initial word in the set of initial words has a part of speech; accordingly, the extracting unit 601, when configured to filter out a plurality of intermediate words from the initial word set according to at least one medical dictionary, may be specifically configured to:
screening initial words of a target part of speech from the initial word set, wherein the target part of speech comprises at least one of the following items: nouns, verbs, and adjectives;
and screening a plurality of intermediate words from the initial words of the target part of speech according to at least one medical dictionary.
In another embodiment, the extracting unit 601, when configured to construct the medical word set of the target medical article by using the plurality of intermediate words, may specifically be configured to:
if the intermediate words meeting the bonding conditions exist in the plurality of intermediate words, performing bonding treatment on the intermediate words meeting the bonding conditions; adding the words after bonding treatment and the intermediate words without bonding treatment into a medical word set of the target medical article as medical words;
if the intermediate words meeting the bonding conditions do not exist in the plurality of intermediate words, all the intermediate words are used as medical words and added to a medical word set of the target medical article;
wherein the bonding conditions include: the distribution locations in the target medical article are adjacent and exist in the same medical dictionary.
According to an embodiment of the present application, the steps involved in the method shown in fig. 2 or fig. 4 may be performed by the units in the article identification apparatus shown in fig. 6. For example, steps S201 to S202 shown in fig. 2 may be performed by the extraction unit 601 and the construction unit 602 shown in fig. 6, respectively, and steps S203 to S204 may be performed by the processing unit 603 shown in fig. 6, respectively. As another example, steps S401 to S402 shown in fig. 4 may be performed by the extraction unit 601 and the construction unit 602 shown in fig. 6, respectively, steps S403 to S409 may be performed by the processing unit 603 shown in fig. 6, and so on.
According to another embodiment of the present application, the units in the article identification apparatus shown in fig. 6 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) may be further split into multiple units with smaller functions to form the same operation, without affecting the achievement of the technical effect of the embodiment of the present invention. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present invention, the article-based recognition apparatus may also include other units, and in practical applications, these functions may also be implemented by the assistance of other units, and may be implemented by cooperation of a plurality of units.
According to another embodiment of the present application, the article identification apparatus shown in fig. 6 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the corresponding method shown in fig. 2 or fig. 4 on a general-purpose computing device such as a computer including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and a storage element, and the article identification method of the embodiment of the present invention is implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.
According to the embodiment of the invention, aiming at the target medical article to be identified, a plurality of medical words can be extracted from the target medical article, and a medical knowledge graph of the target medical article is constructed by adopting the plurality of medical words. Since the medical words recorded by any two connected nodes in the medical knowledge graph have a co-occurrence relationship in the target medical article, the more co-occurrence relationship the medical words are generally more important; therefore, the importance of the medical words recorded by each node can be accurately calculated based on the connection relation among all nodes in the medical knowledge graph. Then, key disease words of the target medical article can be selected from the medical word set according to the importance of each medical word, and a key topic vector for indicating the key disease topic of the target medical article is constructed by adopting the word vectors of the key disease words. Therefore, the accuracy of the key disease words can be effectively improved by improving the accuracy of the importance of each medical word, so that the accuracy of the key disease topics is improved; in addition, the key disease topics of the target medical article can be automatically identified without the manual participation of a user in the whole topic identification process, and the labor cost is effectively saved.
Based on the description of the method embodiment and the device embodiment, the embodiment of the invention also provides computer equipment. Referring to fig. 7, the computer device may include at least a processor 701, an input interface 702, an output interface 703, and a computer storage medium 704. The processor 701, the input interface 702, the output interface 703, and the computer storage medium 704 in the computer device may be connected by a bus or other means. A computer storage medium 704 may be stored in the memory of the computer device, the computer storage medium 704 being used to store a computer program comprising program instructions, the processor 701 being used to execute the program instructions stored by the computer storage medium 704. The processor 701 (or CPU) is a computing core and a control core of the computer device, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.
In an embodiment, the processor 701 according to the embodiment of the present invention may be configured to perform a series of article identification processes, specifically including: extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words; constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article; calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph; and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article, and the like.
An embodiment of the present invention further provides a computer storage medium (Memory), which is a Memory device in a computer device and is used to store programs and data. It is understood that the computer storage medium herein may include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. Computer storage media provide storage space that stores an operating system of a computer device. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 701. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.
In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by processor 701 to perform the corresponding steps of the methods described above in connection with the article identification method embodiment; in particular implementations, one or more instructions in the computer storage medium are loaded by processor 701 and perform the following steps:
extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words;
constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article.
In one embodiment, in constructing the key topic vector for the target medical article using the word vector for the key disease word, the one or more instructions may be loaded and specifically executed by processor 701:
acquiring related non-disease words corresponding to the key disease words from the medical word set, wherein the related non-disease words meet the following conditions: in the medical knowledge graph, nodes for recording the related non-disease words are connected with nodes for recording the key disease words;
acquiring word vectors of the key disease words and word vectors of the related non-disease words;
and fusing the word vector of the key disease word and the word vector of the related non-disease word to obtain the key topic vector of the target medical article.
In yet another embodiment, when selecting the keyword of the target medical article from the set of medical words according to the importance of each medical word, the one or more instructions may be loaded and specifically executed by the processor 701:
selecting a plurality of candidate keywords of the target medical article from the medical word set according to the importance of each medical word and a keyword selection strategy; the candidate keywords comprise at least one candidate disease word;
and selecting the candidate disease word with the maximum importance degree from the at least one candidate disease word as a key disease word of the target medical article.
In yet another embodiment, the one or more instructions may be further loaded and specifically executed by the processor 701:
obtaining word vectors of all candidate keywords, and calculating the vector similarity between the word vectors of all candidate keywords and the key topic vectors;
selecting article keywords of the target medical article from the candidate keywords according to the vector similarity between the word vector of each candidate keyword and the key topic vector; wherein the vector similarity between the word vector of the article keyword and the key topic vector is greater than a similarity threshold;
and performing associated storage on the target medical article and the article keywords so as to perform business processing on the target medical article according to the article keywords.
In another embodiment, when a plurality of candidate keywords of the target medical article are selected from the set of medical words according to the importance of each medical word according to a keyword selection policy, the one or more instructions may be loaded and specifically executed by the processor 701:
according to the sequence of the importance degrees from large to small, selecting a preset number of medical words from the medical word set as a plurality of candidate keywords of the target medical article; alternatively, the first and second electrodes may be,
and selecting the medical words with the importance degrees larger than the importance degree threshold value from the medical word set as a plurality of candidate keywords of the target medical article.
In yet another embodiment, the key topic vector is a dominant topic vector of the target medical article, and the key disease topic is a main disease topic of the target medical article; accordingly, the one or more instructions may also be loaded and specifically executed by the processor 701:
selecting a reference disease word of the target medical article from the at least one candidate disease word, wherein the importance of the reference disease word is less than that of the key disease word;
constructing an affiliated subject vector of the target medical article by using the word vector of the reference disease word, wherein the affiliated subject vector of the target medical article is used for indicating an affiliated disease subject of the target medical article;
and storing the target medical article, the key topic vector and the subordinate topic vector into a storage space in an associated manner, so that when an article search request exists, article search processing is carried out according to the key topic vector and the subordinate topic vector.
In yet another embodiment, the storage space further includes at least one other medical article, and each other medical article has a corresponding dominant topic vector and a corresponding subordinate topic vector; accordingly, the one or more instructions may also be loaded and specifically executed by the processor 701:
when an article searching request exists, acquiring an information vector of article searching information carried by the article searching request;
acquiring each medical article in the storage space and at least one topic vector of each medical article; wherein each medical article has a recommended weight value, and the at least one topic vector of each medical article comprises a leading topic vector and a dependent topic vector of each medical article;
respectively calculating the matching degree between each topic vector of each medical article and the information vector, and updating the recommendation weight value of each medical article according to the calculated matching degree;
according to the updated recommended weight value of each medical article, performing descending arrangement on each medical article; and selecting the first medical article as the medical article to be recommended and outputting the medical article.
In yet another embodiment, when the recommended weight value of each medical article is updated according to the calculated matching degree, the one or more instructions may be loaded and specifically executed by the processor 701:
for any medical article, according to the matching degree between each topic vector of the medical article and the information vector, determining a topic vector with the maximum matching degree from at least one topic vector of the medical article;
if the topic vector with the maximum matching degree is the dominant topic vector of any medical article, increasing the recommendation weight value of any medical article;
and if the topic vector with the maximum matching degree is the subordinate topic vector of any medical article, reducing the recommended weight value of any medical article.
In yet another embodiment, the one or more instructions may be loaded and specifically executed by the processor 701 when constructing the medical knowledge map of the target medical article using the plurality of medical words:
constructing an initial knowledge graph of the target medical article by adopting the plurality of medical words, wherein the initial knowledge graph comprises a plurality of nodes, and each node records one medical word;
selecting at least one pair of co-occurring word pairs from the plurality of medical words, wherein the co-occurring word pairs are word pairs formed by two medical words having a co-occurring relation in the target medical article;
determining at least one node group from the initial knowledge graph according to the at least one pair of co-occurring word pairs, wherein any node group comprises: respectively recording two nodes of two medical words in a pair of co-occurrence word pairs;
and respectively connecting two nodes in each node group in the initial knowledge graph to obtain a medical knowledge graph of the target medical article.
In yet another embodiment, the one or more instructions may be loaded and executed by the processor 701 when selecting at least one pair of co-occurring word pairs from the plurality of medical words:
determining a first distribution position of the first medical word in the target medical article, wherein the first medical word is any one of the plurality of medical words;
acquiring a second medical word from the plurality of medical words according to a first distribution position of the first medical word, wherein the distance between a second distribution position of the second medical word in the target medical article and the first distribution position is smaller than a position distance threshold value;
calculating a semantic distance value between the first medical word and the second medical word, the semantic distance value being indicative of a semantic similarity between the first medical word and the second medical word;
if the semantic distance value between the first medical word and the second medical word is larger than a semantic threshold value, determining that the first medical word and the second medical word have the co-occurrence relationship in the target medical article, and constructing a pair of co-occurrence word pairs by using the first medical word and the second medical word.
In another embodiment, when calculating the importance of the medical word recorded by each node based on the connection relationship between the nodes in the medical knowledge graph, the one or more instructions may be loaded and specifically executed by the processor 701:
aiming at a medical word recorded by any node, determining at least one associated node connected with the any node based on the connection relation among all nodes in the medical knowledge graph;
calculating semantic distance values between the medical words recorded by any node and the medical words recorded by each associated node;
and calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node.
In yet another embodiment, the one or more instructions may be loaded and executed by the processor 701 when extracting a medical word set from a target medical article to be identified:
performing word segmentation processing on a target medical article to be recognized to obtain an initial word set, wherein the initial word set comprises a plurality of initial words;
screening a plurality of intermediate words from the initial word set according to at least one medical dictionary, wherein the intermediate words refer to the initial words existing in the at least one medical dictionary;
and constructing a medical word set of the target medical article by adopting the plurality of intermediate words.
In yet another embodiment, each initial word in the set of initial words has a part of speech; accordingly, the one or more instructions may be loaded and executed by the processor 701 in filtering out a plurality of intermediary words from the initial set of words according to at least one medical dictionary:
screening initial words of a target part of speech from the initial word set, wherein the target part of speech comprises at least one of the following items: nouns, verbs, and adjectives;
and screening a plurality of intermediate words from the initial words of the target part of speech according to at least one medical dictionary.
In yet another embodiment, when constructing the medical word set of the target medical article using the plurality of intermediate words, the one or more instructions may be loaded and specifically executed by processor 701:
if the intermediate words meeting the bonding conditions exist in the plurality of intermediate words, performing bonding treatment on the intermediate words meeting the bonding conditions; adding the words after bonding treatment and the intermediate words without bonding treatment into a medical word set of the target medical article as medical words;
if the intermediate words meeting the bonding conditions do not exist in the plurality of intermediate words, all the intermediate words are used as medical words and added to a medical word set of the target medical article;
wherein the bonding conditions include: the distribution positions in the target medical article are adjacent and exist in the same medical dictionary
According to the embodiment of the invention, aiming at the target medical article to be identified, a plurality of medical words can be extracted from the target medical article, and a medical knowledge graph of the target medical article is constructed by adopting the plurality of medical words. Since the medical words recorded by any two connected nodes in the medical knowledge graph have a co-occurrence relationship in the target medical article, the more co-occurrence relationship the medical words are generally more important; therefore, the importance of the medical words recorded by each node can be accurately calculated based on the connection relation among all nodes in the medical knowledge graph. Then, key disease words of the target medical article can be selected from the medical word set according to the importance of each medical word, and a key topic vector for indicating the key disease topic of the target medical article is constructed by adopting the word vectors of the key disease words. Therefore, the accuracy of the key disease words can be effectively improved by improving the accuracy of the importance of each medical word, so that the accuracy of the key disease topics is improved; in addition, the key disease topics of the target medical article can be automatically identified without the manual participation of a user in the whole topic identification process, and the labor cost is effectively saved.
It should be noted that according to an aspect of the present application, a computer program product or a computer program is also provided, and the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method provided in the various alternatives in the aspect of the article identification method embodiment shown in fig. 2 or fig. 4 described above.
It should be understood, however, that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Claims (15)

1. An article identification method, comprising:
extracting a medical word set from a target medical article to be identified, wherein the medical word set comprises a plurality of medical words, and the plurality of medical words at least comprise disease words;
constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, wherein the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
and selecting key disease words of the target medical article from the medical word set according to the importance of each medical word, and constructing a key topic vector of the target medical article by using word vectors of the key disease words, wherein the key topic vector is used for indicating key disease topics of the target medical article.
2. The method of claim 1, wherein the constructing a key topic vector for the target medical article using the word vector for the key disease word comprises:
acquiring related non-disease words corresponding to the key disease words from the medical word set, wherein the related non-disease words meet the following conditions: in the medical knowledge graph, nodes for recording the related non-disease words are connected with nodes for recording the key disease words;
acquiring word vectors of the key disease words and word vectors of the related non-disease words;
and fusing the word vector of the key disease word and the word vector of the related non-disease word to obtain the key topic vector of the target medical article.
3. The method of claim 1 or 2, wherein the selecting the key disease word of the target medical article from the set of medical words according to the importance of each medical word comprises:
selecting a plurality of candidate keywords of the target medical article from the medical word set according to the importance of each medical word and a keyword selection strategy; the candidate keywords comprise at least one candidate disease word;
and selecting the candidate disease word with the maximum importance degree from the at least one candidate disease word as a key disease word of the target medical article.
4. The method of claim 3, wherein the method further comprises:
obtaining word vectors of all candidate keywords, and calculating the vector similarity between the word vectors of all candidate keywords and the key topic vectors;
selecting article keywords of the target medical article from the candidate keywords according to the vector similarity between the word vector of each candidate keyword and the key topic vector; wherein the vector similarity between the word vector of the article keyword and the key topic vector is greater than a similarity threshold;
and performing associated storage on the target medical article and the article keywords so as to perform business processing on the target medical article according to the article keywords.
5. The method of claim 3, wherein said selecting a plurality of candidate keywords of said target medical article from said set of medical words according to a keyword selection policy based on importance of said respective medical words comprises:
according to the sequence of the importance degrees from large to small, selecting a preset number of medical words from the medical word set as a plurality of candidate keywords of the target medical article; alternatively, the first and second electrodes may be,
and selecting the medical words with the importance degrees larger than the importance degree threshold value from the medical word set as a plurality of candidate keywords of the target medical article.
6. The method of claim 3, wherein the key topic vector is a dominant topic vector of the target medical article, the key disease topic being a main disease topic of the target medical article; the method further comprises the following steps:
selecting a reference disease word of the target medical article from the at least one candidate disease word, wherein the importance of the reference disease word is less than that of the key disease word;
constructing an affiliated subject vector of the target medical article by using the word vector of the reference disease word, wherein the affiliated subject vector of the target medical article is used for indicating an affiliated disease subject of the target medical article;
and storing the target medical article, the key topic vector and the subordinate topic vector into a storage space in an associated manner, so that when an article search request exists, article search processing is carried out according to the key topic vector and the subordinate topic vector.
7. The method of claim 6, wherein the storage space further comprises at least one other medical article, and each other medical article has a corresponding dominant topic vector and a corresponding subordinate topic vector; the method further comprises the following steps:
when an article searching request exists, acquiring an information vector of article searching information carried by the article searching request;
acquiring each medical article in the storage space and at least one topic vector of each medical article; wherein each medical article has a recommended weight value, and the at least one topic vector of each medical article comprises a leading topic vector and a dependent topic vector of each medical article;
respectively calculating the matching degree between each topic vector of each medical article and the information vector, and updating the recommendation weight value of each medical article according to the calculated matching degree;
according to the updated recommended weight value of each medical article, performing descending arrangement on each medical article; and selecting the first medical article as the medical article to be recommended and outputting the medical article.
8. The method of claim 7, wherein said updating the recommended weight value for each of the medical articles according to the calculated degree of match comprises:
for any medical article, according to the matching degree between each topic vector of the medical article and the information vector, determining a topic vector with the maximum matching degree from at least one topic vector of the medical article;
if the topic vector with the maximum matching degree is the dominant topic vector of any medical article, increasing the recommendation weight value of any medical article;
and if the topic vector with the maximum matching degree is the subordinate topic vector of any medical article, reducing the recommended weight value of any medical article.
9. The method of claim 1 or 2, wherein said constructing a medical knowledge map of the target medical article using the plurality of medical words comprises:
constructing an initial knowledge graph of the target medical article by adopting the plurality of medical words, wherein the initial knowledge graph comprises a plurality of nodes, and each node records one medical word;
selecting at least one pair of co-occurring word pairs from the plurality of medical words, wherein the co-occurring word pairs are word pairs formed by two medical words having a co-occurring relation in the target medical article;
determining at least one node group from the initial knowledge graph according to the at least one pair of co-occurring word pairs, wherein any node group comprises: respectively recording two nodes of two medical words in a pair of co-occurrence word pairs;
and respectively connecting two nodes in each node group in the initial knowledge graph to obtain a medical knowledge graph of the target medical article.
10. The method of claim 9, wherein said selecting at least one co-occurring word pair from said plurality of medical words comprises:
determining a first distribution position of the first medical word in the target medical article, wherein the first medical word is any one of the plurality of medical words;
acquiring a second medical word from the plurality of medical words according to a first distribution position of the first medical word, wherein the distance between a second distribution position of the second medical word in the target medical article and the first distribution position is smaller than a position distance threshold value;
calculating a semantic distance value between the first medical word and the second medical word, the semantic distance value being indicative of a semantic similarity between the first medical word and the second medical word;
if the semantic distance value between the first medical word and the second medical word is larger than a semantic threshold value, determining that the first medical word and the second medical word have the co-occurrence relationship in the target medical article, and constructing a pair of co-occurrence word pairs by using the first medical word and the second medical word.
11. The method of claim 1 or 2, wherein calculating the importance of the medical word recorded by each node based on the connection relationship between each node in the medical knowledge graph comprises:
aiming at a medical word recorded by any node, determining at least one associated node connected with the any node based on the connection relation among all nodes in the medical knowledge graph;
calculating semantic distance values between the medical words recorded by any node and the medical words recorded by each associated node;
and calculating the importance of the medical word recorded by any node according to the calculated semantic distance value and the importance of the medical word recorded by each associated node.
12. The method of claim 1 or 2, wherein extracting the set of medical words from the target medical article to be identified comprises:
performing word segmentation processing on a target medical article to be recognized to obtain an initial word set, wherein the initial word set comprises a plurality of initial words;
screening a plurality of intermediate words from the initial word set according to at least one medical dictionary, wherein the intermediate words refer to the initial words existing in the at least one medical dictionary;
and constructing a medical word set of the target medical article by adopting the plurality of intermediate words.
13. The method of claim 12, wherein each initial word in the set of initial words has a part of speech; the screening out a plurality of intermediate words from the initial word set according to at least one medical dictionary comprises:
screening initial words of a target part of speech from the initial word set, wherein the target part of speech comprises at least one of the following items: nouns, verbs, and adjectives;
and screening a plurality of intermediate words from the initial words of the target part of speech according to at least one medical dictionary.
14. The method of claim 12, wherein said constructing a medical word set of said target medical article using said plurality of intermediary words comprises:
if the intermediate words meeting the bonding conditions exist in the plurality of intermediate words, performing bonding treatment on the intermediate words meeting the bonding conditions; adding the words after bonding treatment and the intermediate words without bonding treatment into a medical word set of the target medical article as medical words;
if the intermediate words meeting the bonding conditions do not exist in the plurality of intermediate words, all the intermediate words are used as medical words and added to a medical word set of the target medical article;
wherein the bonding conditions include: the distribution locations in the target medical article are adjacent and exist in the same medical dictionary.
15. An article recognition apparatus, comprising:
the medical treatment article recognition system comprises an extraction unit, a recognition unit and a recognition unit, wherein the extraction unit is used for extracting a medical treatment word set from a target medical article to be recognized, the medical treatment word set comprises a plurality of medical treatment words, and the plurality of medical treatment words at least comprise disease words;
the construction unit is used for constructing a medical knowledge graph of the target medical article by adopting the plurality of medical words, and the medical knowledge graph comprises a plurality of nodes; one node records one medical word, and the medical words recorded by any two connected nodes have a co-occurrence relation in the target medical article;
the processing unit is used for calculating the importance of the medical words recorded by each node based on the connection relation among the nodes in the medical knowledge graph;
the processing unit is further configured to select a key disease word of the target medical article from the set of medical words according to the importance of each medical word, and construct a key topic vector of the target medical article by using a word vector of the key disease word, where the key topic vector is used to indicate a key disease topic of the target medical article.
CN202011213480.1A 2020-11-03 2020-11-03 Article identification method and device, computer equipment and storage medium Pending CN112214580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011213480.1A CN112214580A (en) 2020-11-03 2020-11-03 Article identification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011213480.1A CN112214580A (en) 2020-11-03 2020-11-03 Article identification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112214580A true CN112214580A (en) 2021-01-12

Family

ID=74058115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011213480.1A Pending CN112214580A (en) 2020-11-03 2020-11-03 Article identification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112214580A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230070715A1 (en) * 2021-09-09 2023-03-09 Canon Medical Systems Corporation Text processing method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230070715A1 (en) * 2021-09-09 2023-03-09 Canon Medical Systems Corporation Text processing method and apparatus

Similar Documents

Publication Publication Date Title
El-Beltagy et al. KP-Miner: A keyphrase extraction system for English and Arabic documents
US20190332671A1 (en) Methods, devices, and systems for constructing intelligent knowledge base
US10535106B2 (en) Selecting user posts related to trending topics on online social networks
KR101339103B1 (en) Document classifying system and method using semantic feature
KR101100830B1 (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
CN111539197B (en) Text matching method and device, computer system and readable storage medium
CN106940726B (en) Creative automatic generation method and terminal based on knowledge network
WO2015009620A1 (en) Systems and methods for keyword determination and document classification from unstructured text
CN110569920B (en) Prediction method for multi-task machine learning
WO2013151546A1 (en) Contextually propagating semantic knowledge over large datasets
WO2023029506A1 (en) Illness state analysis method and apparatus, electronic device, and storage medium
CN109992674B (en) Recommendation method fusing automatic encoder and knowledge graph semantic information
CN112559684A (en) Keyword extraction and information retrieval method
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN111783903A (en) Text processing method, text model processing method and device and computer equipment
CN113515589A (en) Data recommendation method, device, equipment and medium
Maidel et al. Ontological content‐based filtering for personalised newspapers: A method and its evaluation
KR101074820B1 (en) Recommendation searching system using internet and method thereof
CN114818724A (en) Construction method of social media disaster effective information detection model
KR20120003834A (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
Kharrat et al. Recommendation system based contextual analysis of Facebook comment
CN112214580A (en) Article identification method and device, computer equipment and storage medium
CN113571196A (en) Method and device for constructing medical training sample and method for retrieving medical text
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
CN114496231A (en) Constitution identification method, apparatus, equipment and storage medium based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40037778

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination