WO2021139262A1 - Procédé et appareil d'agregation de terme mesh de document, dispositif informatique et support de stockage lisible - Google Patents

Procédé et appareil d'agregation de terme mesh de document, dispositif informatique et support de stockage lisible Download PDF

Info

Publication number
WO2021139262A1
WO2021139262A1 PCT/CN2020/118699 CN2020118699W WO2021139262A1 WO 2021139262 A1 WO2021139262 A1 WO 2021139262A1 CN 2020118699 W CN2020118699 W CN 2020118699W WO 2021139262 A1 WO2021139262 A1 WO 2021139262A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
phrase
noun
citation
similarity
Prior art date
Application number
PCT/CN2020/118699
Other languages
English (en)
Chinese (zh)
Inventor
柴玲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139262A1 publication Critical patent/WO2021139262A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to the field of digital medical technology, and in particular to a method, device, computer equipment, and computer-readable storage medium for document subject word aggregation.
  • the selected topic representative words often contain a large number of synonymous or synonymous professional terms, resulting in redundant and inaccurate information, such as non- small cell lung cancer, non-small cell lung cancer, non-small cell cancer, non-small cell lung cancer cells, human non-small cell lung cancer, should be standardized to the same subject term non-small cell lung cancer.
  • This application provides a method, device, computer equipment, and computer-readable storage medium for document subject word aggregation, which can solve the problem of low accuracy of document subject word aggregation in traditional technology.
  • this application provides a method for aggregation of document subject terms, the method comprising: obtaining document data, the document data including the document title, document abstract, and the corresponding information of each document. Citation information; using a preset natural language processing tool to extract the noun phrases contained in the document title and the document abstract; based on the citation information and the noun phrases, clustering the noun phrases to obtain A set of synonymous words; the target noun phrase with the highest word frequency is selected from the set of synonymous words as the subject words of the document.
  • this application also provides a document subject word aggregation device, including: an acquisition unit for acquiring document data, the document data including the document title, document abstract, and each document contained in each document Corresponding citation information; an extraction unit for extracting noun phrases contained in the document title and the document abstract by using a preset natural language processing tool; a clustering unit for extracting noun phrases contained in the document title and the document abstract based on the citation information and the A noun phrase clusters the noun phrase to obtain a set of synonyms; a screening unit is used to filter the target noun phrase with the highest word frequency from the set of synonyms as the subject word of the document.
  • the present application also provides a computer device, which includes a memory and a processor, the memory stores a computer program, and the processor executes the following steps when running the computer program: acquiring document data, The document data includes the document title, document abstract, and citation information corresponding to each document; preset natural language processing tools are used to extract the noun phrases contained in the document title and the document abstract Based on the citation information and the noun phrase, cluster the noun phrase to obtain a set of synonyms; select the target noun phrase with the highest word frequency from the set of synonyms as the subject words of the document.
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor implements the following steps: Obtain document data , The document data includes the document title, document abstract, and citation information corresponding to each document; the preset natural language processing tool is used to extract the document title and the document abstract. Based on the citation information and the noun phrase, cluster the noun phrase to obtain a set of synonyms; select the target noun phrase with the highest word frequency from the set of synonyms as the subject words of the literature.
  • the embodiment of this application obtains document data including the document title, document abstract and citation information corresponding to each document contained in each document, and uses a preset natural language processing tool to obtain the document title and Extracting the noun phrases contained in the document abstract, clustering the noun phrases based on the citation information and the noun phrases to obtain a set of synonyms, and selecting the target with the highest word frequency from the set of synonyms
  • Noun phrases are the subject words of the literature. Due to the combination of noun phrases and citation information, the phrase-level synonym processing method is used for the scene of document mining, and the citation information is combined to represent the similarity of noun phrases. Compared with the traditional technology , Only word-level semantic similarity is used for characterization, while only sentence-level information is considered.
  • the characterization method in the embodiment of this application fully represents the similarity between the topics represented by two noun phrases, so that the aggregated The subject words can accurately describe the subject of the document, which improves the accuracy of the subject word aggregation of the document.
  • FIG. 1 is a schematic flowchart of a method for aggregation of document subject words provided by an embodiment of the application;
  • FIG. 2 is a schematic diagram of a sub-process in the method for document subject word aggregation provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of an example of a document co-citation network in a document subject word aggregation method provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of another sub-flow of the method for document subject word aggregation provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of an aggregation process of a method for aggregation of document subject words provided by an embodiment of the application;
  • Fig. 6 is a schematic block diagram of a document subject word aggregation device provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a method for aggregation of document subject words provided by an embodiment of the application. As shown in Figure 1, the method includes the following steps S101-S104:
  • the document data corresponding to the document can be retrieved from the preset database by means of keywords.
  • the document data includes the document title, document abstract and citation information corresponding to each document contained in each document.
  • the citation information is the mutual citation relationship between the documents.
  • the Pubmed database is generally used to search for documents. By searching keywords, you can obtain the document titles, document abstracts, and mutual citation relationships between documents in a specific field contained in the Pubmed database, for example, by searching for "lung cancer" ", download the titles, abstracts, and citation relationships of articles related to lung cancer.
  • natural language processing tools include Stanford nlp, TextBlob, Polyglot and other natural language processing tools that can extract noun phrases.
  • the specific process of extracting noun phrases is as follows: 1) Use a preset natural language processing tool to extract noun phrases contained in the document title and the document abstract, for example, use Stanford nlp to extract noun phrases. 2) Processing word phrases, you can delete the words and words with the highest frequency. For example, delete the words and words in the 2000 words with the highest frequency in the Wikipedia corpus. The phrase "cancer" will be deleted to avoid high-frequency general vocabulary. Affect the aggregation of keywords.
  • a co-citation network between the documents is constructed, and the co-citation relationship between the noun phrases is obtained according to the co-citation network between the documents, and then the co-citation relationship between the noun phrases is obtained.
  • Predict the semantic similarity of noun phrases so as to cluster the noun phrases according to the co-citation relationship and semantic similarity between the noun phrases to obtain a set of synonyms.
  • the target noun phrase that meets the requirements is selected from the synonym set, and the target noun phrase is used as the subject word of the document, for example, the noun phrase with the highest frequency in the synonym set is used as the target noun Phrases, etc., to get the subject words of the literature.
  • a phrase-level synonym processing method is used for document mining scenarios, and citation information is combined to characterize the similarity of noun phrases.
  • citation information is combined to characterize the similarity of noun phrases.
  • Only word-level semantic similarity is used for characterization, while only sentence-level information is considered.
  • the characterization method of the embodiment of the present application fully represents the similarity between the topics represented by two noun phrases, so that the aggregated topics The word can accurately describe the subject of the document, which improves the accuracy of the subject word aggregation of the document.
  • FIG. 2 is a schematic diagram of a sub-process in the method for aggregation of document subject words provided by an embodiment of the application.
  • the step of clustering the noun phrases based on the citation information and the noun phrases to obtain a set of synonyms includes:
  • the semantic similarity is used to describe the language meaning similarity between noun phrases, and the semantic similarity can be measured by cosine similarity, Euclidean distance, or Minkowski distance (English: Minkowski distance).
  • the similarity between two noun phrases can be calculated by vectorizing the noun phrase, and the similarity between the two noun phrases can be obtained by quantifying the similarity.
  • the step of establishing a semantic similarity based on the noun phrase according to the noun phrase includes: inputting the noun phrase into a preset Biobert model to obtain the semantic vector corresponding to the noun phrase; and calculating; The cosine similarity between the semantic vectors is used to obtain the semantic similarity corresponding to the noun phrase.
  • Biobert is a Bert model trained on a huge medical corpus, which can effectively represent the semantics of medical-related words and phrases. Input the extracted noun phrases into the Biobert model to obtain the semantic vector representation at the noun phrase level.
  • each Noun phrases are transformed into 768-dimensional vectors, that is, the latitude is 768-dimensional, and then the cosine similarity is used to calculate the similarity between the vectors, that is, the semantic similarity between phrases can be obtained.
  • a deep learning model based on the Biobert model can be pre-trained to represent contextual features and semantic information of the noun phrase itself, so as to realize the aggregation of topic words in the embodiment of this application, combining phrase-level synonyms, Fully characterize the similarity between the topics represented by two noun phrases through the similarity of the noun phrase level.
  • phrase-level data train a deep learning model based on the pre-trained Biobert model to represent the contextual features and the semantics of the phrase itself. Information, which improves the semantic similarity corresponding to noun phrases extracted based on co-citation information, and improves the accuracy of semantic similarity statistics.
  • co-cited documents refer to a document at the same time, and co-cited documents are referred to by a document at the same time, and co-cited as co-cited.
  • the corresponding literature citation relationship network can be constructed. From the perspective of the cited literature, it is the literature cited network. If it is multiple documents The co-cited network among the documents is the document co-cited network.
  • Figure 3 is a schematic diagram of a document co-citation network example in the document keyword aggregation method provided by the embodiment of this application.
  • a and B in Figure 3 are both cited by C, and there is a similarity between A and B. Therefore, a co-citation network citing A and B can be constructed to obtain the relationship between A and B.
  • the co-citation network of CDE cited in Document A is constructed.
  • the weight of the edge is the co-citation similarity between the two nodes (for example, the node is A1
  • the node is A1
  • the two documents of A1 and A2, the two documents A1 and A2 cite the same document and there is a citation intersection, which can be used to measure the similarity between A1 and A2.
  • A is cited by CDE.
  • B is cited by CD and AB is cited by CD.
  • the statistical measure of the similarity between the two is called co-cited similarity), the calculation formula as follows:
  • M and N represent the collection of documents citing document i and document j, respectively.
  • Fig. 3 Take Fig. 3 as an example.
  • Fig. 3 there are a total of five documents ABCDE, in which the direction of the arrow indicates the citation relationship, and the dotted line between AB is used to describe AB as a co-cited object, for example, arrows C to A indicate Document C quotes document A.
  • the three CDE documents all quote document A, and the citing document set of A is ⁇ C, D, E ⁇ , the same can be obtained, the citing document set of B Is ⁇ C, D ⁇ , the co-citation similarity between AB is:
  • the noun phrase is used to describe the document, that is, the noun phrase co-citation similarity can be established
  • the default co-citation similarity between document i and itself is 1.
  • the co-citation similarity of the document is used to characterize the similarity between the extracted noun phrases corresponding to the document. The calculation formula is as follows:
  • X and Y respectively represent the collection of documents containing phrases x and y. So far, a co-citation similarity network between phrases can be obtained. According to the phrase co-citation similarity network, the noun phrase is obtained. Corresponding phrase co-citation similarity, and then based on the phrase co-citation similarity and the semantic similarity, clustering the noun phrases to obtain a set of synonyms, the embodiment of this application introduces based on the document co-citation Phrases with similarity are cited as similarity, which can better learn the professional knowledge in the field of phrase segmentation, and can better express the similarity between topics.
  • FIG. 4 is a schematic diagram of another sub-process of the method for aggregation of document subject words provided by an embodiment of the application.
  • the method before the step of clustering the noun phrases according to the phrase co-citation similarity and the semantic similarity to obtain a set of synonyms, the method further includes:
  • community detection also known as community detection
  • English is Community Detection
  • communities usually finds out the closely connected parts of the network, these parts are called communities, then it can also be considered that the internal connections of the communities are dense, and the connections between the communities Sparse
  • community detection algorithms include Louvain algorithm, Newman fast algorithm, CNM algorithm and MSG-MV algorithm.
  • community detection is performed through a preset community detection algorithm, so that phrases are clustered according to the similarity network to obtain several communities, and each community contains similar words.
  • the phrase co-citation network is clustered into small communities, and the Louvain community detection algorithm is used to perform community mining on the obtained phrase co-citation similarity network, and finally a series of communities (Clusters) are obtained.
  • the phrase co-citation similarity network of the phrase is used to perform community detection using the preset community detection method to obtain a series of communities, and then perform hierarchical clustering of each community to obtain a set of synonyms, due to the combination of citation information and semantic information. Synonyms of, construct a phrase similarity network based on co-citation information, and realize the use of phrase similarity network for community detection.
  • the candidate set of synonyms is recalled first, which can greatly reduce the calculation amount of the clustering department. At the same time, it does not rely on annotated data and specific corpus, has good versatility, is more in line with the scene of topic mining, and improves the accuracy of topic word screening.
  • the step of clustering the noun phrases according to the co-citation similarity of the phrases and the semantic similarity to obtain a set of synonyms includes:
  • cluster analysis is also called cluster analysis. It is a statistical analysis method for studying (sample or index) classification problems, and it is also an important algorithm for data mining.
  • Cluster analysis is composed of several patterns. Usually, a pattern is a vector of measurement, or a point in a multi-dimensional space. Cluster analysis is based on similarity, and there are more similarities between patterns in a cluster than patterns that are not in the same cluster.
  • Clustering algorithms include K-means clustering algorithm, Mean-Shift clustering and Expectation Maximization (EM) clustering based on Gaussian Mixture Model (GMM).
  • phrase co-citation similarity network After clustering the phrase co-citation similarity network through community detection, hierarchical clustering is performed on each community, assuming that synonyms will only appear in the same community, and each phrase in each community is separately clustered.
  • phrase co-citation similarity and semantic similarity you can set the threshold of hierarchical clustering, and at the same time, the words that are clustered together in both clusters are considered as synonyms.
  • two types of bottom-up hierarchical clustering are carried out, one is clustering based on the co-citation similarity of noun phrases, and the other is clustering based on the mentioned semantic similarity, for example based on Biobert’s semantic similarity is used as the standard for clustering.
  • FIG. 5 is a schematic diagram of an example of the aggregation process of the document subject word aggregation method provided by an embodiment of the application.
  • the white circles represent the same document
  • the black circle represents the noun phrase extracted from the same document
  • the gray circle represents the noun phrase extracted from the third document
  • the different white circles, black circles and gray circles represent different noun phrases, because A A and B are clustered together in the two hierarchical clusters. Therefore, A and B can be combined into synonyms.
  • a synonym mining combining two types of information of citation information and semantic information is proposed to construct a co-citation information-based approach.
  • the phrase similarity network clusters the phrase community based on the co-cited similarity of the phrases corresponding to the noun phrase, and clusters the phrase community based on the semantic similarity corresponding to the phrase.
  • the community detection is used in the phrase similarity network to recall the set of possible synonyms, which greatly reduces the candidate range of synonyms.
  • the two noun phrases are synonyms, and can be combined to obtain a set of synonyms, using semantic similarity and co-citation similarity respectively Clustering, instead of using the usual strategy of weighted addition of different similarities, avoids the influence of similarity weights on the results, and ensures that the obtained synonyms can have similar semantics and similar topics at the same time, without relying on labeled data and specific
  • the corpus has good versatility, is more in line with the scene of topic mining, and improves the accuracy of topic word selection.
  • the step of selecting the target noun phrase with the highest word frequency from the synonym set as the subject word of the document includes: selecting TF-IDF from the synonym set according to a preset TF-IDF algorithm.
  • the noun phrase with the highest IDF value is used as the target noun phrase; the target noun phrase is used as the subject word of the document.
  • TF-IDF the English term frequency-inverse document frequency
  • term frequency refers to the number of times a given word appears in the document. This number is usually normalized (the numerator is generally smaller than the denominator, which distinguishes it from IDF) To prevent it from favoring long files. (The same word may have a higher word frequency in a long document than in a short document, regardless of the importance of the word.)
  • IDF Inverse document frequency
  • the IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient. A high word frequency in a particular document and a low document frequency of the word in the entire document collection can produce a high-weight TF-IDF. Therefore, TF-IDF tends to filter out common words and keep important words.
  • the TF-IDF value is calculated, and the phrase with the highest TF-IDF value is selected as the standard subject word.
  • a weighted TF-IDF value can be used.
  • the cited amount of the literature is used as the importance index of the literature, and the importance of the literature is standardized to 0-1.
  • the importance is equal to the average value of the importance of all documents in which the phrase appears, and then multiplied by the TF-IDF of the phrase, as the final TF-IDF value of each phrase.
  • a preset community detection method is used to perform community detection to obtain a series of communities, hierarchical clustering is performed on each community, and the obtained synonymous word set is selected, The phrase with the largest TF-IDF value is used as the standard subject word.
  • a phrase similarity network based on co-citation information is constructed, and the phrase similarity network is used for community detection for the first time.
  • the obtained synonyms can have similar semantics and similar topics at the same time, which is more in line with the topic mining scenario, does not rely on labeled data and specific corpus, has good versatility, and improves the accuracy of topic word selection.
  • FIG. 6 is a schematic block diagram of a document subject word aggregation device provided by an embodiment of the application.
  • an embodiment of the present application also provides a document subject word aggregation device.
  • the document topic word aggregation device includes a unit for executing the above-mentioned document topic word aggregation method, and the document topic word aggregation device may be configured in a computer device.
  • the document subject word aggregation device 600 includes an acquisition unit 601, an extraction unit 602, a clustering unit 603 and a screening unit 604.
  • the acquiring unit 601 is configured to acquire document data, the document data including the document title, document abstract, and citation information corresponding to each document contained in each document;
  • the extracting unit 602 is configured to use the preset nature
  • the language processing tool extracts the noun phrases contained in the document title and the document abstract;
  • the clustering unit 603 is configured to cluster the noun phrases based on the citation information and the noun phrases to obtain A set of synonyms;
  • the screening unit 604 is used to filter the target noun phrase with the highest word frequency from the set of synonyms as the subject words of the document.
  • the clustering unit 603 includes: a establishing subunit for establishing a semantic similarity based on the noun phrase according to the noun phrase; a first constructing subunit for establishing a semantic similarity based on the noun phrase; , Construct the document co-citation network corresponding to the document; the first calculation subunit is used to calculate the document co-citation similarity corresponding to the document according to the document co-citation network; the second construction subunit uses According to the co-citation similarity of the documents, construct the phrase co-citation similarity network corresponding to the noun phrase; the first acquisition subunit is used to obtain the noun based on the co-citation similarity network of the phrase Phrases corresponding to the phrase co-citation similarity; the first clustering subunit is used to cluster the noun phrases according to the phrase co-citation similarity and the semantic similarity to obtain a set of synonyms.
  • the establishment subunit includes: an input subunit for inputting the noun phrase into a preset Biobert model to obtain the semantic vector corresponding to the noun phrase; a second calculation subunit, using To calculate the cosine similarity between the semantic vectors to obtain the semantic similarity corresponding to the noun phrase.
  • the document topic word aggregation device 600 further includes: a detection unit, configured to perform community detection using a preset community detection method based on the phrase co-citation similarity network to obtain several phrase communities;
  • the first clustering subunit includes: a second clustering subunit, configured to cluster the phrase community according to the co-citation similarity of the phrases corresponding to the noun phrase to obtain the first cluster;
  • the third clustering subunit is used to cluster the phrase community according to the semantic similarity corresponding to the phrase to obtain the second cluster;
  • the judgment subunit is used to judge every two of the Whether noun phrases are both included in the first cluster and the second cluster;
  • a determination subunit for determining if every two of the noun phrases are included in the first cluster and the second cluster , Determine that the two noun phrases are synonymous words, thereby obtaining synonymous word phrases;
  • the combination subunit is used to combine all the synonymous word phrases into a set to obtain a synonymous word set.
  • the screening unit 604 includes: a screening subunit, configured to select a noun phrase with the highest TF-IDF value from the set of synonyms according to a preset TF-IDF algorithm as a target noun phrase; second The acquiring subunit is used to use the target noun phrase as the subject word of the document.
  • the document topic word aggregation device can be divided into different units as needed, or the document topic words can be aggregated.
  • Each unit in the device adopts different connection sequences and methods to complete all or part of the functions of the above-mentioned document subject word aggregation device.
  • the above-mentioned document subject word aggregation apparatus may be implemented in the form of a computer program, and the computer program may run on the computer device as shown in FIG. 7.
  • FIG. 7 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 700 may be a computer device such as a desktop computer or a server, or may be a component or component in other devices.
  • the computer device 700 includes a processor 702, a memory, and a network interface 705 connected through a system bus 701, where the memory may include a non-volatile storage medium 703 and an internal memory 704.
  • the non-volatile storage medium 703 can store an operating system 7031 and a computer program 7032.
  • the processor 702 can execute the above-mentioned method for aggregation of document subject terms.
  • the processor 702 is used to provide calculation and control capabilities to support the operation of the entire computer device 700.
  • the internal memory 704 provides an environment for the operation of the computer program 7032 in the non-volatile storage medium 703.
  • the processor 702 can execute the above-mentioned method for aggregating literature subject terms.
  • the network interface 705 is used for network communication with other devices.
  • the specific computer device 700 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the computer device may only include a memory and a processor. In such an embodiment, the structure and function of the memory and the processor are consistent with the embodiment shown in FIG. 7 and will not be repeated here.
  • the processor 702 is configured to run a computer program 7032 stored in a memory to implement the method for aggregation of document subject terms described in the embodiment of the present application.
  • the processor 702 may be a central processing unit (Central Processing Unit, CPU), and the processor 702 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by the processor At this time, the processor is made to execute the steps of the method for aggregation of the document subject words described in the above embodiments.
  • the storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.
  • a physical, non-transitory storage medium such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store computer programs. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil d'agrégation de terme MeSH de document, un dispositif informatique et un support d'informations, se rapportant au domaine technique du traitement médical numérique. Le procédé consiste à: acquérir des données de document, les données de document comprenant un titre de document et un résumé de document inclus dans chaque document, et des informations de citation correspondant à chaque document (S101); extraire des groupes nominaux dans le titre de document et le résumé de document à l'aide d'un outil de traitement de langage naturel prédéfini (S102); regrouper les groupes nominaux sur la base des informations de citation et des groupes nominaux pour obtenir un ensemble de synonymes (S103); et cribler un groupe nominal cible ayant la fréquence de mot la plus élevée à partir de l'ensemble de synonymes pour servir de terme MeSH d'un document (S104).
PCT/CN2020/118699 2020-07-29 2020-09-29 Procédé et appareil d'agregation de terme mesh de document, dispositif informatique et support de stockage lisible WO2021139262A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010744556.7A CN111898366B (zh) 2020-07-29 2020-07-29 文献主题词聚合方法、装置、计算机设备及可读存储介质
CN202010744556.7 2020-07-29

Publications (1)

Publication Number Publication Date
WO2021139262A1 true WO2021139262A1 (fr) 2021-07-15

Family

ID=73182439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118699 WO2021139262A1 (fr) 2020-07-29 2020-09-29 Procédé et appareil d'agregation de terme mesh de document, dispositif informatique et support de stockage lisible

Country Status (2)

Country Link
CN (1) CN111898366B (fr)
WO (1) WO2021139262A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705217A (zh) * 2021-09-01 2021-11-26 国网江苏省电力有限公司电力科学研究院 一种面向电力领域知识学习的文献推荐方法及装置
CN115658851A (zh) * 2022-12-27 2023-01-31 药融云数字科技(成都)有限公司 基于主题的医学文献检索方法、系统、存储介质及终端

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667810A (zh) * 2020-12-25 2021-04-16 平安科技(深圳)有限公司 文献聚类、装置、电子设备及存储介质
CN113111180B (zh) * 2021-03-22 2022-01-25 杭州祺鲸科技有限公司 基于深度预训练神经网络的中文医疗同义词聚类方法
CN113392072B (zh) * 2021-06-25 2022-08-02 中国标准化研究院 标准知识服务方法、装置、电子设备和存储介质
CN113704412B (zh) * 2021-08-31 2023-05-02 交通运输部科学研究院 交通运输领域变革性研究文献早期识别方法
CN113806237B (zh) * 2021-11-18 2022-03-08 杭州费尔斯通科技有限公司 一种基于词典的语言理解模型的测评方法和系统
CN114201962B (zh) * 2021-12-03 2023-07-25 中国中医科学院中医药信息研究所 一种论文新颖性分析方法、装置、介质和设备
CN115713085B (zh) * 2022-10-31 2023-11-07 北京市农林科学院 文献主题内容分析方法及装置
CN116644338B (zh) * 2023-06-01 2024-01-30 北京智谱华章科技有限公司 基于混合相似度的文献主题分类方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295903A1 (en) * 2010-05-28 2011-12-01 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN105956130A (zh) * 2016-05-09 2016-09-21 浙江农林大学 多信息融合的科研文献主题发现和跟踪方法及其系统
CN110020034A (zh) * 2018-06-29 2019-07-16 程宇镳 一种信息引证分析方法和系统
CN110349632A (zh) * 2019-06-28 2019-10-18 广州序科码生物技术有限责任公司 一种从PubMed文献筛选基因关键词的方法
CN111143511A (zh) * 2019-12-16 2020-05-12 北京工业大学 新兴技术预测方法、装置、电子设备及介质
CN111259156A (zh) * 2020-02-18 2020-06-09 北京航空航天大学 一种面向时间序列的热点聚类方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
JP4462014B2 (ja) * 2004-11-15 2010-05-12 日本電信電話株式会社 話題語結合方法及び装置及びプログラム
JP5474704B2 (ja) * 2010-08-16 2014-04-16 Kddi株式会社 意味的に類似している事態対を二項関係に分類する二項関係分類プログラム、方法及び装置
CN106897436B (zh) * 2017-02-28 2018-08-07 北京邮电大学 一种基于变分推断的学术研究热点关键词提取方法
CN109117436A (zh) * 2017-06-26 2019-01-01 上海新飞凡电子商务有限公司 基于主题模型的同义词自动发现方法及其系统
CN108920454A (zh) * 2018-06-13 2018-11-30 北京信息科技大学 一种主题短语抽取方法
US20200117751A1 (en) * 2018-10-10 2020-04-16 Twinword Inc. Context-aware computing apparatus and method of determining topic word in document using the same
CN110321553B (zh) * 2019-05-30 2023-01-17 平安科技(深圳)有限公司 短文本主题识别方法、装置及计算机可读存储介质
CN110489745B (zh) * 2019-07-31 2020-12-22 北京大学 基于引文网络的论文文本相似性的检测方法
CN110851602A (zh) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 一种主题聚类的方法及装置
CN111079422B (zh) * 2019-12-13 2023-07-14 北京小米移动软件有限公司 关键词提取方法、装置及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110295903A1 (en) * 2010-05-28 2011-12-01 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN105956130A (zh) * 2016-05-09 2016-09-21 浙江农林大学 多信息融合的科研文献主题发现和跟踪方法及其系统
CN110020034A (zh) * 2018-06-29 2019-07-16 程宇镳 一种信息引证分析方法和系统
CN110349632A (zh) * 2019-06-28 2019-10-18 广州序科码生物技术有限责任公司 一种从PubMed文献筛选基因关键词的方法
CN111143511A (zh) * 2019-12-16 2020-05-12 北京工业大学 新兴技术预测方法、装置、电子设备及介质
CN111259156A (zh) * 2020-02-18 2020-06-09 北京航空航天大学 一种面向时间序列的热点聚类方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705217A (zh) * 2021-09-01 2021-11-26 国网江苏省电力有限公司电力科学研究院 一种面向电力领域知识学习的文献推荐方法及装置
CN113705217B (zh) * 2021-09-01 2024-05-28 国网江苏省电力有限公司电力科学研究院 一种面向电力领域知识学习的文献推荐方法及装置
CN115658851A (zh) * 2022-12-27 2023-01-31 药融云数字科技(成都)有限公司 基于主题的医学文献检索方法、系统、存储介质及终端

Also Published As

Publication number Publication date
CN111898366A (zh) 2020-11-06
CN111898366B (zh) 2022-08-09

Similar Documents

Publication Publication Date Title
WO2021139262A1 (fr) Procédé et appareil d'agregation de terme mesh de document, dispositif informatique et support de stockage lisible
CN111104794B (zh) 一种基于主题词的文本相似度匹配方法
CN108509474B (zh) 搜索信息的同义词扩展方法及装置
WO2019091026A1 (fr) Procédé de recherche rapide de document dans une base de connaissances, serveur d'application, et support d'informations lisible par ordinateur
WO2020001373A1 (fr) Procédé et appareil de construction d'ontologie
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
US10394851B2 (en) Methods and systems for mapping data items to sparse distributed representations
WO2018086470A1 (fr) Procédé et dispositif d'extraction de mot-clé, et serveur
CN110019732B (zh) 一种智能问答方法以及相关装置
CN112328891B (zh) 训练搜索模型的方法、搜索目标对象的方法及其装置
CN111581949B (zh) 学者人名的消歧方法、装置、存储介质及终端
WO2021189951A1 (fr) Procédé et appareil de recherche de texte, et dispositif informatique et support de stockage
WO2020232898A1 (fr) Procédé et appareil de classification de texte, dispositif électronique et support de stockage non volatil lisible par ordinateur
WO2020107835A1 (fr) Procédé et dispositif de traitement de données d'échantillon
JP5057474B2 (ja) オブジェクト間の競合指標計算方法およびシステム
EP2577521A2 (fr) Détection de rebuts dans un classement de résultats de recherche
CN103646112A (zh) 利用了网络搜索的依存句法的领域自适应方法
CN111090771B (zh) 歌曲搜索方法、装置及计算机存储介质
WO2022042297A1 (fr) Procédé et appareil de regroupement de textes, dispositif électronique et support de stockage
CN116848490A (zh) 使用模型相交进行文档分析
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
JP3765801B2 (ja) 対訳表現抽出装置、対訳表現抽出方法、および対訳表現抽出プログラム
CN110019474B (zh) 异构数据库中的同义数据自动关联方法、装置及电子设备
CN113032573A (zh) 一种结合主题语义与tf*idf算法的大规模文本分类方法及系统
CN103034657B (zh) 文档摘要生成方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912252

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912252

Country of ref document: EP

Kind code of ref document: A1