CN111898366B - Document subject word aggregation method and device, computer equipment and readable storage medium - Google Patents

Document subject word aggregation method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN111898366B
CN111898366B CN202010744556.7A CN202010744556A CN111898366B CN 111898366 B CN111898366 B CN 111898366B CN 202010744556 A CN202010744556 A CN 202010744556A CN 111898366 B CN111898366 B CN 111898366B
Authority
CN
China
Prior art keywords
document
similarity
noun phrases
phrase
noun
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010744556.7A
Other languages
Chinese (zh)
Other versions
CN111898366A (en
Inventor
柴玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010744556.7A priority Critical patent/CN111898366B/en
Priority to PCT/CN2020/118699 priority patent/WO2021139262A1/en
Publication of CN111898366A publication Critical patent/CN111898366A/en
Application granted granted Critical
Publication of CN111898366B publication Critical patent/CN111898366B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium. The embodiment of the application belongs to the technical field of semantic processing, and the accuracy of aggregation of subject words of documents is improved by acquiring document data which comprises document titles, document abstracts and quotation information corresponding to each document, extracting contained noun phrases from the document titles and the document abstracts by adopting a preset natural language processing tool, clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-sense word set, and screening a target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.

Description

Document subject word aggregation method and device, computer equipment and readable storage medium
Technical Field
The present application relates to the field of semantic processing technologies, and in particular, to a method and an apparatus for aggregating document topic terms, a computer device, and a computer-readable storage medium.
Background
In the process of researching the technology, it is important to know the dynamic change of research hotspots in a field or the latest research hotspots, and although the document library has the label printing on the theme of the document, in many cases, the label corresponding to the description of the theme is inaccurate. For example, it is important for medical researchers to grasp the dynamic changes of research hotspots in a field or the latest research hotspots, which not only can improve the efficiency of scientific research, but also can greatly help to diagnose and treat difficult and complicated diseases. Although the PUBMED medical literature library has labels (i.e. Mesh Term mode) or keywords printed by experts in most of the literatures, the Mesh Term is labor-consuming, and the Mesh Term is marked from various different angles (such as diseases, drugs, species and the like), and in most cases, the labels do not represent specific research hotspots of the literature, and the keywords also have more general meanings and are biased to the subjective choices of the authors. Therefore, in most scientific metrological analysis, noun phrases in titles and abstracts are selected as candidates of subject terms of an article, so that the contained information is closer to the real research content of the document. But subject analysis is directly performed by using phrases in titles and abstracts, and synonyms bring great noise. Especially for the subdivided fields such as lung cancer, the existing mainstream topic models such as LDA, etc., the selected topic representatives often include a large number of similar or synonymous terms, which results in redundant and inaccurate information, for example, non-small cell lung cancer, non-small cell lung canoma, non-small cell lung cancer cells, and human non-small cell lung cancer should be standardized to the same topic word non-small cell lung cancer.
In the conventional technology, in the process of processing the term of the document synonym, because the term is generally characterized by using semantic similarity at a word level, and the general synonym acquisition only can consider information at a sentence level, such as context, part of speech and the like, the accuracy of aggregation of the subject words of the document is low.
Disclosure of Invention
The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium, and can solve the problem that the accuracy of document subject term aggregation in the prior art is low.
In a first aspect, an embodiment of the present application provides a document theme word aggregation method, where the method includes: acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature; extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and screening out the target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.
In a second aspect, an embodiment of the present application further provides a document theme word aggregation apparatus, including: the document data comprises a document title, a document abstract and citation information corresponding to each document; the extraction unit is used for extracting the noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; the clustering unit is used for clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and the screening unit is used for screening out the target noun phrase with the highest word frequency from the synonym set as the subject term of the document.
In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the document topic word aggregation method when executing the computer program.
In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to execute the steps of the document theme word aggregation method.
The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium. The embodiment of the application acquires document data, the document data comprises a document title, a document abstract and quotation information corresponding to each document, a preset natural language processing tool is adopted to extract noun phrases contained in the document title and the document abstract, based on the quotation information and the noun phrases, clustering is carried out on the noun phrases to obtain a near meaning word set, a target noun phrase with the highest word frequency is selected from the near meaning word set to serve as a subject word of the document, due to the combination of the noun phrases and the quotation information, a phrase-level near meaning word processing mode is used for a scene mined by the document, and the representation of the noun phrase similarity is carried out by combining the quotation information, compared with the traditional technology, the representation is carried out by using the semantic similarity at a word level, and only sentence-level information is considered, the characterization mode of the embodiment of the application fully characterizes the similarity between the topics represented by the two noun phrases, so that the aggregated topic words can accurately describe the topics of the documents, and the accuracy of aggregation of the topic words of the documents is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for aggregating topic terms in a document according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a sub-process of a document topic word aggregation method provided in an embodiment of the present application;
fig. 3 is a schematic diagram illustrating an example of a document co-referenced network in a document topic word aggregation method according to an embodiment of the present application;
FIG. 4 is a schematic view of another sub-flow of a method for aggregating topic terms in a document according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a polymerization process of a document topic word polymerization method provided in an embodiment of the present application;
FIG. 6 is a schematic block diagram of a document theme word aggregation apparatus according to an embodiment of the present application; and
fig. 7 is a schematic block diagram of a computer device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Referring to fig. 1, fig. 1 is a schematic flow chart of a document topic word aggregation method according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps S101-S104:
s101, obtaining literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature.
Specifically, document data corresponding to documents can be retrieved from a preset database in a keyword mode, the document data includes document titles and document abstracts contained in each document and citation information corresponding to each document, and the citation information is a mutual citation relation among the documents. For documents, Pubmed databases are generally used to search documents, and by searching for keywords, document titles, document summaries, and inter-citation relationships among documents of all documents in a specific field contained in the Pubmed databases are obtained, for example, by searching for "lung cancer," titles, summaries, and citation relationships of documents related to lung cancer are downloaded.
And S102, extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool.
The natural language processing tools include Stanford nlp, TextBlob, Polyglot, and other natural language processing tools capable of extracting noun phrases.
Specifically, after the titles, abstracts and mutual reference relations of all documents in a preset specific field are retrieved, noun phrases are extracted from the titles and abstracts by using a preset natural language processing tool, for example, the noun phrases are extracted by using part-of-speech tagging in a Stanford core NLP tool, further, the abbreviations in the extracted noun phrases can be mapped into full names by using abbreviation checking in SciSapy, for example, the extracted documents are described as A, and finally the documents A are represented as a set of phrases { P1, P2, P3, … Pn }.
Further, the specific process of extracting noun phrases is as follows:
1) the noun phrases contained in the document title and the document abstract are extracted by using a preset natural language processing tool, for example, Stanford nlp is used for extracting the noun phrases.
2) The word phrase is processed to delete the word with the highest word frequency, for example, the word phrase such as "cancer" in 2000 words with the highest word frequency in wikipedia corpus is deleted to avoid the high frequency common vocabulary from affecting the aggregation of the subject words.
3) Detecting whether the noun phrases contain abbreviations or not, if the noun phrases contain abbreviations, replacing the abbreviations with full names corresponding to the abbreviations according to a preset replacement lexicon, for example, extracting abbreviations and full names appearing in the article by using a SciSApy tool, such as { QOL: quality of life }, and replacing all the appearing abbreviations with full names, thereby obtaining a phrase set.
S103, clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set.
Specifically, a co-referenced network between documents is constructed according to the retrieved quotation information between the documents, a co-referenced relationship between the noun phrases is obtained according to the co-referenced network between the documents, and then the semantic similarity of the noun phrases is obtained, so that the noun phrases are clustered according to the co-referenced relationship and the semantic similarity between the noun phrases, and a near-meaning word set is obtained.
And S104, screening out target noun phrases with the highest word frequency from the synonym set to serve as subject words of the document.
Specifically, after a near word set is obtained, a target noun phrase meeting the requirement is screened out from the near word set, and the target noun phrase is used as a subject word of the document, for example, a noun phrase with the highest frequency of occurrence in the near word set is used as a target noun phrase, so as to obtain the subject word of the document.
In the embodiment of the application, because noun phrases and quotation information are combined, a phrase-level similar word processing mode is used for a document mining scene, and quotation information is combined for representing the similarity of noun phrases, compared with the prior art that only word-level semantic similarity is used for representing, and only sentence-level information is considered, the representation mode of the embodiment of the application fully represents the similarity between topics represented by two noun phrases, so that the aggregated subject words can accurately describe the topics of documents, and the accuracy of aggregation of the subject words of the documents is improved.
Referring to fig. 2, fig. 2 is a schematic diagram of a sub-process of a document topic word aggregation method according to an embodiment of the present application. In this embodiment, the clustering the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set includes:
s201, establishing semantic similarity based on the noun phrases according to the noun phrases.
The semantic similarity is used for describing language meaning similarity between noun phrases, and may be measured by cosine similarity, euclidean distance, Minkowski distance (Minkowski distance in english), or the like.
Specifically, the similarity between two noun phrases can be calculated by vectorizing the noun phrases, and the similarity is quantized to obtain the similarity between the two noun phrases.
Further, the step of establishing semantic similarity based on the noun phrases according to the noun phrases comprises:
inputting the noun phrases into a preset Biobert model to obtain semantic vectors corresponding to the noun phrases; and calculating cosine similarity among the semantic vectors to obtain semantic similarity corresponding to the noun phrases.
Specifically, semantic similarity based on Biobert is established based on extracted noun phrases, phrase semantics are represented by using output vectors of a pre-trained Biobert model, cosine similarity between vectors is calculated, and the semantic similarity corresponding to the noun phrases can be obtained. Biobert is a Bert model trained based on huge medical corpus, can effectively represent semantemes of medically related words and phrases, and input extracted noun phrases into the Biobert model, so that semantic vector representation of noun phrase levels can be obtained, for example, each noun phrase is converted into 768-dimensional vectors, namely, the latitude is 768 dimensions, and then cosine similarity is used for calculating similarity between vectors, so that semantic similarity between phrases can be obtained. For the data at the phrase level, a deep learning model based on a Biobert model can be trained in advance to respectively represent the context characteristics and the semantic information of noun phrases, so that when subject words are aggregated in the embodiment of the application, the similarity between subjects represented by two noun phrases is fully represented through the similarity at the noun phrase level by combining with the near-sense words at the phrase level, and for the data at the phrase level, the deep learning model based on the pre-trained Biobert model is trained to respectively represent the context characteristics and the semantic information of the phrases, so that the semantic similarity corresponding to the noun phrases extracted based on the co-quoted information is improved, and the accuracy of semantic similarity statistics is improved.
S202, constructing a document co-quoted network corresponding to the documents based on the quote information.
And S203, calculating the common quoted similarity of the literature corresponding to the literature according to the common quoted network of the literature.
Wherein, the co-cited documents are referred to one at a time, and the co-cited documents are referred to as being co-cited.
Specifically, after citation information of the documents is acquired, a corresponding document citation relationship network can be constructed according to citation relationships among the documents, from the perspective of the cited documents, the cited documents are referred to as a document cited network, and if the cited documents are referred to as a common cited network among a plurality of documents, the cited documents are referred to as a common cited network of the documents.
In scientific metrology analysis, two articles cited by the same article have certain topic similarity, please refer to fig. 3, fig. 3 is a schematic diagram illustrating an example of a co-cited network of documents in a document topic word aggregation method provided by an embodiment of the present application, as shown in fig. 3, a and B in fig. 3 are both cited by C, and then a and B have topic similarity, so a co-cited network that cites a and B can be constructed to obtain the similarity between a and B, as shown in fig. 3, a co-cited network that cites CDE of document a is constructed. If the constructed co-referenced network consists of a1, a2, … and Am, where documents a1, a2, … and Am are nodes, and the weight of an edge is the co-referenced similarity between two nodes (for example, two documents with nodes a1 and a2, and reference intersections exist between the two documents a1 and a2 because the same document is commonly referenced, so that the similarity between the two documents a1 and a2 can be measured, as shown in fig. 3, a is referenced by CDE, B is referenced by CD, and AB is commonly referenced by CD, and AB is computed and is commonly referenced by CD, so the similarity between the two AB is measured, and the statistic for measuring the similarity between the two is referred to as co-referenced similarity), the calculation formula is as follows:
Figure BDA0002607899190000071
wherein M and N represent the document sets in which document i and document j are cited, respectively. Taking fig. 3 as an example, in fig. 3, five documents ABCDE are included, wherein the directions pointed by the arrows represent citation relationships, and the dotted lines between the arrows are used to describe that the AB is a common quoted object, for example, the arrows C to a represent that the document C cites the document a, and it can be known that, in fig. 3, three documents CDE cite the document a, the citation document set of the document a is { C, D, E }, which is obtained in the same way, the citation document set of the document B is { C, D }, and the common quoted similarity between the AB is:
Figure BDA0002607899190000072
s204, constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity.
S205, obtaining the phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network.
S206, clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set.
Specifically, after a document co-referenced network is constructed, based on the reference relationship of the document and the extracted noun phrases, the noun phrases are used to describe the document, so that the co-referenced similarity of the noun phrase co-referenced similarity default document i and the extracted noun phrases is established to be 1, the similarity between the extracted noun phrases corresponding to the document is represented by using the document co-referenced similarity, and the calculation formula is as follows:
Figure BDA0002607899190000081
the method comprises the steps that X and Y respectively represent document sets containing phrases X and Y, so that a common-quoted similarity network between the phrases can be obtained, the common-quoted similarity between the phrases is obtained according to the common-quoted similarity network, the common-quoted similarity between the phrases corresponding to noun phrases is obtained, and the noun phrases are clustered according to the common-quoted similarity between the phrases and the semantic similarity to obtain a similar-meaning word set.
Further, please refer to fig. 4, fig. 4 is another sub-flow diagram of a document topic word aggregation method according to an embodiment of the present application. In this embodiment, before the step of clustering the noun phrases according to the co-referenced similarity and the semantic similarity of the phrases to obtain a near-meaning word set, the method further includes:
s401, based on the phrase co-introduced similarity network, carrying out community detection in a preset community detection mode to obtain a plurality of phrase communities.
The Community Detection is also called Community Detection, and is called Community Detection in English, and usually finds out closely-connected parts in a network, and these parts are called communities, so that the connections in the communities can be considered dense, the connections between the communities are sparse, and the Community Detection algorithms include a Louvain algorithm, a Newman fast algorithm, a CNM algorithm, an MSG-MV algorithm and the like.
Specifically, based on the phrase co-referenced similarity network, performing community detection through a preset community detection algorithm, so as to Cluster phrases according to the similarity network, thereby obtaining a plurality of communities, each community contains approximate words, for example, using community detection to Cluster the phrase co-referenced networks into a small community, using a Louvain community detection algorithm to perform community mining on the obtained phrase co-referenced similarity network, finally obtaining a series of communities (clusters), and default approximate words appearing in the same community, so that in the embodiment of the present application, the community detection is combined into noun phrase clusters in the embodiment of the present application, the extracted noun phrases are preliminarily clustered through community detection, so as to perform community detection through a preset community detection manner based on the obtained phrase co-referenced similarity network, so as to obtain a series of communities, and then, hierarchical clustering is carried out on each community to obtain a near meaning word set, because near meaning words of the quotation information and the semantic information are combined, a phrase similarity network based on the common quotation information is constructed, the community detection is used as the phrase similarity network, and the near meaning words are firstly recalled in a candidate set by using a community detection algorithm, so that the calculated amount of a clustering part can be greatly reduced, meanwhile, the method does not depend on labeled data and specific linguistic data, has good universality, is more in line with the scene of topic mining, and improves the accuracy of topic word screening.
Further, the step of clustering the noun phrases according to the co-referenced similarity and the semantic similarity of the phrases to obtain a near-meaning word set includes:
s402, clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases to obtain a first cluster.
And S403, clustering the phrase community according to the semantic similarity corresponding to the phrases to obtain a second cluster.
S404, judging whether every two noun phrases are contained in the first cluster and the second cluster.
S405, if every two noun phrases are contained in the first cluster and the second cluster, judging the two noun phrases as similar meaning words, and obtaining the similar meaning word phrases.
S406, if every two noun phrases are not contained in the first cluster and the second cluster, determining that the two noun phrases are not similar words.
S407, combining all the similar meaning word phrases into a set to obtain a similar meaning word set.
The cluster analysis is also called group analysis, which is a statistical analysis method for researching (sample or index) classification problems, and is also an important algorithm for data mining. Clustering (Cluster) analysis is composed of several patterns (patterns), which are typically vectors of a metric (measure) or a point in a multidimensional space. Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster. Clustering algorithms include K-means clustering algorithm, Mean-Shift clustering, and expectation-maximization (EM) clustering based on Gaussian Mixture Model (GMM).
Specifically, after clustering is performed on the phrase co-introduced similarity network through community detection, hierarchical clustering is performed on each community, on the assumption that the similar meaning words only appear in the same community, hierarchical clustering is performed on the phrases in each community respectively by using the phrase co-introduced similarity and the semantic similarity, a threshold value of hierarchical clustering can be set, and the phrases are clustered together in the two clusters and are considered as the similar meaning words. For each community, two hierarchical clustering from bottom to top are respectively performed, one is to perform clustering by using the co-referenced similarity of noun phrases as the similarity, and the other is to perform clustering by using the referenced semantic similarity, for example, clustering based on the semantic similarity of Biobert as the standard, wherein the basis for finally analyzing whether two noun phrases are synonyms is that, if two types of clustering are clustered together, the two noun phrases are determined to be synonyms, please refer to fig. 5, fig. 5 is a schematic diagram of an exemplary aggregation process of the document subject term aggregation method provided by the embodiment of the present application, as shown in fig. 5, a white circle represents a noun phrase extracted from the same document, a black circle represents a noun phrase extracted from the same document, a gray circle represents a noun phrase extracted from a third document, different white circles, and a similar white circle, The black circles and the gray circles represent different noun phrases respectively, and since a and B are clustered together in two hierarchical clusters, a and B can be combined into a near-meaning word.
In the embodiment of the application, aiming at synonymy and near-meaning subject terms encountered by document theme mining in a certain subdivision field, near-meaning term mining combining quotation information and semantic information is provided, a phrase similarity network based on common quotation information is constructed, phrase communities are clustered according to the phrase common-introduced similarity corresponding to noun phrases, the phrase communities are clustered according to the semantic similarity corresponding to the phrases, the community detection is used in the phrase similarity network for the first time, a possible near-meaning term set is recalled, and the candidate range of the near-meaning terms is greatly reduced.
If every two noun phrases are contained in the two types of clusters, the two noun phrases are judged to be similar words, the two noun phrases can be combined to obtain a similar word set, the semantic similarity and the common introduced similarity are respectively used for clustering instead of the conventional strategy of weighted addition of different similarities, the influence of similarity weight on the result is avoided, the obtained similar words can have similar semantics and similar subjects at the same time, the tagging data and the specific linguistic data are not relied on, the universality is good, the scene of subject mining is better met, and the accuracy of subject word screening is improved,
in one embodiment, the step of selecting the target noun phrase with the highest word frequency from the synonym set as the subject word of the document comprises:
screening noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;
and taking the target noun phrase as a subject word of the document.
Wherein, TF-IDF, English is Term frequency-inverse document frequency, which is a common weighting method. In a given document, the Term Frequency (TF) refers to the number of times a given term appears in the document, and this number is usually normalized (numerator is usually smaller than denominator as distinguished from IDF) to prevent it from biasing toward long documents. (the same word may have a higher word frequency in a long document than a short document, regardless of whether the word is important or not.) the Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, TF-IDF tends to filter out common words, preserving important words.
Specifically, for the obtained similar meaning word set, the TF-IDF value is calculated, and the phrase with the highest TF-IDF value is selected as the standard subject word.
Further, if the importance of the document can be scored, a weighted TF-IDF value, such as using the cited amount of the document as an index of the importance of the document, normalized to 0-1 and then used as the importance of the document, can be used. For each phrase, the degree of importance is equal to the mean of the degrees of importance of all documents in which the phrase appears, multiplied by the TF-IDF of the phrase, as the final TF-IDF value for each phrase.
In the embodiment of the application, based on the obtained phrase co-referenced similarity network, a preset community detection mode is adopted for carrying out community detection to obtain a series of communities, each community is hierarchically clustered, the phrase with the maximum TF-IDF value in the obtained near meaning word set is selected as a standard subject word, the phrase similarity network based on the co-referenced information is constructed by combining the near meaning words of the cited information and the semantic information, the community detection is used for the phrase similarity network for the first time, the obtained near meaning words can have similar semantics and similar subjects at the same time, the scene of subject mining is better met, the common referenced data and the specific linguistic data are not relied on, the universality is good, and the accuracy of subject word screening is improved,
it should be noted that, in the document subject matter polymerization method described in each of the above examples, the technical features included in different examples can be recombined as required to obtain a combined embodiment, but all of them are within the protection scope claimed in the present application.
Referring to fig. 6, fig. 6 is a schematic block diagram of a document theme aggregation apparatus according to an embodiment of the present disclosure. Corresponding to the document theme word aggregation method, the embodiment of the application also provides a document theme word aggregation device. As shown in fig. 6, the document theme word aggregation apparatus, which includes a unit for performing the above-described document theme word aggregation method, may be configured in a computer device. Specifically, referring to fig. 6, the apparatus 600 for aggregating topic terms in literature includes an obtaining unit 601, an extracting unit 602, a clustering unit 603, and a filtering unit 604.
The acquiring unit 601 is configured to acquire document data, where the document data includes a document title, a document abstract, and citation information corresponding to each document;
an extracting unit 602, configured to extract noun phrases contained in the document titles and the document summaries by using a preset natural language processing tool;
a clustering unit 603, configured to cluster the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set;
a screening unit 604, configured to screen out a target noun phrase with the highest word frequency from the near-sense word set as a subject word of the document.
In one embodiment, the clustering unit 603 includes:
the establishing subunit is used for establishing semantic similarity based on the noun phrases according to the noun phrases;
the first construction subunit is used for constructing a document common quoted network corresponding to the documents based on the quoted information;
the first calculating subunit is used for calculating the literature co-quoted similarity corresponding to the literature according to the literature co-quoted network;
the second construction subunit is used for constructing a phrase co-introduced similarity network corresponding to the noun phrases according to the document co-introduced similarity;
a first obtaining subunit, configured to obtain a phrase co-referenced similarity corresponding to the noun phrase according to the phrase co-referenced similarity network;
and the first clustering subunit is used for clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity so as to obtain a near-meaning word set.
In one embodiment, the establishing subunit comprises:
the input subunit is used for inputting the noun phrases into a preset Biobert model so as to obtain semantic vectors corresponding to the noun phrases;
and the second calculating subunit is used for calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.
In one embodiment, the document theme word aggregation apparatus 600 further includes:
the detection unit is used for carrying out community detection in a preset community detection mode on the basis of the phrase co-introduced similarity network so as to obtain a plurality of phrase communities;
the first clustering subunit includes:
and the second clustering subunit is used for clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases so as to obtain a first cluster.
The third clustering subunit is configured to cluster the phrase community according to the semantic similarity corresponding to the phrase to obtain a second cluster;
a determining subunit, configured to determine whether each two noun phrases are included in the first cluster and the second cluster;
a determining subunit, configured to determine that two noun phrases are near-meaning words if the two noun phrases are included in the first cluster and the second cluster, so as to obtain near-meaning word phrases;
and the combination subunit is used for combining all the similar meaning word phrases into a set so as to obtain a similar meaning word set.
In one embodiment, the screening unit 604 includes:
the screening subunit is used for screening out noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;
and the second acquisition subunit is used for taking the target noun phrase as a subject word of the document.
It should be noted that, as can be clearly understood by those skilled in the art, the detailed implementation process of the aggregation device and each unit in the above-mentioned document subject matter may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided herein.
Meanwhile, the division and connection of the units in the document theme word aggregation device are only used as examples, in other embodiments, the document theme word aggregation device may be divided into different units as required, or the units in the document theme word aggregation device may be connected in different orders and manners to complete all or part of the functions of the document theme word aggregation device.
The above-mentioned document subject word aggregating apparatus may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 7.
Referring to fig. 7, fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 700 may be a computer device such as a desktop computer or a server, or may be a component or part of another device.
Referring to fig. 7, the computer device 700 includes a processor 702, memory, and a network interface 705 coupled via a system bus 701, where the memory may include a non-volatile storage medium 703 and an internal memory 704.
The non-volatile storage medium 703 may store an operating system 7031 and a computer program 7032. The computer program 7032, when executed, causes the processor 702 to perform a method for aggregation of document topics as described above.
The processor 702 is configured to provide computing and control capabilities to support the operation of the overall computer device 700.
The internal memory 704 provides an environment for running a computer program 7032 on the non-volatile storage medium 703, and the computer program 7032, when executed by the processor 702, causes the processor 702 to perform a method for aggregating topic words of documents as described above.
The network interface 705 is used for network communication with other devices. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing device 700 to which the disclosed aspects apply, as a particular computing device 700 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 7, and are not described herein again.
Wherein the processor 702 is configured to run a computer program 7032 stored in the memory to perform the steps of: acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature; extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and screening out target noun phrases with the highest word frequency from the synonym set as subject words of the documents.
In an embodiment, when the processor 702 implements the step of clustering the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set, the following steps are implemented:
establishing semantic similarity based on the noun phrases according to the noun phrases;
constructing a document common quoted network corresponding to the documents based on the quoted information;
according to the literature co-quoted network, calculating the literature co-quoted similarity corresponding to the literature;
constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity;
obtaining phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network;
and clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set.
In one embodiment, when the processor 702 implements the step of establishing semantic similarity based on the noun phrases according to the noun phrases, the processor implements the following steps:
inputting the noun phrases into a preset Biobert model to obtain semantic vectors corresponding to the noun phrases;
and calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.
In an embodiment, before the step of clustering the noun phrases according to the co-referenced similarity of the phrases and the semantic similarity to obtain a near word set, the processor 702 further performs the following steps:
based on the phrase co-introduced similarity network, carrying out community detection in a preset community detection mode to obtain a plurality of phrase communities;
when the processor 702 implements the step of clustering the noun phrases according to the common-quoted similarity of the phrases and the semantic similarity to obtain a near-synonym set, the following steps are specifically implemented:
and clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrase to obtain a first cluster.
Clustering the phrase community according to the semantic similarity corresponding to the phrases to obtain a second cluster;
judging whether every two noun phrases are contained in the first cluster and the second cluster;
if every two noun phrases are contained in the first cluster and the second cluster, judging the two noun phrases as similar meaning words, and obtaining similar meaning word phrases;
and combining all the similar meaning word phrases into a set to obtain a similar meaning word set.
In one embodiment, when the processor 702 performs the step of selecting the target noun phrase with the highest word frequency from the synonym set as the subject word of the document, the following steps are specifically performed:
screening noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;
and taking the target noun phrase as a subject word of the document.
It should be understood that, in the embodiment of the present Application, the Processor 702 may be a Central Processing Unit (CPU), and the Processor 702 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiments may be implemented by a computer program, and the computer program may be stored in a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium, the computer readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of:
a computer program product which, when run on a computer, causes the computer to perform the steps of the document subject matter word aggregation method described in the embodiments above.
The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the apparatus, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the apparatus. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a terminal, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. A method for aggregation of subject matter words in a document, the method comprising:
acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature;
extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool;
clustering the noun phrases based on the citation information and the noun phrases to obtain a near-meaning word set, comprising: establishing semantic similarity based on the noun phrases according to the noun phrases; constructing a document common quoted network corresponding to the documents based on the quoted information; according to the literature co-quoted network, calculating the literature co-quoted similarity corresponding to the literature; constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity; obtaining phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network; clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set;
and screening out the target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.
2. The method of claim 1, wherein the step of establishing semantic similarity based on the noun phrases according to the noun phrases comprises:
inputting the noun phrases into a preset Biobert model to obtain semantic vectors corresponding to the noun phrases;
and calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.
3. The method for aggregating document subject words according to claim 1 or 2, wherein before the step of clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near word set, the method further comprises:
based on the phrase co-introduced similarity network, carrying out community detection in a preset community detection mode to obtain a plurality of phrase communities;
the step of clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near-meaning word set comprises:
clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases to obtain a first cluster;
clustering the phrase community according to the semantic similarity corresponding to the phrases to obtain a second cluster;
judging whether every two noun phrases are contained in the first cluster and the second cluster;
if every two noun phrases are contained in the first cluster and the second cluster, judging the two noun phrases as similar meaning words, and obtaining similar meaning word phrases;
and combining all the similar meaning word phrases into a set to obtain a similar meaning word set.
4. The method of claim 1, wherein the step of selecting the target noun phrase with the highest word frequency from the set of near-sense words as the subject word of the document comprises:
screening noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;
and taking the target noun phrase as a subject word of the document.
5. A document theme word aggregation apparatus, comprising:
the document data comprises a document title, a document abstract and citation information corresponding to each document;
the extraction unit is used for extracting the noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool;
a clustering unit, configured to cluster the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set, where the clustering unit includes: the establishing subunit is used for establishing semantic similarity based on the noun phrases according to the noun phrases; the first construction subunit is used for constructing a document common quoted network corresponding to the documents based on the quoted information; the first calculating subunit is used for calculating the literature co-quoted similarity corresponding to the literature according to the literature co-quoted network; the second construction subunit is used for constructing a phrase co-introduced similarity network corresponding to the noun phrases according to the document co-introduced similarity; the acquisition subunit is used for acquiring the phrase co-referenced similarity corresponding to the noun phrase according to the phrase co-referenced similarity network; the clustering subunit is used for clustering the noun phrases according to the phrase co-introduced similarity and the semantic similarity to obtain a near meaning word set;
and the screening unit is used for screening out the target noun phrase with the highest word frequency from the synonym set as the subject term of the document.
6. The document theme word aggregation apparatus of claim 5, wherein the creating subunit comprises:
the input subunit is used for inputting the noun phrases into a preset Biobert model so as to obtain semantic vectors corresponding to the noun phrases;
and the second calculating subunit is used for calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.
7. A computer device, comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is adapted to run the computer program to perform the steps of the method according to any of claims 1-4.
8. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 4.
CN202010744556.7A 2020-07-29 2020-07-29 Document subject word aggregation method and device, computer equipment and readable storage medium Active CN111898366B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010744556.7A CN111898366B (en) 2020-07-29 2020-07-29 Document subject word aggregation method and device, computer equipment and readable storage medium
PCT/CN2020/118699 WO2021139262A1 (en) 2020-07-29 2020-09-29 Document mesh term aggregation method and apparatus, computer device, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010744556.7A CN111898366B (en) 2020-07-29 2020-07-29 Document subject word aggregation method and device, computer equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111898366A CN111898366A (en) 2020-11-06
CN111898366B true CN111898366B (en) 2022-08-09

Family

ID=73182439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010744556.7A Active CN111898366B (en) 2020-07-29 2020-07-29 Document subject word aggregation method and device, computer equipment and readable storage medium

Country Status (2)

Country Link
CN (1) CN111898366B (en)
WO (1) WO2021139262A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667810B (en) * 2020-12-25 2024-07-23 平安科技(深圳)有限公司 Document clustering, device, electronic equipment and storage medium
CN114691861A (en) * 2020-12-28 2022-07-01 北京市博汇科技股份有限公司 Topic clustering method based on subject term semantic similarity
CN113111180B (en) * 2021-03-22 2022-01-25 杭州祺鲸科技有限公司 Chinese medical synonym clustering method based on deep pre-training neural network
CN113392072B (en) * 2021-06-25 2022-08-02 中国标准化研究院 Standard knowledge service method, device, electronic equipment and storage medium
CN113704412B (en) * 2021-08-31 2023-05-02 交通运输部科学研究院 Early identification method for revolutionary research literature in transportation field
CN113705217B (en) * 2021-09-01 2024-05-28 国网江苏省电力有限公司电力科学研究院 Literature recommendation method and device for knowledge learning in electric power field
CN113806237B (en) * 2021-11-18 2022-03-08 杭州费尔斯通科技有限公司 Language understanding model evaluation method and system based on dictionary
CN114201962B (en) * 2021-12-03 2023-07-25 中国中医科学院中医药信息研究所 Method, device, medium and equipment for analyzing paper novelty
CN114528390B (en) * 2022-02-22 2024-10-29 黄河勘测规划设计研究院有限公司 Yellow river basin evolution analysis method based on text mining
CN115713085B (en) * 2022-10-31 2023-11-07 北京市农林科学院 Method and device for analyzing literature topic content
CN116303904A (en) * 2022-12-27 2023-06-23 药融云数字科技(成都)有限公司 Medical literature searching method, system, storage medium and terminal
CN116644338B (en) * 2023-06-01 2024-01-30 北京智谱华章科技有限公司 Literature topic classification method, device, equipment and medium based on mixed similarity
CN117391073B (en) * 2023-09-22 2024-09-06 北京工业大学 Document identification method, device, electronic equipment and storage medium
CN118052225A (en) * 2024-02-28 2024-05-17 中国科学院文献情报中心 Method, device, equipment and medium for extracting research question phrases
CN118069851B (en) * 2024-04-18 2024-08-20 中国标准化研究院 Intelligent document information intelligent classification retrieval method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
JP2006139718A (en) * 2004-11-15 2006-06-01 Nippon Telegr & Teleph Corp <Ntt> Topic word association method, and topic word association/representative word extraction method, device and program
JP2012043048A (en) * 2010-08-16 2012-03-01 Kddi Corp Binomial relationship categorization program, method, and device for categorizing semantically similar situation pair by binomial relationship
CN105956130A (en) * 2016-05-09 2016-09-21 浙江农林大学 Multi-information fusion scientific research literature theme discovering and tracking method and system thereof
CN106897436A (en) * 2017-02-28 2017-06-27 北京邮电大学 A kind of academic research hot keyword extracting method inferred based on variation
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN109117436A (en) * 2017-06-26 2019-01-01 上海新飞凡电子商务有限公司 Synonym automatic discovering method and its system based on topic model
CN110321553A (en) * 2019-05-30 2019-10-11 平安科技(深圳)有限公司 Short text subject identifying method, device and computer readable storage medium
CN110489745A (en) * 2019-07-31 2019-11-22 北京大学 The detection method of paper text similarity based on citation network
CN110851602A (en) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 Method and device for topic clustering

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566360B2 (en) * 2010-05-28 2013-10-22 Drexel University System and method for automatically generating systematic reviews of a scientific field
CN110020034B (en) * 2018-06-29 2023-12-08 程宇镳 Information quotation analysis method and system
US20200117751A1 (en) * 2018-10-10 2020-04-16 Twinword Inc. Context-aware computing apparatus and method of determining topic word in document using the same
CN110349632B (en) * 2019-06-28 2020-06-16 南方医科大学 Method for screening gene keywords from PubMed literature
CN111079422B (en) * 2019-12-13 2023-07-14 北京小米移动软件有限公司 Keyword extraction method, keyword extraction device and storage medium
CN111143511A (en) * 2019-12-16 2020-05-12 北京工业大学 Emerging technology prediction method, emerging technology prediction device, electronic equipment and medium
CN111259156A (en) * 2020-02-18 2020-06-09 北京航空航天大学 Hot spot clustering method facing time sequence

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
JP2006139718A (en) * 2004-11-15 2006-06-01 Nippon Telegr & Teleph Corp <Ntt> Topic word association method, and topic word association/representative word extraction method, device and program
JP2012043048A (en) * 2010-08-16 2012-03-01 Kddi Corp Binomial relationship categorization program, method, and device for categorizing semantically similar situation pair by binomial relationship
CN105956130A (en) * 2016-05-09 2016-09-21 浙江农林大学 Multi-information fusion scientific research literature theme discovering and tracking method and system thereof
CN106897436A (en) * 2017-02-28 2017-06-27 北京邮电大学 A kind of academic research hot keyword extracting method inferred based on variation
CN109117436A (en) * 2017-06-26 2019-01-01 上海新飞凡电子商务有限公司 Synonym automatic discovering method and its system based on topic model
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN110321553A (en) * 2019-05-30 2019-10-11 平安科技(深圳)有限公司 Short text subject identifying method, device and computer readable storage medium
CN110489745A (en) * 2019-07-31 2019-11-22 北京大学 The detection method of paper text similarity based on citation network
CN110851602A (en) * 2019-11-13 2020-02-28 精硕科技(北京)股份有限公司 Method and device for topic clustering

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Recommendation System Based on Hierarchical Clustering of an Article-Level Citation Network;Jevin D. West et.al;《IEEE TRANSACTIONS ON BIG DATA》;20160729;第2卷(第2期);第113-123页 *
基于多元关系融合的科技文本主题识别方法研究;许海云等;《中国图书馆学报》;20190131;第45卷(第1期);第82-93页 *

Also Published As

Publication number Publication date
CN111898366A (en) 2020-11-06
WO2021139262A1 (en) 2021-07-15

Similar Documents

Publication Publication Date Title
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
CN111104794B (en) Text similarity matching method based on subject term
CN108073568B (en) Keyword extraction method and device
Trstenjak et al. KNN with TF-IDF based framework for text categorization
US20200081899A1 (en) Automated database schema matching
CN110019732B (en) Intelligent question answering method and related device
CN110334209B (en) Text classification method, device, medium and electronic equipment
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN112329460B (en) Text topic clustering method, device, equipment and storage medium
CN110688452B (en) Text semantic similarity evaluation method, system, medium and device
CN112541056A (en) Medical term standardization method, device, electronic equipment and storage medium
CN113486670B (en) Text classification method, device, equipment and storage medium based on target semantics
CN112836039B (en) Voice data processing method and device based on deep learning
Wijewickrema et al. Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora
CN116910599A (en) Data clustering method, system, electronic equipment and storage medium
CN114969387A (en) Document author information disambiguation method and device and electronic equipment
CN108021595B (en) Method and device for checking knowledge base triples
Mohemad et al. Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents
CN112417147A (en) Method and device for selecting training samples
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
CN109522928A (en) Theme sentiment analysis method, apparatus, electronic equipment and the storage medium of text
CN112215006B (en) Organization named entity normalization method and system
CN111341404B (en) Electronic medical record data set analysis method and system based on ernie model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant