CN111898366B

CN111898366B - Document subject word aggregation method and device, computer equipment and readable storage medium

Info

Publication number: CN111898366B
Application number: CN202010744556.7A
Authority: CN
Inventors: 柴玲
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2022-08-09
Anticipated expiration: 2040-07-29
Also published as: CN111898366A; WO2021139262A1

Abstract

The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium. The embodiment of the application belongs to the technical field of semantic processing, and the accuracy of aggregation of subject words of documents is improved by acquiring document data which comprises document titles, document abstracts and quotation information corresponding to each document, extracting contained noun phrases from the document titles and the document abstracts by adopting a preset natural language processing tool, clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-sense word set, and screening a target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.

Description

Document subject word aggregation method and device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of semantic processing technologies, and in particular, to a method and an apparatus for aggregating document topic terms, a computer device, and a computer-readable storage medium.

Background

In the process of researching the technology, it is important to know the dynamic change of research hotspots in a field or the latest research hotspots, and although the document library has the label printing on the theme of the document, in many cases, the label corresponding to the description of the theme is inaccurate. For example, it is important for medical researchers to grasp the dynamic changes of research hotspots in a field or the latest research hotspots, which not only can improve the efficiency of scientific research, but also can greatly help to diagnose and treat difficult and complicated diseases. Although the PUBMED medical literature library has labels (i.e. Mesh Term mode) or keywords printed by experts in most of the literatures, the Mesh Term is labor-consuming, and the Mesh Term is marked from various different angles (such as diseases, drugs, species and the like), and in most cases, the labels do not represent specific research hotspots of the literature, and the keywords also have more general meanings and are biased to the subjective choices of the authors. Therefore, in most scientific metrological analysis, noun phrases in titles and abstracts are selected as candidates of subject terms of an article, so that the contained information is closer to the real research content of the document. But subject analysis is directly performed by using phrases in titles and abstracts, and synonyms bring great noise. Especially for the subdivided fields such as lung cancer, the existing mainstream topic models such as LDA, etc., the selected topic representatives often include a large number of similar or synonymous terms, which results in redundant and inaccurate information, for example, non-small cell lung cancer, non-small cell lung canoma, non-small cell lung cancer cells, and human non-small cell lung cancer should be standardized to the same topic word non-small cell lung cancer.

In the conventional technology, in the process of processing the term of the document synonym, because the term is generally characterized by using semantic similarity at a word level, and the general synonym acquisition only can consider information at a sentence level, such as context, part of speech and the like, the accuracy of aggregation of the subject words of the document is low.

Disclosure of Invention

The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium, and can solve the problem that the accuracy of document subject term aggregation in the prior art is low.

In a first aspect, an embodiment of the present application provides a document theme word aggregation method, where the method includes: acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature; extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and screening out the target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.

In a second aspect, an embodiment of the present application further provides a document theme word aggregation apparatus, including: the document data comprises a document title, a document abstract and citation information corresponding to each document; the extraction unit is used for extracting the noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; the clustering unit is used for clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and the screening unit is used for screening out the target noun phrase with the highest word frequency from the synonym set as the subject term of the document.

In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the document topic word aggregation method when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to execute the steps of the document theme word aggregation method.

The embodiment of the application provides a document subject term aggregation method, a document subject term aggregation device, computer equipment and a computer readable storage medium. The embodiment of the application acquires document data, the document data comprises a document title, a document abstract and quotation information corresponding to each document, a preset natural language processing tool is adopted to extract noun phrases contained in the document title and the document abstract, based on the quotation information and the noun phrases, clustering is carried out on the noun phrases to obtain a near meaning word set, a target noun phrase with the highest word frequency is selected from the near meaning word set to serve as a subject word of the document, due to the combination of the noun phrases and the quotation information, a phrase-level near meaning word processing mode is used for a scene mined by the document, and the representation of the noun phrase similarity is carried out by combining the quotation information, compared with the traditional technology, the representation is carried out by using the semantic similarity at a word level, and only sentence-level information is considered, the characterization mode of the embodiment of the application fully characterizes the similarity between the topics represented by the two noun phrases, so that the aggregated topic words can accurately describe the topics of the documents, and the accuracy of aggregation of the topic words of the documents is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for aggregating topic terms in a document according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sub-process of a document topic word aggregation method provided in an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an example of a document co-referenced network in a document topic word aggregation method according to an embodiment of the present application;

FIG. 4 is a schematic view of another sub-flow of a method for aggregating topic terms in a document according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a polymerization process of a document topic word polymerization method provided in an embodiment of the present application;

FIG. 6 is a schematic block diagram of a document theme word aggregation apparatus according to an embodiment of the present application; and

fig. 7 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Referring to fig. 1, fig. 1 is a schematic flow chart of a document topic word aggregation method according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps S101-S104:

s101, obtaining literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature.

Specifically, document data corresponding to documents can be retrieved from a preset database in a keyword mode, the document data includes document titles and document abstracts contained in each document and citation information corresponding to each document, and the citation information is a mutual citation relation among the documents. For documents, Pubmed databases are generally used to search documents, and by searching for keywords, document titles, document summaries, and inter-citation relationships among documents of all documents in a specific field contained in the Pubmed databases are obtained, for example, by searching for "lung cancer," titles, summaries, and citation relationships of documents related to lung cancer are downloaded.

And S102, extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool.

The natural language processing tools include Stanford nlp, TextBlob, Polyglot, and other natural language processing tools capable of extracting noun phrases.

Specifically, after the titles, abstracts and mutual reference relations of all documents in a preset specific field are retrieved, noun phrases are extracted from the titles and abstracts by using a preset natural language processing tool, for example, the noun phrases are extracted by using part-of-speech tagging in a Stanford core NLP tool, further, the abbreviations in the extracted noun phrases can be mapped into full names by using abbreviation checking in SciSapy, for example, the extracted documents are described as A, and finally the documents A are represented as a set of phrases { P1, P2, P3, … Pn }.

Further, the specific process of extracting noun phrases is as follows:

1) the noun phrases contained in the document title and the document abstract are extracted by using a preset natural language processing tool, for example, Stanford nlp is used for extracting the noun phrases.

2) The word phrase is processed to delete the word with the highest word frequency, for example, the word phrase such as "cancer" in 2000 words with the highest word frequency in wikipedia corpus is deleted to avoid the high frequency common vocabulary from affecting the aggregation of the subject words.

3) Detecting whether the noun phrases contain abbreviations or not, if the noun phrases contain abbreviations, replacing the abbreviations with full names corresponding to the abbreviations according to a preset replacement lexicon, for example, extracting abbreviations and full names appearing in the article by using a SciSApy tool, such as { QOL: quality of life }, and replacing all the appearing abbreviations with full names, thereby obtaining a phrase set.

S103, clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set.

Specifically, a co-referenced network between documents is constructed according to the retrieved quotation information between the documents, a co-referenced relationship between the noun phrases is obtained according to the co-referenced network between the documents, and then the semantic similarity of the noun phrases is obtained, so that the noun phrases are clustered according to the co-referenced relationship and the semantic similarity between the noun phrases, and a near-meaning word set is obtained.

And S104, screening out target noun phrases with the highest word frequency from the synonym set to serve as subject words of the document.

Specifically, after a near word set is obtained, a target noun phrase meeting the requirement is screened out from the near word set, and the target noun phrase is used as a subject word of the document, for example, a noun phrase with the highest frequency of occurrence in the near word set is used as a target noun phrase, so as to obtain the subject word of the document.

In the embodiment of the application, because noun phrases and quotation information are combined, a phrase-level similar word processing mode is used for a document mining scene, and quotation information is combined for representing the similarity of noun phrases, compared with the prior art that only word-level semantic similarity is used for representing, and only sentence-level information is considered, the representation mode of the embodiment of the application fully represents the similarity between topics represented by two noun phrases, so that the aggregated subject words can accurately describe the topics of documents, and the accuracy of aggregation of the subject words of the documents is improved.

Referring to fig. 2, fig. 2 is a schematic diagram of a sub-process of a document topic word aggregation method according to an embodiment of the present application. In this embodiment, the clustering the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set includes:

s201, establishing semantic similarity based on the noun phrases according to the noun phrases.

The semantic similarity is used for describing language meaning similarity between noun phrases, and may be measured by cosine similarity, euclidean distance, Minkowski distance (Minkowski distance in english), or the like.

Specifically, the similarity between two noun phrases can be calculated by vectorizing the noun phrases, and the similarity is quantized to obtain the similarity between the two noun phrases.

Further, the step of establishing semantic similarity based on the noun phrases according to the noun phrases comprises:

inputting the noun phrases into a preset Biobert model to obtain semantic vectors corresponding to the noun phrases; and calculating cosine similarity among the semantic vectors to obtain semantic similarity corresponding to the noun phrases.

Specifically, semantic similarity based on Biobert is established based on extracted noun phrases, phrase semantics are represented by using output vectors of a pre-trained Biobert model, cosine similarity between vectors is calculated, and the semantic similarity corresponding to the noun phrases can be obtained. Biobert is a Bert model trained based on huge medical corpus, can effectively represent semantemes of medically related words and phrases, and input extracted noun phrases into the Biobert model, so that semantic vector representation of noun phrase levels can be obtained, for example, each noun phrase is converted into 768-dimensional vectors, namely, the latitude is 768 dimensions, and then cosine similarity is used for calculating similarity between vectors, so that semantic similarity between phrases can be obtained. For the data at the phrase level, a deep learning model based on a Biobert model can be trained in advance to respectively represent the context characteristics and the semantic information of noun phrases, so that when subject words are aggregated in the embodiment of the application, the similarity between subjects represented by two noun phrases is fully represented through the similarity at the noun phrase level by combining with the near-sense words at the phrase level, and for the data at the phrase level, the deep learning model based on the pre-trained Biobert model is trained to respectively represent the context characteristics and the semantic information of the phrases, so that the semantic similarity corresponding to the noun phrases extracted based on the co-quoted information is improved, and the accuracy of semantic similarity statistics is improved.

S202, constructing a document co-quoted network corresponding to the documents based on the quote information.

And S203, calculating the common quoted similarity of the literature corresponding to the literature according to the common quoted network of the literature.

Wherein, the co-cited documents are referred to one at a time, and the co-cited documents are referred to as being co-cited.

Specifically, after citation information of the documents is acquired, a corresponding document citation relationship network can be constructed according to citation relationships among the documents, from the perspective of the cited documents, the cited documents are referred to as a document cited network, and if the cited documents are referred to as a common cited network among a plurality of documents, the cited documents are referred to as a common cited network of the documents.

In scientific metrology analysis, two articles cited by the same article have certain topic similarity, please refer to fig. 3, fig. 3 is a schematic diagram illustrating an example of a co-cited network of documents in a document topic word aggregation method provided by an embodiment of the present application, as shown in fig. 3, a and B in fig. 3 are both cited by C, and then a and B have topic similarity, so a co-cited network that cites a and B can be constructed to obtain the similarity between a and B, as shown in fig. 3, a co-cited network that cites CDE of document a is constructed. If the constructed co-referenced network consists of a1, a2, … and Am, where documents a1, a2, … and Am are nodes, and the weight of an edge is the co-referenced similarity between two nodes (for example, two documents with nodes a1 and a2, and reference intersections exist between the two documents a1 and a2 because the same document is commonly referenced, so that the similarity between the two documents a1 and a2 can be measured, as shown in fig. 3, a is referenced by CDE, B is referenced by CD, and AB is commonly referenced by CD, and AB is computed and is commonly referenced by CD, so the similarity between the two AB is measured, and the statistic for measuring the similarity between the two is referred to as co-referenced similarity), the calculation formula is as follows:

wherein M and N represent the document sets in which document i and document j are cited, respectively. Taking fig. 3 as an example, in fig. 3, five documents ABCDE are included, wherein the directions pointed by the arrows represent citation relationships, and the dotted lines between the arrows are used to describe that the AB is a common quoted object, for example, the arrows C to a represent that the document C cites the document a, and it can be known that, in fig. 3, three documents CDE cite the document a, the citation document set of the document a is { C, D, E }, which is obtained in the same way, the citation document set of the document B is { C, D }, and the common quoted similarity between the AB is:

s204, constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity.

S205, obtaining the phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network.

S206, clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set.

Specifically, after a document co-referenced network is constructed, based on the reference relationship of the document and the extracted noun phrases, the noun phrases are used to describe the document, so that the co-referenced similarity of the noun phrase co-referenced similarity default document i and the extracted noun phrases is established to be 1, the similarity between the extracted noun phrases corresponding to the document is represented by using the document co-referenced similarity, and the calculation formula is as follows:

the method comprises the steps that X and Y respectively represent document sets containing phrases X and Y, so that a common-quoted similarity network between the phrases can be obtained, the common-quoted similarity between the phrases is obtained according to the common-quoted similarity network, the common-quoted similarity between the phrases corresponding to noun phrases is obtained, and the noun phrases are clustered according to the common-quoted similarity between the phrases and the semantic similarity to obtain a similar-meaning word set.

Further, please refer to fig. 4, fig. 4 is another sub-flow diagram of a document topic word aggregation method according to an embodiment of the present application. In this embodiment, before the step of clustering the noun phrases according to the co-referenced similarity and the semantic similarity of the phrases to obtain a near-meaning word set, the method further includes:

s401, based on the phrase co-introduced similarity network, carrying out community detection in a preset community detection mode to obtain a plurality of phrase communities.

The Community Detection is also called Community Detection, and is called Community Detection in English, and usually finds out closely-connected parts in a network, and these parts are called communities, so that the connections in the communities can be considered dense, the connections between the communities are sparse, and the Community Detection algorithms include a Louvain algorithm, a Newman fast algorithm, a CNM algorithm, an MSG-MV algorithm and the like.

Specifically, based on the phrase co-referenced similarity network, performing community detection through a preset community detection algorithm, so as to Cluster phrases according to the similarity network, thereby obtaining a plurality of communities, each community contains approximate words, for example, using community detection to Cluster the phrase co-referenced networks into a small community, using a Louvain community detection algorithm to perform community mining on the obtained phrase co-referenced similarity network, finally obtaining a series of communities (clusters), and default approximate words appearing in the same community, so that in the embodiment of the present application, the community detection is combined into noun phrase clusters in the embodiment of the present application, the extracted noun phrases are preliminarily clustered through community detection, so as to perform community detection through a preset community detection manner based on the obtained phrase co-referenced similarity network, so as to obtain a series of communities, and then, hierarchical clustering is carried out on each community to obtain a near meaning word set, because near meaning words of the quotation information and the semantic information are combined, a phrase similarity network based on the common quotation information is constructed, the community detection is used as the phrase similarity network, and the near meaning words are firstly recalled in a candidate set by using a community detection algorithm, so that the calculated amount of a clustering part can be greatly reduced, meanwhile, the method does not depend on labeled data and specific linguistic data, has good universality, is more in line with the scene of topic mining, and improves the accuracy of topic word screening.

Further, the step of clustering the noun phrases according to the co-referenced similarity and the semantic similarity of the phrases to obtain a near-meaning word set includes:

s402, clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases to obtain a first cluster.

And S403, clustering the phrase community according to the semantic similarity corresponding to the phrases to obtain a second cluster.

S404, judging whether every two noun phrases are contained in the first cluster and the second cluster.

S405, if every two noun phrases are contained in the first cluster and the second cluster, judging the two noun phrases as similar meaning words, and obtaining the similar meaning word phrases.

S406, if every two noun phrases are not contained in the first cluster and the second cluster, determining that the two noun phrases are not similar words.

S407, combining all the similar meaning word phrases into a set to obtain a similar meaning word set.

The cluster analysis is also called group analysis, which is a statistical analysis method for researching (sample or index) classification problems, and is also an important algorithm for data mining. Clustering (Cluster) analysis is composed of several patterns (patterns), which are typically vectors of a metric (measure) or a point in a multidimensional space. Cluster analysis is based on similarity, with more similarity between patterns in one cluster than between patterns not in the same cluster. Clustering algorithms include K-means clustering algorithm, Mean-Shift clustering, and expectation-maximization (EM) clustering based on Gaussian Mixture Model (GMM).

Specifically, after clustering is performed on the phrase co-introduced similarity network through community detection, hierarchical clustering is performed on each community, on the assumption that the similar meaning words only appear in the same community, hierarchical clustering is performed on the phrases in each community respectively by using the phrase co-introduced similarity and the semantic similarity, a threshold value of hierarchical clustering can be set, and the phrases are clustered together in the two clusters and are considered as the similar meaning words. For each community, two hierarchical clustering from bottom to top are respectively performed, one is to perform clustering by using the co-referenced similarity of noun phrases as the similarity, and the other is to perform clustering by using the referenced semantic similarity, for example, clustering based on the semantic similarity of Biobert as the standard, wherein the basis for finally analyzing whether two noun phrases are synonyms is that, if two types of clustering are clustered together, the two noun phrases are determined to be synonyms, please refer to fig. 5, fig. 5 is a schematic diagram of an exemplary aggregation process of the document subject term aggregation method provided by the embodiment of the present application, as shown in fig. 5, a white circle represents a noun phrase extracted from the same document, a black circle represents a noun phrase extracted from the same document, a gray circle represents a noun phrase extracted from a third document, different white circles, and a similar white circle, The black circles and the gray circles represent different noun phrases respectively, and since a and B are clustered together in two hierarchical clusters, a and B can be combined into a near-meaning word.

In the embodiment of the application, aiming at synonymy and near-meaning subject terms encountered by document theme mining in a certain subdivision field, near-meaning term mining combining quotation information and semantic information is provided, a phrase similarity network based on common quotation information is constructed, phrase communities are clustered according to the phrase common-introduced similarity corresponding to noun phrases, the phrase communities are clustered according to the semantic similarity corresponding to the phrases, the community detection is used in the phrase similarity network for the first time, a possible near-meaning term set is recalled, and the candidate range of the near-meaning terms is greatly reduced.

If every two noun phrases are contained in the two types of clusters, the two noun phrases are judged to be similar words, the two noun phrases can be combined to obtain a similar word set, the semantic similarity and the common introduced similarity are respectively used for clustering instead of the conventional strategy of weighted addition of different similarities, the influence of similarity weight on the result is avoided, the obtained similar words can have similar semantics and similar subjects at the same time, the tagging data and the specific linguistic data are not relied on, the universality is good, the scene of subject mining is better met, and the accuracy of subject word screening is improved,

in one embodiment, the step of selecting the target noun phrase with the highest word frequency from the synonym set as the subject word of the document comprises:

screening noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;

and taking the target noun phrase as a subject word of the document.

Wherein, TF-IDF, English is Term frequency-inverse document frequency, which is a common weighting method. In a given document, the Term Frequency (TF) refers to the number of times a given term appears in the document, and this number is usually normalized (numerator is usually smaller than denominator as distinguished from IDF) to prevent it from biasing toward long documents. (the same word may have a higher word frequency in a long document than a short document, regardless of whether the word is important or not.) the Inverse Document Frequency (IDF) is a measure of the general importance of a word. The IDF for a particular term may be obtained by dividing the total number of documents by the number of documents that contain that term and taking the logarithm of the resulting quotient. A high word frequency within a particular document, and a low document frequency for that word across the document collection, may result in a high-weighted TF-IDF. Therefore, TF-IDF tends to filter out common words, preserving important words.

Specifically, for the obtained similar meaning word set, the TF-IDF value is calculated, and the phrase with the highest TF-IDF value is selected as the standard subject word.

Further, if the importance of the document can be scored, a weighted TF-IDF value, such as using the cited amount of the document as an index of the importance of the document, normalized to 0-1 and then used as the importance of the document, can be used. For each phrase, the degree of importance is equal to the mean of the degrees of importance of all documents in which the phrase appears, multiplied by the TF-IDF of the phrase, as the final TF-IDF value for each phrase.

In the embodiment of the application, based on the obtained phrase co-referenced similarity network, a preset community detection mode is adopted for carrying out community detection to obtain a series of communities, each community is hierarchically clustered, the phrase with the maximum TF-IDF value in the obtained near meaning word set is selected as a standard subject word, the phrase similarity network based on the co-referenced information is constructed by combining the near meaning words of the cited information and the semantic information, the community detection is used for the phrase similarity network for the first time, the obtained near meaning words can have similar semantics and similar subjects at the same time, the scene of subject mining is better met, the common referenced data and the specific linguistic data are not relied on, the universality is good, and the accuracy of subject word screening is improved,

it should be noted that, in the document subject matter polymerization method described in each of the above examples, the technical features included in different examples can be recombined as required to obtain a combined embodiment, but all of them are within the protection scope claimed in the present application.

Referring to fig. 6, fig. 6 is a schematic block diagram of a document theme aggregation apparatus according to an embodiment of the present disclosure. Corresponding to the document theme word aggregation method, the embodiment of the application also provides a document theme word aggregation device. As shown in fig. 6, the document theme word aggregation apparatus, which includes a unit for performing the above-described document theme word aggregation method, may be configured in a computer device. Specifically, referring to fig. 6, the apparatus 600 for aggregating topic terms in literature includes an obtaining unit 601, an extracting unit 602, a clustering unit 603, and a filtering unit 604.

The acquiring unit 601 is configured to acquire document data, where the document data includes a document title, a document abstract, and citation information corresponding to each document;

an extracting unit 602, configured to extract noun phrases contained in the document titles and the document summaries by using a preset natural language processing tool;

a clustering unit 603, configured to cluster the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set;

a screening unit 604, configured to screen out a target noun phrase with the highest word frequency from the near-sense word set as a subject word of the document.

In one embodiment, the clustering unit 603 includes:

the establishing subunit is used for establishing semantic similarity based on the noun phrases according to the noun phrases;

the first construction subunit is used for constructing a document common quoted network corresponding to the documents based on the quoted information;

the first calculating subunit is used for calculating the literature co-quoted similarity corresponding to the literature according to the literature co-quoted network;

the second construction subunit is used for constructing a phrase co-introduced similarity network corresponding to the noun phrases according to the document co-introduced similarity;

a first obtaining subunit, configured to obtain a phrase co-referenced similarity corresponding to the noun phrase according to the phrase co-referenced similarity network;

and the first clustering subunit is used for clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity so as to obtain a near-meaning word set.

In one embodiment, the establishing subunit comprises:

the input subunit is used for inputting the noun phrases into a preset Biobert model so as to obtain semantic vectors corresponding to the noun phrases;

and the second calculating subunit is used for calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.

In one embodiment, the document theme word aggregation apparatus 600 further includes:

the detection unit is used for carrying out community detection in a preset community detection mode on the basis of the phrase co-introduced similarity network so as to obtain a plurality of phrase communities;

the first clustering subunit includes:

and the second clustering subunit is used for clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases so as to obtain a first cluster.

The third clustering subunit is configured to cluster the phrase community according to the semantic similarity corresponding to the phrase to obtain a second cluster;

a determining subunit, configured to determine whether each two noun phrases are included in the first cluster and the second cluster;

a determining subunit, configured to determine that two noun phrases are near-meaning words if the two noun phrases are included in the first cluster and the second cluster, so as to obtain near-meaning word phrases;

and the combination subunit is used for combining all the similar meaning word phrases into a set so as to obtain a similar meaning word set.

In one embodiment, the screening unit 604 includes:

the screening subunit is used for screening out noun phrases with the highest TF-IDF value from the synonym set according to a preset TF-IDF algorithm to serve as target noun phrases;

and the second acquisition subunit is used for taking the target noun phrase as a subject word of the document.

It should be noted that, as can be clearly understood by those skilled in the art, the detailed implementation process of the aggregation device and each unit in the above-mentioned document subject matter may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided herein.

Meanwhile, the division and connection of the units in the document theme word aggregation device are only used as examples, in other embodiments, the document theme word aggregation device may be divided into different units as required, or the units in the document theme word aggregation device may be connected in different orders and manners to complete all or part of the functions of the document theme word aggregation device.

The above-mentioned document subject word aggregating apparatus may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 700 may be a computer device such as a desktop computer or a server, or may be a component or part of another device.

Referring to fig. 7, the computer device 700 includes a processor 702, memory, and a network interface 705 coupled via a system bus 701, where the memory may include a non-volatile storage medium 703 and an internal memory 704.

The non-volatile storage medium 703 may store an operating system 7031 and a computer program 7032. The computer program 7032, when executed, causes the processor 702 to perform a method for aggregation of document topics as described above.

The processor 702 is configured to provide computing and control capabilities to support the operation of the overall computer device 700.

The internal memory 704 provides an environment for running a computer program 7032 on the non-volatile storage medium 703, and the computer program 7032, when executed by the processor 702, causes the processor 702 to perform a method for aggregating topic words of documents as described above.

The network interface 705 is used for network communication with other devices. Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing device 700 to which the disclosed aspects apply, as a particular computing device 700 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 7, and are not described herein again.

Wherein the processor 702 is configured to run a computer program 7032 stored in the memory to perform the steps of: acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature; extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool; clustering the noun phrases based on the quotation information and the noun phrases to obtain a near-meaning word set; and screening out target noun phrases with the highest word frequency from the synonym set as subject words of the documents.

In an embodiment, when the processor 702 implements the step of clustering the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set, the following steps are implemented:

establishing semantic similarity based on the noun phrases according to the noun phrases;

constructing a document common quoted network corresponding to the documents based on the quoted information;

according to the literature co-quoted network, calculating the literature co-quoted similarity corresponding to the literature;

constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity;

obtaining phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network;

and clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set.

In one embodiment, when the processor 702 implements the step of establishing semantic similarity based on the noun phrases according to the noun phrases, the processor implements the following steps:

inputting the noun phrases into a preset Biobert model to obtain semantic vectors corresponding to the noun phrases;

and calculating cosine similarity between the semantic vectors to obtain semantic similarity corresponding to the noun phrases.

In an embodiment, before the step of clustering the noun phrases according to the co-referenced similarity of the phrases and the semantic similarity to obtain a near word set, the processor 702 further performs the following steps:

based on the phrase co-introduced similarity network, carrying out community detection in a preset community detection mode to obtain a plurality of phrase communities;

when the processor 702 implements the step of clustering the noun phrases according to the common-quoted similarity of the phrases and the semantic similarity to obtain a near-synonym set, the following steps are specifically implemented:

and clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrase to obtain a first cluster.

Clustering the phrase community according to the semantic similarity corresponding to the phrases to obtain a second cluster;

judging whether every two noun phrases are contained in the first cluster and the second cluster;

if every two noun phrases are contained in the first cluster and the second cluster, judging the two noun phrases as similar meaning words, and obtaining similar meaning word phrases;

and combining all the similar meaning word phrases into a set to obtain a similar meaning word set.

In one embodiment, when the processor 702 performs the step of selecting the target noun phrase with the highest word frequency from the synonym set as the subject word of the document, the following steps are specifically performed:

and taking the target noun phrase as a subject word of the document.

It should be understood that, in the embodiment of the present Application, the Processor 702 may be a Central Processing Unit (CPU), and the Processor 702 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiments may be implemented by a computer program, and the computer program may be stored in a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium, the computer readable storage medium storing a computer program that, when executed by a processor, causes the processor to perform the steps of:

a computer program product which, when run on a computer, causes the computer to perform the steps of the document subject matter word aggregation method described in the embodiments above.

The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the apparatus, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the apparatus. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a terminal, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for aggregation of subject matter words in a document, the method comprising:

acquiring literature data, wherein the literature data comprises a literature title, a literature abstract and citation information corresponding to each literature;

extracting noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool;

clustering the noun phrases based on the citation information and the noun phrases to obtain a near-meaning word set, comprising: establishing semantic similarity based on the noun phrases according to the noun phrases; constructing a document common quoted network corresponding to the documents based on the quoted information; according to the literature co-quoted network, calculating the literature co-quoted similarity corresponding to the literature; constructing a phrase co-referenced similarity network corresponding to the noun phrases according to the document co-referenced similarity; obtaining phrase co-quoted similarity corresponding to the noun phrase according to the phrase co-quoted similarity network; clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near meaning word set;

and screening out the target noun phrase with the highest word frequency from the near-sense word set as the subject word of the document.

2. The method of claim 1, wherein the step of establishing semantic similarity based on the noun phrases according to the noun phrases comprises:

3. The method for aggregating document subject words according to claim 1 or 2, wherein before the step of clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near word set, the method further comprises:

the step of clustering the noun phrases according to the phrase co-referenced similarity and the semantic similarity to obtain a near-meaning word set comprises:

clustering the phrase community according to the phrase co-referenced similarity corresponding to the noun phrases to obtain a first cluster;

4. The method of claim 1, wherein the step of selecting the target noun phrase with the highest word frequency from the set of near-sense words as the subject word of the document comprises:

and taking the target noun phrase as a subject word of the document.

5. A document theme word aggregation apparatus, comprising:

the document data comprises a document title, a document abstract and citation information corresponding to each document;

the extraction unit is used for extracting the noun phrases contained in the document titles and the document abstracts by adopting a preset natural language processing tool;

a clustering unit, configured to cluster the noun phrases based on the citation information and the noun phrases to obtain a near-synonym set, where the clustering unit includes: the establishing subunit is used for establishing semantic similarity based on the noun phrases according to the noun phrases; the first construction subunit is used for constructing a document common quoted network corresponding to the documents based on the quoted information; the first calculating subunit is used for calculating the literature co-quoted similarity corresponding to the literature according to the literature co-quoted network; the second construction subunit is used for constructing a phrase co-introduced similarity network corresponding to the noun phrases according to the document co-introduced similarity; the acquisition subunit is used for acquiring the phrase co-referenced similarity corresponding to the noun phrase according to the phrase co-referenced similarity network; the clustering subunit is used for clustering the noun phrases according to the phrase co-introduced similarity and the semantic similarity to obtain a near meaning word set;

and the screening unit is used for screening out the target noun phrase with the highest word frequency from the synonym set as the subject term of the document.

6. The document theme word aggregation apparatus of claim 5, wherein the creating subunit comprises:

7. A computer device, comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is adapted to run the computer program to perform the steps of the method according to any of claims 1-4.

8. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 4.