CN112765305A - Method and device for analyzing interest topic of author, electronic equipment and storage medium - Google Patents

Method and device for analyzing interest topic of author, electronic equipment and storage medium Download PDF

Info

Publication number
CN112765305A
CN112765305A CN202011625275.6A CN202011625275A CN112765305A CN 112765305 A CN112765305 A CN 112765305A CN 202011625275 A CN202011625275 A CN 202011625275A CN 112765305 A CN112765305 A CN 112765305A
Authority
CN
China
Prior art keywords
author
document
authors
word
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011625275.6A
Other languages
Chinese (zh)
Other versions
CN112765305B (en
Inventor
徐硕
李玲
翟东升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202011625275.6A priority Critical patent/CN112765305B/en
Priority claimed from CN202011625275.6A external-priority patent/CN112765305B/en
Publication of CN112765305A publication Critical patent/CN112765305A/en
Application granted granted Critical
Publication of CN112765305B publication Critical patent/CN112765305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method and a device for analyzing an interest topic of an author, electronic equipment and a storage medium, and relates to the technical field of information analysis. The method comprises the following steps: obtaining at least one document of a target field, determining contribution weight of each author in the document, a topic expressed in the document by each word and a word responsible for each author in the document; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each co-author contributes inequally to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.

Description

Method and device for analyzing interest topic of author, electronic equipment and storage medium
Technical Field
The present application relates to the field of information analysis technologies, and in particular, to a method and an apparatus for analyzing an interest topic of an author, an electronic device, and a storage medium.
Background
Nowadays, scientific and technical literature is used as a main carrier of academic achievements, gathers a great deal of human intelligence, and is a window for spreading knowledge and performing academic communication, wherein scientific and technical literature resources contain a great deal of characteristic information, such as potential semantic relations between words, relations between scientific and technical literature topics and authors (research interests of authors), the rise of research hotspots, the process from maturity to decline, and the like.
Now, in the research interest mining aspect of technologists, Rosen-Zvi et al introduce Author hidden variables into an LDA (Latent Dirichlet Allocation) model, replace document-Topic distribution in the LDA model with Author-Topic distribution, and propose an AT (Author-Topic) model. The model can mine the relation between the author and the subject, namely the research interest of the scientific research personnel.
However, when the AT model and other similar models model author interests, the contribution of each author in the literature is assumed to be the same, which is inconsistent with the actual situation, and the interest topics of the authors cannot be accurately analyzed.
Disclosure of Invention
Embodiments of the present invention provide an analysis method, apparatus, electronic device and storage medium for a topic of interest of an author, which overcome the above problems or at least partially solve the above problems.
In a first aspect, a method for analyzing an interest topic of an author is provided, the method comprising:
acquiring at least one document of a target field, and determining the contribution weight of each author in the document; the contribution weight is a normalized result of the contribution value of the author;
for each document, determining the topic of each word in the document expressed in the document and the word responsible for each author in the document; obtaining a topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document;
for each author, related documents responsible for the author are determined from at least one document, and topics expressed by the author in the content responsible for the related documents are obtained to determine the topics of interest of the author.
Further, determining a contribution weight for each author in the document, comprising:
acquiring authors and contribution values of each author in the literature;
determining an initial weight of each author according to the number of authors in the document and the contribution value of each author;
and normalizing the initial weight of each author in the literature to obtain the final weight of each author in the literature.
Further, determining an initial weight for each author based on the number of authors and the contribution value of each author in the document, comprising:
if the number of the authors in the document does not exceed the preset number value, performing descending order arrangement on the authors in the document according to the contribution value of each author in the document to obtain an ordering result of each author in the document;
and calculating the initial weight of each author according to a preset weight algorithm according to the sequencing result of each author.
Further, determining an initial weight of each author according to the number of authors and the contribution value of each author in the document, further comprising:
if the number of the authors in the document exceeds a preset number value, performing descending order arrangement on the authors in the document according to the contribution value of each author in the document to obtain an ordering result of each author in the document;
when the sorting result of the authors is smaller than or equal to the preset quantity value, calculating to obtain the initial weight of the authors of which the sorting result is smaller than or equal to the preset quantity according to a preset weight algorithm according to the sorting result of the authors of which the sorting result is smaller than or equal to the preset quantity value in the sorting results of the authors;
when the sorting result of the author is larger than the preset quantity value, taking the preset multiple of the initial weight of the first author as the initial weight of all authors of which the sorting result is larger than the preset quantity value;
the first author is the author in the document whose ranking result is the first.
Further, determining a topic in the document that each word in the document expresses in the document includes:
allocating a preset number of themes to all words in the document, and after allocating themes to all words in the document every time, calculating the probability that the words are allocated to the target theme and other themes except the target theme when the theme is allocated next time according to the number of the words appearing in the document, the number of the words allocated to the target theme in the document after the theme is allocated at this time, and the number of the words allocated as the target theme in the document for any word in the document;
according to the probability that the words are allocated to the target theme and other themes except the target theme when the themes are allocated next time, allocating the themes for the next time on the words until the allocation times reach a preset threshold value;
obtaining a theme distributed when the distribution frequency of the words reaches a preset threshold value;
wherein, the target theme is the theme distributed when the word appears for the first time in the distribution.
Further, determining the words in the document for which each author is responsible includes:
the method comprises the steps that authors of preset times are allocated to all words in a document, and after the authors are allocated to all the words in the document every time, for any word in the document, the probability that the word is allocated to a target author and other authors except the target author when the author is allocated next time is calculated according to the number of the words appearing in a document, the number of the words allocated to the target author in the document after the authors are allocated at this time, and the number of the words allocated to the target author in the document;
according to the probability that the word is allocated to the target author and other authors except the target author when the author is allocated next time, allocating the author to the word next time until the allocation frequency reaches a preset threshold value;
acquiring an author to be distributed when the distribution frequency of the words reaches a preset threshold value;
wherein the target author is the author to which the word is assigned when it first appears in the assignment.
Further, obtaining the topic expressed by the content responsible for each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, including:
for any author in the literature, selecting a topic expressed by a word in the literature and a word responsible for the author in the literature according to the final weight of the author;
and taking the word responsible by the author as a target word, and determining the topic expressed in the literature by the author according to the topic expressed in the literature by the target word.
Further, determining related documents responsible for the author from at least one document, and obtaining topics expressed by the content responsible for the related documents by the author to determine topics of interest of the author, including:
acquiring relevant documents responsible for authors, and determining topics expressed by the authors in the content responsible for the relevant documents;
determining an interest topic of an author in the topics expressed by the author according to the topics expressed by the author;
calculating the occurrence probability of the interest topic of the author according to the occurrence times of the interest topic of the author in the related documents in charge of the author, and taking the topic with the probability exceeding a preset probability value as the interest topic of the author.
In a second aspect, an apparatus for analyzing interest topics of authors is provided, including:
the first acquisition module is used for acquiring at least one document of the target field and determining the contribution weight of each author in the document; the contribution weight is a normalized result of the contribution value of the author;
the determining module is used for determining the topic of each word in the literature expressed in the literature and the word responsible for each author in the literature for each literature; obtaining a topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document;
and the second acquisition module is used for determining related documents responsible for the authors from at least one document for each author, and acquiring topics expressed by the content responsible for the related documents by the authors so as to determine interest topics of the authors.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method provided in the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method as provided in the first aspect.
According to the method, the device, the electronic equipment and the storage medium for analyzing the interest topics of the authors, the contribution weight of each author in the literature is determined by acquiring at least one literature in the target field, and the topic expressed by each word in the literature and the word responsible for each author in the literature are determined; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each co-author contributes inequally to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic illustration of a document distribution with different author numbers provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for analyzing an author's interest topic according to an embodiment of the present application;
FIG. 3 is a diagram illustrating word distribution in a document provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of a topic assignment for primary words provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of topic assignment of a word after an iterative process is completed according to an embodiment of the present application;
FIG. 6 is a diagram illustrating author assignment of primary words provided by an embodiment of the present application;
FIG. 7 is a diagram illustrating author assignment of words after iterative processing is completed according to an embodiment of the present application;
FIG. 8 is a diagram of an author interest disclosure model provided in accordance with an embodiment of the present application;
FIG. 9 is a schematic structural diagram of an apparatus for analyzing a topic of interest of an author provided in an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
The application provides a method and a device for analyzing an interest topic of an author, an electronic device and a storage medium, and aims to solve the above technical problems in the prior art.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
First, the present application can be applied in various scientific research literature exploration scenarios, such as: data mining, machine learning, information analysis, relevant policy making, personalized academic recommendation, author scientific evaluation and the like, scientific documents are used as main carriers of academic achievements, a large amount of human intelligence is gathered, and the method is a window for spreading knowledge and performing academic communication. The Prisy scientific literature exponential growth law and the logic curve growth model show that the scientific literature quantity is growing exponentially, which brings great challenges to the detection and tracking of scientific knowledge/topics. Scientific and technical literature resources contain a large amount of implicit information, such as potential semantic relationships between words and relationships between literature topics and authors (research interests of authors) and the like, and can reflect research hotspots and trends in the current subject field to a certain extent. Research shows that the research of automatically disclosing the subject of the literature and mining the research interest of the author can play a good role in supporting scientific researchers, academic exchange platforms and scientific research management institutions.
In the research interest mining aspect of science and technology workers, Rosen-Zvi et al introduce Author hidden variables into an LDA (Latent Dirichlet Allocation) model, replace document-Topic distribution in the LDA model with Author-Topic distribution, and propose an AT (Author-Topic) model. The model can mine the relation between the author and the subject, namely the research interest of the scientific research personnel. The information age has rapidly developed scientific technology, and the form of scientific research is gradually changed from personal research into multiparty collaborative group-type research, which is particularly shown in the continuous rising number of authors of scientific papers describing scientific research achievements. It is well known that the contribution of each author is different for most scientific achievements. However, AT and other similar models implicitly embed assumptions of equivalent contribution when modeling author interest.
Fig. 1 is a schematic diagram of distribution of documents with different numbers of authors provided in this embodiment of the present application, as shown in fig. 1, it can be known that 2-5 authors in a document are generally responsible for completion, and it can be seen that the generality of a multi-author signed academic document, and multiple authors are responsible for a document together, and it is necessary to determine the contribution weight of each author in the document to analyze the topic of interest of each author more clearly.
According to the application, in the process of author interest disclosure, a contribution weight re-distribution mechanism is introduced, on the premise that each author contributes inequality to one multi-author article, interest topics of each author are found, and more scientific decision support is provided for a personalized academic recommendation system, recruitment promotion of scholars, scientific research rewards and fund distribution. Specifically, the invention provides an author interest disclosure model introducing a contribution weight allocation mechanism on the basis of an AT model, which is named as ATcreditThe model, the concept is also applicable to other similar interest-revealing topic models.
It should be understood that the method for analyzing the author's interest topic provided in the present application can be applied to any computer or system having a function of analyzing the author's interest topic, such as: analyzing the interest Topic of authors in the field of bioscience, in the research interest mining aspect of science and technology workers, Rosen-Zvi et al introduce Author hidden variables into an LDA (Latent Dirichlet Allocation) model, replace document-Topic distribution in the LDA model with Author-Topic distribution, and propose an AT (Author-Topic) model. The model can mine the relation between the author and the subject, namely the research interest of the scientific research personnel. The information age has rapidly developed scientific technology, and the form of scientific research is gradually changed from personal research into multiparty collaborative group-type research, which is particularly shown in the continuous rising number of authors of scientific papers describing scientific research achievements. It is well known that the contribution of each author is different for most scientific achievements. However, AT and other similar models implicitly embed assumptions of equivalent contribution when modeling author interest.
According to the application, in the process of author interest disclosure, a contribution weight re-distribution mechanism is introduced, on the premise that each author contributes inequality to one multi-author article, interest topics of each author are found, and more scientific decision support is provided for a personalized academic recommendation system, recruitment promotion of scholars, scientific research rewards and fund distribution. Specifically, the invention provides an author interest disclosure model introducing a contribution weight allocation mechanism on the basis of an AT model, which is named as ATcreditModels, the concept is also applicable to other similar interest topic models.
In order to solve the above problem, an embodiment of the present application provides a method for analyzing an interest topic of an author. Referring to the drawings, a method for analyzing an author's interest topic provided in the embodiment of the present application is described in detail through specific embodiments and other application scenarios, and fig. 2 is a schematic flow chart of the method for analyzing an author's interest topic provided in the embodiment, as shown in fig. 2, the method includes:
s201, obtaining at least one document of the target field, and determining the contribution weight of each author in the document.
In this embodiment, the target document and the related information of the target document are acquired by a computer, or official statistical documents are searched through the internet, and the document in the target field, the author signature related to the target document and the contribution value of each author in the target document are selected, for example: the SynBio data set provided by a competition organization party can be predicted by adopting the emerging technology of 2018 and 2019, through statistics, 2580 academic papers in the data set are scientific research documents related to the biological field, and then the author signatures of all documents and the contribution weight of all authors in each document are collected.
S202, for each document, determining the topic of each word in the document expressed in the document and the word responsible for each author in the document; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document.
After the documents in the target field are obtained, the documents need to be preprocessed, stop words are filtered, remaining words are reserved, authors of the documents are determined, contribution weights of each author in the documents are obtained through analysis and calculation according to information of the authors, topics expressed in the documents by each word in the documents and words responsible for each author in the documents are analyzed through a preset author topic model, sampling analysis is conducted on the words distributed by the authors and the topics of the words according to the weights of the authors, and topics expressed in the documents by each author are determined. For example: a document is commonly responsible for two authors, wherein Zhang III is mainly responsible for, Zhang III is 80% in weight, Liqu is 20% in weight, when analyzing the subject of a word and the author of the word, sampling is needed through the proportion of the weight, the probability that the word associated with Zhang III and the subject associated with the word are sampled is higher, and the subject expressed in the document by Zhang III is determined through high-probability sampling analysis. The probability that a word associated with liquad and a topic associated with that word are sampled is relatively low, but the topic expressed in that document can also be determined.
Specifically, word segmentation processing is carried out on the target document, and preset words are filtered to obtain processed text information; processing related information of the target document to obtain an author name list and each author weight in the target document;
in this embodiment, after collecting the target document, the target document needs to be preprocessed, that is, redundant information, stop words, sentence segmentation, and the like in the target document are removed, and after the preprocessing is completed, the cleaned text information is obtained, for example: after the target document is collected, firstly, the sentences in the target document need to be segmented into words, and then, the characters are filtered out, wherein the characters comprise stop characters, numbers and characters with a frequency lower than a preset frequency, the filtering method comprises the following steps of comparing according to a pre-constructed stop word vocabulary, and judging which words are stop words, for example: in english, these words "first, and, but" are stop words and then removed, but sometimes "and" is not stop words and needs to be determined through some complicated analysis, for example, determination is performed according to upper and lower contexts, and after documents are preprocessed, text information is formed, which contains unprocessed words and can also form a dictionary.
The method comprises the steps of processing according to related information of a target document, namely disambiguating author signatures in each document in a target field, distinguishing whether authors with the same name in different documents are the same person, re-determining the number of authors signed in the target document, calculating according to the number of authors in the target document and contribution weights of the authors, and determining the weight of each author in the target document, wherein for example, two documents are used, one male author is named Zhang III and signed in the first document, the other female author is also named Zhang III and signed in the second document, whether two authors named Zhang III are the same person is determined, and disambiguation is performed if the two authors are the same person.
S203, for each author, determining related documents responsible for the author from at least one document, and acquiring topics expressed by the content responsible for the related documents by the author to determine interest topics of the author.
According to the method and the device, the interested subject of the author needs to be determined, other documents related to the author need to be collected, analysis is carried out according to the subjects expressed by the author in the other documents, and the subject expressed by the author with the largest occurrence probability is selected as the subject interested by the author; for example: the author A expresses a topic 1 in a first document, expresses a topic 2 in a second document, and the topic 1 and the topic 2 are two different topics. According to the method, the device, the electronic equipment and the storage medium for analyzing the interest topics of the authors, the contribution weight of each author in the literature is determined by acquiring at least one literature in the target field, and the topic expressed by each word in the literature and the word responsible for each author in the literature are determined; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each author contributes inequality to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.
On the basis of the above embodiment, as an alternative embodiment, determining the contribution weight of each author in the document includes:
acquiring authors and contribution values of each author in the literature;
determining an initial weight of each author according to the number of authors in the document and the contribution value of each author;
and normalizing the initial weight of each author in the literature to obtain the final weight of each author in the literature.
After the related information of the target document is acquired, the related information needs to be processed, wherein the author in charge of the target document needs to be disambiguated, because the author with the same name exists in the target document, whether the author with the same name is the same person needs to be distinguished, and the method adopted in the embodiment of the application includes but is not limited to a rule-based scoring clustering method, and a manual disambiguation method, an automatic disambiguation method and the like can be used.
For example, rule-based scoring clustering is used to determine whether authors with the same name in different target documents are the same person, such as: the names of the A author and the B author are the same, and the A author and the B author sign their names as authors in different documents, so that the same names or different names are distinguished. The rule-based scoring and clustering method mainly judges and identifies according to several judgment rules, wherein the judgment rules comprise a rule 1: according to the mailbox of the author described in the literature, if the mailboxes of the author are the same, the same person can be determined, if the mailboxes are different, the same person is determined, if the names of the author are the same but the mailboxes are different, the judgment is performed according to the working address of the author described in the literature, if the described working addresses are the same, the possibility that the author is the same person is high, and the rule 2: the judgment can be carried out according to the collaborators with which the two persons often collaborate, several authors may often collaborate together, if the collaborators of the two persons overlap, the two persons may be the same person, there is a possible situation that some authors like to quote their references, if the references quoted by two persons with the same name are consistent and both quote their references, the two persons may be the same person, based on the rules, the similarity of each rule is scored, for example, according to mailbox judgment, if the mailboxes are the same, the similarity is higher, 100 points are scored, if the mailboxes of the two persons are different, the working units are the same, the possibility of similarity is determined to be not high, 80 points are scored, the similarity is scored according to the judgment rules, the scores are accumulated between the authors based on the rules, whether the number of the accumulated scores is the same person is determined to be not the same, if the score is high, the two persons are the same person, such as: all people who call XX (people with the same name) form a table, information of each person is specifically analyzed, clustering analysis is carried out according to the rule, the same names which are gathered together are determined to be the same person, other people with the same names which are gathered together are also the other person, an author set consisting of authors is obtained after disambiguation, the number of the authors responsible for the signature in the target document is determined again, and the weights of the authors are determined conveniently in the follow-up process.
After the relevant information of the target document is obtained, the relevant information needs to be processed, namely, ambiguous authors are eliminated, the number of authors in charge of signature in the target document is determined, contribution weight of each author is determined, the content in charge of each author is analyzed conveniently, and unnecessary troubles caused by the same author are avoided.
After the number of authors in the target document is determined, all authors in the target document need to be sorted according to contribution sizes of the authors in the target document, and then, a preset contribution weight algorithm is used to estimate contribution weights of the authors in the document, where the contribution weight algorithm includes, but is not limited to, an arithmetic counting method, a geometric counting method, a harmonic counting method, a network-based counting method, a axiom counting method, and a golden number counting method, and one of the contribution weight algorithms is randomly selected to be calculated.
Arithmetic counting method: the method is that the collaborators in the author signature list linearly distribute contribution scores in descending order. The difference in contribution between two adjacent collaborators is
Figure BDA0002879136020000111
Wherein, in particular, for a paper m, the number of coworkers is AmI.e. number of authors, contribution of authors is cmThe lambda is a free parameter, different parameter values can be set according to actual conditions, and i represents the sequencing result of the author.
Figure BDA0002879136020000112
Geometric counting method: the method is that the contribution scores of the collaborators in the author signature list form a geometric series. The contribution ratio between two adjacent co-workers is lambda (lambda is more than or equal to 1).
Figure BDA0002879136020000113
Harmonic counting method: the contribution weight of each co-worker in the author signature list is
Figure BDA0002879136020000114
The contribution ratio between two adjacent collaborators is
Figure BDA0002879136020000115
Figure BDA0002879136020000116
Network-based counting method: the method consists of two steps, the first step being a fractional counting method, i.e.
Figure BDA0002879136020000121
The second step is a portion of the contribution weight (λ e [0,1]) that each collaborator after the first author in the author signature list will obtain himself]) Assigned to the previous author.
Figure BDA0002879136020000122
Axiom counting method: the method divides the author into Gm(Gm≤Am) Group gm,kAre the elements in the k-th group that are arranged in order.
Figure BDA0002879136020000123
Further, the method has a parameter λ (λ ∈ [0,1]]) The calculation method is as follows:
Figure BDA0002879136020000124
gold number counting method: the method is based on gold number
Figure BDA0002879136020000125
To consider the contributions of each collaborator in the list of author signatures.
Figure BDA0002879136020000126
Further, the method has the calculation mode of the parameter lambda (lambda epsilon [0,1 ]):
Figure BDA0002879136020000127
in addition, for a case where there are multiple first authors or correspondents contributing the same, before applying the contribution calculation method (except axiom counting method), we regard all the collaborators as the first authors, re-order the collaborators and calculate contribution weights, and average the contribution weights.
On the basis of the above embodiment, as an alternative embodiment, determining the initial weight of each author according to the number of authors and the contribution value of each author in the document includes:
if the number of the authors in the document does not exceed the preset number value, performing descending order arrangement on the authors in the document according to the contribution value of each author in the document to obtain an ordering result of each author in the document;
and calculating the initial weight of each author according to a preset weight algorithm according to the sequencing result of each author.
Before the weights are calculated, sorting is performed according to contribution sizes of authors in a target document, authors in the target document with the most contributions are ranked first, wherein authors in the target document with the same contribution sizes need to be ranked together, a sequence is randomly set, after the authors in the target document are ranked, initial weights of the authors are calculated according to ranking positions of the authors in the target document by using a contribution weight algorithm, wherein the initial weights of the authors with the same contribution sizes need to be added and summed and then averagely distributed to the authors participating in the summation, after the initial weights of the authors are calculated, normalization processing needs to be performed again, because the sum of the initial weights of the authors calculated by the contribution weight algorithm is greater than 1, and therefore final weights of the authors need to be calculated, the normalization process is to add and sum the initial weights of all authors in the target document, the sum value is used as a denominator, the initial weight of any author in the target document is used as a numerator, and the finally obtained value is the final contribution weight of the author in the target document.
In the embodiment of the application, a harmonic counting method is used for analysis, wherein i is the author serial number after sorting according to contribution size, namely a sorting result, λ is a free parameter, and infinity is generally taken for convenient calculation
Figure BDA0002879136020000131
Then, the initial weight of the author is calculated as a numerator, that is, the final weight of the first author is
Figure BDA0002879136020000132
And the final weight of all authors in the target document is obtained by analogy.
After the number of authors in the target document is determined, the weight of each author in the target document is calculated by using a contribution weight algorithm according to the contribution size of the authors in the target document, so that the contribution weight of each author in the target document is obtained, the distribution accuracy of the contribution weight of each author in the target document is improved, and the analysis of interest topics of subsequent authors is facilitated.
On the basis of the above embodiment, as an alternative embodiment, determining the initial weight of each author according to the number of authors and the contribution value of each author in the document, further includes:
if the number of the authors in the document exceeds a preset number value, performing descending order arrangement on the authors in the document according to the contribution value of each author in the document to obtain an ordering result of each author in the document;
when the sorting result of the authors is smaller than or equal to the preset quantity value, calculating to obtain the initial weight of the authors of which the sorting result is smaller than or equal to the preset quantity according to a preset weight algorithm according to the sorting result of the authors of which the sorting result is smaller than or equal to the preset quantity value in the sorting results of the authors;
when the sorting result of the author is larger than the preset quantity value, taking the preset multiple of the initial weight of the first author as the initial weight of all authors of which the sorting result is larger than the preset quantity value;
the first author is the author in the document whose ranking result is the first.
In the embodiment of the present application, after determining the number of authors in the target document, it is required to determine whether the number of authors in the target document exceeds a preset number value, and if the number of authors in the target document exceeds the preset number value, the superordinate collaborator is called a superordinate collaborator, and for a paper that owns the superordinate collaborator (i.e., the collaborator is greater than the preset number value), the contribution weight of the collaborator needs to be redistributed, and the corresponding method is as follows:
Figure BDA0002879136020000141
cm,irepresenting the author weight, i representing the author ranking result, and using harmonic counting as an example for analysis, assuming a preset value of 10, the number of authors in the target document is more than 10, which is called super collaborator, assuming that 11 authors in the document signature have more than a preset value of 10, the contribution of each author in the target document is needed first, all authors in the target document are ranked, the initial weight of the first author after ranking is 1, the weight of the second author is one half, the weight of the third author is one third, therefore, only the author weight of the first 10 bits is calculated, the contribution weight of the 11 th author in the target document is a preset multiple of the initial contribution weight of the first author, and assuming that the preset multiple is 0.05, the initial contribution weight of the 11 th author is 0.05, and the initial contribution of the authors ranked after the 11 th order is all 0.05, thereby determining the initial weight of all authors in the target document.
According to the embodiment of the application, the contribution weight of the author needs to be analyzed according to the number of the authors in the document, when the number of the authors exceeds a preset number value, if the authors are ranked according to the contribution values and the contribution weight is calculated, the weight of the author ranked at the back is possibly smaller, in order to enable each author to be highlighted, the preset multiple of the contribution weight of the first author is used as the contribution weight of the exceeding part of the authors, so that each author can participate in topic analysis and highlight the topic which the author wants to express.
In the embodiment of the application, a Gibbs sampling algorithm formula is adopted to calculate the theme z of the nth word in the target document mm,nAnd the author x of the nth word in the target document mm,n
The gibbs sampling algorithm is as follows:
Figure BDA0002879136020000151
Figure BDA0002879136020000152
wherein Pr represents the probability of calculating the condition,
Figure BDA0002879136020000153
representing word vectors, omega, in textual informationm,nRepresenting the nth word in the target document m,
Figure BDA0002879136020000154
representing all topic vectors, z, outside the topic assigned to the nth word in the target document mm,nRepresenting the topic of the nth word in the target document m,
Figure BDA0002879136020000155
representing all author vectors, x, except the author assigned to the nth word in the target document mm,nRepresenting the author of the nth word in the target document m,
Figure BDA0002879136020000156
the author variable representing the target document,
Figure BDA0002879136020000157
representing rights of various authors in the target documentA weight vector, λ is a parameter for calculating the author weight of the target document, K represents the number of topics of the content of the target document, ωm,nRepresenting the nth word in the target document m, V representing the number of words in the processed text information, am,nIndicating the number of authors responsible for the target document,
Figure BDA0002879136020000158
refers to the nth word in the target document m being assigned zm,nThe number of times of the subject matter,
Figure BDA0002879136020000159
the dirichlet prior parameter vector of (a),
Figure BDA00028791360200001510
the topic parameter representing the nth word in the target document m is
Figure BDA00028791360200001511
The element of (1) indicates that regardless of the number of times that the current dispensing is made,
Figure BDA00028791360200001512
indicating that sum all words v are assigned a topic zm,nThe number of times and the subject parameters of all words,
Figure BDA00028791360200001513
indicating that the nth word in the target document m is simultaneously assigned zm,nSubject and author xm,nThe number of times of the operation of the motor,
Figure BDA00028791360200001514
is a dirichlet prior parameter vector,
Figure BDA00028791360200001515
topic z representing the nth word in the target document mm,nThe author parameter of is
Figure BDA00028791360200001516
The elements (A) and (B) in (B),
Figure BDA00028791360200001517
representing summing all simultaneously assigned topics k and authors xm,nThe number of times and the author parameter alpha of all topics k,
Figure BDA00028791360200001518
represents author x in target document mm,nThe contribution weight of (1).
Fig. 3 is a schematic diagram of word distribution in the literature provided by an embodiment of the present application, as shown in fig. 3, there are 4 bank words, 6 money words, 6 loan words, and other words in literature 1, and there are 5 bank words, 7 money words, 4 loan words, and other words in literature 2, which are only sample schematic diagrams, and there are many words in the literature, so it can be known that the number of words needs to be determined before topic assignment is performed, and the same words can also be grouped together for statistics.
On the basis of the above embodiment, as an alternative embodiment, determining the topic of each word in the document expressed in the document includes:
allocating themes of preset times to all words in the document, and after allocating themes to all words in the document every time, calculating the probability that the words are allocated to the target theme and other themes except the target theme when the theme is allocated next time according to the number of the words appearing in the document, the number of the words allocated to the target theme in the document after the theme is allocated at this time, and the number of the words allocated as the target theme in the document for any word in the document;
according to the probability that the words are allocated to the target theme and other themes except the target theme when the themes are allocated next time, allocating the themes for the next time on the words until the allocation times reach a preset threshold value;
obtaining a theme distributed when the distribution frequency of the words reaches a preset threshold value;
wherein, the target theme is the theme distributed when the word appears for the first time in the distribution.
Fig. 4 is a schematic diagram of topic assignment of a primary word provided in an embodiment of the present application, and as shown in fig. 4, it is assumed that topics have topics 1 and 2, a bank of words is assigned with topic 1, and a money is assigned with topics 1 and 2. Specifically, in the embodiment of the present application, it is necessary to determine the topic to be expressed by the word, and then assign a topic to the word, and perform an iterative process to make the calculated probability converge, so as to determine what the topic of the word is. In a general case, after the document is preprocessed to perform word segmentation, and the topic to be expressed by each word is unknown, further analysis is needed, first, the number of words and the type and number of topics need to be determined, then, a topic is randomly assigned to the words, the first assignment is initialization, so that each word is assigned with a topic, the assignment of topics is performed for the second time, the topic for assigning words for the second time is assigned according to the probability of calculating the topic, taking fig. 4 as an example, the first word bank is assigned with topic 1 for the first time, the number of all bank words assigned with topic 1 is counted to be 4, the number of all words assigned with topic 1 is counted to be 11, the probability of the first bank word assigned with topic 1 is 4/11, the topic is assigned according to the probability of topic 1 being 4/11, and the topic 2 is assigned with topic 7/11, if there are many themes, theme 1 is assigned with a probability of 4/11, and the remaining themes are assigned according to the probability calculated by the number of assignments. And then calculating the probability of a second bank word, redistributing the topics according to the probability, sequentially calculating all the words, completing second topic allocation, namely recording as one iteration, completing the preset iteration times, and estimating the topic of each word in the text information.
The probability calculation formula corresponds to the part of the Gibbs sampling algorithm
Figure BDA0002879136020000161
This part calls for the nth word assignment z in the target document mm,nProbability of topic, according to assignment zm,nThe probability of a topic determines what the topic the nth word in the target document m may express, a document having only one topic, but each word is capable of expressingThere are many different themes that determine the theme that each word may express.
Figure BDA0002879136020000171
Indicating that the nth word in the target document m is assigned zm,nThe number of times of this topic is understood as the number of times z is assigned to the word in the target document m which is the same as the nth wordm,nThe number of words for this topic is,
Figure BDA0002879136020000172
the theme parameter representing the nth word in the target document m is a preset parameter generally having a value of 0.01, and is used for avoiding that some words are not distributed with themes when the themes are distributed, if only one word is distributed with themes for the first time, statistics can be calculated according to a value of 0, a numerator is 0, the result probability is 0, the theme parameter is set, and the situation that the occurrence probability is 0 is avoided, wherein-1 represents that the situation that the currently distributed themes z are not considered, andm,nthe number of times.
Figure BDA0002879136020000173
Indicating that different ones of all words are assigned Zm,nThe number of times this topic has been assigned, it is understood that different words of all words are assigned zm,nThe number of words for this topic, such as: bank is assigned zm,nThe number of words of the subject, and also money, is assigned zm,nThe number of words of this topic, summing up these different words is assigned zm,nThe number of words of the topic and the topic parameters of the different words are obtainedm,nThe total number of words for this topic.
The nth word assignment z in the target document m is calculatedm,nProbability of topic, which is the assignment of z once to a wordm,nProbability of subject matter, i.e.
Figure BDA0002879136020000174
Denotes z in the target document mm,nAnd (3) distributing the probability distribution of the words of the theme, distributing the theme according to the calculated probability for the expected estimation result, completing probability calculation and theme distribution of all the words, and performing iteration processing for a preset number of times, wherein the iteration processing is the calculation of the theme probability from the first word, and distributing the theme according to the probability until the last word is redistributed to form the one-time iteration processing.
Fig. 5 is a schematic diagram of topic assignment of a word after iteration processing is completed according to an embodiment of the present application, and as shown in fig. 5, in document 1, after a preset number of iterations processing, a topic assigned to a word bank is topic 1, and a topic assigned to a word money is topic 1, so that it can be determined that the topic of the word bank is topic 1, and the topic of the word money is topic 1.
According to the method and the device, the word theme is distributed, the probability of the distributed theme is calculated, the theme is redistributed according to the probability, iteration processing is sequentially carried out, the result is converged and tends to be constant, the theme of the word is determined, and a foundation is laid for subsequently determining the interest theme of an author.
On the basis of the above embodiment, as an alternative embodiment, determining the word in the document for which each author is responsible includes:
the method comprises the steps that authors of preset times are allocated to all words in a document, and after the authors are allocated to all the words in the document every time, for any word in the document, the probability that the word is allocated to a target author and other authors except the target author when the author is allocated next time is calculated according to the number of the words appearing in a document, the number of the words allocated to the target author in the document after the authors are allocated at this time, and the number of the words allocated to the target author in the document;
according to the probability that the word is allocated to the target author and other authors except the target author when the author is allocated next time, allocating the author to the word next time until the allocation frequency reaches a preset threshold value;
acquiring an author to be distributed when the distribution frequency of the words reaches a preset threshold value;
wherein the target author is the author to which the word is assigned when it first appears in the assignment.
Fig. 6 is a schematic diagram of author assignment of primary words provided in this embodiment of the present application, and as shown in fig. 6, it is assumed that authors have author 1 and author 2, authors assigned to the word bank have author 1 and author 2, and authors assigned to the word money have author 1 and author 2. After the theme of the word is determined, the author corresponding to the word needs to be determined, and the author needs to be allocated to the word for iterative processing, so that the calculation probability is converged, and the author of the word can be determined. Specifically, in general, after obtaining relevant documents, only the content of the documents and the contribution value of each author are known, and the content specifically responsible for each author is not known, so that only one-step analysis can be performed, first, the number of words and the number of authors need to be determined, then, the words are randomly assigned to one author, the first assignment is initialized, each word is assigned with an author, the second assignment of the author is performed, and the authors assigned the words for the second time are assigned according to the probability of the author, which is similar to the topic assignment of the words.
The probability calculation formula corresponds to the same principle in the Gibbs sampling algorithm
Figure BDA0002879136020000181
This part calls for the nth word in the target document m to be simultaneously assigned zm,nSubject and author xm,nAccording to the distribution zm,nSubject and author xm,nIs assigned zm,nWho the author of the topic is.
Figure BDA0002879136020000182
Indicating that the nth word in the target document m is simultaneously assigned zm,nThis topic and Author xm,nIs understood as the number of times in the target document m, the same word as the nth word is simultaneously assigned zm,nThis topic and Author xm,nIs also understood that the word with the same nth word is assigned the author xm,nAnd the subject of the wordIs zm,nThe number of words of (a) is,
Figure BDA0002879136020000183
topic z representing the nth word in the target document mm,nThe author parameter of (1) is generally 0.01, and is used for avoiding that when an author is allocated, some words are not allocated with authors, if the word has only one word and is not allocated with an author for the first time, statistics can be calculated according to a 0 value, a molecule has 0, the result probability is 0, the author parameter is set, and the situation that the probability is 0 is avoided, wherein-1 represents that the currently allocated author x is not consideredm,nThe times, meaning is calculated first, and then the topics are assigned according to the probability.
Figure BDA0002879136020000191
Meaning that different ones of all words are assigned assignment z simultaneouslym,nThis topic and Author xm,nIs understood to mean that different words of all words are assigned z simultaneouslym,nThis topic and Author xm,nCan also be understood as the number of words in all words to which different words are assigned author xm,nAnd the subject of these words is zm,nThe number of words of the subject, summing up the different words being assigned z simultaneouslym,nThis topic and Author xm,nThe number of words and the subject parameters of these different words, all of which are assigned z simultaneouslym,nThis topic and Author xm,nThe total number of words.
Calculating the nth word simultaneous assignment z in the available target document mm,nSubject and author xm,nIs a probability of
Figure BDA0002879136020000192
Represents author x in target document mm,nThe author is assigned according to the calculated probability, the probability calculation and the author assignment of all the words are completed, and the iteration processing of the preset times is carried out, wherein the iteration processing is the first singleThe author probability is calculated at the beginning of the word, the author is distributed according to the probability until the author is redistributed to the last word, and the iterative processing is carried out.
Fig. 7 is a schematic diagram illustrating author assignment of a word after completion of an iterative process according to an embodiment of the present application, and as shown in fig. 7, after a preset number of iterative processes in document 1, an author assigned to a word bank is author 2, a majority of authors assigned to a word money are author 2, and a minority of authors are author 1, so that it can be determined that an author of the word bank is author 2, and an author of the word money is author 2.
According to the method and the device, word authors are allocated, the probability of the allocated authors is calculated, the authors are allocated according to the probability, iteration processing is carried out in sequence, the result is converged and tends to be constant, and therefore the authors of the words are determined, the part of the authors responsible for the documents is determined, and a foundation is laid for subsequently determining topics in which the authors are interested.
On the basis of the above embodiment, as an alternative embodiment, obtaining the topic expressed by the content responsible for each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document includes:
for any author in the literature, selecting a topic expressed in the literature by each word and a word responsible for each author in the literature according to the final weight of the author;
and taking the word responsible by the author as a target word, and determining the topic expressed in the literature by the author according to the topic expressed in the literature by the target word.
After determining the topic of the word and the author of the word, the embodiment of the application needs to sample according to the final weight of the author, the author with a large final weight ratio has a large sampling probability, and the word associated with the author is also sampled more, so as to determine the topic of the author in charge of the content in the document according to the topic of the word, because the author contributes more weight, the topic of the author in charge of the content may be diversified, the author weight is the sampling probability, the contribution of the author in the document can be highlighted, and the topic of the author in charge of the content can be better indicated, specifically, if a document is jointly responsible for two authors, wherein zhang san is mainly responsible for, zhang san is 80% weight, and lie is 20% weight, when analyzing the topic of the word and the author of the word, the sampling needs to be performed according to the ratio of the weight, the probability that the word associated with zhang san and the subject associated with the word are sampled is higher, and through high-probability sampling analysis, the subject expressed in zhang san in the document is determined, zhang san is likely to have more responsibility and the related subjects are also more, so that it can be analyzed that the subject interested in zhang san may be more diversified, and meanwhile, the probability that the word associated with lie si and the subject associated with the word are sampled is relatively low, but the subject expressed in the document can also be determined, and the subject interested in lie si is likely to be more, but in the document, there are fewer subjects related to lie si in charge and the subject interested in stand out may be relatively low.
After the topic of the word and the author of the word are determined, sampling analysis needs to be performed according to the final weight of the author, so that the sampling probability of the author with a large weight is higher, the contribution of the author in the literature can be highlighted, and the topic of the content responsible by the author can be indicated.
Fig. 8 is a schematic diagram of an author interest disclosure model provided in an embodiment of the present application, and as shown in fig. 8, the author interest disclosure model calculates a topic z of an nth word in a target document m by using a gibbs sampling algorithm formulam,nAnd the author x of the nth word in the target document mm,nTable 1 discloses a description table of each parameter in the model for author 1.
Figure BDA0002879136020000201
Table 1, Author 1 reveals a description table of each parameter in the model
On the basis of the above embodiment, as an alternative embodiment, determining relevant documents in charge of an author from at least one document, and obtaining topics expressed by the content in charge of the relevant documents by the author to determine topics of interest to the author includes:
acquiring relevant documents responsible for authors, and determining topics expressed by the authors in the content responsible for the relevant documents;
determining an interest topic of an author in the topics expressed by the author according to the topics expressed by the author;
calculating the probability of the occurrence of the interest topic of the author according to the occurrence times of the interest topic of the author in the related documents in charge of the author, and taking the topic with the probability exceeding a preset threshold value as the interest topic of the author.
After determining the topics of the authors in the literature, the embodiments of the present application need to collect the topics expressed by the authors in the literature related to the authors, summarize the topics expressed by the authors, and select the topics with higher occurrence probability as the topics of interest of the authors, for example: the author A expresses a topic 1 and a topic 2 in a first document, expresses a topic 3 and a topic 4 in a second document, and needs to collect more documents related to the author in order to determine the interested topic of the author A, the interested topics of the author A in different documents are collected, the times of the interested topics are counted, the probability of the topics is calculated according to the times of the topics, and the topics with the probability exceeding a preset threshold value are used as the interested topics of the author.
Author interest disclosure model (AT) using the mechanism for incorporating contribution weight assignment proposed by the present inventioncreditModels) to find topics of interest to each researcher in the data set. Taking two high-yielding students of the university of Toronto, Boone, Charles and Andrews, Breda J. as an example, Table 2 is a probability table of interest topics and topics calculated by various algorithms, as shown in Table 2, Boone, Charles and Andrews, Breda J. utilizing ATcreditThe topic of interest found by the model AT the top 3 and the corresponding probability, for example, the topic with the probability greater than 10.00% is taken as the topic of interest of each scholarer, and table 3 is ATcreditThe model mining yields a list of 3 domain deep topics, as shown in table 3, where each domain topic is represented by the most relevant 10 words.
Figure BDA0002879136020000211
Figure BDA0002879136020000221
TABLE 2 probability tables of interest topics and topics calculated by various algorithms
Figure BDA0002879136020000222
TABLE 3 ATcreditModel mining obtained deep theme table of 3 fields
From table 3, it can be found that the research interest of Charles is mainly focused on "genetic interaction", while the interests of Andrews, breda j.
Fig. 9 is a schematic structural diagram of an apparatus for analyzing an interest topic of an author according to an embodiment of the present application, and as shown in fig. 9, the apparatus may include: the first obtaining module 301, the determining module 302, and the second obtaining module 303 specifically:
a first obtaining module 301, configured to obtain at least one document in a target field, and determine a contribution weight of each author in the document; the contribution weight is a normalized result of the contribution value of the author;
a determining module 302, configured to determine, for each document, a topic in the document that each word in the document expresses in the document and a word in the document that each author is responsible for; obtaining a topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document;
and a second obtaining module 303, configured to, for each author, determine, from at least one document, a relevant document for which the author is responsible, and obtain a topic expressed by the content for which the author is responsible in the relevant document, so as to determine an interest topic of the author.
The apparatus for analyzing an author's interest topic provided in the embodiment of the present invention specifically executes the process of the method embodiment, and for details, the contents of the method embodiment for analyzing an author's interest topic are described in detail, and are not described herein again. According to the analysis device for the interest topics of the authors, the contribution weight of each author in the literature, the topic expressed by each word in the literature and the word responsible for each author in the literature are determined by acquiring at least one literature in the target field; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each author contributes inequality to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.
Further, the first obtaining module 301 includes:
the preprocessing module is used for acquiring authors and contribution values of each author in the document;
determining an initial weight of each author according to the number of authors in the document and the contribution value of each author;
and normalizing the initial weight of each author in the literature to obtain the final weight of each author in the literature.
Further, a pre-processing module comprising:
the first weight calculation module is used for performing descending arrangement on the authors in the document according to the contribution values of the authors in the document to obtain the ordering result of each author in the document if the number of the authors in the document does not exceed the preset number value;
and calculating the initial weight of each author according to a preset weight algorithm according to the sequencing result of each author.
Further, the preprocessing module further comprises:
the second weight calculation module is used for performing descending arrangement on the authors in the document according to the contribution values of the authors in the document to obtain the ordering result of each author in the document if the number of the authors in the document exceeds a preset number value;
when the sorting result of the authors is smaller than or equal to the preset quantity value, calculating to obtain the initial weight of the authors of which the sorting result is smaller than or equal to the preset quantity according to a preset weight algorithm according to the sorting result of the authors of which the sorting result is smaller than or equal to the preset quantity value in the sorting results of the authors;
when the sorting result of the author is larger than the preset quantity value, taking the preset multiple of the initial weight of the first author as the initial weight of all authors of which the sorting result is larger than the preset quantity value;
the first author is the author in the document whose ranking result is the first.
Further, the determining module 302 includes:
the theme determining module is used for allocating themes of preset times to all words in the document, and calculating the probability that the words are allocated to the target theme and other themes except the target theme when the theme is allocated next time according to the number of the words appearing in the document, the number of the words allocated to the target theme after the theme is allocated at this time, the number of the words allocated to the target theme in the document and the number of the words allocated to the target theme in the document for any word in the document after the theme allocation to all the words in the document is completed each time;
according to the probability that the words are allocated to the target theme and other themes except the target theme when the themes are allocated next time, allocating the themes for the next time on the words until the allocation times reach a preset threshold value;
obtaining a theme distributed when the distribution frequency of the words reaches a preset threshold value;
wherein, the target theme is the theme distributed when the word appears for the first time in the distribution.
Further, the determining module 302 further includes:
the author confirming module is used for allocating authors of preset times to all words in the document, and after the author allocation to all the words in the document is completed each time, for any word in the document, calculating the probability that the word is allocated to the target author and other authors except the target author when the author is allocated next time according to the number of the words appearing in the document, the number of the words allocated to the target author in the document after the author is allocated at this time, and the number of the words allocated to the target author in the document;
according to the probability that the word is allocated to the target author and other authors except the target author when the author is allocated next time, allocating the author to the word next time until the allocation frequency reaches a preset threshold value;
acquiring an author to be distributed when the distribution frequency of the words reaches a preset threshold value;
wherein the target author is the author to which the word is assigned when it first appears in the assignment.
Further, the preprocessing module further comprises:
the interest topic module is used for selecting a topic expressed by a word in the document and a word responsible for the author in the document for any author in the document according to the final weight of the author;
and taking the word responsible by the author as a target word, and determining the topic expressed in the literature by the author according to the topic expressed in the literature by the target word.
Further, the second obtaining module 303 includes:
the document acquisition module is used for acquiring relevant documents responsible for the author and determining a theme expressed by the content responsible for the relevant documents by the author;
determining an interest topic of an author in the topics expressed by the author according to the topics expressed by the author;
calculating the probability of the occurrence of the interest topic of the author according to the occurrence times of the interest topic of the author in the related documents in charge of the author, and taking the topic with the probability exceeding a preset threshold value as the interest topic of the author.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: determining contribution weight of each author in the literature, a topic expressed by each word in the literature and a word responsible for each author in the literature by acquiring at least one literature in a target field; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each author contributes inequality to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.
In an alternative embodiment, an electronic device is provided, as shown in fig. 10, the electronic device 4000 shown in fig. 10 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the contribution weight of each author in the literature, the topic expressed by each word in the literature and the word responsible for each author in the literature are determined by acquiring at least one literature in the target field; and obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document, and determining the interest topic of the author according to the topic expressed by the content responsible for the relevant document by the author. According to the embodiment of the application, the interest topics of the authors can be found on the premise that each author contributes inequality to one multi-author article, the interest topics of scientific research personnel are reasonably reflected, research hotspots and trends in the subject field can be beneficially explored, and personalized academic research can be promoted.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (11)

1. A method for analyzing interest topics of authors is characterized by comprising the following steps:
acquiring at least one document of a target field, and determining the contribution weight of each author in the document; the contribution weight is a normalized result of the contribution value of the author;
for each document, determining a topic in the document that each word in the document expresses in the document, and a word in the document that each author is responsible for; obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document;
for each author, determining related documents responsible for the author from the at least one document, and obtaining topics expressed by the content responsible for the related documents by the author to determine interest topics of the author.
2. The method of claim 1, wherein the determining the contribution weight of each author in the document comprises:
acquiring the contribution value of the author and each author in the document;
determining an initial weight of each author according to the number of authors in the document and the contribution value of each author;
normalizing the initial weight of each author in the document to obtain the final weight of each author in the document.
3. The method for analyzing author interest topic according to claim 2, wherein the determining an initial weight of each author according to the number of authors and contribution value of each author in the document comprises:
if the number of the authors in the document does not exceed a preset number value, performing descending order arrangement on the authors in the document according to the contribution value of each author in the document to obtain an ordering result of each author in the document;
and calculating the initial weight of each author according to a preset weight algorithm according to the sequencing result of each author.
4. The method of analyzing author interest topic of claim 2, wherein the determining an initial weight of each author based on the number of authors and the contribution value of each author in the document further comprises:
if the number of the authors in the literature exceeds a preset number value, performing descending order arrangement on the authors in the literature according to the contribution value of each author in the literature to obtain an ordering result of each author in the literature;
when the sorting result of the authors is smaller than or equal to the preset quantity value, calculating to obtain the initial weight of the authors of which the sorting result is smaller than or equal to the preset quantity according to a preset weight algorithm according to the sorting result of the authors of which the sorting result is smaller than or equal to the preset quantity value in the sorting results of the authors;
when the sorting result of the authors is greater than a preset quantity value, taking a preset multiple of the initial weight of the first author as the initial weight of all the authors of which the sorting result is greater than the preset quantity value;
the first author is the author in the document whose ranking result is first.
5. The method for analyzing the author's topic of interest as recited in claim 1, wherein the determining the topic of each word in the document expressed in the document comprises:
allocating a preset number of themes to all words in the document, and after allocating themes to all words in the document every time, calculating the probability that the words are allocated to the target theme and other themes except the target theme when allocating themes next time according to the number of the words appearing in the document, the number of the words allocated to the target theme in the document after allocating themes this time, the number of the words allocated to the target theme in the document and the number of the words allocated to the target theme in the document for any word in the document;
according to the probability that the word is allocated to the target theme and other themes except the target theme when the theme is allocated next time, allocating the theme for the next time on the word until the allocation frequency reaches a preset threshold value;
obtaining a theme distributed when the distribution frequency of the words reaches a preset threshold value;
and the target theme is the theme distributed when the word appears for the first time in the distribution.
6. The method for analyzing interest topics of authors as claimed in claim 1, wherein the determining words in the document for which each author is responsible comprises:
assigning authors of a preset number of times to all words in the document, and after assigning authors to all words in the document is completed each time, calculating the probability that a word is assigned to a target author and other authors except the target author when an author is assigned next time according to the number of the words appearing in the document, the number of the words assigned to the target author in the document after the authors are assigned this time, and the number of the words assigned to the target author in the document for any word in the document;
according to the probability that the word is allocated to the target author and other authors except the target author when the author is allocated next time, allocating authors for the next time until the allocation times reach a preset threshold value;
acquiring an author to be distributed when the distribution frequency of the words reaches a preset threshold value;
wherein the target author is an author assigned to the word when the word first appears in the assignment.
7. The method for analyzing interest topics of authors according to claim 2, wherein the obtaining of the topics expressed by the content responsible for each author in the document according to the topics expressed by each word in the document, the words responsible for each author in the document and the contribution weight of each author in the document comprises:
for any author in the literature, selecting a topic expressed by a word in the literature and a word responsible for the author in the literature according to the final weight of the author;
and taking the word responsible by the author as a target word, and determining the expressed topic of the author in the literature according to the expressed topic of the target word in the literature.
8. The method for analyzing interest topics of authors according to claim 1, wherein the determining related documents responsible for authors from the at least one document, and obtaining topics expressed by contents responsible for related documents of authors to determine interest topics of authors comprises:
acquiring relevant documents responsible for the author, and determining a theme expressed by the content of the author in the responsibility of the relevant documents;
determining an interest topic of an author in the topics expressed by the author according to the topics expressed by the author;
calculating the occurrence probability of the interest topic of the author according to the occurrence times of the interest topic of the author in related documents in charge of the author, and taking the topic of which the probability exceeds a preset probability value as the interest topic of the author.
9. An apparatus for analyzing interest topics of authors, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring at least one document of a target field and determining the contribution weight of each author in the document; the contribution weight is a normalized result of the contribution value of the author;
a determining module for determining, for each document, a topic of each word in the document expressed in the document and a word in charge of each author in the document; obtaining the topic expressed by each author in the document according to the topic expressed by each word in the document, the word responsible for each author in the document and the contribution weight of each author in the document;
and the second acquisition module is used for determining related documents written by the authors from the at least one document for each author, and acquiring the topics expressed by the content of the authors in the related documents to determine the interest topics of the authors.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method for analyzing a topic of interest of an author as claimed in any one of claims 1 to 8 when executing said program.
11. A computer-readable storage medium, characterized in that it stores computer instructions that make the computer execute the steps of the method for analyzing the author's interest topic according to any one of claims 1 to 8.
CN202011625275.6A 2020-12-31 Method and device for analyzing interest subject of author, electronic equipment and storage medium Active CN112765305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011625275.6A CN112765305B (en) 2020-12-31 Method and device for analyzing interest subject of author, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011625275.6A CN112765305B (en) 2020-12-31 Method and device for analyzing interest subject of author, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112765305A true CN112765305A (en) 2021-05-07
CN112765305B CN112765305B (en) 2024-05-14

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515638A (en) * 2021-09-14 2021-10-19 北京邮电大学 Student clustering-oriented research interest mining method and device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605671A (en) * 2013-10-29 2014-02-26 中国科学技术信息研究所 Scientific research information evolution analyzing method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605671A (en) * 2013-10-29 2014-02-26 中国科学技术信息研究所 Scientific research information evolution analyzing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余传明 等: "基于复合主题演化模型的作者研究兴趣动态发现", 《山东大学学报(理学版)》, no. 9, pages 23 - 33 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515638A (en) * 2021-09-14 2021-10-19 北京邮电大学 Student clustering-oriented research interest mining method and device and storage medium
CN113515638B (en) * 2021-09-14 2021-12-07 北京邮电大学 Student clustering-oriented research interest mining method and device and storage medium

Similar Documents

Publication Publication Date Title
Nelson et al. The future of coding: A comparison of hand-coding and three types of computer-assisted text analysis methods
CN108073568B (en) Keyword extraction method and device
Divjak et al. Finding structure in linguistic data
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
CN109471942B (en) Chinese comment emotion classification method and device based on evidence reasoning rule
CN108132961B (en) Reference recommendation method based on citation prediction
Shamir et al. Leveraging pattern recognition consistency estimation for crowdsourcing data analysis
Gahmousse et al. Handwriting based personality identification using textural features
El-Rashidy et al. Reliable plagiarism detection system based on deep learning approaches
CN109344232A (en) A kind of public feelings information search method and terminal device
US20210117448A1 (en) Iterative sampling based dataset clustering
Sharma et al. A trend analysis of significant topics over time in machine learning research
CN114943285B (en) Intelligent auditing system for internet news content data
CN112765305A (en) Method and device for analyzing interest topic of author, electronic equipment and storage medium
CN113988878B (en) Graph database technology-based anti-fraud method and system
CN112765305B (en) Method and device for analyzing interest subject of author, electronic equipment and storage medium
CN112733542B (en) Theme detection method and device, electronic equipment and storage medium
JP5679400B2 (en) Category theme phrase extracting device, hierarchical tagging device and method, program, and computer-readable recording medium
Febriany et al. Analysis model for identifying negative posts based on social media
CN110413782B (en) Automatic table theme classification method and device, computer equipment and storage medium
CN110010231A (en) A kind of data processing system and computer readable storage medium
CN113011689B (en) Evaluation method and device for software development workload and computing equipment
CN113157757A (en) Data recommendation method and device, electronic equipment and storage medium
CN112215006A (en) Organization named entity normalization method and system
Rajasekar et al. Comparison of machine learning algorithms in domain specific information extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant