CN110059180B - Article author identity recognition and evaluation model training method and device and storage medium - Google Patents

Article author identity recognition and evaluation model training method and device and storage medium Download PDF

Info

Publication number
CN110059180B
CN110059180B CN201910187998.3A CN201910187998A CN110059180B CN 110059180 B CN110059180 B CN 110059180B CN 201910187998 A CN201910187998 A CN 201910187998A CN 110059180 B CN110059180 B CN 110059180B
Authority
CN
China
Prior art keywords
article
word vector
word
vector representation
author
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910187998.3A
Other languages
Chinese (zh)
Other versions
CN110059180A (en
Inventor
刘焱
吕中厚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910187998.3A priority Critical patent/CN110059180B/en
Publication of CN110059180A publication Critical patent/CN110059180A/en
Application granted granted Critical
Publication of CN110059180B publication Critical patent/CN110059180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an article author identity recognition and evaluation model training method, an article author identity recognition and evaluation model training device and a storage medium, wherein the article author identity recognition method can comprise the following steps: acquiring word vector representation of a first article to be recognized, and inputting an evaluation model obtained by pre-training to obtain a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing different author features; aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain a feature vector of the output second article; and determining whether the first article and the second article belong to the same author or not by comparing the feature vector of the first article with the feature vector of the second article. The scheme of the invention has strong expandability and can correspondingly improve the recognition efficiency and the like.

Description

Article author identity recognition and evaluation model training method and device and storage medium
[ technical field ] A method for producing a semiconductor device
The invention relates to a computer application technology, in particular to an article author identity recognition and evaluation model training method, device and storage medium.
[ background ] A method for producing a semiconductor device
In various fields such as archaeology, national defense, public opinion analysis and the like, the real identity of an anonymous author needs to be determined, namely, the identity of an article author needs to be identified.
At present, the following identification methods are generally adopted: information entropy, word frequency, a language model (n-gram) and the like are extracted aiming at an article, and then machine learning classification models such as a support vector machine are used for classification, but the expandability of the method is poor, only existing authors during model training can be matched during author identity recognition, and retraining is needed if newly added authors are needed.
[ summary of the invention ]
In view of the above, the present invention provides an article author identification and evaluation model training method, apparatus and storage medium.
The specific technical scheme is as follows:
an article author identification method comprises the following steps:
acquiring word vector representation of a first article to be identified, and inputting an evaluation model obtained by pre-training to obtain a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing different author features;
aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain an output feature vector of the second article;
and determining whether the first article and the second article belong to the same author or not by comparing the feature vector of the first article with the feature vector of the second article.
According to a preferred embodiment of the present invention, for any article, the manner of obtaining the word vector representation of the article includes:
performing word segmentation processing on the article;
truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one;
respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one;
and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
According to a preferred embodiment of the present invention, the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
According to a preferred embodiment of the present invention, the determining whether the first article and the second article belong to the same author by comparing the feature vector of the first article and the feature vector of the second article comprises:
determining a difference between the feature vector of the first article and the feature vector of the second article;
if the difference is smaller than a preset threshold value, the first article and the second article are determined to belong to the same author, otherwise, the first article and the second article are determined not to belong to the same author.
An assessment model training method, comprising:
acquiring articles of known authors serving as training samples, and classifying the articles belonging to the same author into a class;
respectively obtaining word vector representation of each article;
training an evaluation model according to the word vector representation and the belonged classification of each article so as to respectively evaluate a feature vector corresponding to the word vector representation of the input first article and a feature vector corresponding to the word vector representation of the second article by using the evaluation model when identifying the identity of the author of the articles, and determining whether the first article and the second article belong to the same author or not by comparing the two feature vectors, wherein the features are features capable of being used for distinguishing the characteristics of different authors.
According to a preferred embodiment of the present invention, for any article, the manner of obtaining the word vector representation of the article includes:
performing word segmentation processing on the article;
truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one;
respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one;
and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
According to a preferred embodiment of the present invention, the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
According to a preferred embodiment of the present invention, the training of the evaluation model according to the word vector representation and the attached classification of each article comprises:
at each training, word vector representations of three articles are input, wherein two articles belong to the same category and respectively correspond to the anchor example and the positive example of the Triplet Loss, and the other article belongs to the other category and corresponds to the negative example of the Triplet Loss.
An article author identification apparatus comprising: the system comprises a first acquisition unit and an identity recognition unit;
the first obtaining unit is used for obtaining word vector representation of a first article to be recognized, inputting an evaluation model obtained through pre-training, and obtaining a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing features of different authors; aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain an output feature vector of the second article;
the identity recognition unit is used for determining whether the first article and the second article belong to the same author by comparing the feature vector of the first article with the feature vector of the second article.
According to a preferred embodiment of the present invention, for any article, the first obtaining unit obtains a word vector representation of the article in the following manner: performing word segmentation processing on the article; truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
According to a preferred embodiment of the present invention, the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
According to a preferred embodiment of the present invention, the identity recognition unit determines a difference between the feature vector of the first article and the feature vector of the second article, and determines that the first article and the second article belong to the same author if the difference is smaller than a predetermined threshold, or otherwise determines that the first article and the second article do not belong to the same author.
An evaluation model training apparatus comprising: a second obtaining unit and a model training unit;
the second acquisition unit is used for acquiring articles of known authors serving as training samples and classifying the articles belonging to the same author into one category; respectively obtaining word vector representation of each article;
the model training unit is used for training an evaluation model according to word vector representations and belonged classifications of articles so as to respectively evaluate feature vectors corresponding to word vector representations of input first articles and feature vectors corresponding to word vector representations of second articles by using the evaluation model when article author identities are identified, and determine whether the first articles and the second articles belong to the same author or not by comparing the two feature vectors, wherein the features are features capable of being used for distinguishing characteristics of different authors.
According to a preferred embodiment of the present invention, for any article, the second obtaining unit obtains the word vector representation of the article in the following manner: performing word segmentation processing on the article; truncating the article according to a mode of keeping the first L word segmentation results, wherein L is a positive integer greater than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
According to a preferred embodiment of the present invention, the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
According to a preferred embodiment of the present invention, the model training unit inputs word vector representations of three articles at each training, wherein two articles belong to the same category, corresponding to the anchor instance and the positive instance of the Triplet Loss, respectively, and the other article belongs to the other category, corresponding to the negative instance of the Triplet Loss.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method as set forth above.
Based on the above description, it can be seen that, by using the scheme of the present invention, for a first article to be recognized and a second article of a known author, word vector representations of the first article and the second article can be respectively obtained, and the word vector representations of the two articles can be respectively input into an evaluation model, so as to respectively obtain feature vectors of the two articles, and further, whether the two articles belong to the same author can be determined by comparing the feature vectors of the two articles.
[ description of the drawings ]
FIG. 1 is a flowchart of an embodiment of an assessment model training method according to the present invention.
Fig. 2 is a schematic diagram of the distance optimization method according to the present invention.
FIG. 3 is a flowchart illustrating an embodiment of an article author identification method according to the present invention.
Fig. 4 is a schematic structural diagram of an article author identification apparatus according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a composition structure of an evaluation model training apparatus according to an embodiment of the present invention.
FIG. 6 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention.
[ detailed description ] A
In order to make the technical solution of the present invention clearer and more obvious, the solution of the present invention is further described below by referring to the drawings and examples.
It should be apparent that the described embodiments are only some of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In addition, it should be understood that the term "and/or" herein is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The invention provides an article author identity identification method which can obtain word vector representation of a first article to be identified, input the word vector representation of the first article into an evaluation model obtained through pre-training to obtain a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing characteristics of different authors.
It can be seen that in order to implement the method of the present invention, training of the evaluation model is first required.
FIG. 1 is a flowchart of an embodiment of an assessment model training method according to the present invention. As shown in fig. 1, the following detailed implementation is included.
In 101, articles of known authors as training samples are obtained and articles belonging to the same author are classified into one category.
At 102, word vector representations of the articles are obtained, respectively.
In 103, training an evaluation model according to the word vector representation and the belonged classification of each article, so that when article author identity recognition is performed, the evaluation model is used to evaluate feature vectors corresponding to the word vector representation of the input first article and the word vector representation of the second article, and whether the first article and the second article belong to the same author is determined by comparing the two feature vectors, where the feature is a feature that can be used to distinguish different author characteristics.
How to obtain the articles of the known authors as the training samples is not limited, for example, the articles of the known authors may be crawled from mainstream media such as the national newspaper, the national network, and the new kyoto newspaper, and the articles belonging to the same author may be classified into one category.
For each article as a training sample, its word vector representation can be obtained separately.
Specifically, for any article, word segmentation processing may be performed on the article first, then the article may be truncated in a manner of retaining the first L word segmentation results, where L is a positive integer greater than one, and n-dimensional word vector representations of each word segmentation result may be obtained respectively, where n is a positive integer greater than one, and then word vector representations of L word segmentation results may be used to form a word vector representation of L lines and n columns, and the word vector representation of the L lines and n columns is used as a word vector representation of the article.
The word segmentation method can be used for carrying out word segmentation on the article. And then only the first L word segmentation results in the article can be kept, namely, the article is cut off, the subsequent content is discarded, and the specific value of L can be determined according to the actual requirement. N-dimensional word vector representations of each word segmentation result can be obtained respectively by using word vectors trained based on a large amount of linguistic data, such as word to vector (word2vec, word to vector) and the like, and the specific value of n can also be determined according to actual needs. Thus, L word vector representations of word segmentation results form L lines and n columns of word vector representations, each line is a word vector representation of a word segmentation result, where the first line may be a word vector representation of a first word segmentation result, the second line may be a word vector representation of a second word segmentation result, and so on, the L lines and n columns of word vector representations may be used as word vector representations of articles.
After the word vector representation of each article is acquired, an evaluation model can be trained according to the word vector representation and the belonged classification of each article.
The evaluation model described in this embodiment may be a deep learning model based on convolution, and may use an error function (Triplet Loss) as a Loss function.
Accordingly, at each training, a word vector representation of three articles may be input, two of which belong to the same category, corresponding to the Anchor (Anchor) and Positive (Positive) examples of triple Loss, respectively, and the other of which belong to another category, corresponding to the Negative (Negative) example of triple Loss.
The triple Loss is a Loss function in deep learning and is used for training samples with smaller differences, and the triple is a triple which is formed by the following steps: randomly selecting a sample from the training data set, wherein the sample is called an Anchor, and then randomly selecting a sample Positive belonging to the same classification as the Anchor and a sample Negative not belonging to the same classification as the Anchor, thereby forming a (Anchor, Positive, Negative) triple, namely the Anchor example, the Positive example and the Negative example.
For each element (sample) in the triplet, a network with shared or unshared parameters is trained to obtain feature expressions of three elements, which can be respectively recorded as:
Figure RE-GDA0002055942650000071
and
Figure RE-GDA0002055942650000072
wherein the content of the first and second substances,
Figure RE-GDA0002055942650000073
is the characteristic expression of Anchor,
Figure RE-GDA0002055942650000074
is the characteristic expression of Positive,
Figure RE-GDA0002055942650000075
is the characteristic expression of Negative. The purpose of triple Loss is to allow the expression of either Anchor or Positive characteristics through learningThe distance is as small as possible, the distance between the expression of the Anchor and the Negative characteristic is as large as possible, and the distance between the Anchor and the Negative and the distance between the Anchor and the Positive are separated by a minimum interval alpha, and the corresponding objective function is as follows:
Figure RE-GDA0002055942650000081
the distance can be measured by Euclidean distance]The latter + represents [ 2 ]]When the value of the internal value is more than zero, the value is taken as loss, and when the value is less than zero, the loss is zero.
Correspondingly, in this embodiment, the triple Loss is used as the Loss function, so that the euclidean distance between the deep learning network output values of the articles of the same author is ensured to be smaller, and conversely, the euclidean distance between the deep learning network output values of the articles of different authors is ensured to be larger. Namely, the similarity calculation of the samples is realized by optimizing that the distance between the Anchor example and the Positive example is smaller than that between the Anchor example and the Negative example. As shown in fig. 2, fig. 2 is a schematic diagram of a distance optimization method according to the present invention.
In this embodiment, at each training, a word vector representation of three articles may be input, where two articles belong to the same category and correspond to the Anchor and Positive examples, respectively, and the other article belongs to another category and corresponds to the Negative example. Through the evaluation model training process, the evaluation model can learn which features can be used for distinguishing different authors. In this way, when the evaluation model is subsequently used for article author identity recognition, the evaluation model can be used for evaluating the feature vector of the input word vector representing the corresponding article, and the feature is a feature capable of being used for distinguishing different author characteristics. How to perform evaluation model training is prior art.
Based on the above description, fig. 3 is a flowchart of an embodiment of an article author identification method according to the present invention. As shown in fig. 3, the following detailed implementation is included.
In 301, word vector representation of a first article to be recognized is obtained, the word vector representation of the first article is input into an evaluation model obtained through pre-training, and a feature vector of the output first article is obtained, wherein the feature is a feature capable of being used for distinguishing features of different authors.
In 302, for a second article of a known author, word vector representation of the second article is obtained, and the word vector representation of the second article is input into an evaluation model obtained through pre-training, so that an output feature vector of the second article is obtained.
In 303, it is determined whether the first article and the second article belong to the same author by comparing the feature vector of the first article and the feature vector of the second article.
For any article, the manner of obtaining the word vector representation of the article may include: performing word segmentation processing on the article; truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L lines and n columns, and taking the word vector representation of the L lines and the n columns as the word vector representation of the article.
For a first article to be recognized, after the word vector representation of the first article is obtained, the word vector representation of the first article can be input into the pre-trained evaluation model, so that the feature vector of the output first article is obtained, and similarly, for a second article of a known author, after the word vector representation of the second article is obtained, the word vector representation of the second article can be input into the pre-trained evaluation model, so that the feature vector of the output second article is obtained, wherein the features are features capable of being used for distinguishing different author features.
Preferably, the evaluation model may be a convolution-based deep learning model. The convolution-based deep learning model may use Triplet Loss as a Loss function.
Then, whether the first article and the second article belong to the same author can be determined by comparing the feature vector of the first article and the feature vector of the second article.
Specifically, a difference between the feature vector of the first article and the feature vector of the second article may be determined, and then the difference may be compared with a preset threshold, and if the difference is smaller than the threshold, it may be determined that the first article and the second article belong to the same author, otherwise, it may be determined that the first article and the second article do not belong to the same author. The specific value of the threshold can be determined according to actual needs.
For example, a euclidean distance between the feature vector of the first article and the feature vector of the second article may be calculated, and if the distance is less than a threshold, it may be determined that the first article and the second article belong to the same author, otherwise, it may be determined that the first article and the second article do not belong to the same author. The smaller the distance, the greater the probability that two articles belong to the same author.
In practical application, when the author identification needs to be performed on the first article, in order to reduce the workload of subsequent processing, some authors may be first screened from known authors according to some predicted information, and then, for each remaining author, any article of the author may be respectively used as a second article to be compared with the first article, and so on.
It should be noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In short, by adopting the scheme of the embodiment of the method, word vector representations of a first article to be recognized and a second article of a known author can be respectively obtained, and the word vector representations of the two articles can be respectively input into an evaluation model, so that the feature vectors of the two articles can be respectively obtained, and further whether the two articles belong to the same author can be determined by comparing the feature vectors of the two articles.
The above is a description of method embodiments, and the embodiments of the present invention are further described below by way of apparatus embodiments.
Fig. 4 is a schematic structural diagram of an article author identification apparatus according to an embodiment of the present invention. As shown in fig. 4, includes: a first acquisition unit 401 and an identification unit 402.
A first obtaining unit 401, configured to obtain a word vector representation of a first article to be identified, and input an evaluation model obtained through pre-training to obtain a feature vector of the output first article, where the feature is a feature that can be used to distinguish characteristics of different authors; and aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain an output feature vector of the second article.
An identity recognition unit 402, configured to determine whether the first article and the second article belong to the same author by comparing the feature vector of the first article and the feature vector of the second article.
For any article, the first obtaining unit 401 may obtain a word vector representation of the article in the following manner: performing word segmentation processing on the article; truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L lines and n columns, and taking the word vector representation of the L lines and the n columns as the word vector representation of the article.
For a first article, after obtaining the word vector representation thereof, the first obtaining unit 401 may input the word vector representation of the first article into the pre-trained evaluation model to obtain a feature vector of the output first article, and similarly, for a second article of a known author, after obtaining the word vector representation thereof, the first obtaining unit 401 may input the word vector representation of the second article into the pre-trained evaluation model to obtain a feature vector of the output second article, where the feature is a feature that can be used to distinguish different author features.
Preferably, the evaluation model may be a convolution-based deep learning model and may use Triplet Loss as a Loss function.
The identity recognition unit 402 may determine whether the first article and the second article belong to the same author by comparing the feature vector of the first article and the feature vector of the second article.
Specifically, the identity recognition unit 402 may determine a difference between the feature vector of the first article and the feature vector of the second article, and if the difference is smaller than a predetermined threshold, may determine that the first article and the second article belong to the same author, otherwise, may determine that the first article and the second article do not belong to the same author.
For example, a euclidean distance between the feature vector of the first article and the feature vector of the second article may be calculated, and if the distance is less than a threshold, it may be determined that the first article and the second article belong to the same author, otherwise, it may be determined that the first article and the second article do not belong to the same author. The smaller the distance, the greater the probability that two articles belong to the same author.
Fig. 5 is a schematic diagram of a composition structure of an evaluation model training apparatus according to an embodiment of the present invention. As shown in fig. 5, includes: a second acquisition unit 501 and a model training unit 502.
The second obtaining unit 501 is configured to obtain articles of known authors as training samples, classify the articles belonging to the same author into a class, and obtain word vector representations of the articles respectively.
The model training unit 502 is configured to train an evaluation model according to the word vector representations and the belonged classifications of the articles, so that when article author identity recognition is performed, the evaluation model is used to evaluate feature vectors corresponding to the word vector representations of the input first article and the word vector representation of the second article, and determine whether the first article and the second article belong to the same author by comparing the two feature vectors, where the feature is a feature that can be used to distinguish different author characteristics.
How the second obtaining unit 501 obtains the articles of the known authors as the training samples is not limited, for example, the articles of the known authors may be crawled from mainstream media such as the national newspaper, the civil network, and the new kyoto newspaper, and the articles belonging to the same author may be classified into one category.
For each article as a training sample, the second obtaining unit 501 may obtain the word vector representation thereof separately, for example, the word vector representation of each article may be obtained as follows: performing word segmentation processing on the article; truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L lines and n columns, and taking the word vector representation of the L lines and the n columns as the word vector representation of the article.
After the word vector representation of each article is obtained, an evaluation model may be trained by the model training unit 502 according to the word vector representation and the belonging classification of each article.
Preferably, the evaluation model may be a convolution-based deep learning model and may use Triplet Loss as a Loss function.
Accordingly, the model training unit 502 may input word vector representations of three articles at each training, where two articles belong to the same category, corresponding to the anchor instance and the positive instance of the Triplet Loss, respectively, and another article belongs to another category, corresponding to the negative instance of the Triplet Loss. Through the evaluation model training process, the evaluation model can learn which features can be used for distinguishing different authors. Therefore, when the evaluation model is subsequently used for article author identity recognition, the evaluation model can be used for evaluating the feature vector of the corresponding article represented by the input word vector. How to perform evaluation model training is prior art.
For a specific work flow of the device embodiments shown in fig. 4 and fig. 5, reference is made to the related description in the foregoing method embodiments, and details are not repeated. In practical applications, the devices shown in fig. 4 and 5 can be independent devices respectively, or can be combined into one device.
FIG. 6 illustrates a block diagram of an exemplary computer system/server 12 suitable for use in implementing embodiments of the present invention. The computer system/server 12 shown in FIG. 6 is only one example and should not be taken to limit the scope of use or the functionality of embodiments of the present invention.
As shown in FIG. 6, computer system/server 12 is in the form of a general purpose computing device. The components of computer system/server 12 may include, but are not limited to: one or more processors (processing units) 16, a memory 28, and a bus 18 that connects the various system components, including the memory 28 and the processors 16.
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The computer system/server 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, and commonly referred to as a "hard drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The computer system/server 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer system/server 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the computer system/server 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 20. As shown in FIG. 6, network adapter 20 communicates with the other modules of computer system/server 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer system/server 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 16 executes various functional applications and data processing, such as implementing the methods of the embodiments shown in fig. 1 or fig. 3, by executing programs stored in the memory 28.
The invention also discloses a computer-readable storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of the embodiments shown in fig. 1 or fig. 3.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method, etc., can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may also be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer-readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (18)

1. An article author identification method is characterized by comprising the following steps:
acquiring word vector representation of a first article to be identified, and inputting an evaluation model obtained through pre-training to obtain a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing features of different authors, and the first article is an article of an unknown author;
aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain an output feature vector of the second article;
and determining whether the first article and the second article belong to the same author or not by comparing the feature vector of the first article with the feature vector of the second article.
2. The method of claim 1,
for any article, the way of obtaining word vector representation of the article includes:
performing word segmentation processing on the article;
truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one;
respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one;
and utilizing the word vector representations of the L word segmentation results to form a word vector representation of L lines and n columns, and taking the word vector representation of the L lines and the n columns as the word vector representation of the article.
3. The method of claim 1,
the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
4. The method of claim 1,
the determining whether the first article and the second article belong to the same author by comparing the feature vector of the first article with the feature vector of the second article comprises:
determining a difference between the feature vector of the first article and the feature vector of the second article;
if the difference is smaller than a preset threshold value, determining that the first article and the second article belong to the same author, otherwise, determining that the first article and the second article do not belong to the same author.
5. An assessment model training method, comprising:
acquiring articles of known authors serving as training samples, and classifying the articles belonging to the same author into a class;
respectively obtaining word vector representations of the articles;
training an evaluation model according to the word vector representation and the belonged classification of each article, so that when article author identity recognition is carried out, respectively evaluating a feature vector corresponding to the word vector representation of an input first article and a feature vector corresponding to the word vector representation of a second article by using the evaluation model, and determining whether the first article and the second article belong to the same author or not by comparing the two feature vectors, wherein the features are features capable of being used for distinguishing features of different authors, the first article is an article of an unknown author, and the second article is an article of a known author.
6. The method of claim 5,
for any article, the way of obtaining word vector representation of the article includes:
performing word segmentation processing on the article;
truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one;
respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one;
and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
7. The method of claim 5,
the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
8. The method of claim 7,
the training of the evaluation model according to the word vector representation and the belonged classification of each article comprises:
at each training, word vector representations of three articles are input, wherein two articles belong to the same category and respectively correspond to the anchor example and the positive example of the Triplet Loss, and the other article belongs to the other category and corresponds to the negative example of the Triplet Loss.
9. An article author identification device, comprising: the system comprises a first acquisition unit and an identity recognition unit;
the first obtaining unit is used for obtaining word vector representation of a first article to be identified, inputting an evaluation model obtained through pre-training to obtain a feature vector of the output first article, wherein the feature is a feature capable of being used for distinguishing features of different authors, and the first article is an article of an unknown author; aiming at a second article of a known author, acquiring word vector representation of the second article, and inputting an evaluation model obtained by pre-training to obtain an output feature vector of the second article;
the identity recognition unit is used for determining whether the first article and the second article belong to the same author by comparing the feature vector of the first article with the feature vector of the second article.
10. The apparatus of claim 9,
for any article, the first obtaining unit obtains a word vector representation of the article in the following manner: performing word segmentation processing on the article; truncating the article according to a mode of keeping the first L word segmentation results, wherein L is a positive integer greater than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
11. The apparatus of claim 9,
the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
12. The apparatus of claim 9,
the identity recognition unit determines a difference between the feature vector of the first article and the feature vector of the second article, and determines that the first article and the second article belong to the same author if the difference is smaller than a predetermined threshold, otherwise, determines that the first article and the second article do not belong to the same author.
13. An evaluation model training apparatus, comprising: a second obtaining unit and a model training unit;
the second acquisition unit is used for acquiring articles of known authors serving as training samples, classifying the articles belonging to the same author into one class, and respectively acquiring word vector representations of the articles;
the model training unit is used for training an evaluation model according to word vector representations and belonged classifications of articles, so that when article author identity recognition is carried out, feature vectors corresponding to word vector representations of input first articles and feature vectors corresponding to word vector representations of second articles are respectively evaluated by using the evaluation model, and whether the first articles and the second articles belong to the same author is determined by comparing the two feature vectors, wherein the features are features capable of being used for distinguishing features of different authors, the first articles are articles of unknown authors, and the second articles are articles of known authors.
14. The apparatus of claim 13,
for any article, the second obtaining unit obtains a word vector representation of the article in the following manner: performing word segmentation processing on the article; truncating the article in a mode of keeping the first L word segmentation results, wherein L is a positive integer larger than one; respectively obtaining n-dimensional word vector representation of each word segmentation result, wherein n is a positive integer greater than one; and utilizing the word vector representations of the L word cutting results to form a word vector representation of L rows and n columns, and taking the word vector representation of the L rows and n columns as the word vector representation of the article.
15. The apparatus of claim 13,
the evaluation model includes: a convolution-based deep learning model;
the convolution-based deep learning model uses the error function Triplet Loss as a Loss function.
16. The apparatus of claim 15,
the model training unit inputs word vector representations of three articles during each training, wherein two articles belong to the same class and respectively correspond to the anchor example and the positive example of the Triplet Loss, and the other article belongs to the other class and corresponds to the negative example of the Triplet Loss.
17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any one of claims 1 to 8.
18. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN201910187998.3A 2019-03-13 2019-03-13 Article author identity recognition and evaluation model training method and device and storage medium Active CN110059180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910187998.3A CN110059180B (en) 2019-03-13 2019-03-13 Article author identity recognition and evaluation model training method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910187998.3A CN110059180B (en) 2019-03-13 2019-03-13 Article author identity recognition and evaluation model training method and device and storage medium

Publications (2)

Publication Number Publication Date
CN110059180A CN110059180A (en) 2019-07-26
CN110059180B true CN110059180B (en) 2022-09-23

Family

ID=67316719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910187998.3A Active CN110059180B (en) 2019-03-13 2019-03-13 Article author identity recognition and evaluation model training method and device and storage medium

Country Status (1)

Country Link
CN (1) CN110059180B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326347B (en) * 2021-05-21 2021-10-08 四川省人工智能研究院(宜宾) Syntactic information perception author attribution method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778186A (en) * 2013-12-31 2014-05-07 南京财经大学 Method for detecting sockpuppet
CN104598599A (en) * 2015-01-23 2015-05-06 清华大学 Method and system for removing name ambiguity
CN108763354A (en) * 2018-05-16 2018-11-06 浙江工业大学 A kind of academic documents recommendation method of personalization

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3123285A1 (en) * 2011-05-06 2012-11-15 Duquesne University Of The Holy Spirit Authorship technologies
CN102880631A (en) * 2012-07-05 2013-01-16 湖南大学 Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
US9264387B2 (en) * 2013-02-06 2016-02-16 Msc Intellectual Properties B.V. System and method for authorship disambiguation and alias resolution in electronic data
CN105653590B (en) * 2015-12-21 2019-03-26 青岛智能产业技术研究院 A kind of method that Chinese literature author duplication of name disambiguates
CN106055539B (en) * 2016-05-27 2018-12-28 中国科学技术信息研究所 The method and apparatus that name disambiguates
CN108255846A (en) * 2016-12-29 2018-07-06 北京赛时科技有限公司 A kind of method and apparatus for distinguishing author of the same name
CN106777339A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of method that author is recognized based on heterogeneous network incorporation model
US10049103B2 (en) * 2017-01-17 2018-08-14 Xerox Corporation Author personality trait recognition from short texts with a deep compositional learning approach
CN108108184B (en) * 2017-03-07 2020-12-04 北京理工大学 Source code author identification method based on deep belief network
CN107729300B (en) * 2017-09-18 2021-12-24 百度在线网络技术(北京)有限公司 Text similarity processing method, device and equipment and computer storage medium
CN107870991A (en) * 2017-10-27 2018-04-03 湖南纬度信息科技有限公司 A kind of similarity calculating method and computer-readable recording medium of paper metadata

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778186A (en) * 2013-12-31 2014-05-07 南京财经大学 Method for detecting sockpuppet
CN104598599A (en) * 2015-01-23 2015-05-06 清华大学 Method and system for removing name ambiguity
CN108763354A (en) * 2018-05-16 2018-11-06 浙江工业大学 A kind of academic documents recommendation method of personalization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"An improved method for facial expression recognition using hybrid approach of CLBP and Gabor filter";Sakshi Sharma 等;《International Conference on Computing, Communication and Automation (ICCCA2017)》;20171221;第1019-1024页 *
"文字出版物的智能审读方法研究";单妍 等;《福建电脑》;20180531;第1-3页 *

Also Published As

Publication number Publication date
CN110059180A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN112417096B (en) Question-answer pair matching method, device, electronic equipment and storage medium
CN109614625B (en) Method, device and equipment for determining title text relevancy and storage medium
CN108733778B (en) Industry type identification method and device of object
CN109492222B (en) Intention identification method and device based on concept tree and computer equipment
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
CN108090216B (en) Label prediction method, device and storage medium
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN110032734B (en) Training method and device for similar meaning word expansion and generation of confrontation network model
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN111125658A (en) Method, device, server and storage medium for identifying fraudulent users
CN112100377A (en) Text classification method and device, computer equipment and storage medium
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN110020638B (en) Facial expression recognition method, device, equipment and medium
CN114706985A (en) Text classification method and device, electronic equipment and storage medium
CN116415020A (en) Image retrieval method, device, electronic equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN113569018A (en) Question and answer pair mining method and device
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN113177479B (en) Image classification method, device, electronic equipment and storage medium
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant