CN105095185A - Author analysis method and author analysis system - Google Patents

Author analysis method and author analysis system Download PDF

Info

Publication number
CN105095185A
CN105095185A CN201510431523.6A CN201510431523A CN105095185A CN 105095185 A CN105095185 A CN 105095185A CN 201510431523 A CN201510431523 A CN 201510431523A CN 105095185 A CN105095185 A CN 105095185A
Authority
CN
China
Prior art keywords
language
author
authors
language material
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510431523.6A
Other languages
Chinese (zh)
Inventor
朱睿
张弛
吴家楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Megvii Technology Co Ltd
Beijing Aperture Science and Technology Ltd
Original Assignee
Beijing Megvii Technology Co Ltd
Beijing Aperture Science and Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Megvii Technology Co Ltd, Beijing Aperture Science and Technology Ltd filed Critical Beijing Megvii Technology Co Ltd
Priority to CN201510431523.6A priority Critical patent/CN105095185A/en
Publication of CN105095185A publication Critical patent/CN105095185A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an author analysis method and an author analysis system. The author analysis method comprises following steps of: step S101, loading a language model of a specified author, wherein the language model is obtained by utilizing a corpora of the specified author based on neural network training; and step S102, calculating the probability of an author for a newly-input corpora as the specified author by means of the language model. The author analysis method can be used for author analysis more precisely in order to provide the best author analysis performance. The author analysis system has the same advantage as the author analysis method.

Description

A kind of authors' analysis method and authors' analysis system
Technical field
The present invention relates to literary works analysis field, in particular to a kind of authors' analysis method and authors' analysis system.
Background technology
The language works having many classics in history of the mankind, these works facilitate the understanding of current common people to ancient society greatly.Therefore, the problem that the author investigating these works one of just becoming that archaeologist and historian face is important.But, because the printing technology in ancient times is flourishing not, in a lot of situation these literature contributions all retain few; Most men of old not too payes attention to the problem of intellecture property simultaneously, and thus these contributions retained also not necessarily have the surname of true authors, or have only retained a pseudonym.As the fat inkstone vegetarian of " fat inkstone vegetarian heavily comments stone to remember ", life is laughed on the blue mound of " the golden lotus ", and Hispanic " song of prosperous moral " or Arab " Harem ", the author of these works is that everybody only can leave some conjectures because of a variety of causes, and cannot obtain strong evidence confirmation.For addressing this problem, traditional scholar also has some concrete methods.
In these methods, have a kind of method usually can be considered to suitable science, that is exactly screen in the people that generation has manuscript to leave at the same time, finds the people meeting this works condition of works most.The content of screening mainly contains the life track of author, the characteristic style of works and thought etc.But this process is just main at present, and what rely on is artificial qualification, it is the process of a comparatively perception; Even if there is the qualification of introducing method and computer procedures, also be the system that an effect is poor, rule is relatively weak of comparatively simple Corpus--based Method mostly, or based on a feature classifiers of neural network, there is no the feature that fundamentally can find spoken and written languages.Therefore, the author of the ancient original text of these lost names is who is still difficult to next final conclusion.
Summary of the invention
For the deficiencies in the prior art, the present invention proposes a kind of authors' analysis method and authors' analysis system, significantly can promote degree of accuracy and the performance of authors' analysis, has good portability and can improvement simultaneously.
One embodiment of the present of invention provide a kind of authors' analysis method, it is characterized in that, described method comprises: step S101: the language model being loaded into particular author, and wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training; Step S102: the author being calculated the language material of new input by described language model is the probability of described particular author.
Exemplarily, described neural network is shot and long term memory artificial neural network LSTM.
Exemplarily, in described step S102, utilize viterbi algorithm to calculate, described viterbi algorithm only records the state higher than threshold value of marking through described language model.
Exemplarily, described step S102 comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.
Exemplarily, also step S103 is comprised: the language material reading in described new input after described step S102, described language material is encoded as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.
Another embodiment of the present invention provides a kind of authors' analysis system, it is characterized in that, described system comprises: language determination module, for being loaded into the language model of particular author to calculate the probability that the author of the language material of new input is described particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training.
Exemplarily, described system also comprises production language model module, for generating the language model of described particular author; And/or described neural network is shot and long term memory artificial neural network LSTM.
Exemplarily, described language determination module is also for utilizing viterbi algorithm to judge, described viterbi algorithm only records scoring through described language model higher than the state of threshold value.
Exemplarily, the author of the language material that described calculating newly inputs is that the probability of described particular author comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.
Exemplarily, described authors' analysis system also comprises language generation module, and for reading in the language material of described new input, described language material is encoded using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.
Authors' analysis method of the present invention, the author of language model to the language material of new input due to the particular author by obtaining based on neural metwork training is that the probability of particular author calculates, and thus can ensure higher authors' analysis degree of accuracy and optimum authors' analysis performance.Authors' analysis system of the present invention, has above-mentioned advantage equally.
Accompanying drawing explanation
Following accompanying drawing of the present invention in this as a part of the present invention for understanding the present invention.Shown in the drawings of embodiments of the invention and description thereof, be used for explaining principle of the present invention.
In accompanying drawing:
Fig. 1 is the process flow diagram of the authors' analysis method of the embodiment of the present invention one; And
Fig. 2 is the process flow diagram of the authors' analysis method of the embodiment of the present invention two.
Embodiment
In the following description, a large amount of concrete details is given to provide more thorough understanding of the invention.But, it is obvious to the skilled person that the present invention can be implemented without the need to these details one or more.In other example, in order to avoid obscuring with the present invention, technical characteristics more well known in the art are not described.
Should be understood that, the present invention can implement in different forms, and should not be interpreted as the embodiment that is confined to propose here.On the contrary, provide these embodiments will expose thoroughly with complete, and scope of the present invention is fully passed to those skilled in the art.In the accompanying drawings, in order to clear, the size in Ceng He district and relative size may be exaggerated.Same reference numerals represents identical element from start to finish.
The object of term is only to describe specific embodiment and not as restriction of the present invention as used herein.When this uses, " one ", " one " and " described/to be somebody's turn to do " of singulative is also intended to comprise plural form, unless context is known point out other mode.It is also to be understood that term " composition " and/or " comprising ", when using in this specification, determine the existence of described feature, integer, step, operation, element and/or parts, but do not get rid of one or more other feature, integer, step, operation, element, the existence of parts and/or group or interpolation.When this uses, term "and/or" comprises any of relevant Listed Items and all combinations.
In order to thoroughly understand the present invention, detailed step and detailed structure will be proposed in following description, to explain technical scheme of the present invention.Preferred embodiment of the present invention is described in detail as follows, but except these are described in detail, the present invention can also have other embodiments.
One embodiment of the present of invention provide a kind of authors' analysis method.The method can significantly improve degree of accuracy and the authors' analysis performance of authors' analysis.
Embodiment one
Below, a kind of authors' analysis method of one embodiment of the present of invention is specifically described with reference to Fig. 1.Wherein, Fig. 1 is the process flow diagram of the authors' analysis method of the embodiment of the present invention one.
The authors' analysis method of the embodiment of the present invention, comprises the steps:
Step S101: the language material reading in particular author, extracts the feature of described language material based on neural network, generate the language model of described particular author.Exemplarily, this step comprises: production language model module reads in a large amount of language material documents of particular author, and utilize neural network repetition training to sum up the feature of the language material document under this authors' name in this inside modules, its is preserved the language model becoming specific format and is convenient to call next time.
Step S102: be loaded into described language model, to judge the probability of author as described particular author of the language material of new input.Exemplarily, this step comprises: language determination module reads in the ancient original text of one section of lost name of author, as DSR to be determined after being encoded, is convenient to next step and is supplied to language model.Exemplarily, this step also comprises: the language model that language determination module trains before being loaded into, as the foundation judging current data to be determined.Exemplarily, this step also comprises: language determination module is according to the parameters of language model, utilize and improve the auxiliary of viterbi algorithm, calculate under this language model, the confidence level of each short word or word in data to be determined, the confidence level of whole data can be drawn further, and be normalized to the score of a probability correlation.Exemplarily, this step also comprises: the score of the probability correlation drawn by language determination module exports to user.User according to the probability score corresponding to different language model, can assess the author that the corresponding particular author of the identification language model the highest with probability score is most possibly the language material of this new input.
Exemplarily, described neural network is shot and long term memory artificial neural network LSTM (LongShortTermMemory).
Exemplarily, in described step S102, utilize viterbi algorithm to judge, described viterbi algorithm only records scoring through described language model higher than the state of threshold value.Wherein said threshold value can set according to actual needs, such as, be 60%, do not limit at this.The viterbi algorithm of this improvement, owing to not recording all states, thus compared with the traditional viterbi algorithm recording all states, saves a large amount of Time and place.
In one example, this authors' analysis method only comprises step S102, the language model of particular author loaded is in this step utilize the language material of described particular author to obtain based on neural metwork training, and concrete training method with reference to above-mentioned steps S101, can not limit at this.
The method of the embodiment of the present invention, the author of language model to the language material of new input due to the particular author by obtaining based on neural metwork training is that the probability of particular author calculates, and thus can ensure higher authors' analysis degree of accuracy and optimum authors' analysis performance.Further, owing to introducing the viterbi algorithm of LSTM neural network and improvement, higher authors' analysis degree of accuracy and optimum authors' analysis performance thus can be ensured further.
Embodiment two
Below, a kind of authors' analysis method of one embodiment of the present of invention is specifically described with reference to Fig. 2.Wherein, Fig. 2 is the process flow diagram of the authors' analysis method of the embodiment of the present invention two.
The authors' analysis method of the embodiment of the present invention, comprise the steps: that step S101 is identical with embodiment one with S102, also step S103 is comprised: the language material reading in described new input after described step S102, described language material is encoded using as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.Exemplarily, the ancient original text of the lost name of author that language generation module is to be determined before reading in, as DSR to be determined after being encoded, is convenient to next step and is supplied to language model.Exemplarily, the language model that language generation module trains before being loaded into, as the foundation generating identical style language.Exemplarily, language generation module constantly inputs each short word or the word of the ancient original text of lost name successively to language model, then each output terminal at language model obtains some new short words or the word of most possible appearance under current given words sight, and its probability distribution can be seen, as another effective reference of the ancient original text authors' analysis of lost name.In other words, this embodiment not only can judge ancient original text author, also has the function generating new language, can guess which word most possibly occurs, thus judge that whether it is close with the style of known article.
Another embodiment of the present invention provides a kind of authors' analysis system, it is characterized in that, described system comprises: production language model module, for reading in the language material of particular author, extract the feature of described language material based on neural network, generate the language model of described particular author; And language determination module: be loaded into described language model, to judge the probability of author as described particular author of the language material of new input.
Exemplarily, production language model module primary responsibility reads in the reference works of certain particular author of extraneous input, by the neural net layer of several times repetition training after encoding, the feature of these specific documents is arranged and becomes a language model.
Exemplarily, language determination module is with a series of words and phrases for input, and now this module can run the score that a viterbi algorithm improved analyzes the probability correlation of these words and phrases under given language model.
Exemplarily, described neural network is shot and long term memory artificial neural network LSTM.
Exemplarily, described language determination module is also for utilizing viterbi algorithm to judge, described viterbi algorithm only records scoring through described language model higher than the state of threshold value.
Exemplarily, described authors' analysis system also comprises language generation module, and for reading in the language material of described new input, described language material is encoded using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.Exemplarily, language generation module in charge for input with some words and phrases, calls the neural network trained continuously, meets the new words and phrases of given language model under making it constantly be created on current words and phrases sight most, and then utilize these new words and phrases as next step input, finally obtain long sentence.
In sum, in order to can the author of the better ancient original text of the lost name of assistant analysis, the invention provides a set of language model based on neural network and improve the language analysis method and system of viterbi algorithm.This system has following several module: production language model module, by some documents and language material, the oeuvre of such as certain writer, arrange and become a language model based on LSTM method, have recorded the most of language feature in these documents and language material and style, even can also expand and infer some close features; Language generation module, the language model that can generate before for generates word and the short sentence of similar language style, even can generate entire article if desired; Language determination module, the language model that can generate before for, the probability of the individual character utilizing language model to provide, through once improving viterbi algorithm, the document calculating certain section of unknown author belongs to the probability of this language model, and provides marking.By these three modules, whenever having a unknown ancient original text to need qualification, first coeval candidate is set up several different language models respectively according to its works, utilize language model extract the characteristic sum style of its language after utilize decision model to calculate the probability PTS of this contribution under different language model respectively, can in this, as with reference to judge the ownership of author.
In a concrete example, this authors' analysis system can not comprise production language model module, and adopts at language determination module and utilize the language material of particular author based on the good language model of neural network precondition.Certainly, in some examples, language generation module also can be omitted.
The modules of the embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the authors' analysis system of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on memory carrier, or provides with any other form.
Run through above-described embodiment, the language model that the present invention is based on a large amount of initial language materials and document generates and decision-making system provides a set of efficient, intelligent and have the lost name literature author analytical approach of great reference significance.Due to the introducing of new technology and method, the degree of accuracy of determination module and reliability have great breakthrough, can extract style and the feature of large language material more accurately.
The present invention is illustrated by above-described embodiment, but should be understood that, above-described embodiment just for the object of illustrating and illustrate, and is not intended to the present invention to be limited in described scope of embodiments.In addition it will be appreciated by persons skilled in the art that the present invention is not limited to above-described embodiment, more kinds of variants and modifications can also be made according to instruction of the present invention, within these variants and modifications all drop on the present invention's scope required for protection.Protection scope of the present invention defined by the appended claims and equivalent scope thereof.

Claims (10)

1. an authors' analysis method, is characterized in that, described method comprises:
Step S101: the language model being loaded into particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training;
Step S102: the author being calculated the language material of new input by described language model is the probability of described particular author.
2. authors' analysis method as claimed in claim 1, is characterized in that, described neural network is shot and long term memory artificial neural network LSTM.
3. authors' analysis method as claimed in claim 1 or 2, is characterized in that, in described step S102, utilize viterbi algorithm to calculate, and described viterbi algorithm only records scoring through described language model higher than the state of threshold value.
4. authors' analysis method as claimed in claim 3, it is characterized in that, described step S102 comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.
5. authors' analysis method as claimed in claim 1, it is characterized in that, also step S103 is comprised: the language material reading in described new input after described step S102, by described language material coding using as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.
6. an authors' analysis system, is characterized in that, described system comprises:
Language determination module, for being loaded into the language model of particular author to calculate the probability that the author of the language material of new input is described particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training.
7. authors' analysis system as claimed in claim 6, it is characterized in that, described system also comprises production language model module, for generating the language model of described particular author;
And/or described neural network is shot and long term memory artificial neural network LSTM.
8. authors' analysis system as claimed in claims 6 or 7, is characterized in that, described language determination module is also for utilizing viterbi algorithm to calculate, and described viterbi algorithm only records scoring through described language model higher than the state of threshold value.
9. authors' analysis system as claimed in claim 8, it is characterized in that, the author of the language material that described calculating newly inputs is that the probability of described particular author comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.
10. authors' analysis system as claimed in claim 6, it is characterized in that, also comprise language generation module, for reading in the language material of described new input, by described language material coding using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.
CN201510431523.6A 2015-07-21 2015-07-21 Author analysis method and author analysis system Pending CN105095185A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510431523.6A CN105095185A (en) 2015-07-21 2015-07-21 Author analysis method and author analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510431523.6A CN105095185A (en) 2015-07-21 2015-07-21 Author analysis method and author analysis system

Publications (1)

Publication Number Publication Date
CN105095185A true CN105095185A (en) 2015-11-25

Family

ID=54575657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510431523.6A Pending CN105095185A (en) 2015-07-21 2015-07-21 Author analysis method and author analysis system

Country Status (1)

Country Link
CN (1) CN105095185A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105679308A (en) * 2016-03-03 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN101751385A (en) * 2008-12-19 2010-06-23 华建机器翻译有限公司 Multilingual information extraction method adopting hierarchical pipeline filter system structure
US20120053935A1 (en) * 2010-08-27 2012-03-01 Cisco Technology, Inc. Speech recognition model
CN102999533A (en) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 Textspeak identification method and system
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1293428A (en) * 2000-11-10 2001-05-02 清华大学 Information check method based on speed recognition
CN101751385A (en) * 2008-12-19 2010-06-23 华建机器翻译有限公司 Multilingual information extraction method adopting hierarchical pipeline filter system structure
US20120053935A1 (en) * 2010-08-27 2012-03-01 Cisco Technology, Inc. Speech recognition model
CN102999533A (en) * 2011-09-19 2013-03-27 腾讯科技(深圳)有限公司 Textspeak identification method and system
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯冲 等: "基于Multigram语言模型的主动学习中文分词", 《中文信息学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105679308A (en) * 2016-03-03 2016-06-15 百度在线网络技术(北京)有限公司 Method and device for generating g2p model based on artificial intelligence and method and device for synthesizing English speech based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN108804495B (en) Automatic text summarization method based on enhanced semantics
CN111144131B (en) Network rumor detection method based on pre-training language model
CN111241294B (en) Relationship extraction method of graph convolution network based on dependency analysis and keywords
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN107291836B (en) Chinese text abstract obtaining method based on semantic relevancy model
CN109241330A (en) The method, apparatus, equipment and medium of key phrase in audio for identification
CN109543764B (en) Early warning information validity detection method and detection system based on intelligent semantic perception
CN111597350A (en) Rail transit event knowledge map construction method based on deep learning
CN110162625A (en) Based on word in sentence to the irony detection method of relationship and context user feature
CN101770580A (en) Training method and classification method of cross-field text sentiment classifier
Johannsen et al. More or less supervised supersense tagging of Twitter
CN111061951A (en) Recommendation model based on double-layer self-attention comment modeling
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN112749283A (en) Entity relationship joint extraction method for legal field
CN115841119A (en) Emotional cause extraction method based on graph structure
Liu et al. Automatic generation of personalized Chinese handwriting characters
CN117113937A (en) Electric power field reading and understanding method and system based on large-scale language model
CN105095185A (en) Author analysis method and author analysis system
CN110765768A (en) Optimized text abstract generation method
CN103942188B (en) A kind of method and apparatus identifying language material language
CN111785236A (en) Automatic composition method based on motivational extraction model and neural network
CN110059179A (en) A kind of song text name entity recognition method based on deep learning
CN115438645A (en) Text data enhancement method and system for sequence labeling task
TWI724644B (en) Spoken or text documents summarization system and method based on neural network
CN113919351A (en) Network security named entity and relationship joint extraction method and device based on transfer learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100190 Beijing, Haidian District Academy of Sciences, South Road, No. 2, block A, No. 313

Applicant after: MEGVII INC.

Applicant after: Beijing maigewei Technology Co., Ltd.

Address before: 100080 room 1001-011, building 3, Haidian street, Beijing, Haidian District, 1

Applicant before: MEGVII INC.

Applicant before: Beijing aperture Science and Technology Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151125