CN105095185A

CN105095185A - Author analysis method and author analysis system

Info

Publication number: CN105095185A
Application number: CN201510431523.6A
Authority: CN
Inventors: 朱睿; 张弛; 吴家楠
Original assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Current assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Priority date: 2015-07-21
Filing date: 2015-07-21
Publication date: 2015-11-25

Abstract

The invention provides an author analysis method and an author analysis system. The author analysis method comprises following steps of: step S101, loading a language model of a specified author, wherein the language model is obtained by utilizing a corpora of the specified author based on neural network training; and step S102, calculating the probability of an author for a newly-input corpora as the specified author by means of the language model. The author analysis method can be used for author analysis more precisely in order to provide the best author analysis performance. The author analysis system has the same advantage as the author analysis method.

Description

A kind of authors' analysis method and authors' analysis system

Technical field

The present invention relates to literary works analysis field, in particular to a kind of authors' analysis method and authors' analysis system.

Background technology

The language works having many classics in history of the mankind, these works facilitate the understanding of current common people to ancient society greatly.Therefore, the problem that the author investigating these works one of just becoming that archaeologist and historian face is important.But, because the printing technology in ancient times is flourishing not, in a lot of situation these literature contributions all retain few; Most men of old not too payes attention to the problem of intellecture property simultaneously, and thus these contributions retained also not necessarily have the surname of true authors, or have only retained a pseudonym.As the fat inkstone vegetarian of " fat inkstone vegetarian heavily comments stone to remember ", life is laughed on the blue mound of " the golden lotus ", and Hispanic " song of prosperous moral " or Arab " Harem ", the author of these works is that everybody only can leave some conjectures because of a variety of causes, and cannot obtain strong evidence confirmation.For addressing this problem, traditional scholar also has some concrete methods.

In these methods, have a kind of method usually can be considered to suitable science, that is exactly screen in the people that generation has manuscript to leave at the same time, finds the people meeting this works condition of works most.The content of screening mainly contains the life track of author, the characteristic style of works and thought etc.But this process is just main at present, and what rely on is artificial qualification, it is the process of a comparatively perception; Even if there is the qualification of introducing method and computer procedures, also be the system that an effect is poor, rule is relatively weak of comparatively simple Corpus--based Method mostly, or based on a feature classifiers of neural network, there is no the feature that fundamentally can find spoken and written languages.Therefore, the author of the ancient original text of these lost names is who is still difficult to next final conclusion.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of authors' analysis method and authors' analysis system, significantly can promote degree of accuracy and the performance of authors' analysis, has good portability and can improvement simultaneously.

One embodiment of the present of invention provide a kind of authors' analysis method, it is characterized in that, described method comprises: step S101: the language model being loaded into particular author, and wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training; Step S102: the author being calculated the language material of new input by described language model is the probability of described particular author.

Exemplarily, described neural network is shot and long term memory artificial neural network LSTM.

Exemplarily, in described step S102, utilize viterbi algorithm to calculate, described viterbi algorithm only records the state higher than threshold value of marking through described language model.

Exemplarily, described step S102 comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.

Exemplarily, also step S103 is comprised: the language material reading in described new input after described step S102, described language material is encoded as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.

Another embodiment of the present invention provides a kind of authors' analysis system, it is characterized in that, described system comprises: language determination module, for being loaded into the language model of particular author to calculate the probability that the author of the language material of new input is described particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training.

Exemplarily, described system also comprises production language model module, for generating the language model of described particular author; And/or described neural network is shot and long term memory artificial neural network LSTM.

Exemplarily, described language determination module is also for utilizing viterbi algorithm to judge, described viterbi algorithm only records scoring through described language model higher than the state of threshold value.

Exemplarily, the author of the language material that described calculating newly inputs is that the probability of described particular author comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.

Exemplarily, described authors' analysis system also comprises language generation module, and for reading in the language material of described new input, described language material is encoded using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.

Authors' analysis method of the present invention, the author of language model to the language material of new input due to the particular author by obtaining based on neural metwork training is that the probability of particular author calculates, and thus can ensure higher authors' analysis degree of accuracy and optimum authors' analysis performance.Authors' analysis system of the present invention, has above-mentioned advantage equally.

Accompanying drawing explanation

Following accompanying drawing of the present invention in this as a part of the present invention for understanding the present invention.Shown in the drawings of embodiments of the invention and description thereof, be used for explaining principle of the present invention.

In accompanying drawing:

Fig. 1 is the process flow diagram of the authors' analysis method of the embodiment of the present invention one; And

Fig. 2 is the process flow diagram of the authors' analysis method of the embodiment of the present invention two.

Embodiment

In the following description, a large amount of concrete details is given to provide more thorough understanding of the invention.But, it is obvious to the skilled person that the present invention can be implemented without the need to these details one or more.In other example, in order to avoid obscuring with the present invention, technical characteristics more well known in the art are not described.

Should be understood that, the present invention can implement in different forms, and should not be interpreted as the embodiment that is confined to propose here.On the contrary, provide these embodiments will expose thoroughly with complete, and scope of the present invention is fully passed to those skilled in the art.In the accompanying drawings, in order to clear, the size in Ceng He district and relative size may be exaggerated.Same reference numerals represents identical element from start to finish.

The object of term is only to describe specific embodiment and not as restriction of the present invention as used herein.When this uses, " one ", " one " and " described/to be somebody's turn to do " of singulative is also intended to comprise plural form, unless context is known point out other mode.It is also to be understood that term " composition " and/or " comprising ", when using in this specification, determine the existence of described feature, integer, step, operation, element and/or parts, but do not get rid of one or more other feature, integer, step, operation, element, the existence of parts and/or group or interpolation.When this uses, term "and/or" comprises any of relevant Listed Items and all combinations.

In order to thoroughly understand the present invention, detailed step and detailed structure will be proposed in following description, to explain technical scheme of the present invention.Preferred embodiment of the present invention is described in detail as follows, but except these are described in detail, the present invention can also have other embodiments.

One embodiment of the present of invention provide a kind of authors' analysis method.The method can significantly improve degree of accuracy and the authors' analysis performance of authors' analysis.

Embodiment one

Below, a kind of authors' analysis method of one embodiment of the present of invention is specifically described with reference to Fig. 1.Wherein, Fig. 1 is the process flow diagram of the authors' analysis method of the embodiment of the present invention one.

The authors' analysis method of the embodiment of the present invention, comprises the steps:

Step S101: the language material reading in particular author, extracts the feature of described language material based on neural network, generate the language model of described particular author.Exemplarily, this step comprises: production language model module reads in a large amount of language material documents of particular author, and utilize neural network repetition training to sum up the feature of the language material document under this authors' name in this inside modules, its is preserved the language model becoming specific format and is convenient to call next time.

Step S102: be loaded into described language model, to judge the probability of author as described particular author of the language material of new input.Exemplarily, this step comprises: language determination module reads in the ancient original text of one section of lost name of author, as DSR to be determined after being encoded, is convenient to next step and is supplied to language model.Exemplarily, this step also comprises: the language model that language determination module trains before being loaded into, as the foundation judging current data to be determined.Exemplarily, this step also comprises: language determination module is according to the parameters of language model, utilize and improve the auxiliary of viterbi algorithm, calculate under this language model, the confidence level of each short word or word in data to be determined, the confidence level of whole data can be drawn further, and be normalized to the score of a probability correlation.Exemplarily, this step also comprises: the score of the probability correlation drawn by language determination module exports to user.User according to the probability score corresponding to different language model, can assess the author that the corresponding particular author of the identification language model the highest with probability score is most possibly the language material of this new input.

Exemplarily, described neural network is shot and long term memory artificial neural network LSTM (LongShortTermMemory).

Exemplarily, in described step S102, utilize viterbi algorithm to judge, described viterbi algorithm only records scoring through described language model higher than the state of threshold value.Wherein said threshold value can set according to actual needs, such as, be 60%, do not limit at this.The viterbi algorithm of this improvement, owing to not recording all states, thus compared with the traditional viterbi algorithm recording all states, saves a large amount of Time and place.

In one example, this authors' analysis method only comprises step S102, the language model of particular author loaded is in this step utilize the language material of described particular author to obtain based on neural metwork training, and concrete training method with reference to above-mentioned steps S101, can not limit at this.

The method of the embodiment of the present invention, the author of language model to the language material of new input due to the particular author by obtaining based on neural metwork training is that the probability of particular author calculates, and thus can ensure higher authors' analysis degree of accuracy and optimum authors' analysis performance.Further, owing to introducing the viterbi algorithm of LSTM neural network and improvement, higher authors' analysis degree of accuracy and optimum authors' analysis performance thus can be ensured further.

Embodiment two

Below, a kind of authors' analysis method of one embodiment of the present of invention is specifically described with reference to Fig. 2.Wherein, Fig. 2 is the process flow diagram of the authors' analysis method of the embodiment of the present invention two.

The authors' analysis method of the embodiment of the present invention, comprise the steps: that step S101 is identical with embodiment one with S102, also step S103 is comprised: the language material reading in described new input after described step S102, described language material is encoded using as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.Exemplarily, the ancient original text of the lost name of author that language generation module is to be determined before reading in, as DSR to be determined after being encoded, is convenient to next step and is supplied to language model.Exemplarily, the language model that language generation module trains before being loaded into, as the foundation generating identical style language.Exemplarily, language generation module constantly inputs each short word or the word of the ancient original text of lost name successively to language model, then each output terminal at language model obtains some new short words or the word of most possible appearance under current given words sight, and its probability distribution can be seen, as another effective reference of the ancient original text authors' analysis of lost name.In other words, this embodiment not only can judge ancient original text author, also has the function generating new language, can guess which word most possibly occurs, thus judge that whether it is close with the style of known article.

Another embodiment of the present invention provides a kind of authors' analysis system, it is characterized in that, described system comprises: production language model module, for reading in the language material of particular author, extract the feature of described language material based on neural network, generate the language model of described particular author; And language determination module: be loaded into described language model, to judge the probability of author as described particular author of the language material of new input.

Exemplarily, production language model module primary responsibility reads in the reference works of certain particular author of extraneous input, by the neural net layer of several times repetition training after encoding, the feature of these specific documents is arranged and becomes a language model.

Exemplarily, language determination module is with a series of words and phrases for input, and now this module can run the score that a viterbi algorithm improved analyzes the probability correlation of these words and phrases under given language model.

Exemplarily, described authors' analysis system also comprises language generation module, and for reading in the language material of described new input, described language material is encoded using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.Exemplarily, language generation module in charge for input with some words and phrases, calls the neural network trained continuously, meets the new words and phrases of given language model under making it constantly be created on current words and phrases sight most, and then utilize these new words and phrases as next step input, finally obtain long sentence.

In sum, in order to can the author of the better ancient original text of the lost name of assistant analysis, the invention provides a set of language model based on neural network and improve the language analysis method and system of viterbi algorithm.This system has following several module: production language model module, by some documents and language material, the oeuvre of such as certain writer, arrange and become a language model based on LSTM method, have recorded the most of language feature in these documents and language material and style, even can also expand and infer some close features; Language generation module, the language model that can generate before for generates word and the short sentence of similar language style, even can generate entire article if desired; Language determination module, the language model that can generate before for, the probability of the individual character utilizing language model to provide, through once improving viterbi algorithm, the document calculating certain section of unknown author belongs to the probability of this language model, and provides marking.By these three modules, whenever having a unknown ancient original text to need qualification, first coeval candidate is set up several different language models respectively according to its works, utilize language model extract the characteristic sum style of its language after utilize decision model to calculate the probability PTS of this contribution under different language model respectively, can in this, as with reference to judge the ownership of author.

In a concrete example, this authors' analysis system can not comprise production language model module, and adopts at language determination module and utilize the language material of particular author based on the good language model of neural network precondition.Certainly, in some examples, language generation module also can be omitted.

The modules of the embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the authors' analysis system of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on memory carrier, or provides with any other form.

Run through above-described embodiment, the language model that the present invention is based on a large amount of initial language materials and document generates and decision-making system provides a set of efficient, intelligent and have the lost name literature author analytical approach of great reference significance.Due to the introducing of new technology and method, the degree of accuracy of determination module and reliability have great breakthrough, can extract style and the feature of large language material more accurately.

The present invention is illustrated by above-described embodiment, but should be understood that, above-described embodiment just for the object of illustrating and illustrate, and is not intended to the present invention to be limited in described scope of embodiments.In addition it will be appreciated by persons skilled in the art that the present invention is not limited to above-described embodiment, more kinds of variants and modifications can also be made according to instruction of the present invention, within these variants and modifications all drop on the present invention's scope required for protection.Protection scope of the present invention defined by the appended claims and equivalent scope thereof.

Claims

1. an authors' analysis method, is characterized in that, described method comprises:

Step S101: the language model being loaded into particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training;

Step S102: the author being calculated the language material of new input by described language model is the probability of described particular author.

2. authors' analysis method as claimed in claim 1, is characterized in that, described neural network is shot and long term memory artificial neural network LSTM.

3. authors' analysis method as claimed in claim 1 or 2, is characterized in that, in described step S102, utilize viterbi algorithm to calculate, and described viterbi algorithm only records scoring through described language model higher than the state of threshold value.

4. authors' analysis method as claimed in claim 3, it is characterized in that, described step S102 comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.

5. authors' analysis method as claimed in claim 1, it is characterized in that, also step S103 is comprised: the language material reading in described new input after described step S102, by described language material coding using as data to be determined, to each short word in the language material of the described new input of described language model input or word, to generate new short word or word.

6. an authors' analysis system, is characterized in that, described system comprises:

Language determination module, for being loaded into the language model of particular author to calculate the probability that the author of the language material of new input is described particular author, wherein said language model is utilize the language material of described particular author to obtain based on neural metwork training.

7. authors' analysis system as claimed in claim 6, it is characterized in that, described system also comprises production language model module, for generating the language model of described particular author;

And/or described neural network is shot and long term memory artificial neural network LSTM.

8. authors' analysis system as claimed in claims 6 or 7, is characterized in that, described language determination module is also for utilizing viterbi algorithm to calculate, and described viterbi algorithm only records scoring through described language model higher than the state of threshold value.

9. authors' analysis system as claimed in claim 8, it is characterized in that, the author of the language material that described calculating newly inputs is that the probability of described particular author comprises: the confidence level being calculated each short word or word in the language material of new input by described language model, utilize described viterbi algorithm to obtain the confidence level of the language material of whole described new input, the author confidence level of the language material of whole described new input being normalized to the language material of described new input is the probability of described particular author.

10. authors' analysis system as claimed in claim 6, it is characterized in that, also comprise language generation module, for reading in the language material of described new input, by described language material coding using as data to be determined, and to each short word in the language material of the described new input of described language model input or word, to export new short word or word.