CN106095758B

CN106095758B - A kind of literary works guess method of word-based vector model

Info

Publication number: CN106095758B
Application number: CN201610439566.3A
Authority: CN
Inventors: 王庆林; 李原; 刘禹; 阮海鹏
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2016-06-17
Filing date: 2016-06-17
Publication date: 2018-12-04
Anticipated expiration: 2036-06-17
Also published as: CN106095758A

Abstract

The present invention relates to a kind of literary works of word-based vector model guess methods, belong to technical field of information processing, including literary works construction of knowledge base and literary works knowledge are guessed two stages.In the building stage: collecting the small-scale corpus of specific literary works, therefrom excavate the literary works Feature Words；Term vector correlation model is obtained using the small-scale corpus training term vector neural network；Calculating and each higher related term of the Feature Words degree of association based on this model, to construct literary works guess knowledge base.In the guess stage: system randomly chooses Feature Words as guess object, the related term of the specific word is extracted from knowledge base and successively to guess person's publicity, and guess person makes inferences answer.The present invention has found the Feature Words incidence relation of specific literary works using term vector model analysis method, and investigates reader in the form of guessing and improve the interest of reading while interaction between enhancing literary works and reader to the familiarity of literary works.

Description

A kind of literary works guess method of word-based vector model

Technical field

It is the present invention relates to a kind of literary works of word-based vector model guess method, in particular to a kind of to be based on term vector The literary works guess method of text deep layer complex information relationship, belongs to the information processing technology in model automatic mining literary works Field.

Background technique

Specific literary works refer to literary works or portfolio with particular story background and plot, such Often length is longer for literary works, and relationship is complicated between personage, things.On the one hand, such literary works are read to need A large amount of energy and time is spent, in nowadays rhythm of life quick in this way, people, which are difficult to extract a large amount of time out, to be gone completely Whole works are read, one kind is thus needed quickly, absorbs literature knowledge full of interest and interactive mode；Another party Face after readers ' reading crosses certain specific literary works, can there is certain understanding to the literary works, and degree of understanding has deeply and has shallow, reading Person is merely able to qualitatively evaluate oneself familiarity to the literary works relevant knowledge, can not quantitatively evaluate, so I Need a kind of method and can quantitatively investigate reader to the familiarity of specific literary works relevant knowledge.

Knowledge guess is a kind of for reflecting guess person to Opening field or a certain restriction domain knowledge familiarity Mode, guess person's information according to one section of word or several words make inferences answer, and prompt information is less or information The degree of association is lower, then answer difficulty is bigger, and the knowledge quantity for needing the person of guess to have is also bigger.

Knowledge guess is applied in literary works, reader can not only be investigated with a variety of answering modes to the ripe of literary works Degree is known, reader can also be made to quickly understand the topological relation of the entities such as high priest in literary works, things, promote reader's Read interest.

Currently, in terms of the building of guess knowledge base, it is main by manually being constructed, need a large amount of field special Family's knowledge is cooperated.In literary works guess construction of knowledge base, the literary works of specific subject can be considered a field, Construct this field guess knowledge base, expert must have very deep understanding to the literary works, to high priest in literary works, Relationship between things clearly can just construct the guess knowledge base of high quality very much.Artificial constructed method has following a few sides Face disadvantage: guess construction of knowledge base process is very slow, and each problem requires domain expert's manual construction problem and answer, and Guessing, the general topic of knowledge base is more, and manual construction difficulty is larger；Domain-specialist knowledge is excessively relied on, it is such as ripe to the field literature It is inadequate to know degree, will be unable to building high quality guess knowledge base；For the literary works of different themes, artificial constructed method can Transplantability is poor, to the construction method that a certain theme literary works are applicable in, with the poor effect on another theme literary works.

The present invention will utilize natural language processing related tool and side for these problems present in artificial constructed method Method, automatic, science quickly and efficiently construct specific literary works guess knowledge base, and this method has stronger portable Property.After building guess knowledge base, guess knowledge base can be used to carry out answer in a manner of a variety of guesses for guess person, to quickly inhale It receives literary works relevant knowledge or qualitatively evaluates and tests oneself familiarity to the literary works relevant knowledge.

Summary of the invention

The purpose of the present invention is guess knowledge to solve how automatic, science, quickly and efficiently constructing specific literary works How library to the familiarity of specific literary works and makes reader not read over specific literature so as to quantitative assessment reader The problem of relevant knowledge of the literary works is quickly understood on the basis of works original text proposes a kind of word-based vector model Literary works guess method, this method are used to excavate its text deep layer complex information relationship simultaneously to a certain specific literary works Knowledge base is constructed, related term is extracted from knowledge base and is guessed to guess person's publicity.

Idea of the invention is that automatic mining goes out text from its relevant small-scale corpus to a certain specific literary works The information relationship of word deep layer complexity is constructed knowledge base according to a certain correlation rule, and is presented in the form of visual presentation competing The person of guessing carries out answer, so as to quickly, scientifically investigate guess person to the familiarity of this literary works, can also excavate Interest in literary works out increases interactive.

The purpose of the present invention is what is be achieved through the following technical solutions:

A kind of literary works guess method of word-based vector model, is divided into literary works construction of knowledge base and literary works Knowledge is guessed two stages, and the literary works construction of knowledge base stage includes the following steps:

Step 1, the related text corpus of specific literary works, including but not limited to literary works original work and this article are collected The related encyclopaedic knowledge entry of works and correlative study document are learned, the small-scale corpus of specific literary works is constructed；

Step 2, natural language text pretreatment work is carried out to the small-scale corpus of the literary works built, removal is not Related text noise；

Step 3, to going the small-scale corpus after noise to be named entity using natural language processing tool or method Identification, is added to obtained name entity as the distinctive Feature Words of the literary works in Feature Words vocabulary；

Step 4, whole Feature Words in Feature Words vocabulary are added in the dictionary for word segmentation of participle tool, use participle word Allusion quotation segments the small-scale corpus of specific literary works, corpus after being segmented, and by all words of corpus after participle It is no duplicate to be added in vocabulary；

Step 5, bluebeard compound vector analysis tool uses after participle corpus as input and obtains the small-scale language of the literary works The term vector model of material, and calculate and the maximally related one group of related term of each Feature Words, building literary works guess knowledge base；

The literary works knowledge guess stage includes the following steps:

Step 6, it guesses the stage into literary works knowledge, system randomly chooses a Feature Words as guess object, and The highest top n related term of the specific word degree of association is extracted from literary works guess knowledge base；

Step 7, the N number of related term retained in step 6 is divided into M group, every group has no less than 2 related terms, foundation respectively Degree of association size is that different groups set difficulty level；

Step 8, system respectively randomly selects out a related term from M group, and from low to high successively according to relational degree taxis To guess person's publicity；

Step 9, guess person makes inferences answer according to the related term of publicity, and system judges that it is answered and correctly then records public affairs Show the time, while entering next topic；It still answers wrong or does not answer when related term disappears, be then recorded as failure, while entering next Topic；

Step 10, after the problem of guess person answers certain amount, guess terminates, during system is according to guess person's answer The time of cost, accuracy carry out overall merit, and provide score, reflect that guess person is familiar with journey to the literary works with this Degree.

In the step 3 when being named Entity recognition, the name entity for representing synonymy is aligned.

In the step 5, bluebeard compound vector analysis tool, the text after using participle obtains the literature as input corpus The term vector model of the small-scale corpus of works, when calculating one group of related term maximally related with each Feature Words, with two term vectors Between the degree of association of the cosine similarity calculated result as two words.

In the step 9, guess person makes inferences answer according to the related term of publicity, and guess person is either one answers Topic form is also possible to more people and races to be the first to answer a question form.

Beneficial effect

The prior art is compared, the invention has the characteristics that:

1) literary works provided by the present invention are guessed method, by from a certain specific small-scale corpus of literary works from The dynamic information relationship for excavating text deep layer complexity can make reader quickly understand high priest in literary works, things etc. real The topological relation of body.Reader does not need completely to read whole literary works, so that it may have a comparison is deep to recognize the works Know.

2) through the invention provided by literary works guess method, can it is automatic, quickly, scientifically construct specific literature The guess knowledge base of works, this method effectively prevent the inefficiencies of manual construction method, excessively dependence domain-specialist knowledge, can The disadvantages of transplantability is poor.

3) after provided literary works guess method builds guess knowledge base through the invention, guess person can be used competing Guess that knowledge base carries out answer in a manner of a variety of guesses, system is carried out according to the time of cost, accuracy during guess person's answer Overall merit, and score is provided, quantitatively reflect guess person to the familiarity of the literary works with this.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the literary works guess method of word-based vector model of the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.

The literary works guess method of word-based vector model of the invention, is divided into literary works construction of knowledge base and literature Works knowledge is guessed two stages.Its principle is:

In the literary works construction of knowledge base stage, the small-scale corpus of a certain specific literary works is collected first, it is secondly logical It crosses Text Pretreatment and therefrom extracts the distinctive Feature Words of the literary works, such as name, place name, time, event, followed by The small-scale corpus training term vector neural network obtains term vector correlation model, finally calculates and each feature based on this model The higher related term of word association degree, to construct literary works guess knowledge base.

Guess the stage in literary works knowledge, system randomly choose one Feature Words as object of guessing first, then from Literary works are guessed extracts the related term of the specific word in knowledge base, and successively to guess person's publicity, guess person is according to being received Related term makes inferences, until correctly answering out the specific word.

Fig. 1 is the flow diagram of the literary works guess method of word-based vector model provided by the invention.In order to more The method of the present invention process is illustrated well, is described in detail by taking literary works Heroes of the Marshes as an example.As shown in Figure 1, this method packet Include following steps:

Step 101, collect the related text corpus of specific literary works, including but not limited to literary works original work, with should The related encyclopaedic knowledge entry of literary works and correlative study document, construct the small-scale corpus of specific literary works.

Specific literary works refer to literary works or portfolio with particular story background and plot, such as " water Waterside passes ", The Romance of the Three Kingdoms, " Star War " series, " Harry Potter " series etc..Collecting specific literary works related text language During material, need to choose the corpus of text of high quality, so-called high quality corpus refers to corpus content and the literary works The original work content degree of correlation is very high and only introduces the corpus of a small amount of noise.The literary works related text corpus quality being collected into is got over Height, the term vector model that step 105 constructs are better.

In the present embodiment, in order to construct the small-scale corpus of Heroes of the Marshes, it is necessary first to collect the related text of Heroes of the Marshes Word corpus.Heroes of the Marshes original work shares 120 chapters, on this basis, by high priest and event etc. in Heroes of the Marshes literary works 427 entries are obtained the Baidu hundred of corresponding entry using web crawlers as query word automatically on Baidupedia website Section's webpage extracts corresponding entry corpus of text, has obtained plain text relevant to the influence of the Water Margin personage, historical background and literature The small-scale corpus of the Heroes of the Marshes of form, total size are 6.87M.

Step 102, natural language text pretreatment work is carried out to the small-scale corpus of the specific literary works built, Uncorrelated text noise is removed, in symbol, English character such as without practical significance, the serial number in entry and webpage Advertising information etc., this step can be further improved the quality for the literary works corpus being collected into.After denoising, Heroes of the Marshes Small-scale corpus is further compressed to 6.59M.

Step 103, use the small-scale corpus of Heroes of the Marshes as input, natural language processing tool or method are to corpus It is named Entity recognition, includes but is not limited to the name entity of name, place in identification text, obtained name entity is made It is added in Feature Words vocabulary for the distinctive Feature Words of the literary works.

In the present embodiment, Entity recognition is named using HanLP tool.In HanLP, name Entity recognition is to make It is existing for a subsequent process of participle, i.e., sentence is first subjected to cutting, then identify whether the word being syncopated as is name Entity.In the present embodiment, using the small-scale text corpus of Heroes of the Marshes as input, HanLP new word discovery function is opened, it is defeated Out for after the participle with part-of-speech tagging as a result, word and part of speech "/" are divided, " nr " such as " Song Jiang/nr ", " apartment for the newly-weds/ns " The part of speech for indicating word " Song Jiang " is name, and " ns " indicates that the part of speech of word " apartment for the newly-weds " is place name.It can be automatic by HanLP Such as " Song Jiang ", " Song Gongming ", " Lu Zhishen ", " apartment for the newly-weds " name entity are excavated, under normal circumstances, program automatic mining goes out Name entity can have a small amount of mistake, need expert to name Entity recognition result be filtered.In addition, due to different lives The meaning of name entity expression is possible to identical, therefore when being named Entity recognition, real to the name for representing synonymy Body is aligned.For example, in the present embodiment, " Song Jiang " and " Song Gongming " is the name entity for representing synonymy, is needed It is aligned, i.e., " Song Gongming " is replaced with into " Song Jiang ".The Feature Words that these name entities are constituted belong to this portion of Heroes of the Marshes The distinctive personage of literary works or object are to have pointing clearly to property and representative feature in literary works, are added to Guess object in Feature Words vocabulary, i.e., as literary works knowledge guess link.

Step 104, using participle kit, in conjunction with the Feature Words vocabulary generated in step 103, by Feature Words whole in table It is added in the dictionary for word segmentation of participle tool, and the small-scale corpus of specific literary works is segmented using dictionary for word segmentation, Corpus after being segmented, and all words of corpus after participle are added in vocabulary without duplicate.

In the present embodiment, it is segmented using HanLP participle kit, the Feature Words vocabulary that will be generated in step 103 In whole Feature Words be added in the dictionary for word segmentation of HanLP, close the new word discovery function of HanLP, it is original small for inputting Scale corpus, output are the corpus of text after segmenting；All words of corpus after participle are added to number without duplicate again According in table, Heroes of the Marshes vocabulary is constructed.

Step 103 participle is to need to carry out specially to obtain the name entity in Heroes of the Marshes literary works, i.e. Feature Words Family's filtering；Step 104 dictionary for word segmentation is that the updated dictionary of extension is different with step 103 word segmentation result, is known by name entity The Feature Words that do not excavate afterwards keep the participle effect of corpus of text in step 104 more preferable.

Step 105, bluebeard compound vector analysis tool selects suitable parameter, and corpus after participle is used to obtain as input The term vector model of the small-scale corpus of the literary works；

Term vector model can state word with vector form, similar by calculating the cosine between term vector Degree, reflects the degree of association between word, is associated between the degree of association two words of bigger explanation closer.Further calculate vocabulary In any one word and other each words cosine similarity, can excavate and the highest one group of word of the word association degree.Certainly, Those skilled in the art will be seen that removing is reflected outside the degree of association between word using cosine similarity, can also use Any one is able to reflect the method realization of the degree of association between different terms, such as Euclidean distance, manhatton distance.

In the present embodiment, it selects Word2vec as term vector analysis tool, utilizes the small rule of Heroes of the Marshes after participle Mould corpus training Word2vec neural network obtains the term vector model of 200 dimensions.By term vector model, available " the Water Margin Pass " term vector of all words in vocabulary.Further directed to each Feature Words of Heroes of the Marshes calculate its with it is other in vocabulary The degree of association of word is simultaneously ranked up, and can be obtained and the maximally related one group of related term of the specific word.Such as Feature Words " Lu Zhi It is deep ", the highest one group of word of the degree of association includes:

Rule of thumb as can be seen that the above related term has with the development of the plot of Lu Zhishen in Heroes of the Marshes story Close connection, meet people read literary works when thinking habit and the mode of thinking.

After one group of related term for successively calculating each Feature Words, i.e. the building of completion literary works guess knowledge base.

Step 106, it guesses the stage into literary works knowledge, system randomly chooses a Feature Words and is used as guess object, The specific word may be high priest, main matter, main place etc. in literary works.Further know from literary works guess Know the top n related term and its degree of association that the specific word is extracted in library.

In the present embodiment, if selecting " 100 singly eight incite somebody to action " in Heroes of the Marshes as guess object, system is taken out at random The Feature Words got be " Lu Zhishen ", then will from literary works guess knowledge base in extract " Lu Zhishen " 8 related terms and The degree of association.The setting of the N value should be not higher than the related term of any one Feature Words in all literary works guess knowledge base Number.

In practice, if guess object is personage, the word that personage is similarly in related term cannot be guess object Reference and suggesting effect well are provided, therefore a kind of method for filtering related term can be provided, similar name entity is carried out Filtering.Such as guess object be personage when, then will be similarly in related term personage vocabulary filtering, and guess object be place When, then the vocabulary that place is similarly in related term is filtered.In the present embodiment, it is extracted from literary works guess knowledge base In 8 related terms of " Lu Zhishen ", including two names: " Wu Song " and " history into ".Under this filtering rule, name will be similarly Two related terms be filtered, filtered 6 related terms include:

Step 107, the N number of related term retained in step 106 is divided into M group, every group has N/M related term, foundation respectively Degree of association size is that different groups set difficulty level.

Have much to the association ordering rule that related term is grouped, main group basis is characterized word word associated therewith Degree of association size.In the present embodiment, 6 related terms of Feature Words " Lu Zhishen " have been obtained by previous step, according to the degree of association Size sequence is equally divided into 3 groups, and every group of 2 words take highest two related terms of the degree of association as level-one difficulty group, two intermediate Word is as second level difficulty group, and two minimum words of the degree of association are as three-level difficulty group.

Step 108, system respectively randomly selects out a related term from M group, and according to relational degree taxis from low to high according to It is secondary to guess person's publicity.

In the present embodiment, system extracts related term " dandy monk ", " wineshop " and " Baozhusi " from three groups at random, and presses According to relational degree taxis from low to high first to guess person's publicity three-level difficulty group related term " Baozhusi ", and publicity two after 5 seconds Grade difficulty group related term " wineshop ", publicity level-one difficulty group related term " dandy monk " after 10 seconds, related term all disappears after 20 seconds.

Step 109, guess person makes inferences answer according to the related term of publicity, and system judges that its answer correctly then records The publicity time, while entering next topic；It still answers wrong or does not answer when related term disappears, be then recorded as failure, while under entrance One topic.

In the present embodiment, related term " Baozhusi " is to guess person's publicity, if guess person answered out correctly at the 3rd second Feature Words " Lu Zhishen ", then system records 3 seconds Reaction times, while entering next topic；If guess person still answers after 20 seconds Mistake is not answered, then system records this topic and answers failure, and enters next topic.

Further, the guess mode of guess person is either single answer form, is also possible to more people and races to be the first to answer a question form.When Guess mode is more people when racing to be the first to answer a question, can be with first correct person of racing to be the first to answer a question in Reaction time when racing to be the first to answer a question the time and answering successfully Between, it is other artificially to race to be the first to answer a question failure.

Step 110, after the problem of guess person answers certain amount, guess terminates, and system is according to guess person's answer process The time of middle cost, accuracy carry out overall merit, and provide score, reflect that guess person is familiar with journey to the literary works with this Degree.

In the present embodiment, if guess person answers 10 problems altogether, every problem has three groups of related terms, and related term is most The long display time is 35 seconds, that is, answering 10 problem maximum durations is 350 seconds.If guess person answers correct 9 problem, used time 140 altogether Second, then its score are as follows: 100 (9/10+ (350-140)/350)=150 (total score is 200 points).Score is higher, reflects guess person There is good familiarity to Heroes of the Marshes.While answer, guess person also can be carried out study, understand main in literary works It is interrelated between the Feature Words such as personage, things, the relevant knowledge more quickly, in a manner of interaction to absorb Heroes of the Marshes. Certainly, it will be understood by those skilled in the art that other point systems also can be used in standards of grading herein, but answer should be met The time spent in journey less, its higher score of accuracy just should be higher condition.It only in this way, could be to guess person to text The familiarity for learning works provides the evaluation for correctly meeting the natural law.

Above-described specific descriptions have carried out further specifically the purpose of invention, technical scheme and beneficial effects It is bright, it should be understood that the above is only a specific embodiment of the present invention, the protection model being not intended to limit the present invention It encloses, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in the present invention Protection scope within.

Claims

The method 1. a kind of literary works of word-based vector model are guessed, which is characterized in that this method includes literary works knowledge Library building and literary works knowledge are guessed two stages, are specifically comprised the following steps:

Step 101, the related text corpus of specific literary works, including but not limited to literary works original work and the literature are collected The related encyclopaedic knowledge entry of works and correlative study document, construct the small-scale corpus of specific literary works；

Step 102, natural language text pretreatment work is carried out to the small-scale corpus of the literary works built, removes not phase Close text noise；

Step 103, know to going the small-scale corpus after noise to be named entity using natural language processing tool or method Not, it is added to obtained name entity as the distinctive Feature Words of the literary works in Feature Words vocabulary；

Step 104, whole Feature Words in Feature Words vocabulary are added in the dictionary for word segmentation of participle tool, use dictionary for word segmentation The small-scale corpus of specific literary works is segmented, corpus after being segmented, and by all words of corpus after participle without It is duplicate to be added in vocabulary；

Step 105, bluebeard compound vector analysis tool uses after participle corpus as input and obtains the small-scale corpus of the literary works Term vector model, and calculate with the maximally related one group of related term of each Feature Words, building literary works guess knowledge base；

Step 106, it guesses the stage into literary works knowledge, system randomly chooses a Feature Words as guess object, and from The highest top n related term of the specific word degree of association is extracted in literary works guess knowledge base；

Step 107, the N number of related term retained in step 106 is divided into M group, every group has no less than 2 related terms, foundation respectively Degree of association size is that different groups set difficulty level；

Step 108, system respectively randomly selects out a related term from M group, and according to relational degree taxis from low to high successively to Guess person's publicity；

Step 109, guess person makes inferences answer according to the related term of publicity, and system judges that it is answered and correctly then records publicity Time, while entering next topic；It still answers wrong or does not answer when related term disappears, be then recorded as failure, while entering next topic；

Step 110, after the problem of guess person answers certain amount, guess terminates, and system is according to flower during guess person's answer Time for taking, accuracy carry out overall merit, and provide score, reflect guess person to the familiarity of the literary works with this.
The method 2. a kind of literary works of word-based vector model according to claim 1 are guessed, it is characterised in that: described In step 103, when being named Entity recognition, the name entity for representing synonymy is aligned.
The method 3. a kind of literary works of word-based vector model according to claim 1 are guessed, it is characterised in that: described In step 105, bluebeard compound vector analysis tool uses after participle corpus as input and obtains the small-scale corpus of the literary works Term vector model, when calculating one group of related term maximally related with each Feature Words, with the cosine similarity between two term vectors The degree of association of the calculated result as two words.
The method 4. a kind of literary works of word-based vector model according to claim 1 to 3 are guessed, feature exist In: in the step 109, guess person makes inferences answer according to the related term of publicity, and guess mode is either one answers Topic form is also possible to more people and races to be the first to answer a question form.