CN109977407A

CN109977407A - A kind of multi-level difference analysis method of Written Texts of word-based insertion

Info

Publication number: CN109977407A
Application number: CN201910236193.3A
Authority: CN
Inventors: 吕学强; 周强; 游新冬; 董志安; 张学敬
Original assignee: Tsinghua University; Beijing Information Science and Technology University
Current assignee: Tsinghua University; Beijing Information Science and Technology University
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2019-07-05

Abstract

The present invention relates to a kind of multi-level difference analysis methods of Written Texts of word-based insertion, comprising: step 1) carries out variance analysis with to a language piece to a monologue language piece；Step 2) analyzes the relevance between different character/word insertions and language piece word.The step 1) includes: architectural difference analysis, relationship variance analysis, term variance analysis.The step 2) includes: a comparison monologue language piece and is overlapped situation with what each data set character/word was embedded in the character/word occurred in a language piece；It the use of the purpose that character/word is embedded in is indicated to the vector of words specific dimension, carries out the training of model.The present invention to a monologue language piece and carries out multi-level, multi-angle variance analysis to a language piece by statistical analysis technique, it has found in a monologue language piece and to the difference existed from structure to concrete function between distribution in a language piece, the levels such as, vocabulary usage long from the structure of an article and sentence show these differences, can meet the needs of practical application well.

Description

A kind of multi-level difference analysis method of Written Texts of word-based insertion

Technical field

The invention belongs to text-processing technical fields, and in particular to a kind of multi-level difference of Written Texts of word-based insertion Analysis method.

Background technique

Discourse anlysis is a set of analysis for being widely used in the subjects such as linguistics, sociology, cognitive psychology, oral communication Mode is stressed and analyzes the different language performance and implication of spoken language, written language, sign language, body language etc. in context. Either a monologue language piece is all one kind of Written Texts (written Discourse) still to a language piece.Due to representation function Purpose is different, and there is analysis a variety of differences such as structure and lexical representation mode between them.And word is embedded in (word Embedding training data) is depended on, therefore there is also certain for the word insertion that is trained of the training data of different Text Variations Difference.But researcher often has ignored the difference of lexical representation between a different language pieces, thus shadow when carrying out Task Ring the result for having arrived experiment.It is good to obtain since researcher often only focuses on the feature of data in analysis task Analytical effect, also therefore cause to lack a monologue language piece in the research of a language piece and language piece substance compared, also can not be from more A angle carries out the variance analysis of a language piece.Such case results in the research of Discourse anlysis that there are blind spots.

Discourse anlysis considers Expression of language, either oral expression or wirtiting, all embody society and Cultural viewpoint and identity.In the work that a language piece is researched and analysed, establishing for language piece system is particularly significant, and statistical analysis is also Important research method.A monologue language piece and to the labeled data of a language piece provide for a different type language piece carry out variance analysis The data basis of research.And the research of dialogue function and dependence is then mainly relied on to the analysis of a language piece.It grinds at these In studying carefully, most researchers are researched and analysed just for a language piece for a certain type, and ignore the analysis of a multiple types language piece Comparison lacks for a monologue language piece compared with the otherness to a language piece.

CNN (Convolution Neural Network) model is widely used in text classification and language relationship research In, with the development of subsequent technology, attention mechanism also starts to be applied in the analysis and research of coherent relationship, and combines RNN (Recurrent Neural Network) model has new development.In addition, DNN (Deep Neural Network) model It is applied in the analysis and research of conversation activity.Current research is carried out according to specific tasks for certain specific type language piece It researchs and analyses.But in existing research work, a monologue language piece and the difference between a language piece is failed to obtain researcher Concern.

Summary of the invention

For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind The multi-level difference analysis method of Written Texts of the word-based insertion of art defect.

In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:

A kind of multi-level difference analysis method of Written Texts of word-based insertion, comprising:

Step 1) carries out variance analysis with to a language piece to a monologue language piece；

Step 2) analyzes the relevance between different character/word insertions and language piece word.

Further, the step 1) includes: architectural difference analysis, relationship variance analysis, term variance analysis.

Further, the step 2) includes:

It compares a monologue language piece and situation is overlapped with what each data set character/word was embedded in the character/word occurred in a language piece；

It the use of the purpose that character/word is embedded in is indicated to the vector of words specific dimension, carries out the training of model.

Further, in the step 2), using CBOW model and Skip-gram model；Using Tensorflow- Tensorboard realizes the visualization of word embedding distribution.

Further, the method also includes: steps 3) to design analysis task；

Between relationship analysis task monologue language piece selection sentence group Boundary Recognition task, sentence；

Sentence group's Boundary Recognition task carries out on a complete monologue language piece, and processing target is according to sentence group purport in a language piece Variation, the boundary of Accurate Prediction wherein each group, so that a complete monologue language piece is cut into several groups；

Relationship analysis task carries out between the adjacent sentence inside sentence group between sentence, and processing target is according to sentence adjacent in sentence group The continuity of son describes feature, differentiates first between the two with the presence or absence of continuity of content, then to possible linked phrase pair, into One step identifies wherein possible connection relationship；

Relationship analysis task uses Tsing-Hua University's tree bank data between distich, selects the phase that there is coherent relationship in complicated sentence Adjacent small sentence pair, constructs its possible connection relationship of CNN model automatic Prediction；

Select a language piece: the conversation activity for each verbal messages marks prediction task；It is interdependent for what is adjoined pair Relation recognition task；

Analyzing and training is carried out using multi task model to be directed to simultaneously in such a way that CNN model is in conjunction with attention mechanism Conversation activity and dependence carry out model parameter training.

Further, in dialogue Discourse anlysis task, using multi-task learning model, multi task model combines CNN Model and attention mechanism, CNN model are mainly responsible for the text feature of dialogue, are obtained by the processing of attention mechanism The context of context.

Further, in the step 2):

Select to carry out the matching analysis in following three classes data: Baidu's data know from Baidu search and Baidu, news The web page news that data are crawled from news portal website, the data that microblog data is crawled from microblogging website based on topic；It adopts The training data for using these three types of data to be embedded in as character/word；

Using the method being embedded in same training data using the character/word of the multiple versions of different model trainings, to compare not The training data of writtenization of specific influence with degree colloquial style and to(for) different task；

The different character/word completed based on text data training are embedded in file, compare a monologue language piece first and to language The character/word occurred in is overlapped situation with what each data set character/word was embedded in；

It is embedded in using character/word, the vector of words specific dimension is indicated, the training of model is carried out.

Further, tri- kinds of models of CBOW, Skip-gram and Glove are selected to train different character/word insertion vectors.

The multi-level difference analysis method of Written Texts of word-based insertion provided by the invention, passes through statistical analysis technique pair A monologue language piece and multi-level, multi-angle variance analysis is carried out to a language piece, had found in a monologue language piece and to being deposited in a language piece In the difference from structure to concrete function between distribution, the levels such as, vocabulary usage long from the structure of an article and sentence are to these differences It is showed, the needs of practical application can be met well.

Detailed description of the invention

Fig. 1 is Baidu, news, microblogging word insertion visualization distribution map.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

Written Texts include a monologue language piece and to language piece two types, and a monologue language piece and have difference to a language piece Representation function and wording characteristics, this proposes new challenge to the different analysis task computation modelings based on these language pieces.

Method of the invention be based on it is existing there are two types of language piece annotation repositories, using statistical analysis technique, to the difference of a two class language pieces Layer function structural difference has carried out quantitative analysis, is then based in three kinds of different type corpus texts what training automatically obtained Different words are embedded in vector, with different distributions feature of the two class language piece of the angle preliminary analysis of word vector in terms of word.In addition, It is directed to 4 canonical analysis tasks of a two class language pieces on this basis, has studied different word insertions to deep learning model analysis The impact effect of energy.The experimental results showed that different words is embedded in the expressive ability of different Discourse anlysis tasks in the presence of obvious poor It is different, to demonstrate a monologue language piece and the multi-level difference to a language piece.

Step 2) analyzes the relevance between different character/word insertions and language piece word；

Step 3) designs analysis task.

Step 1) includes:

(1) general evaluation system data

A 387 monologue language pieces and 500 of the fundamental analysis data that the present embodiment uses from seminar's mark completion are right Language record evidence.The fundamental analysis unit of a monologue language piece is that sentence (sentence) (ties sentence with fullstop, question mark, exclamation mark etc. The Chinese character segment that point number separates), minimum unit of analysis is minor sentence (clause) (i.e. with the separation of the point numbers such as comma, branch, dash Chinese character segment)；It and is that (i.e. interlocutor is in dialog procedure for verbal messages (utterance) to the fundamental analysis unit of a language piece One verbal information of middle sending)；Minimum unit of analysis is function fragment (function segment) (i.e. verbal messages In can characterize the minimum unit of information of some communicative function).

The conceptual data of 1 two kinds of language pieces of table

Table 1 shows the general evaluation system data of a these two types of language pieces.There it can be seen that two kinds of language pieces are in fundamental analysis unit Level on, there are notable differences.In a monologue language piece, each sentence basic unit includes about 20 words, about 35 Chinese characters. And in a language piece, it includes about 7 words, 9 Chinese characters that each verbal messages mark unit substantially.Due to in a language piece It is shorter to talk with segment, in specifically dialogue Discourse anlysis task, semantic expressiveness extraction can have certain difficulty.

Can be seen that simultaneously has 16,087 and 10 respectively in the corpus in a monologue language piece and to a language piece, 000 mark The basic unit of note, minimum unit have 34,839 minor sentences and 12 respectively, 452 function fragments, can also be with from this data Find out that the ratio that sentence can be cut into minor sentence in a monologue language piece is higher than in a language piece that verbal messages cutting is successful The ratio of energy segment.Due to the limitation to verbal messages length in a language piece, can be cut into the verbal messages of function fragment compared with It is few.

(2) architectural difference is analyzed

Primary structure unit in a monologue language piece is sentence group and linked phrase pair.Sentence group is that semantically having logical communication link, Grammatically there is structural relation, the combination of one group of coherent sentence is connected in flow.Linked phrase to then pass through connection relationship mark Describe the reason and logic connection occurred between two coherent sentences of content in sentence group.

It is topic clue and to adjoin to (adjacent pairs) to the primary structure unit in a language piece.Topic clue is retouched It has stated in session for wheel (turn) evolution process if same topic content.Adjoin to then describing adjacent two in topic clue Round transformation situation if a different sessions person.By to wherein it is each words wheel utterance information conversation activity (dialog act, DA it) analyzes and the change of the function of adjacent words wheel/feedback dependence mark accurate description communicative function and semantic content therein Change relationship.

Structure/relationship the statistical data of 2 two kinds of language pieces of table

Table 2 shows the comparison statistical information in a two class language pieces in structure description.Sentence group's content in a monologue language piece is big Mostly it is unfolded around sentence group's purport, different chat is formed by reason and logic relationship of the wherein each linked phrase to description The functional structures such as state, illustrate, proving, preferably to state the writing intention of author.Topic clue in a language piece is then surrounded One communication topic expansion passes through wherein each continuous drilling for adjoining the wheel message dependence promotion topic content to formation Into variation, preferably to reflect that the real dialog of each sessions participant is intended to.Current statistical data shows, monologue language The sentence group mean of a piece includes about 6 sentences, and the topic clue to a language piece then averagely includes 12 verbal messages.The information of the two Capacity volume variance may reflect the Bu Tong inherent controlling mechanism of two kinds of modes of intelligence transmission of monologue and dialogue.

Connection relationship in a monologue language piece with to the dependence in a language piece, there are certain similitude, the two is all to make For the relation recognition of two sentences.What the connection relationship of a monologue language piece was mainly studied is the reason and logic pass between two sentences System, using the relationships such as undertaking, arranged side by side as representative.And to the dependence of a language piece then it is emphasised that dependence, type master Want functional dependence and feedback dependence.Dependence research is that dependence between conversation activity and function are closed System, is then not so good as a monologue language piece for the sensibility of logic clue.

(3) relationship variance analysis

The connection relationship label of monologue textual coherence sentence pair is broadly divided into following 8 major class: 1) continuous relationship；2) it closes side by side System；3) causality；4) relationship is reversed；5) comparison；6) note relationship is solved；7) general comment relationship；8) syntactic relation.It specifically divides Cloth data are as shown in table 3.

Connection relationship distribution in a 3 monologue language piece of table

There it can be seen that each connection relationship example distribution is very uneven in a monologue language piece, class (1), class (2+ arranged side by side are accepted 5) and prove class (6+7) relationship occupy respectively 54.2%, 22.8% and 14.1% it is most of.The body of this and current monologue corpus Cut out type and much relations are distributed with: the news and encyclopaedia class data for accounting for major portion mainly include to narrate, illustrate and discuss sentence group, Core content continuity therein mainly pass through undertaking, side by side with theoretical connections realize.

As a comparison, connection relationship distribution number between the minor sentence inside complicated sentence has also been counted in Tsing-Hua University tree bank TCT According to as shown in table 4.It in the process, is following four classifications by 11 class relationship merger of TCT complex sentence level mark: coherent to close System (former TCT coherent and flowing water relationship), coordination (former TCT side by side and choice relation), solution note relationship (former TCT solves note relationship) With other relationships (former TCT cause and effect, purpose, condition, hypothesis, progressive and turning relation).Coherent relationship and coordination therein It is roughly equivalent to the undertaking and class arranged side by side of sentence surface.In these four types of relationships, the number that solution note relationship occurs is minimum.It is opposite and Speech, the single cores such as cause and effect, condition, turnover relationship occurs more frequent in minor sentence level ratio in sentence surface.

Minor sentence connection relationship distribution in 4 TCT complex sentence of table

Be broadly divided into following two type to the dependence that a language piece adjoins between: function is interdependent and feeds back interdependent. Wherein the interdependent main description ' inquiry-answer ' of function is to relationship；It feeds back interdependent, describes ' elaboration/movement-reaction ' to relationship.It is right In adjoining the case where there is no dependences between two session segments to present in, also counted together in table 5.

Table 5 is distributed the dependence in a language piece

5 data of table show that, in adjoining in there are dependence, it is interdependent that feedback dependence institute accounting is higher than function Relationship.This is because the content that dialogue both sides express mutually is more than enquirement in dialogue.In dialogue, dialogue both sides seldom connect It is continuous to put question to, the expression of oneself attitude and the elaboration to objective things can be also interted in enquirement.But interlocutor both sides can be with By continuously illustrate and conversational feedback carry out viewpoint exchange, therefore, feedback dependence occur number be also higher than Function feedback relationship.

It can be seen that in a monologue language piece and relationship between the sentence of a language piece be distributed unbalanced by the comparison of table 3,4,5. In a monologue language piece, to accept and coordination proportion is most, this kind of relationship ensure that the content fluency of a monologue language piece, It is the obvious characteristic of a monologue language piece.In to a language piece, the fluency of a language piece is weakened, and emphasizes the dependence between dialogue sentence. The dialogue trend to a language piece can be analysed in depth by the dependence to a language piece, therefrom extracts the row in dialog procedure For features such as turnovers.

(4) term variance analysis

Occur 36,227 vocabulary, 4,104 words altogether in a monologue language piece.And occur 6,117 altogether in a language piece Vocabulary, 1,926 word.

A monologue language piece with to a language piece, there are apparent wording mode differences.Limit of the monologue language piece word due to its type System, word is generally serious official's word, more formally, seldom exist " uh uh ", " yes ", the colloquial styles such as " good " Vocabulary, and more arbitrarily occasion is applied to mostly to a language piece, some more serious vocabulary are difficult to occur, such as " comrade ", " it is reported that ".

A 6 monologue language piece of table counts (first 5) with to language piece words

The statistical data of table 6 is shown, in a monologue language piece, is occurred most being formal and official vocabulary, and is being talked with The vocabulary occurred in a language piece then more arbitrarily, compares day-to-day.It is obviously weak with the difference of word but when using word as statistical unit To change, it is coincidence that the highest partial words of the statistical result showed frequency of occurrences as unit of word, which have significant portion, this also means that In the case where using word vector to indicate as task data, the difference of data can be weakened.A monologue language piece with to a language piece in word The difference of word is mainly reflected on lexical representation.

Step 2) includes:

A monologue language piece and to a language piece there are biggish word difference, this species diversity show a monologue language piece word rule The features such as model, syntactically correct, and in a language piece, conversation content is random, does not focus on syntax rule.In order to deeper into analyze this The word of a two class language pieces is specific, and select to carry out the matching analysis in following three classes data: Baidu's data are from Baidu search and hundred Degree knows that the web page news that news data is crawled from news portal website, microblog data is climbed from microblogging website based on topic The data taken.

In these three types of data, question and answer information is known comprising a large amount of online friends in Baidu's data, personalized presentation content is more, Overlay content is wide, but does not also focus on syntax rule relatively.News data and a monologue language piece have a stronger similitude, length compared with Long, text structure train of thought is clear.Due to the fact that the needs of report, it mostly can Code words and emphasis syntax rule.And microblogging Topic data then describes the paragraph structure that length is limited, is more closely similar in a monologue language piece.It is easy to obtain since these three types of data have It takes and the advantage of data volume abundance, while also representative with certain data, can be the different language piece classes of this paper concern Type provides corresponding contrast distribution data.The scale of each data is Baidu's data there are about 4G, and there are about 13G, microblog numbers for news data According to there are about 7G, therefore, the training data that the present embodiment is embedded in using these three types of data as character/word.

CBOW model and Skip-gram model are the models for two kinds of trained character/word insertions that Word2vec tool provides. CBOW model can predict this word itself according to input n-1 word of surrounding, that is to say, that the input of CBOW model is certain The sum of word insertion of n word around a word A, output are the word insertions of word A itself.Skip-gram model can be according to word Itself there is which word around predicting, that is to say, that the input of skip-gram model is word A itself, and output is the n around word A The word of a word is embedded in.Glove model has incorporated the global statistics information of matrix decomposition.Global priori statistics are incorporated, It can accelerate the training speed of model, and can control the relative weighting of word.

The characteristics of these three model training character/word are embedded in is different.Therefore, the present embodiment, which uses, utilizes same training data The method of the character/word insertion of the different multiple versions of model training, to compare the training data of different degrees of colloquial style and writtenization Specific influence for different task.

The different character/word completed based on the training of above-mentioned text data are embedded in file, compare a monologue language piece first and to language The character/word occurred in is overlapped situation with what each data set character/word was embedded in.Specific difference condition is as shown in table 7.

A 7 monologue language piece of table, to the character/word registration of a language piece and each data

As can be seen from Table 7, not consistent in the words of different derived datas and the coincidence factor of task data.Due to book It is mostly common word in the language piece of face, Baidu, news and microblog data are in a monologue language piece and to the word coverage rate occurred in a language piece Basically reach 100%.But in the data of these three types, lexical gap is embodied.In a monologue language piece and to a language piece In the vocabulary of appearance, since Baidu's data are related to widest in area, what is be overlapped in Baidu's data is most.A monologue language piece with Baidu, news, microblogging data coincidence factor will be lower than to a language piece, this is because the type multiplicity that a monologue language piece is related to, is situated between The content to continue is embodied from style is illustrated to narrating style also compared with horn of plenty.Therefore the and dialogue to language piece centrality Content is compared, and the vocabulary in a monologue language piece for appearance more disperses, it is not easy to be completely coincident.

Since the data in three sources are inconsistent, the compatible degree of the character/word insertion and task data that thus train also has Institute is different.It the use of the purpose that character/word is embedded in is indicated to the vector of words specific dimension, to carry out the instruction of model Practice.

The present embodiment realizes the visualization of word embedding distribution using Tensorflow-tensorboard, is embedded in from the word of Fig. 1 Visualization profiles versus figure is schemed it is also seen that the word embedding distribution that Baidu, three news, microblogging sources train has differences 1 shows in Baidu's data, vocabulary it is widely distributed and uniform, and the vocabulary of news and microblogging distribution in, it is existing dissipate Point is more, and vocabulary is distributed obvious uneven.Therefore, the difference value of vocabulary distribution is obtained is visited in the task of Discourse anlysis Rope.

Due to the training data Type-Inconsistencies of character/word insertion, risen in a monologue language piece and the specific tasks to a language piece The effect arrived also can be variant.Therefore, the present embodiment is subsequent will carry out with a monologue language piece for task-driven and to the difference of a language piece Different analysis comparison, it is hereby achieved that a monologue language piece and the variance analysis to a language piece in specific tasks.

During the experiment, strict control variable only remains the difference of words expression and indicates dimension according to words Necessary adjustment required for changing, therefore, it is embedding that experimental result can sufficiently represent the character/word that separate sources difference model training goes out Enter the result influence for task.

Step 3) designs analysis task, comprising:

(1) analysis task of a monologue language piece is designed:

Two analysis tasks have been selected to a monologue language piece, first is that sentence group's Boundary Recognition task, first is that relationship analysis is appointed between sentence Business.

Sentence group's Boundary Recognition task carries out on a complete monologue language piece, and processing target is according to sentence group purport in a language piece Variation, the boundary of Accurate Prediction wherein each group, so as to which a complete monologue language piece is cut into several groups.

Relationship analysis task then carries out between the adjacent sentence inside sentence group between sentence, and processing target is according to adjacent in sentence group The continuity of sentence describes feature, differentiates first between the two with the presence or absence of continuity of content, then to possible linked phrase pair, Further identify wherein possible connection relationship.

In view of current monologue language piece annotation repository scale, appropriate adjustment has been carried out to the two tasks.Distich group boundary Identification mission using training test library based on existing 387 sentences group's annotation repository, and uses large-scale People's Daily's language piece In paragraph data as weakly supervised data, design the boundary class test that CNN model carries out adjacent sentence.Relation between distich Analysis task then uses Tsing-Hua University's tree bank data, selects the adjacent small sentence pair that there is coherent relationship in complicated sentence, constructs CNN Model automatic Prediction its possible connection relationship (coherent, side by side or other).

(2) analysis task to a language piece is designed

Two analysis tasks are also selected to a language piece, first is that the conversation activity label prediction for each verbal messages is appointed Business；First is that for the dependence identification mission adjoined pair.Due to shorter to each dialogue length of a language piece, talk with the length of segment Spend limited, it is a huge challenge that semantic information how is got in the content of text of limited length.

Conversation activity label is described to verbal messages in a language piece or the communicative function of function fragment, and main includes explaining Several major class such as state, inquire, act, answer and react.Since words wheel language ambience information has a major impact the differentiation of conversation activity, It needs to consider different contexts in computation modeling.Dependence is interdependent and anti-to describe to adjoin the function between Dependence is presented, it is accurate to determine to need to mark predictive information using the conversation activity of wherein each verbal messages/function fragment.

In view of the prediction of conversation activity label and dependence identify the internal association of two tasks, using multitask mould Type carries out analyzing and training.In such a way that CNN model is in conjunction with attention mechanism, at the same for conversation activity and dependence into The training of row model parameter.

Analysis of experimental results is carried out to method of the invention:

1) monologue language piece task result is analyzed

The word that tri- kinds of model trainings of CBOW, Skip-gram and Glove go out is embedded in difference, simultaneously because Baidu's data, The feature that news data and microblog data have is not identical, therefore 9 parts of word insertions of training, the dimension of these words insertion are 200, and verify this influence of 9 parts of words insertions for minor sentence relationship analysis task.

8 minor sentence relationship analysis result of table

Table 8 is demonstrated by minor sentence analysis task, and the word insertion of separate sources and different training methods are for minor sentence The influence of analysis task.The experimental results showed that the word insertion that news data trains is the brightest for the effect of minor sentence analysis task It is aobvious.Data in Tsing-Hua University's tree bank contain a variety of types such as narrative, expository writing, and are the parts of monologue language record evidence Analysis.This is because the data in news data and treebank are all a long language piece, there is similar structure, simultaneously because types of news Feature, word rigorously standardize and a monologue language piece to agree with degree higher.Simultaneously because microblog data and Baidu's data and monologue language A piece agrees with degree relative deficiency, so that the word insertion effect that this two parts data trains is not good enough.Simultaneously from table 8 In, it can be seen that the effect in CBOW model is the most outstanding, this is related with the training method of CBOW model, therefore is embedded in word In the selection of training pattern, it is considered as the feature of task data, to obtain good experiment performance.

In sentence group's Boundary Recognition task, since Baidu's data, news data and microblog data have in chapter length Notable difference, therefore the present embodiment mainly demonstrates word insertion that this three points of different training datas train for the shadow of task It rings.

In table 9, it can be clearly seen that the word of separate sources is embedded in the influence difference for Sentence group analysis task.For sentence Group's Boundary Recognition task, the evaluation criterion that the present embodiment uses is fuzzy consensus rate.

9 group's fuzzy consensus rates of table

Table 9 compared the word that Baidu's data, news data and microblog data train and be embedded in sentence group obscurity boundary one Difference in the evaluation criterion of cause rate, the word that wherein Baidu's data train, which is embedded in Sentence group analysis, be more good table It is existing, this is because, vocabulary distribution more balance and extensive, word in sentence group Boundary Recognition task wide in the field of Baidu's data It is embedded with better expressive ability.In addition, Baidu's data are above other data in the result of three kinds of models, this also demonstrates word The popularity being distributed of converging is even more important for the distribution of sentence group boundary, therefore Baidu's data are in this Discourse anlysis of sentence group marginal analysis It is more helpful in task.The word insertion overall performance that Glove model training goes out simultaneously is better than other two models, therefore in sentence In this task of group's marginal analysis, it is more suitable for carrying out word insertion training using Glove model.

2) language piece task result is analyzed

In dialogue Discourse anlysis task, the present embodiment uses multi-task learning model, and multi task model combines CNN Model and attention mechanism, CNN model is mainly responsible for the text feature of dialogue, and passes through the processing of attention mechanism, The context of available context, therefore multi task model can be very good to complete dialogue Discourse anlysis task.This implementation Example has first verified that influence of the different word insertion training methods to DA label prediction task process performance.Using with informal day Baidu's data that normal session corpus relatively agrees with, select tri- kinds of models of CBOW, Skip-gram and Glove to train different character/word It is embedded in vector.

Since verbal messages/function fragment for limited length is to be embedded in using word or be embedded in using word to indicate Semantic effect also lacks verifying, while the ability that may represent the former meaning of a word in view of the dimension of character/word insertion is different, Mei Gemo Type all has trained the character/word insertion of 300 dimensions and the character/word insertion of 200 dimensions in multi task model training conversation activity task In compare.Therefore the character/word insertion for obtaining 12 versions in Baidu's data altogether indicates.

The DA prediction result that table 10 is embedded in based on Baidu's data difference word

Table 10 shows specific analysis performance correlation data, there it can be seen that three kinds of character/word in selection are embedded in instruction Practice in model, the character/word of Skip-gram model training is embedded in DA label prediction task and has a clear superiority.Experimental result It shows, the different dimensions that character/word indicates can generate certain influence to DA label prediction.The prediction effect of the character/word insertion of 200 dimensions Fruit is opposite to want much better, and the character/word insertion of especially 300 dimensions indicates to will increase calculation amount, therefore, marks prediction task in DA It is middle that good experimental result can be obtained using the character/word expression of 200 dimensions.

Due to the length limitation of verbal messages, in Skip-gram model and Glove model, language is carried out using word vector Justice indicates that forecast result of model can be improved.But in the experimental result of the character/word of CBOW model training insertion, word embeddability It can be embedded in instead than word worse.This shows that CBOW model may DA label more not applicable than Skip-gram model and Glove model Prediction task.It with word is mostly common word in life that its reason, which may be to a language piece, is predicted according to the word around input This word itself reduces the ability to express of word vector instead out.

Simultaneously in order to compare the influence of term vector that different data source trains to mission effectiveness, the present embodiment is equally adopted Comparative analysis has been carried out with the words vector that news data and microblog data train.

The DA in 11 different data source of table marks prediction result

Table 11 compared being embedded in DA label using the character/word that Skip-gram model training goes out based on different data source Experiment performance in prediction task.The experiment knot of the Baidu's character/word insertion more agreed with task data as can be seen from the table Fruit will be substantially better than the character/word insertion of news and microblogging.So far the present embodiment had compared three trained derived datas and dialogue Task data agrees with degree, the experimental results showed that, training that character/word insertion indicates is carried out to appointing using the high data of compatible degree Business has more efficiently help.Experimental result also indicates that the experimental result being embedded in using word is better than the reality being embedded in using word simultaneously It tests as a result, this is because in a language piece, the limited length of text, therefore carry out model analysis using word vector to provide More effective informations.

In dialogue dependency analysis, the present embodiment equally uses multi task model to compared different model trainings not The experimental result being embedded in the character/word of version.

The dependence recognition result of the different word insertions of table 12

As can be seen from Table 12, the result of CBOW model is still less desirable, and Skip-gram model still better than Glove model.Baidu's data are in dialogue dependence and the compatible degree of task data is still that task provides good word Insertion indicates.In the insertion of multiple character/word indicates, the experimental result of the word insertion of 300 dimensions is the most outstanding.Simultaneously in whole reality It tests in result, the experimental result of word insertion is better than word insertion.The experimental results showed that in dialogue dependency analysis task, word The ability to express of insertion is better than word insertion, this is because the characteristics of word expression can more embody dialogue in a language piece, leads to Cross attention mechanism, the weight variation between word insertion is more obvious, this also embodied dialogue dependence identification mission with Difference between DA identification mission.

In dialogue dependency analysis task, the word insertion of 300 dimensions can embody good in attention mechanism Ability has good embodiment ability to the dependence adjoined between, therefore to language by the relationship between word and word In the dependency analysis task of a piece, the word insertion gone out using Skip-gram model training can preferably reach experimental result.

The dependence recognition result in 13 different data source of table compares

It is poor that table 13 compared the performance being embedded in dependency analysis task using the character/word of different data source training Different (using Skip-gram model).In table 13, the character/word insertion otherness of separate sources weakens, this is because in interdependent pass It is in analysis task, the attention mechanism of model is calculated according to the relationship between word and word, weakens independent character/word insertion Expression ability, in dependency analysis, what is mainly utilized is that relationship between words indicates, also as a result, in separate sources Train character/word insertion relationship be it is similar, weaken the otherness between source.

The word insertion result of Baidu's data is better than word insertion as a result, this is because the character/word of Baidu's data indicates and appoints Being engaged in, data compatible degree is higher, and the expression ability of word insertion is substantially better than the expression ability of word insertion.However it is based on news and microblogging The word vector expression of data, which is better than word insertion, indicates that this is also in that the training data in both sources and the word of task data The compatible degree that converges is inadequate, and in the training of word insertion, the difference between derived data is weakened, to reach good experiment As a result.Thus it is believed that in the case where vocabulary distribution has differences, indicates to make up using word insertion a degree of Words indicates difference.

Table 8 is illustrated to table 13 in four analysis tasks in a monologue language piece and to a language piece, and different character/word is embedded in table Show the experiment effect obtained in task.In a monologue language piece and in a language piece, although being all made of the training number in three kinds of sources According to the training for carrying out word insertion, but its word is embedded in the effect obtained in experimental duties and is also not quite similar.In conjunction with different task Demand selects the training pattern of separate sources word insertion to also play vital effect.

Totally three kinds of modes carry out two kinds of models of CBOW and Skip-gram and Glove that the present invention uses Word2vec simultaneously Training indicates to generate the word insertion in different sources, and progress words indicates the influence generated to language piece variance analysis accordingly Research, further verifying different language piece data characteristicses influences the otherness of Discourse anlysis.

The present invention passes through statistical analysis technique first and carries out multi-level, multi-angle difference to a monologue language piece and to a language piece Analysis can be found out in a monologue language piece and to the difference existed from structure to concrete function between distribution in a language piece.These are poor It is different can the levels such as, vocabulary usage long from the structure of an article and sentence showed.

Furthermore the present invention is based on word insertions is verified in the deep learning method of task-driven.Based on separate sources Data are embedded in using the different character/word of different model trainings.By experimental result comparative analysis, separate sources, difference are found The data of model training have notable difference for the function and effect of the different task of a monologue language piece and Written Texts.In model training During, the training that character/word insertion is carried out with the high data of task data compatible degree should be selected as far as possible.And different task institute The model etc. that the word insertion needed should use requires that task is combined to need selection with caution.

The present invention is by the multi-level variance analysis of word-based insertion, for the monologue language piece in Written Texts and to language Difference between is analyzed.For difference therein, when carrying out follow-up study, can preferably be designed for its feature Model and selection data characteristics etc., to be directed to language piece difference feature linguistic term analysis method.

Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of multi-level difference analysis method of Written Texts of word-based insertion characterized by comprising to a monologue language piece with Variance analysis is carried out to a language piece.

2. the multi-level difference analysis method of Written Texts according to claim 1, which is characterized in that the described method includes:

3. the multi-level difference analysis method of Written Texts according to claim 1 to 2, which is characterized in that the step 1) packet It includes: architectural difference analysis, relationship variance analysis, term variance analysis.

4. the multi-level difference analysis method of Written Texts according to claim 1 to 3, which is characterized in that the step 2) packet It includes:

5. the multi-level difference analysis method of Written Texts described in -4 according to claim 1, which is characterized in that the step 2) In, using CBOW model and Skip-gram model；Realize that word embedding distribution is visual using Tensorflow-tensorboard Change.

6. the multi-level difference analysis method of Written Texts described in -5 according to claim 1, which is characterized in that the method is also wrapped Include: step 3) designs analysis task；

Sentence group's Boundary Recognition task carries out on a complete monologue language piece, and processing target is the change according to sentence group purport in a language piece Change, the boundary of Accurate Prediction wherein each group, so that a complete monologue language piece is cut into several groups；

Relationship analysis task carries out between the adjacent sentence inside sentence group between sentence, and processing target is according to sentence adjacent in sentence group Continuity describes feature, differentiates first between the two with the presence or absence of continuity of content, then to possible linked phrase pair, further Identify wherein possible connection relationship；

Relationship analysis task uses Tsing-Hua University's tree bank data between distich, selects the presence of the adjacent small of coherent relationship in complicated sentence Sentence pair constructs its possible connection relationship of CNN model automatic Prediction；

Select a language piece: the conversation activity for each verbal messages marks prediction task；For the dependence adjoined pair Identification mission；

Analyzing and training is carried out using multi task model, in such a way that CNN model is in conjunction with attention mechanism, while for dialogue Behavior and dependence carry out model parameter training.

7. the multi-level difference analysis method of Written Texts described in -6 according to claim 1, which is characterized in that a language piece point In analysis task, using multi-task learning model, multi task model combines CNN model and attention mechanism, CNN model master It is responsible for the text feature of dialogue, the context of context is obtained by the processing of attention mechanism.

8. the multi-level difference analysis method of Written Texts described in -7 according to claim 1, which is characterized in that in the step 2) In:

Select to carry out the matching analysis in following three classes data: Baidu's data know from Baidu search and Baidu, news data The web page news crawled from news portal website, the data that microblog data is crawled from microblogging website based on topic；Using this The training data that three classes data are embedded in as character/word；

Using the method being embedded in same training data using the character/word of the multiple versions of different model trainings, to compare different journeys Spend specific influence of the training data of colloquial style and writtenization for different task；

The different character/word completed based on text data training are embedded in file, compare a monologue language piece first and in a language piece The character/word of appearance is overlapped situation with what each data set character/word was embedded in；

9. the multi-level difference analysis method of Written Texts described in -8 according to claim 1, which is characterized in that selection CBOW, Tri- kinds of models of Skip-gram and Glove train different character/word insertion vectors.