CN112487806B

CN112487806B - English text concept understanding method

Info

Publication number: CN112487806B
Application number: CN202011382136.5A
Authority: CN
Inventors: 李俊; 姜兰兰; 黄桂敏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2023-05-23
Anticipated expiration: 2040-11-30
Also published as: CN112487806A

Abstract

The invention discloses an English text concept understanding method, which is an understanding model composed of an English text understanding preprocessing module, an English text keyword concept semantic feature extraction module, an English text keyword and concept semantic dependency relation extraction module and a candidate answer selection module which are sequentially connected. After the English text and the questions related to the English text are processed by the understanding method, the related concept answers of the questions can be obtained. The method solves the problem of English text concept understanding, and the answer result is more accurate than that of the traditional English text understanding method.

Description

English text concept understanding method

Technical Field

The invention relates to a natural language processing technology, in particular to an English text concept understanding method, which is only suitable for English text and is not suitable for Chinese text.

Background

Machine automated english text understanding is where by entering a stretch of english text and a number of questions related to the text, the machine relies on its own algorithm to find the answer to the question from the entered english text. The traditional English text understanding method mainly comprises a text-question semantic analysis understanding method and a text-question vocabulary matching understanding method. The text-question semantic analysis understanding method mainly relies on a predefined rule template, and utilizes manually designed language features to learn the relation between the text and the question, and firstly, a large amount of manually marked data is needed, which can cause the problem of sparse semantic features, and the method is only suitable for certain limited fields. The text-question vocabulary matching understanding method selects words or phrases with higher similarity as answers by calculating semantic similarity of key words in the text and the questions, and the method only matches similarity information between the questions and the words in the English text, so that accurate semantics of ambiguous words in the English text are difficult to obtain, and the problems of inaccurate answer selection are caused by reading and understanding. Aiming at the problems, the invention provides an English text concept understanding method, which acquires conceptual semantic information of English text by mining deep concept semantic features of key words in English text, and finally acquires more accurate answers by concept semantic dependency relationship between English text and words in the problems.

Disclosure of Invention

The general processing flow of the English text concept understanding method is shown in fig. 1, and comprises an English text understanding preprocessing module, an English text keyword concept semantic feature extraction module, an English text keyword and concept semantic dependency relation extraction module and a candidate answer selection module.

The English text understanding preprocessing module comprises the following processing flows: firstly, inputting English texts and questions to be read, respectively performing word segmentation, stop word removal and word lowercase treatment on the English texts and the questions to be read, and forming text sequences consisting of a plurality of sentences on English text clauses to be read; secondly, word segmentation, phrase segmentation and part-of-speech tagging are carried out on the text sequence output in the first step, so that a sequence consisting of words and phrases of English text to be read and questions is obtained; thirdly, respectively outputting nouns and noun phrases, verbs and adjective lists of sentence sequences in the English text to be read, and nouns and noun phrases, verbs and adjective lists of the problem sentence sequences.

The processing flow of the English text keyword concept semantic feature extraction module is as follows: firstly, inputting the English text to be read and the preprocessing result of the problem in an English text preprocessing module, and selecting nouns or noun phrases in the English text to be read; secondly, carrying out word vector representation on the nouns or noun phrases selected in the first step by using a pre-trained reading understanding data set; thirdly, respectively calculating cosine similarity between nouns or noun phrases in the questions and nouns or noun phrases selected in the English text to be read, sorting the calculated results in a descending order, and selecting the results with the top five ranks as candidate key nouns or noun phrases; fourthly, calculating the co-occurrence probability of the candidate key nouns or noun phrases and the candidate concepts to which the candidate key nouns or noun phrases belong, if the co-occurrence probability result is zero, continuing to execute the fifth step, otherwise, selecting the result with the highest probability as the candidate key nouns or noun phrases to which the candidate key nouns or noun phrases belong; fifth, if the co-occurrence probability result of the candidate key noun or noun phrase and the concept to which the candidate key noun or noun phrase belongs is zero, directly using the current noun or noun phrase as the concept to which the candidate key noun or noun phrase belongs; and sixthly, calculating the importance degree of the selected keywords, and obtaining the final importance degree score of the current keywords by calculating the weight coefficient between the current keywords and the context words and then weighting and summing.

The processing flow of the English text keyword and concept semantic dependency relation extraction module is as follows: first, inputting a word vector representation of a candidate key noun or noun phrase; second, the second; inputting a conceptual representation of the candidate key nouns or noun phrases; thirdly, extracting semantic dependency relations among candidate key nouns or noun phrases by using a pre-trained semantic dependency relation set; fourthly, extracting concept dependency relations among candidate key nouns or noun phrases by using a pre-trained concept dependency relation set; fifth, the cosine similarity between the semantic dependency relationship and the concept dependency relationship of the candidate key nouns or noun phrases is calculated, the calculated results are ordered in a descending order, and the result with the highest similarity is selected as the current key word and the concept semantic dependency relationship thereof.

The processing flow of the candidate answer selection module is as follows: first, inputting a conceptual representation of candidate key nouns or noun phrases; secondly, inputting the selected keywords and the concept semantic dependency relationship thereof; thirdly, constructing a concept semantic graph model, wherein a conceptual representation of candidate key nouns or phrases is used as nodes, and the selected keywords and concept semantic dependencies thereof are used as edges; fourthly, calculating Euclidean distance between each node vector and all node weighted average vectors in the conceptual semantic graph model, and taking probability distribution of the Euclidean distance as a weight value of the node; fifth, the node with the highest weight value is selected as the final answer.

The definition of the invention is as follows:

1. word part of speech tagging structure

The part-of-speech tagging in the invention is to tag the part of speech of a word in a text to be read and a question, and mainly tags nouns, verbs and adjectives, and the tagging format is as follows:

words and phrases ₁ [ # part of speech ₁ * Part of speech # ₂ * Part of speech # ₃ *……]

Words and phrases ₂ [ # part of speech ₁ * Part of speech # ₂ * Part of speech # ₃ *……]

……

Words and phrases _n [ # part of speech ₁ * Part of speech # ₂ * Part of speech # ₃ *……]

2. Word segmentation and phrase segmentation structure

The segmentation and phrase segmentation in the invention is to segment nouns or noun phrases in texts and questions to be read, and the segmentation format is as follows:

nouns and noun phrases ₁ Segmentation and phrase segmentation mark ₁

Nouns and noun phrases ₂ Segmentation and phrase segmentation mark ₂

……

Nouns and noun phrases _n Segmentation and phrase segmentation mark _n

3. Concept structure to which nouns or noun phrases belong

In general, the semantic concepts represented by the same noun in different texts are not identical, for example, "apple" has a concept of representing "fruit" and also has a concept of representing "company". The concept of nouns or noun phrases in the invention is to divide the concepts of nouns or noun phrases in the text to be read and the problems, so as to ensure the accuracy of the semantic concept of the current nouns or noun phrases, and the structure is as follows:

nouns and noun phrases ₁ [ possible concepts of the genus ] ₁ The possible concepts ₂ … …, the possible concepts of the genus _n ]

Nouns and noun phrases ₂ [ possible concepts of the genus ] ₁ The possible concepts ₂ ，……，Possible concepts of the genus _n ]

……

Nouns and noun phrases _n [ possible concepts of the genus ] ₁ The possible concepts ₂ … …, the possible concepts of the genus _n ]

4. Keyword and concept semantic dependency structure thereof

In addition to understanding the keyword information in the text, the semantic dependency relationship among the keywords needs to be further determined, and different semantic dependencies usually express different semantics of the text. The keywords and the conceptual semantic dependencies in the invention refer to: extracting and determining semantic dependency relations between nouns or noun phrases in English text to be read and questions, wherein the structure is as follows:

[ keyword ] ₁ Dependency relationship ₁₂ Keyword(s) ₂ ]

[ keyword ] ₁ Dependency relationship ₁₃ Keyword(s) ₃ ]

[ keyword ] ₁ Dependency relationship ₁₄ Keyword(s) ₄ ]

……

[ keyword ] ₂ Dependency relationship ₂₃ Keyword(s) ₃ ]

[ keyword ] ₂ Dependency relationship ₂₄ Keyword(s) ₄ ]

……

[ keyword ] _n Dependency relationship _n,n+1 Keyword(s) _n+1 ]

[ keyword ] _n Dependency relationship _n,n+2 Keyword(s) _n+2 ]

Concept of ₁ Dependency relationship ₁₂ Concept of ₂ ]

Concept of ₁ Dependency relationship ₁₃ Concept of ₃ ]

Concept of ₁ Dependency relationship ₁₄ Concept of ₄ ]

……

Concept of ₂ Dependency relationship ₂₃ Concept of ₂₃ ]

Concept of ₂ Dependency relationship ₂₄ Concept of ₂₄ ]

……

Concept of _n Dependency relationship _n,n+1 Concept of _n,n+1 ]

Concept of _n Dependency relationship _n,n+2 Concept of _n,n+2 ]

……

5. Directed edge structure between key words

The key words are used as nodes, the weight values among the words are used as edges to form a graph model, the directed edges are that the weight values from the word a to the word b are different from the weight values from the word b to the word a, and the structure is as follows:

[ keyword ] ₁ Directed edge weight value ₁₂ Keyword(s) ₂ ][ keyword ] ₂ Directed edge weight value ₂₁ Keyword(s) ₁ ]

[ keyword ] ₁ Directed edge weight value ₁₃ Keyword(s) ₃ ][ keyword ] ₃ Directed edge weight value ₃₁ Keyword(s) ₁ ]

……

[ keyword ] ₁ Directed edge weight value _1n Keyword(s) _n ][ keyword ] _n Directed edge weight value _n1 Keyword(s) ₁ ]

[ keyword ] ₂ Directed edge weight value ₂₃ Keyword(s) ₃ ][ keyword ] ₃ Directed edge weight value ₃₂ Keyword(s) ₂ ]

[ keyword ] ₂ Directed edge weight value ₂₄ Keyword(s) ₄ ][ keyword ] ₄ Directed edge weight value ₄₂ Keyword(s) ₂ ]

……

[ keyword ] ₂ Directed edge weight value _2n Keyword(s) _n ][ keyword ] _n Directed edge weight value _n2 Keyword(s) ₂ ]

[ keyword ] _n Directed edge weight value _n,n+1 Keyword(s) _n+1 ][ keyword ] _n+1 Directed side weightsWeight value _n+1,n Keyword(s) _n ]

[ keyword ] _n Directed edge weight value _n,n+2 Keyword(s) _n+2 ][ keyword ] _n+2 Directed edge weight value _n+2,n Keyword(s) _n ]

……

[ keyword ] _n Directed edge weight value _n,2n Keyword(s) _2n ][ keyword ] _2n Directed edge weight value _2n,n Keyword(s) _n ]。

6. Certain concept calculation formula of noun or noun phrase

To determine a particular concept to which a noun or noun phrase belongs in the current text, a co-occurrence relationship between the current noun or noun phrase and the concept to which it belongs is used to calculate the following formula:

7. formula for calculating semantic similarity of nouns or noun phrases in English text to be read and questions

In the formula (2), the similarity between the noun or noun phrase in the problem and the noun or phrase in the text to be read is calculated, and the word vector can be obtained through training.

8. Weight coefficient calculation formula between current word and context word thereof

In formula (3), the numerator represents the correlation between the current term i and the term j in its context, and the denominator represents the sum of the correlations between the current term i and the n terms in its context.

9. Formula for calculating importance degree of current words or phrases

The weight coefficient between the current word i and the context word j is obtained in the formula (3), and the importance degree score of the current word or phrase in the text can be obtained by weighted summation of the weight coefficient, and the calculation formula is as follows:

10. directional edge weight value calculation formula among key words

The weight value of the candidate key word in the current graph model refers to the duty ratio of the sum of Euclidean distances between the candidate key word and all other adjacent node words, and the calculation formula is as follows:

11. word weight value normalization processing formula

After the word weight value is obtained, the normalized score of the word is obtained through normalization processing, and the final answer word is selected after descending order sorting. The normalized score of a term in english text refers to the ratio of the weight value of the term in the current graph model to the sum of the weight values of all terms in the current graph model, and the calculation formula is as follows:

in the formula (6), the weight value of the word i in the current graph model is calculated by the calculation formula (5).

As shown in fig. 2, the processing flow of the english text understanding preprocessing module is as follows:

p201 begins;

p202 reads in English text and questions to be read;

p203 separates text to be read from the question use mark;

p204, performing stop word processing on the text and the problem to be read;

p205 performs word lowercase processing on the text and the problem to be read;

p206, sentence dividing is carried out on the text to be read and the problems to form a plurality of sentence sequences;

p207 performs word segmentation and phrase segmentation processing on texts and questions to be read;

p208 marks the part of speech of the text sequence after word segmentation, and outputs nouns or noun phrases, verbs and adjective lists in the text to be read;

p209 marks the part of speech of the question sequence after word segmentation, and outputs nouns or noun phrases, verbs and adjective lists in the questions;

p210 counts the total number of words in the text to be read and the problem sequence after word segmentation respectively;

p211 performs grouping processing on the text sequence to be read after word segmentation, wherein every 20 words are separated by one group, and groups of less than 20 words are complemented by NULL;

p212 performs grouping processing on the segmented problem sequence, wherein the number of the problem sequences is generally less than 20 words, and NULL filling is used;

p212 ends.

As shown in fig. 3, the processing flow of the english text keyword concept semantic feature extraction module is as follows:

p301 starts;

p302 reads text to be read and a problem sequence result after word segmentation;

p303 calculates the distributed word vector of the text to be read and the words in the question, and generates 200-dimensional vector representation;

p304 uses formula (2) to calculate cosine similarity between nouns or noun phrases in questions and nouns or noun phrases in English text to be read;

p305 performs descending order sorting on the calculated cosine similarity results, and selects the result with the top five ranks as a candidate keyword or phrase of the text to be read related to the problem;

p306 calculates the co-occurrence probability of the candidate keyword or phrase and the candidate concept to which the candidate keyword or phrase belongs by using the formula (1);

p307 judges whether the co-occurrence probability of the keyword and the candidate concept to which the keyword belongs is zero, if so, P308 is executed, otherwise P309 is executed;

p308 uses the current candidate key word or phrase as the concept to which the current candidate key word or phrase belongs, and 200-dimensional word vector representation of the current candidate key word or phrase is the conceptual representation result to which the current candidate key word or phrase belongs;

p309 descending order the co-occurrence probability of the current candidate keyword and the possible concepts to which the current candidate keyword belongs, and determining the concepts to which the current candidate keyword belongs;

p310 carries out vectorization representation on concepts to which the determined keywords belong, and generates 200-dimensional word vector representation;

p311 calculates the weight coefficient of the current concept in its context using equation (3);

p312 calculates an importance score for the current concept in its context using equation (4);

p313 carries out descending order sorting on importance scores of the current concept in the context of the current concept to obtain semantic features of the current candidate keyword concept;

p314 ends.

As shown in fig. 4, the processing flow of the english text keyword and the conceptual semantic dependency relationship extraction module is as follows:

p401 starts;

p402 reads word vector representation results of candidate key nouns or noun phrases;

p403 reads the conceptual representation result of the candidate key nouns or noun phrases;

p404 inputs the conceptual representation result into a pre-trained concept semantic dependency set, and selects two candidate concept dependencies with the top order;

p405 performs position coding on the candidate concept dependency relationship, namely, calculates the position distance between the concept dependency relationship and the belonging concept pair;

p406 fuses the conceptual representation of the candidate key nouns or phrases and the conceptual dependency position codes, and inputs the fused vectors into a convolutional neural network;

p407 fuses word vector representation of candidate key nouns or noun phrases and conceptual dependency position codes, and inputs the fused vector into another convolutional neural network;

p408 carries out convolution layer calculation on input vectors of P406 and P407 respectively, and P406 and P407 share network parameters;

p409 performs word vector and concept vector pooling operation on the convolution calculation result of P408;

p410 splices the pooling operation results obtained by P409 respectively;

p411 uses a softmax function to classify the splicing result of P410 to obtain the final concept dependency relationship result;

p412 ends.

As shown in fig. 5, the process flow of the candidate answer selection module is as follows:

p501 begins;

p502 inputs a conceptual representation of candidate key nouns or noun phrases;

p503 inputs the selected keywords and the concept semantic dependency relationship;

p504 uses the conceptual representation of candidate key nouns or phrases as nodes, uses the selected key words and the concept semantic dependency relationships thereof as edges, and builds a concept semantic graph model;

p505 uses the formula (5) to calculate the Euclidean distance between any two nodes in the conceptual semantic graph model;

p506 calculates the directed edge weight value between nodes using equation (6);

p507 descending order of weight values among all nodes;

p508 selects the maximum weight value between nodes, and takes the candidate key word of the node and the concept thereof as the final answer;

p509 ends.

The method solves the problem of English text concept understanding, and the answer result is more accurate than that of the traditional English text understanding method. And after the English text and the questions related to the English text are processed by the understanding method, finally, the related concept answers of the questions can be obtained.

Drawings

FIG. 1 is a general process flow diagram of the method of the present invention;

FIG. 2 is a flow chart of the English text pre-processing module processing of the method of the present invention;

FIG. 3 is a flow chart of the processing of the semantic feature extraction module of the English text keyword concept of the method of the present invention;

FIG. 4 is a flow chart of the process of the English text keyword and its conceptual semantic dependency extraction module;

FIG. 5 is a flow chart of a candidate answer selection module process of the method of the invention.

Detailed Description

The specific implementation mode of the English text concept understanding method is divided into the following five steps.

A first step of: executing English text preprocessing module "

The English text input in the embodiment of the invention is obtained from standard reading understanding text, questions and answers in the Stanford reading understanding data set, and the English text content and questions are as follows:

the english text content to be read is as follows:

On June 14,1946,Sam was born in New York City.After graduating from the military school in 1964,Sam entered the Wharton School of the University of Pennsylvania.In college,Sam carefully learned new knowledge in the business field and cultivated a smart business savvy.In college,Sam entered a real estate company founded by his father.His father's business secrets taught Sam more experience.When he was a senior,he wanted to make a breakthrough in the business world.From time to time,he went abroad to inspect the latest and future economic trends,and deeply realized that the most important corporate business strategy today is"marketing."In 1999,Sam was again active in investment activities in the real estate,casino,entertainment,sports and transportation sectors.His assets have exceeded$3billion.

the questions and answers are as follows:

Where was Sam born？

Ground Truth Answers:[New York][New York][New York]

Prediction:New York

When was Sam born？

Ground Truth Answers:[1946][1946][1946]

Prediction:1946

When did Sam become the President of the United States？

Ground Truth Answers:<No Answer>

Prediction:<No Answer>

(1) After word segmentation and part-of-speech tagging are carried out on the English text to be read, the generated part-of-speech tagging result is as follows:

On[on#IN*]，[#null*]，June[june#NNP*]，[#null*]，14[14#CD*]，[#null*]，1946[1946#CD*]，[#null*]，Donald[donald#NNP*]，[#null*]，Sam[Sam#NNP*]，[#null*]，was[is#VBD*]，[#null*]，born[born#VBN*]，[#null*]，in[in#IN*]，[#null*]，New York[new york#NNP*]，[#null*]，City[city#NNP*]，[#null*]，After[after#IN*]，[#null*]，graduating[graduate#VBG*]，[#null*]，from[from#IN*]，[#null*]，the[the#DT*]，[#null*]，military[military#JJ*]，[#null*]，school[school#NN*]，[#null*]，in[in#IN*]，[#null*]，1964[1946#CD*]，[#null*]，Sam[Sam#NNP*]，[#null*]，entered[enter#VBD*]，[#null*]，the[the#DT*]，[#null*]，Wharton[wharton#NNP*]，[#null*]，School[school#NNP*]，[#null*]，of[of#IN*]，[#null*]，the[the#DT*]，[#null*]，University[university#NNP*]，[#null*]，of[of#IN*]，[#null*]，Pennsylvania[pennsylvania#NNP*]，[#null*]，college[college#NN*]，[#null*]，Sam[Sam#NNP*]，[#null*]，carefully[carefully#RB*]，[#null*]，learned[learn#VBD*]，[#null*]，new[new#JJ*]，[#null*]，knowledge[knowledge#NN*]，[#null*]，in[in#IN*]，[#null*]，the[the#DT*]，[#null*]，business[business#NN*]，[#null*]，field[field#NN*]，[#null*]，and[and#CC*]，[#null*]，cultivated[cultivate#VBD*]，[#null*]，a[a#DT*]，[#null*]，smart[smart#JJ*]，[#null*]，business[business#NN*]，[#null*]，savvy[savvy#NN*]，[#null*]，In[in#IN*]，[#null*]，college[college#NN*]，[#null*]，Sam[Sam#NNP*]，[#null*]，entered[enter#VBD*]，[#null*]，a[a#DT*]，[#null*]，real[real#JJ*]，[#null*]，estate[estate#NN*]，[#null*]，company[company#NN*]，[#null*]，founded[found#VBD*]，[#null*]，by[by#IN*]，[#null*]，his[his#PRP*]，[#null*]，father[father#NN*],[#null*]，His[his#PRP*]，[#null*]，father[father#NN*]，[#null*]，business[business#NN*]，[#null*]，secrets[secret#NNS*]，[#null*]，taught[teach#VBD*]，[#null*]，Sam[Sam#NNP*]，[#null*]，more[more#JJR*]，[#null*]，experience[experience#NN*]，[#null*]，When[when#WRB*]，[#null*]，he[he#PRP*]，[#null*]，was[is#VBD*]，[#null*]，a[a#DT*]，[#null*]，senior[senior#JJ*]，[#null*]，he[he#PRP*]，[#null*]，wanted[want#VBD*]，[#null*]，make[make#VB*]，[#null*]，breakthrough[breakthrough#NN*]，[#null*]，business[business#NN*]，[#null*]，world[world#NN*]，[#null*]，From[from#IN*]，[#null*]，time[time#NN*]，[#null*]，to[to#TO*]，[#null*]，time[time#NN*]，[#null*]，he[he#PRP*]，[#null*]，went[go#VBD*]，[#null*]，abroad[abroad#RB*]，[#null*]，inspect[inspect#VB*]，[#null*]，latest[latest#JJS*]，[#null*]，and[and#CC*]，[#null*]，future[future#JJ*]，[#null*]，economic[economic#JJ*]，[#null*]，trends[trend#NNS*]，[#null*]，and[and#CC*]，[#null*]，deeply[deeply#RB*]，[#null*]，realized[realize#VBD*]，[#null*]，that[that#IN*]，[#null*]，the[the#DT*]，[#null*]，most[most#RBS*]，[#null*]，important[important#JJ*]，[#null*]，corporate[corporate#JJ*]，[#null*]，business[business#NN*]，[#null*]，strategy[strategy#NN*]，[#null*]，today[today#NN*]，[#null*]，is[is#VBZ*]，[#null*]，marketing[market#NN*]，[#null*]，In[in#IN*]，[#null*]，1999[1999#CD*]，[#null*]，Sam[Sam#NNP*]，[#null*]，was[is#VBD*]，[#null*]，again[again#RB*]，[#null*]，active[active#JJ*]，[#null*]，in[in#IN*]，[#null*]，investment[investment#NN*]，[#null*]，activities[activity#NNS*]，[#null*]，real[real#JJ*]，[#null*]，estate[estate#NN*]，[#null*]，casino[casino#NN*]，[#null*]，entertainment[entertainment#NN*]，[#null*]，sports[sport#NNS*]，[#null*]，and[and#CC*]，[#null*]，transportation[transportation#NN*]，[#null*]，sectors[sector#NNS*]，[#null*]，His[his#PRP*]，[#null*]，assets[asset#NNS*]，[#null*]，have[have#VBP*]，[#null*]，exceeded[exceed#VBN*]，[#null*]，billion[billion#CD*][#null*]

part-of-speech tagging of question text:

[question#1，Where[where#WRB*]，was[is#VBD*]，Donald[donald#NNP*]，Sam[Sam#NNP*]，born[born#VBN*]]

[question#2，When[when#WRB*]，was[is#VBD*]，Donald[donald#NNP*]，Sam[Sam#NNP*]，born[born#VBN*]]

[question#3，When[when#WRB*]，did[do#VBD*]，Sam[Sam#NNP*]，become[become#VB*]，the[the#DT*]，President[president#NNP*]，of[of#IN*]，the[the#DT*]，United[united#NNP*]，States[states#NNP*]]

(2) After noun or noun phrase dicing is performed on English text to be read, the generated noun or noun phrase dicing result is as follows:

/On[on#IN*]/June[june#NNP*]/14[14#CD*]/1946[1946#CD*]/DonaldSam[donaldSam#NN P*]/was[is#VBD*]/born[born#VBN*]/in[in#IN*]/NewYork[newyork#NNP*]/City[city#NNP*]/After[after#IN*]/graduating[graduate#VBG*]/from[from#IN*]/the[the#DT*]/militar y[military#JJ*]/school[school#NN*]/in[in#IN*]/1964[1946#CD*]/Sam[Sam#NNP*]/enter ed[enter#VBD*]/the[the#DT*]/Wharton[wharton#NNP*]/School[school#NNP*]/of[of#IN*]/the[the#DT*]/University[university#NNP*]/of[of#IN*]/Pennsylvania[pennsylvania#NNP*]/college[college#NN*]/Sam[Sam#NNP*]/carefully[carefully#RB*]/learned[learn#VBD*]/new[new#JJ*]/knowledge[knowledge#NN*]/in[in#IN*]/the[the#DT*]/business[busi ness#NN*]/field[field#NN*]/and[and#CC*]/cultivated[cultivate#VBD*]/a[a#DT*]/smar t[smart#JJ*]/business[business#NN*]/savvy[savvy#NN*]/In[in#IN*]/college[college#NN*]/Sam[Sam#NNP*]/entered[enter#VBD*]/a[a#DT*]/real[real#JJ*]/estate[estate#NN*]/company[company#NN*]/founded[found#VBD*]/by[by#IN*]/his[his#PRP*]/father[fathe r#NN*]/His[his#PRP*]/father[father#NN*]/business[business#NN*]/secrets[secret#NN S*]/taught[teach#VBD*]/Sam[Sam#NNP*]/more[more#JJR*]/experience[experience#NN*]/When[when#WRB*]/he[he#PRP*]/was[is#VBD*]/a[a#DT*]/senior[senior#JJ*]/he[he#PRP*]/wanted[want#VBD*]/make[make#VB*]/breakthrough[breakthrough#NN*]/business[busine ss#NN*]/world[world#NN*]/From[from#IN*]/time[time#NN*]/to[to#TO*]/time[time#NN*]/he[he#PRP*]/went[go#VBD*]/abroad[abroad#RB*]/inspect[inspect#VB*]/latest[latest#JJS*]/and[and#CC*]/future[future#JJ*]/economic[economic#JJ*]/trends[trend#NNS*]/and[and#CC*]/deeply[deeply#RB*]/realized[realize#VBD*]/that[that#IN*]/the[the#DT*]/most[most#RBS*]/important[important#JJ*]/corporate[corporate#JJ*]/business[business#NN*]/strategy[strategy#NN*]/today[today#NN*]/is[is#VBZ*]/marketing[marke t#NN*]/In[in#IN*]/1999[1999#CD*]/Sam[Sam#NNP*]/was[is#VBD*]/again[again#RB*]/act ive[active#JJ*]/in[in#IN*]/investment[investment#NN*]/activities[activity#NNS*]/real[real#JJ*]/estate[estate#NN*]/casino[casino#NN*]/entertainment[entertainment#NN*]/sports[sport#NNS*]/and[and#CC*]/transportation[transportation#NN*]/sectors[sector#NNS*]/His[his#PRP*]/assets[asset#NNS*]/have[have#VBP*]/exceeded[exceed#VBN*]/billion[billion#CD*]/

and a second step of: executing 'English text keyword concept semantic feature extraction module'

(1) On the basis of the first step, word vector representation is carried out on nouns or noun phrases in the preprocessed English text, a 200-dimensional vector representation form is generated, and word vector representation results of partial words are as follows:

business:[-2.59042799e-01 1.56627929e+00 -1.55328619e+00 1.16095312e-01

8.28763063e-04 1.13678873e+00 1.07951772e+00 6.84864402e-01

-3.05663824e-01 -9.47709203e-01 -9.14580405e-01 1.78567588e-01

9.55694243e-02 1.46830523e+00 4.33245957e-01 5.62674284e-01

-1.20297933e+00 -3.30155420e+00 2.39313304e-01 5.39111316e-01

1.37632453e+00 -5.18846154e-01 -1.72100616e+00 -7.81766713e-01

8.12833726e-01 -6.71297908e-01 -2.55080253e-01 -9.63443890e-02

3.75874341e-02 -1.85547560e-01 -5.85621536e-01 -1.32061994e+00

-1.15084291e+00 1.19156432e+00 6.12567663e-01 -4.88826752e-01

2.49715820e-01 -1.13945462e-01 -4.11442071e-01 7.39667833e-01

7.39755988e-01 6.95835590e-01 -2.12423000e-02 -6.15605295e-01

-8.16631496e-01 -4.95573401e-01 1.19313017e-01 -2.32566208e-01

-7.09587812e-01 -2.01330781e+00 6.02940023e-01 2.97293991e-01

-8.00344229e-01 2.30241203e+00 -7.61904955e-01 -4.40068513e-01

5.51879108e-01 4.55911309e-01 7.38105178e-01 1.89581215e+00

1.05786526e+00 1.08144259e+00 -2.95965791e-01 -9.70735908e-01

7.77064264e-01 1.23684049e+00 -1.16662085e+00 1.25651217e+00

-5.55168211e-01 1.06070185e+00 6.27060890e-01 1.89990854e+00

-4.69613642e-01 3.78263712e-01 1.10785294e+00 5.32317340e-01

1.78810787e+00 -1.90469372e+00 -6.32371485e-01 5.51381886e-01

-2.27715746e-01 -1.09175253e+00 -1.68093562e+00 1.41336232e-01

8.34236890e-02 -2.33603567e-01 -1.16054632e-01 -6.98961541e-02

5.63091874e-01 1.23674989e+00 -5.66389710e-02 -9.67171729e-01

4.83761936e-01 -1.42906487e-01 6.26178682e-01 1.67304240e-02

1.24199748e+00 -3.84036869e-01 4.28546637e-01 -6.10349886e-02

1.66938648e-01 3.96170676e-01 4.63583052e-01 -9.17208970e-01

-5.85813046e-01 -6.92225516e-01 -9.51395154e-01 -6.38596237e-01

3.08472663e-01 -5.36561683e-02 -7.41630197e-02 -1.49298131e-01

-6.27747476e-01 1.96738780e+00 2.24164918e-01 3.24346006e-01

2.43802595e+00 -3.70077312e-01 8.90044630e-01 9.88620240e-03

1.34185135e-01 6.29028857e-01 -1.10365725e+00 -3.79670203e-01

5.07582128e-01 7.99743831e-01 -8.41116905e-01 -1.29741180e+00

-2.33467355e-01 -8.41176212e-01 2.48963069e-02 5.14094293e-01

1.13484383e+00 -7.05592871e-01 5.25330365e-01 -3.20291258e-02

-2.67125368e-01 -4.17263657e-01 2.82960385e-01 -9.61873531e-01

3.51352364e-01 -6.42272592e-01 -2.43765354e+00 2.40605965e-01

-1.68029988e+00 3.13021213e-01 -9.40301061e-01 1.38528538e+00

-1.08122826e-01 -8.73246133e-01 1.75076559e-01 5.97331882e-01

-1.39861321e+00 -3.17869186e-01 3.57864857e-01 -1.39695033e-01

6.25059903e-01 9.22169983e-01 -8.13591704e-02 -9.10186917e-02

-4.52748924e-01 1.60742199e+00 4.60776240e-01 -7.78419793e-01

-1.02559980e-02 1.52036750e+00 -1.84489512e+00 -6.73551381e-01

1.20446825e+00 2.46079013e-01 8.50453556e-01 -7.69736469e-01

1.84337378e-01 1.13760567e+00 4.32253242e-01 -6.89828217e-01

-7.06000090e-01 9.13547158e-01 1.73478693e-01 1.42103589e+00

7.80944586e-01 8.11390400e-01 -7.83208683e-02 -5.13207555e-01

-1.06880486e+00 -7.83280969e-01 -5.65739870e-01 -2.30160475e-01

6.54523432e-01 -9.24793482e-01 -2.84793049e-01 1.01340890e+00

9.57501888e-01 2.22771317e-01 3.90049964e-01 1.60163665e+00

2.16183096e-01 7.16380775e-01 8.28462422e-01 1.71259999e-01]

savvy:[0.06921814 0.08985148 0.10130031 -0.01975576 0.00613875 0.06860386

0.07878992 0.15682952 -0.079765-0.01364678 0.05102079 0.00548506

0.03024285 0.11446191 0.09568619 0.04286152 -0.13500483 -0.08419026

-0.01513231 0.11023535 0.06145927 0.00069024 -0.06334386 0.02397627

-0.13211721 0.10869574 -0.01575115 0.01712319 0.10889407 -0.03390257

-0.08128685 -0.00774771 0.07443068 -0.02511345 0.02655445 0.10193694

0.01160171 -0.03776457 0.18400234 -0.05345958 0.03763071 0.01195812

0.202218 0.0132231-0.19167267 0.04500511 0.0789397-0.01589778

0.13028212 -0.06922863 -0.06018286 0.08444316 -0.03776797 -0.14269106

-0.13448288 0.01259283 0.01702782 -0.00926038 0.01356861 0.03965648

-0.08855332 0.06088002 -0.10612214 -0.09905583 0.06241861 0.1188715

-0.04242382 0.06692507 0.02515559 -0.00878243 0.02058123 -0.00600162

0.05146226 0.10495976 0.06806118 0.03343373 0.11794326 -0.11481091

-0.12138966 0.02585844 -0.03958427 -0.02640601 -0.05624481 -0.01868268

-0.15891208 0.03756193 -0.03025833 0.01944492 0.10282031 -0.03299379

-0.00475729 0.14685485 -0.06587423 0.0149247 0.04896393 -0.06590062

0.11573595 -0.03508269 0.0751999-0.04895703 0.01599983 0.07251011

-0.09170596 -0.02906534 -0.04846796 0.06372514 -0.07596011 -0.02131839

-0.05209391 0.13131613 -0.22141725 -0.00611135 -0.04040148 -0.03427979

0.0410597-0.02699451 0.04695193 0.01251158 0.03160017 0.00255954

-0.07341788 -0.05954413 -0.10209412 0.00679443 0.00787201 0.00381293

-0.05103155 -0.14217651 0.05005223 -0.00610479 0.06478029 -0.1646596

0.09607032 -0.09883969 -0.05145364 0.00964217 0.14213578 0.01998526

-0.06588282 -0.0529303 0.06216754 0.02636117 -0.11312462 0.01608072

-0.01465175 -0.00260696 -0.04901178 0.00495274 0.05634578 0.00028076

0.06987215 0.09869573 0.11174746 0.01768979 -0.12532751 -0.04939596

-0.05851451 -0.17550679 0.24233076 0.0345888 0.08057397 -0.02626101

0.00672352 -0.03837141 0.01871823 -0.07934792 0.01752568 -0.133829

0.0478517 0.0792998-0.02651287 0.05125243 0.09184576 0.15655527

0.03717348 -0.01241744 -0.08104452 0.06890302 0.01926608 -0.10523076

-0.11265913 -0.09659582 0.04266785 0.04144118 -0.14290997 -0.02705677

-0.02053294 0.05827883 -0.01985832 -0.05965782 0.14561172 -0.04690978

0.10358934 0.04019428 0.06787848 0.01593667 -0.13111904 -0.06707609

0.08144604 0.04385952]

……

(2) Calculating the similarity between the text to be read and the words in the problem, and sorting the top 20 related words by the similarity, wherein partial results are as follows:

the first 20 words most relevant to the word bussiness are:

financial 0.6826215982437134

consumer 0.6628485918045044

banking 0.6589778661727905

marketing 0.6573569178581238

corporate 0.6446224451065063

firms 0.6148818731307983

investments 0.6143110990524292

insurance 0.6100685596466064

retail 0.604107141494751

financing 0.5926154851913452

management 0.5904277563095093

buying 0.5883773565292358

businesses 0.5873700380325317

markets 0.5868954062461853

employees 0.5846246480941772customer 0.583165168762207

marketplace 0.5821336507797241enterprise 0.5816493034362793welfare 0.5800684690475464

jobs 0.5792907476425171

the first 20 words most relevant to the word born are: married 0.6355774402618408

christened 0.5665861368179321novelist 0.5470004677772522actress 0.5364381670951843

apprenticed 0.530538022518158maclean 0.5302119255065918

interred 0.525600790977478

beatrice 0.525336503982544

desmond 0.5203564763069153

beecher 0.5200093388557434

lafcadio 0.5169895887374878corinne 0.5124737024307251

louise 0.5076141357421875

patricia 0.5058313012123108anna 0.5041660070419312

sarah 0.5030679702758789

ballerina 0.5028273463249207angela 0.500499963760376

died 0.4998953342437744

The first 20 terms to which the anton 0.4994434416294098 is most relevant to the terms are: inventnents 0.7772245407104492profits 0.7662760019302368

revenues 0.7530128359794617revenue 0.7483336925506592

funds 0.7441127896308899

investors 0.7420588731765747firms 0.7401308417320251

debts 0.7333177924156189

loans 0.7315413951873779

shareholders 0.7296478748321533businesses 0.7258060574531555

employees 0.7210573554039001

costs 0.7146604061126709

expenses 0.7083866596221924

purchases 0.7039198279380798

earnings 0.7029934525489807

subsidies 0.7015666961669922

payments 0.7007849812507629

goods 0.6995357275009155

The first 20 words most relevant to the word keyword by the connections 0.6982542872428894 are: the antigens 0.7221421003341675

administration 0.6922751069068909regime 0.6741224527359009parliament 0.6391890048980713

electorate 0.6347169876098633

prc 0.6314117908477783

legislature 0.6243986487388611

legislation 0.6075990796089172

authorities 0.6037262082099915

senate 0.5914326906204224parliamentary 0.5884313583374023coalition 0.5815113186836243

policies 0.5814124345779419

policy 0.5776035785675049

junta 0.5771560668945312privatization 0.5765987038612366economy 0.5755563974380493

taxation 0.5730693340301514

autonomy 0.5683175325393677

kmt 0.5680544376373291

The first 20 words most relevant to the word merchant are: shipyards 0.6756232976913452

kaiserliche 0.6664568185806274

sailing 0.6592236757278442

ship 0.6573899984359741

tonnage 0.647135853767395

ships 0.635455846786499marine 0.6257590651512146

fleet 0.6237657070159912marines 0.6213807463645935warship 0.6195002794265747

aboard 0.619187593460083

sailors 0.6180980205535889

frigate 0.6149691343307495navy 0.612155556678772

surveyors 0.6083635687828064harbours 0.6074026823043823

submarines 0.6049712896347046hms 0.6042121052742004

escort 0.6040891408920288

cruiser 0.6031404733657837

The first 20 words most relevant to the word enterreneleurs are: journ alists 0.6671593189239502

professionals 0.6548882722854614

intellectuals 0.6519579887390137

pioneers 0.6428285241127014hackers 0.6421672105789185

capitalists 0.6376326084136963

consultants 0.6374378204345703

comedians 0.6370235681533813

economists 0.6340476274490356

executives 0.633492112159729

distributors 0.6326943635940552

businessmen 0.6269378662109375

firms 0.6252130270004272producers 0.6200482249259949

filmmakers 0.6186020970344543

ventures 0.6152722239494324

investors 0.6144982576370239

charities 0.6126831769943237

engineers 0.6111494302749634

writers 0.6104857325553894

The first 20 words most relevant to the word president are:

presidency 0.7305980324745178

chairman 0.7099910974502563

governor 0.6958410739898682

presidents 0.6945462226867676

taoiseach 0.6547336578369141

chancellor 0.6463114023208618

senator 0.6372398138046265

presidential 0.6284170150756836

deputy 0.6119073629379272

democrat 0.6081264019012451

incumbent 0.5973949432373047

eisenhower 0.5925225019454956

senate 0.5860650539398193

reagan 0.583078145980835

mayor 0.5807799696922302

secretary 0.5800341367721558

pinochet 0.578060507774353

resigns 0.576712429523468

ould 0.5762377381324768

taya 0.5750265121459961

……

(3) Selecting candidate keywords based on similarity ranking of key nouns or noun phrases

The top five keywords are as follows:

Sam:0.8765474881873130

business:0.7866258742548321

business savvy:0.7456898574232562

government:0.7120154685214523

assets:0.6956024587541035

……

(4) Calculating probability of concept to which candidate keyword belongs

The calculation result of the probability of the candidate keyword Sam belonged to the concept is as follows:

Sam[merchant,entrepreneurs,president]

probability of belonging to the concept of merchant: 0.8532689542652531

Probability of belonging to the concept of enterreneleurs: 0.8325621421303526 probability of belonging to president concept: 0.2102145741021432

……

The probability result of a certain concept of the keyword can be obtained, and the probability of the concept of the word Sam belonging to the businessman or the enterprise in the text to be read is far greater than that of the president.

And a third step of: executing English text keyword and concept semantic dependency relation extraction module thereof "

(1) Semantic dependencies between key nouns are extracted, and partial results are as follows:

[Sam,born in,New York]

[Sam,born in year,1946]

[Sam,university of,Pennsylvania]

……

after the key nouns or noun phrases are conceptualized, candidate semantic dependencies between concepts are extracted from the knowledge base as follows:

[Sam,born in,New York]

[Sam,president of,United State]

……

(2) Selection validation of candidate semantic dependencies

After extracting semantic dependency relations among candidate key nouns or noun phrases and candidate semantic dependency relations after the candidate key nouns or noun phrases are conceptualized, respectively inputting the two semantic dependency relations into two independent convolution networks for feature extraction, and in the step, in order to fully acquire semantic features of the two semantic dependency relations, respectively acquiring hidden layer semantic information by using a three-layer convolution network structure.

The information output from the hidden layer is respectively pooled by two independent convolution networks, in this step, the hidden layer information is weighted-averaged using an averaging pooling operation, and the output result of the pooled layer is input to the fully connected layer.

And in the full connection layer, splicing the output results of the two independent pooling layers to form a new feature vector.

And classifying and calculating the spliced vectors of the full-connection layer through a softmax function to obtain probability scores of candidate semantic dependencies, sorting the probability scores in descending order, and selecting the result with the highest probability as the final semantic dependency.

Fourth step: executing "candidate answer selection Module"

The conceptual training representation of the candidate key nouns or phrases extracted in the second step, i.e. the 200-dimensional vector, is input.

Concept semantic dependencies between nouns or noun phrases extracted in the third step are entered.

Concept semantic graph models are built using concept representations of candidate key nouns or phrases as nodes and concept semantic dependencies as edges.

The directional edge weight values between nodes in the graph model are calculated using equation (5).

After the term weight value is obtained, the normalized score of the term is obtained through the normalization processing of the formula (6), and the final answer term is obtained through descending order sorting.

Claims

1. An English text concept understanding method is characterized in that: the method comprises an understanding model consisting of an English text understanding preprocessing module, an English text keyword concept semantic feature extraction module, an English text keyword and concept semantic dependency relation extraction module and a candidate answer selection module which are sequentially connected, wherein the understanding method comprises the following steps:

(1) The English text understanding preprocessing module inputs English text and questions to be read, and word segmentation, stop word removal and word lowercase processing are respectively carried out on the English text and the questions to be read; performing part-of-speech tagging and phrase segmentation on English texts and questions to be read after word segmentation, stop word removal and word lowercase treatment; outputting the processed English text to be read and the preprocessing result of the problem;

(2) The method comprises the steps of firstly, inputting a preprocessing result of an English text to be read and a preprocessing result of a problem in an English text preprocessing module, and marking nouns or noun phrases in the English text to be read and the problem; secondly, calculating word vectors of labeled nouns or noun phrases in English texts and questions to be read; thirdly, calculating cosine similarity between nouns or noun phrases in the text to be read and nouns or noun phrases in the questions, sorting the calculated cosine similarity results in a descending order, and selecting the result with the top five ranks as a candidate key noun or noun phrase; fourthly, calculating the co-occurrence probability of the candidate key nouns or noun phrases and the candidate concepts to which the candidate key nouns or noun phrases belong, if the co-occurrence probability result is zero, continuing to execute the fifth step, otherwise, selecting the result with the highest probability as the candidate key nouns or noun phrases to which the candidate key nouns or noun phrases belong; fifth, if the co-occurrence probability result of the candidate key noun or noun phrase and the concept to which the candidate key noun or noun phrase belongs is zero, directly using the current noun or noun phrase as the concept to which the candidate key noun or noun phrase belongs; sixthly, calculating a weight coefficient between the current keyword and the context word thereof, and then carrying out weighted summation to obtain a final importance degree score of the current keyword;

(3) English text keywords and the concept semantic dependency relation extraction module inputs word vector representation of candidate key nouns or noun phrases; inputting a conceptual representation of the candidate key nouns or noun phrases; extracting semantic dependency relations among candidate key nouns or noun phrases by using a pre-trained semantic dependency relation set; extracting concept dependency relations among candidate key nouns or noun phrases by using a pre-trained concept dependency relation set; calculating cosine similarity between semantic dependency relationship and concept dependency relationship of candidate key nouns or noun phrases, sorting the calculated results in descending order, and selecting the result with highest similarity as the current key word and the concept semantic dependency relationship thereof;

(4) The candidate answer selection module inputs a conceptual representation of candidate key nouns or noun phrases; inputting the selected keywords and the concept semantic dependency relationship thereof; using the conceptual representation of the candidate key nouns or phrases as nodes and using the selected key words and the conceptual semantic dependency relationships thereof as edges to construct a conceptual semantic representation graph model; the Euclidean distance between each node vector and the weighted average vector of all nodes in the conceptual semantic graph model is calculated, and the probability distribution of the Euclidean distance is used as the weight value of the nodes; and selecting the node with the highest weight value as a final answer.

2. The understanding method according to claim 1, characterized in that: the English text understanding preprocessing module comprises the following processing steps:

p201 begins;

p202 reads in English text and questions to be read;

p203 separates text to be read from the question use mark;

p204, performing stop word processing on the text and the problem to be read;

p205 performs word lowercase processing on the text and the problem to be read;

p212 ends.

3. The understanding method according to claim 1, characterized in that: the calculation formula of the English text keyword concept semantic feature extraction module is defined as follows:

(1) Certain concept calculation formula of noun or noun phrase

The probability of a term or term phrase in english text belonging to a concept refers to the ratio of the co-occurrence number of the term and the concept in the current text to the sum of the co-occurrence numbers of the term and all possible concepts in the training text set, and the calculation formula is as follows:

(2) Formula for calculating semantic similarity of nouns or noun phrases in English text to be read and questions

The semantic similarity between English text and nouns or phrases in question refers to the ratio of the inner product of word vectors of English text and words in question to the model of word vectors, and the calculation formula is as follows:

in the calculation formula (2), the word vector can be obtained through training;

(3) Weight coefficient calculation formula between current word and context word thereof

The weight coefficient between the current word and the context word refers to the correlation between the current word and a certain word in the context, and the ratio of the correlation between the current word and the sum of the correlations of all words in the context is calculated as follows:

(4) Formula for calculating importance degree of current words or phrases

4. an understanding method according to claim 3, characterized in that: the processing steps of the English text keyword concept semantic feature extraction module are as follows:

p301 starts;

p314 ends.

5. The understanding method according to claim 1, characterized in that: the processing steps of the English text keyword and the concept semantic dependency relation extraction module are as follows:

p401 starts;

p410 splices the pooling operation results obtained by P409 respectively;

p412 ends.

6. The understanding method according to claim 1, characterized in that: the calculation formula of the candidate answer selection module is defined as follows:

(1) Directional edge weight value calculation formula among key words

(2) Word weight value normalization processing formula

The normalized score of a term in english text refers to the ratio of the weight value of the term in the current graph model to the sum of the weight values of all terms in the current graph model, and its calculation formula is as follows:

7. the understanding method according to claim 6, characterized in that: the candidate answer selection module comprises the following processing steps:

p501 begins;

p502 inputs a conceptual representation of candidate key nouns or noun phrases;

p507 descending order of weight values among all nodes;

p509 ends.