CN111753550A

CN111753550A - Semantic parsing method for natural language

Info

Publication number: CN111753550A
Application number: CN202010594776.6A
Authority: CN
Inventors: 汪秀英
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-09

Abstract

The invention relates to the technical field of semantic parsing, and discloses a semantic parsing method of a natural language, which comprises the following steps: performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings; prefix word removing processing is carried out on the natural language by utilizing a preset dictionary; coding the natural language subjected to word segmentation by utilizing a Hash coding mode; recoding the natural language by using a Chinese word vector model to obtain vectorization expression of words; calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language; carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors; and receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector. The invention realizes the analysis of the semantic meaning.

Description

Semantic parsing method for natural language

Technical Field

The invention relates to the technical field of semantic parsing, in particular to a semantic parsing method of natural language.

Background

The internet has been rapidly developed for many years, and the growth speed of information and knowledge existing in the network is in a geometric progression trend. In the prior art, a general search engine is mainly used for searching matched webpage information by a user in a mode of searching information in the internet, and the basis for sequencing a webpage set is the correlation degree between webpage links and keywords. The information searching mode does not directly answer the questions of the user, but reduces the information searching range, and the user needs to secondarily refine the searched information to obtain the information really needed by the user, so that the accurate analysis of the semantics in the natural language is realized, and the attention of a researcher is received.

In the prior art, a method for performing semantic analysis on a natural language is to extract keywords in the natural language by using a TFIDF algorithm and perform semantic analysis by using a machine learning model. The method for extracting the keywords in the natural language by using the TFIDF algorithm is to count the word frequency of the keywords, when a certain word frequently appears in the text, the word and the text are strongly associated, and meanwhile, the problem of too high weight of the common words is avoided depending on the inverse document frequency of the word, but the inverse document frequency of the word can reduce the weight of all the common words in the text, and the position information of the word is not considered; meanwhile, the existing machine learning model needs a large amount of high-quality training sample data for training, and is time-consuming and labor-consuming.

In view of this, how to extract keywords in the natural language and perform accurate semantic analysis by using the extracted keywords becomes an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a semantic parsing method of natural language, which is used for carrying out vectorization processing on the natural language and extracting keywords by using an improved keyword extraction algorithm so as to accurately parse the semantics by using the extracted keywords.

In order to achieve the above object, the present invention provides a semantic parsing method for natural language, including:

performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings;

performing stop word processing on the natural language by using a preset dictionary;

coding the natural language subjected to word segmentation by utilizing a Hash coding mode;

recoding the natural language by using a Chinese word vector model to obtain vectorization expression of words;

calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language;

carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors;

and receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector.

Optionally, the performing word segmentation processing by using a forward maximum matching algorithm based on a character string includes:

1) and taking the first n characters of the character string to be processed as matching fields, and searching a pre-constructed word segmentation dictionary, wherein the number of the Chinese characters contained in the maximum entry in the word segmentation dictionary is n. If the dictionary contains the word, the matching is successful, and the word is separated;

2) starting from n +1 of the compared character string, taking n words to form a field, and matching the field in the dictionary again;

3) if the matching is not successful, the last bit of the field composed of the n words is removed, the field composed of the remaining n-1 words is matched in the dictionary, and the process is carried out until the segmentation is successful.

Optionally, the performing, by using a preset dictionary, stop word processing on the natural language includes:

traversing and matching the word segmentation result by using a preset stop word dictionary, and deleting the matched words;

the stop word dictionary comprises words which have high frequency of occurrence in natural language but have small practical meanings, mainly comprises words of tone, auxiliary words, prepositions and conjunctions, generally has no definite meaning, and only has certain functions when being put into a complete sentence, such as common words of ' in ', ' and ' then ', and the like.

Optionally, the encoding the natural language by using a Hash encoding method includes:

converting the input with any length into the output with fixed length through a hash algorithm, and outputting the output as a hash value;

after the Chinese words are converted into the Hash values with the same length, the decimal Hash values are converted into binary digits, and the converted binary digits are the final Hash code.

Optionally, the re-encoding the natural language by using the chinese word vector model includes:

1) constructing a Chinese word vectorization model:

|(x，θ)＝f(θ^Tx)

m(·)：x_i→y_i

×{0，1}^p→×{0，1}^q

wherein:

x is a natural language to be analyzed;

x_in is a Hash code of the ith Chinese word;

f (-) is a point-by-point function: r^q→×{0，1}^qTaking f as a sigmoid function;

×{0，1}^pa product space generated for the discrete set is a Hash code space;

×{0，1}^qa product space generated for the interval is a word vector space after recoding;

theta is a p multiplied by q dimensional matrix and is a parameter of the model;

2) product space × {0, 1} generated using a discrete set^pExactly the probability of each Chinese word occurring, and × {0, 1}^pThe point in (1) is marked as e, thenTo probabilistic language model:

e＝g(β^Ty)+

wherein:

beta is a p multiplied by q dimensional matrix;

y is a q-dimensional column vector and represents a result after word quantization;

is a p-dimensional column vector and represents random error;

e is × {0, 1}^pPoints in space, meaning the probability of the occurrence of a consequent;

g (-) is a point-by-point function: r^q→×{0，1}^qTaking g (-) as sigmoid function;

3) because the appearance sequence of Chinese words is changed along with the Chinese, the appearance sequence of Chinese words in the Chinese represents the appearance probability of words, so the invention obtains the objective function of the Chinese word vectorization model, and obtains the estimation of the parameter theta by solving the objective function, wherein the objective function is as follows:

wherein:

x_in is a Hash code of the ith Chinese word;

e_ithe probability of the occurrence of the word immediately following the ith word;

4) according to the solved parameter theta, a Chinese word vectorization model m (x, theta) is used for changing into f (theta)^Tx re-encodes the Hash code of the natural language.

Optionally, the revising the length of the natural language includes:

aiming at the problem of inconsistent text lengths, the invention sets a modification value, if the actual length of the text is greater than the average length, the modification value can restrict the text, so that the negative influence on the screening of the keywords caused by inconsistent text lengths can be restricted, and meanwhile, a word frequency modification value is set to avoid the too high weight assigned to too many repeated words;

the revision value of the natural language length is set as₁The word frequency revision value is set to₂The word frequency control formula is as follows:

wherein:

L_iis the length of the current natural language;

L_aveis the average length of the natural language;

the larger the denominator T when the natural language length exceeds the average length of the document_cThe smaller the size, the more frequent the words in long natural language texts are suppressed.

Optionally, the performing, by using the modified TFIDF algorithm, weight calculation of the word vector includes:

introducing a word position weight factor to the TFIDF algorithm₃When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text₃When the words of the rest positions of the natural language are calculated,₃taking the value as 0;

carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors, wherein the calculation formula of the word vector weight is as follows:

wherein:

iv (t) is the information entropy of the word vector;

tf_tis the word frequency of the word vector;

L_iis the length of the current natural language;

L_aveis the average length of the natural language;

n_tthe number of natural language texts containing candidate keyword vectors t is obtained;

₁set as a revision value for the length of natural language₁；

₂The word frequency revision value.

Optionally, the active learning training process includes:

1) and randomly selecting i samples from the candidate sample set U, and carrying out correct semantic annotation on the i samples. Constructing the correctly labeled samples into an initial training set T;

2) training an initial semantic parser fc by using an initial training set T;

3) predicting the remaining samples (U-T) of the candidate set by using an initial semantic parser fc, selecting samples meeting the requirements for correct marking, and adding the samples into an initial training set T to obtain a new training set T';

the strategy of sample selection is as follows: selecting samples having a probability range value of

The sample probability value is in the sample probability range, and the sample probability value is defined as an uncertain sample, wherein C is the category number of sample semantics, and a and b are constants between 0 and 1;

4) training a new semantic parser fc 'by using a new training set T';

5) predicting the rest samples (U-T-T ') by using a new semantic parser fc ', selecting samples (specifically belonging to probability range values and having wrong prediction results) meeting requirements, correctly labeling the samples, and adding the samples into T ';

6) if the accuracy of the semantic parser reaches the specified accuracy, the method stops and returns to the semantic parser fc'; otherwise jump to 4), and continue the iteration.

Compared with the prior art, the invention provides a semantic parsing method of natural language, which has the following advantages:

firstly, the existing vectorization method of natural language is one-hot coding, each Chinese word is converted into a vector with the length of N, wherein N represents the total number of all Chinese words to be coded,only one position is 1, the other positions are all 0, according to the coding mode, the coding length can be rapidly increased along with the increase of the characteristic dimension, namely the coding length and the characteristic dimension are in a direct proportion relation, the computer memory can not bear the matrix calculation of the large dimension, in addition, the coded structured data can be very sparse, which brings great difficulty to the subsequent modeling, therefore, the invention provides the coding mode of the Chinese word vector model, and the occurrence sequence of words in natural language is used as the internal logic to establish the probability model e ═ g (β) (the coding mode is used for establishing the probability model e ═ g)^Ty) +, where y is a q-dimensional column vector representing the result of the word after quantization, a p-dimensional column vector representing the random error, and e is × {0, 1}^pThe point in the space means the probability of the occurrence of the later word, therefore the invention uses the previous word or words of the natural language to infer the next possible word, wherein the probability is the code value of the next possible word, compared with the prior art, the invention considers that the probability of the first word is larger, therefore the code value of each word after coding is different, and the invention predicts the probability of the occurrence of the next word by using the code value of the current word through establishing a probability model, and the predicted result becomes the code value of the next word, therefore the word vector result obtained by the invention fully utilizes the internal relation between the words, and the code length is fixed, and the dimensionality disaster can not be caused.

Secondly, in the prior art, a TFIDF algorithm is mainly used for extracting keywords in a natural language, namely, the word frequency of the keywords is counted, when a certain word frequently appears in a text, the word and the text are strongly associated, and the problem of too high weight of the occurrence of common words is avoided depending on the inverse document frequency of the word, but the inverse document frequency of the word can reduce the weight of all common words in the text, and the position information of the word is not considered, and meanwhile, if the natural language is too long, the word frequency of some words can be greatly increased, and the words can not be keywords of the natural language; therefore, the invention improves the prior TFIDF algorithm and adds the revision value of the natural language length₁And word frequency revision value₂Thus, the word frequency in the natural language is controlled, and the obtained word frequency control formula is

Wherein L is_iIs the length of the current natural language, L_aveIs the average length of natural language. When the natural language length exceeds the average length of the document,

the larger the value of (A), i.e. the longer the natural language length, the word frequency T_cThe smaller the value of the weight factor of the position of the word is, the problem of overhigh frequency of the word in the long natural language text is effectively avoided, and meanwhile, because the position information of the word is ignored by the traditional TFIDF algorithm, but the first sentence and the last sentence in the text generally contain the summarized information of the whole text, the higher weight is given to the word in the first sentence and the last sentence, so the invention introduces the weight factor of the position of the word₃When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text₃(0＜₃< 1), when words of the rest positions of the natural language are calculated,₃taking the value as 0; in addition, the prior art does not consider the influence of different distribution of the keywords on the weights of the keywords while depending on the word frequency too much, so the information quantity of the keywords is taken as the control coefficient of the TFIDF value of the keywords to be recorded into the final result, and the larger the information quantity of the keywords is, the larger the TFIDF value of the keywords is.

Finally, aiming at the problem that a large amount of effective training data is needed when the existing machine learning model is trained, the invention trains the machine learning model by using an active learning mode, a semantic analyzer does not passively receive data provided by a user, but actively requires the user to label sample data selected by the current semantic analyzer, continuously selects useful samples according to a sample selection strategy, and simultaneously continuously trains the semantic analyzer through the samples to ensure that the semantic analyzer achieves satisfactory performance. Compared with the existing training mode which needs a large amount of effective data for training, the method and the device have the advantages that the useful sample data is selected from a small amount of unlabeled data in an iterative mode, the sample data meeting the requirements is selected and correctly labeled, and the number of the samples needed in the training process of the semantic analyzer is effectively reduced.

Drawings

Fig. 1 is a schematic flowchart of a semantic parsing method for natural language according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The natural language is vectorized, and the keywords are extracted by using an improved keyword extraction algorithm, so that the extracted keywords are used for semantic accurate analysis. Fig. 1 is a schematic diagram illustrating a semantic parsing method for natural language according to an embodiment of the present invention.

In this embodiment, the semantic parsing method for natural language includes:

and S1, performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings, and performing prefix word removal processing on the natural language by using a preset dictionary.

Firstly, the invention obtains the natural language to be analyzed, and carries out word segmentation processing on the natural language to be analyzed by utilizing a forward maximum matching algorithm based on character strings, wherein the word segmentation process of the forward maximum matching algorithm based on the character strings is as follows:

3) if the matching is not successful, the last bit of the field consisting of the n words is removed, the remaining fields consisting of the n-1 words are matched in the dictionary, and the process is continued until the segmentation is successful;

in one embodiment of the invention, if the character string to be processed is 'Chinese character is mostly ideographic characters', the step length of comparison with a dictionary is 5, the character string 'Chinese character is mostly table' is taken to be compared with the dictionary, no corresponding words are generated, the 'table' characters are removed, the character segment 'Chinese character is mostly' is used for matching until the character string 'Chinese character' is matched with the dictionary, then the character string 'Chinese character is mostly ideographic characters', and the process is circulated to segment a word 'characters'.

Furthermore, the invention uses the preset dictionary to perform traversal matching operation on the word segmentation result, and deletes the matched words; the preset dictionary is a stop word dictionary.

The stop word dictionary comprises words which have high occurrence frequency but low actual meanings in texts, mainly comprises words of tone, auxiliary words, prepositions and conjunctions, generally has no definite meaning, and only has certain functions when being put into a complete sentence, such as common words of ' in ', ' and ' then ', and the like.

S2, coding the natural language after word segmentation by using a Hash coding mode, and recoding the natural language by using a Chinese word vector model to obtain the vectorization expression of the word.

Furthermore, the invention codes the natural language after word segmentation by using a Hash coding mode, the Hash coding mode converts the input with any length into the output with fixed length through a Hash algorithm, the output is called as a Hash value, after the Chinese words are converted into the Hash values with the same length, the invention converts the decimal Hash value into a binary number, and the converted binary number is the final Hash code.

Furthermore, because the value space of each position of the Hash code is {0, 1}, the vectorization method does not consider the distance between words, i.e. the distance between vectors cannot represent the similarity of word senses, so the invention utilizes the Chinese word vectorization model to recode the Hash code, and the coding process based on the Chinese word vectorization model is as follows:

1) constructing a Chinese word vectorization model:

m(x，θ)＝f(θ^Tx)

m(·)：x_i→y_i

×{0，1}^p→×{0，1}^q

wherein:

x is a natural language to be analyzed;

x_in is a Hash code of the ith Chinese word;

×{0，1}^pa product space generated for the discrete set is a Hash code space;

2) product space × {0, 1} generated using a discrete set^pExactly the probability of each Chinese word occurring, and × {0, 1}^pAnd (5) marking the point in (e) as e to obtain a probability language model:

e＝g(β^Ty)+

wherein:

beta is a p multiplied by q dimensional matrix;

is a p-dimensional column vector and represents random error;

wherein:

x_in is a Hash code of the ith Chinese word;

e_ithe probability of occurrence of the word immediately following the ith word.

And S3, calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language.

Further, the invention calculates the relative entropy iv (t) of each word vector t in natural language:

wherein:

is the frequency with which the word vector t appears in all natural language text;

in a specific embodiment, t1 and t2 are two keyword vectors in natural language, t1 occurs in one natural language text as many times as t2 occurs in multiple natural language texts, which cannot simply mark the relative entropy values of the two keyword vectors as equal, and the distribution of t1 is more concentrated than that of t2 according to the information theory, indicating that t1 has higher correlation with the natural language text where it is located. Therefore, the present invention derives the above formula:

wherein:

μ (t) is the probability of the word vector t occurring in a natural language text;

s is the total number of words in the natural language to be analyzed;

d is the total number of the natural language texts;

tf (t) is the word frequency of the keyword vector.

Further, aiming at the problem of inconsistent text length, the invention sets a fixed value, if the actual length of the text is greater than the average length, the fixed value can restrict the text, and the negative influence on the screening of the keywords caused by the inconsistent text length can be restricted. And meanwhile, a word frequency fixed value is set, so that too high weight assigned to too many repeated words is avoided.

The invention sets the revision value of the natural language length as₁The word frequency revision value is set to₂The word frequency control formula is as follows:

wherein:

L_iis the length of the current natural language;

L_aveis the average length of natural language.

And S4, introducing a position weight factor, and improving the TFIDF algorithm by comprehensively considering the relative entropy of the word vectors and the natural language revision value, so that the improved TFIDF algorithm is used for carrying out weight calculation on the word vectors, and the k word vectors with the highest weight are used as the keyword vectors.

The traditional TFIDF algorithm ignores the position information of the words, but the first sentence and the last sentence in the text generally contain the summarized information of the whole text, so the words in the first sentence and the last sentence should be given higher weight, and the word position weight factor is introduced₃When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text₃Calculating the rest bits of the natural languageWhen the words are placed,₃take 0.

After the relative entropy of the word vectors and the revision value of the natural language are comprehensively considered, the invention improves the traditional TFIDF algorithm, utilizes the improved TFIDF algorithm to calculate the weight of the word vectors, takes the k word vectors with the highest weight as the keyword vectors, and adopts the calculation formula of the word vector weight as follows:

wherein:

iv (t) is the information entropy of the word vector;

tf_tis the word frequency of the word vector;

L_iis the length of the current natural language;

L_aveis the average length of the natural language;

₁set as a revision value for the length of natural language₁；

₂The word frequency revision value.

And S5, receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector.

Further, the invention receives the keyword vector by using an LSTM model obtained based on active learning training, and performs semantic analysis based on the keyword vector, wherein the semantic analysis process comprises the following steps:

1) checking input keyword vector x using convolution_iPerforming convolution operation:

c_i＝f(ωx_i+b)

wherein:

ω∈R^h×dweight of the convolution kernel;

h represents how many adjacent words to slide over;

b is a bias term;

f is a ReLU activation function;

therefore, the invention obtains the following characteristic diagram based on the keyword vector:

c＝{c₁，c₂，...，c_n-h+1}

wherein:

n is the length of the keyword vector;

2) c is divided into several t sections averagely, and the maximum c is taken in each section_iValues, concatenating these fetched maxima into a vector

In order to capture key features of different structures, the invention adopts segmented pooling to divide convolution vectors output by a convolution layer into a plurality of segments, each segment is also a small convolution vector, then maximum pooling operation is carried out on each small convolution vector to respectively extract maximum features, and then the maximum features are spliced into a new feature vector;

3) and (3) completing the analysis of the natural language semantics by utilizing a softmax classifier, and outputting a value with the maximum probability as the analyzed content:

wherein:

w is a weight matrix;

b is a bias term;

the active learning training process comprises the following steps:

2) training an initial semantic parser fc by using an initial training set T;

3) predicting the remaining samples (U-T) of the candidate set by using an initial semantic parser fc, selecting samples (specifically belonging to a probability range value and having wrong prediction results) meeting requirements, correctly labeling the samples, and adding the samples into an initial training set T to obtain a new training set T';

The samples in between. The sample probability values between this sample probability range are defined as the samples of uncertainty. If the predicted result of the sample is inconsistent with the actual result at the same time, namely the sample with the wrong prediction is a sample which has great influence on the semantic analyzer and meets the requirement of the method, wherein C is the category number of the sample semantics, and a and b are constants between 0 and 1;

4) training a new semantic parser fc 'by using a new training set T';

For active learning, the classifier does not passively accept user-provided data, but actively asks the user to label the sample data currently selected by the semantic parser. And continuously selecting useful samples according to a sample selection strategy, and continuously training a semantic analyzer through the samples to achieve satisfactory performance. Compared with the semantic analysis effect achieved by randomly selected samples, the new active learning method provided by the invention can obviously reduce the number of required samples.

The following description will explain the embodiments of the present invention by a simulation experiment and test the processing method of the present invention. The hardware testing environment of the algorithm is an Ubuntu14.04 system, the algorithm runs on an NVIDIA TITANX GPU server, a deep learning framework is a deep learning framework caffe, a CPU is E5-2609 v3@1.90GHz, and an operating system is Ubuntu 16.04; the comparison algorithm is a TF-IDF based CNN model, a TF-IDF algorithm based LSTM model and a TF-IDF algorithm based RNN model.

According to the experimental result, the time for the CNN model based on the TF-IDF to complete the primary semantic analysis is 25s, the accuracy is 65.57%, the time for the LSTM model based on the TF-IDF algorithm to complete the primary semantic analysis is 56s, the accuracy is 78.04%, the time for the RNN model based on the TF-IDF algorithm to complete the primary semantic analysis is 23s, and the accuracy is 80.64%.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A semantic parsing method of a natural language, the method comprising:

2. The method for semantic parsing of natural language according to claim 1, wherein the performing word segmentation processing by using a forward maximum matching algorithm based on character strings comprises:

1) taking the first n characters of a character string to be processed as matching fields, searching a pre-constructed word segmentation dictionary, wherein the number of Chinese characters contained in the maximum entry in the word segmentation dictionary is n, if the dictionary contains the word, matching is successful, and the word is segmented;

3. The method for parsing a natural language according to claim 2, wherein the deactivating word processing of the natural language using the predetermined dictionary comprises:

the stop word dictionary comprises words which have high occurrence frequency but low actual meanings in natural languages, mainly comprises words with tone, auxiliary words, prepositions and conjunctions, has no definite meanings usually, and has certain functions only when being put into a complete sentence.

4. A method for parsing semantics of a natural language according to claim 3, wherein the encoding of the natural language by using a Hash coding method comprises:

5. The method for semantic parsing of natural language according to claim 4, wherein said re-encoding of natural language using Chinese word vector model comprises:

1) constructing a Chinese word vectorization model:

m(x，θ)＝f(θ^Tx)

m(·)：x_i→y_i

×{0，1}^p→×{0，1}^q

wherein:

x is a natural language to be analyzed;

x_in is a Hash code of the ith Chinese word;

×{0，1}^pa product space generated for the discrete set is a Hash code space;

×{0，1}^qthe product space generated for the interval, i.e. the recompositionA coded word vector space;

e＝g(β^Ty)+

wherein:

beta is a p multiplied by q dimensional matrix;

is a p-dimensional column vector and represents random error;

wherein:

x_in is a Hash code of the ith Chinese word;

ei is the probability of the occurrence of the word immediately following the ith word;

4) according to the solved parameter theta, a Chinese word vectorization model m (x, theta) is used for changing into f (theta)^Tx) re-encoding the Hash code of the natural language.

6. The method for semantic parsing of natural language according to claim 5, wherein the revising the length of the natural language comprises:

aiming at the problem of different text lengths, the invention sets a modification value and a word frequency modification value simultaneously, thereby avoiding the over-high weight assigned to the excessive repeated words;

wherein:

L_iis the length of the current natural language;

L_aveis the average length of the natural language;

7. The method for semantic parsing of natural language according to claim 6, wherein the performing of the weight calculation of the word vector by using the modified TFIDF algorithm comprises:

wherein:

iv (t) is the information entropy of the word vector;

tf_tis the word frequency of the word vector;

L_iis the length of the current natural language;

L_aveis the average length of the natural language;

₁set as a revision value for the length of natural language₁；

₂The word frequency revision value.

8. The method for semantic parsing of natural language according to claim 7, wherein the active learning training process is:

2) training an initial semantic parser fc by using an initial training set T;

4) training a new semantic parser fc 'by using a new training set T';

5) predicting the rest samples (U-T-T ') by using a new semantic analyzer fc ', selecting samples meeting the requirements for correct marking, and adding the samples into T ';