CN111753550A - Semantic parsing method for natural language - Google Patents

Semantic parsing method for natural language Download PDF

Info

Publication number
CN111753550A
CN111753550A CN202010594776.6A CN202010594776A CN111753550A CN 111753550 A CN111753550 A CN 111753550A CN 202010594776 A CN202010594776 A CN 202010594776A CN 111753550 A CN111753550 A CN 111753550A
Authority
CN
China
Prior art keywords
word
natural language
words
chinese
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010594776.6A
Other languages
Chinese (zh)
Inventor
汪秀英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202010594776.6A priority Critical patent/CN111753550A/en
Publication of CN111753550A publication Critical patent/CN111753550A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of semantic parsing, and discloses a semantic parsing method of a natural language, which comprises the following steps: performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings; prefix word removing processing is carried out on the natural language by utilizing a preset dictionary; coding the natural language subjected to word segmentation by utilizing a Hash coding mode; recoding the natural language by using a Chinese word vector model to obtain vectorization expression of words; calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language; carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors; and receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector. The invention realizes the analysis of the semantic meaning.

Description

Semantic parsing method for natural language
Technical Field
The invention relates to the technical field of semantic parsing, in particular to a semantic parsing method of natural language.
Background
The internet has been rapidly developed for many years, and the growth speed of information and knowledge existing in the network is in a geometric progression trend. In the prior art, a general search engine is mainly used for searching matched webpage information by a user in a mode of searching information in the internet, and the basis for sequencing a webpage set is the correlation degree between webpage links and keywords. The information searching mode does not directly answer the questions of the user, but reduces the information searching range, and the user needs to secondarily refine the searched information to obtain the information really needed by the user, so that the accurate analysis of the semantics in the natural language is realized, and the attention of a researcher is received.
In the prior art, a method for performing semantic analysis on a natural language is to extract keywords in the natural language by using a TFIDF algorithm and perform semantic analysis by using a machine learning model. The method for extracting the keywords in the natural language by using the TFIDF algorithm is to count the word frequency of the keywords, when a certain word frequently appears in the text, the word and the text are strongly associated, and meanwhile, the problem of too high weight of the common words is avoided depending on the inverse document frequency of the word, but the inverse document frequency of the word can reduce the weight of all the common words in the text, and the position information of the word is not considered; meanwhile, the existing machine learning model needs a large amount of high-quality training sample data for training, and is time-consuming and labor-consuming.
In view of this, how to extract keywords in the natural language and perform accurate semantic analysis by using the extracted keywords becomes an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The invention provides a semantic parsing method of natural language, which is used for carrying out vectorization processing on the natural language and extracting keywords by using an improved keyword extraction algorithm so as to accurately parse the semantics by using the extracted keywords.
In order to achieve the above object, the present invention provides a semantic parsing method for natural language, including:
performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings;
performing stop word processing on the natural language by using a preset dictionary;
coding the natural language subjected to word segmentation by utilizing a Hash coding mode;
recoding the natural language by using a Chinese word vector model to obtain vectorization expression of words;
calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language;
carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors;
and receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector.
Optionally, the performing word segmentation processing by using a forward maximum matching algorithm based on a character string includes:
1) and taking the first n characters of the character string to be processed as matching fields, and searching a pre-constructed word segmentation dictionary, wherein the number of the Chinese characters contained in the maximum entry in the word segmentation dictionary is n. If the dictionary contains the word, the matching is successful, and the word is separated;
2) starting from n +1 of the compared character string, taking n words to form a field, and matching the field in the dictionary again;
3) if the matching is not successful, the last bit of the field composed of the n words is removed, the field composed of the remaining n-1 words is matched in the dictionary, and the process is carried out until the segmentation is successful.
Optionally, the performing, by using a preset dictionary, stop word processing on the natural language includes:
traversing and matching the word segmentation result by using a preset stop word dictionary, and deleting the matched words;
the stop word dictionary comprises words which have high frequency of occurrence in natural language but have small practical meanings, mainly comprises words of tone, auxiliary words, prepositions and conjunctions, generally has no definite meaning, and only has certain functions when being put into a complete sentence, such as common words of ' in ', ' and ' then ', and the like.
Optionally, the encoding the natural language by using a Hash encoding method includes:
converting the input with any length into the output with fixed length through a hash algorithm, and outputting the output as a hash value;
after the Chinese words are converted into the Hash values with the same length, the decimal Hash values are converted into binary digits, and the converted binary digits are the final Hash code.
Optionally, the re-encoding the natural language by using the chinese word vector model includes:
1) constructing a Chinese word vectorization model:
|(x,θ)=f(θTx)
m(·):xi→yi
×{0,1}p→×{0,1}q
wherein:
x is a natural language to be analyzed;
xin is a Hash code of the ith Chinese word;
f (-) is a point-by-point function: rq→×{0,1}qTaking f as a sigmoid function;
×{0,1}pa product space generated for the discrete set is a Hash code space;
×{0,1}qa product space generated for the interval is a word vector space after recoding;
theta is a p multiplied by q dimensional matrix and is a parameter of the model;
2) product space × {0, 1} generated using a discrete setpExactly the probability of each Chinese word occurring, and × {0, 1}pThe point in (1) is marked as e, thenTo probabilistic language model:
e=g(βTy)+
wherein:
beta is a p multiplied by q dimensional matrix;
y is a q-dimensional column vector and represents a result after word quantization;
is a p-dimensional column vector and represents random error;
e is × {0, 1}pPoints in space, meaning the probability of the occurrence of a consequent;
g (-) is a point-by-point function: rq→×{0,1}qTaking g (-) as sigmoid function;
3) because the appearance sequence of Chinese words is changed along with the Chinese, the appearance sequence of Chinese words in the Chinese represents the appearance probability of words, so the invention obtains the objective function of the Chinese word vectorization model, and obtains the estimation of the parameter theta by solving the objective function, wherein the objective function is as follows:
Figure BDA0002557121860000031
wherein:
xin is a Hash code of the ith Chinese word;
eithe probability of the occurrence of the word immediately following the ith word;
4) according to the solved parameter theta, a Chinese word vectorization model m (x, theta) is used for changing into f (theta)Tx re-encodes the Hash code of the natural language.
Optionally, the revising the length of the natural language includes:
aiming at the problem of inconsistent text lengths, the invention sets a modification value, if the actual length of the text is greater than the average length, the modification value can restrict the text, so that the negative influence on the screening of the keywords caused by inconsistent text lengths can be restricted, and meanwhile, a word frequency modification value is set to avoid the too high weight assigned to too many repeated words;
the revision value of the natural language length is set as1The word frequency revision value is set to2The word frequency control formula is as follows:
Figure BDA0002557121860000041
wherein:
Liis the length of the current natural language;
Laveis the average length of the natural language;
the larger the denominator T when the natural language length exceeds the average length of the documentcThe smaller the size, the more frequent the words in long natural language texts are suppressed.
Optionally, the performing, by using the modified TFIDF algorithm, weight calculation of the word vector includes:
introducing a word position weight factor to the TFIDF algorithm3When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text3When the words of the rest positions of the natural language are calculated,3taking the value as 0;
carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors, wherein the calculation formula of the word vector weight is as follows:
Figure BDA0002557121860000042
Figure BDA0002557121860000043
wherein:
iv (t) is the information entropy of the word vector;
tftis the word frequency of the word vector;
Liis the length of the current natural language;
Laveis the average length of the natural language;
ntthe number of natural language texts containing candidate keyword vectors t is obtained;
1set as a revision value for the length of natural language1
2The word frequency revision value.
Optionally, the active learning training process includes:
1) and randomly selecting i samples from the candidate sample set U, and carrying out correct semantic annotation on the i samples. Constructing the correctly labeled samples into an initial training set T;
2) training an initial semantic parser fc by using an initial training set T;
3) predicting the remaining samples (U-T) of the candidate set by using an initial semantic parser fc, selecting samples meeting the requirements for correct marking, and adding the samples into an initial training set T to obtain a new training set T';
the strategy of sample selection is as follows: selecting samples having a probability range value of
Figure BDA0002557121860000044
The sample probability value is in the sample probability range, and the sample probability value is defined as an uncertain sample, wherein C is the category number of sample semantics, and a and b are constants between 0 and 1;
4) training a new semantic parser fc 'by using a new training set T';
5) predicting the rest samples (U-T-T ') by using a new semantic parser fc ', selecting samples (specifically belonging to probability range values and having wrong prediction results) meeting requirements, correctly labeling the samples, and adding the samples into T ';
6) if the accuracy of the semantic parser reaches the specified accuracy, the method stops and returns to the semantic parser fc'; otherwise jump to 4), and continue the iteration.
Compared with the prior art, the invention provides a semantic parsing method of natural language, which has the following advantages:
firstly, the existing vectorization method of natural language is one-hot coding, each Chinese word is converted into a vector with the length of N, wherein N represents the total number of all Chinese words to be coded,only one position is 1, the other positions are all 0, according to the coding mode, the coding length can be rapidly increased along with the increase of the characteristic dimension, namely the coding length and the characteristic dimension are in a direct proportion relation, the computer memory can not bear the matrix calculation of the large dimension, in addition, the coded structured data can be very sparse, which brings great difficulty to the subsequent modeling, therefore, the invention provides the coding mode of the Chinese word vector model, and the occurrence sequence of words in natural language is used as the internal logic to establish the probability model e ═ g (β) (the coding mode is used for establishing the probability model e ═ g)Ty) +, where y is a q-dimensional column vector representing the result of the word after quantization, a p-dimensional column vector representing the random error, and e is × {0, 1}pThe point in the space means the probability of the occurrence of the later word, therefore the invention uses the previous word or words of the natural language to infer the next possible word, wherein the probability is the code value of the next possible word, compared with the prior art, the invention considers that the probability of the first word is larger, therefore the code value of each word after coding is different, and the invention predicts the probability of the occurrence of the next word by using the code value of the current word through establishing a probability model, and the predicted result becomes the code value of the next word, therefore the word vector result obtained by the invention fully utilizes the internal relation between the words, and the code length is fixed, and the dimensionality disaster can not be caused.
Secondly, in the prior art, a TFIDF algorithm is mainly used for extracting keywords in a natural language, namely, the word frequency of the keywords is counted, when a certain word frequently appears in a text, the word and the text are strongly associated, and the problem of too high weight of the occurrence of common words is avoided depending on the inverse document frequency of the word, but the inverse document frequency of the word can reduce the weight of all common words in the text, and the position information of the word is not considered, and meanwhile, if the natural language is too long, the word frequency of some words can be greatly increased, and the words can not be keywords of the natural language; therefore, the invention improves the prior TFIDF algorithm and adds the revision value of the natural language length1And word frequency revision value2Thus, the word frequency in the natural language is controlled, and the obtained word frequency control formula is
Figure BDA0002557121860000051
Wherein L isiIs the length of the current natural language, LaveIs the average length of natural language. When the natural language length exceeds the average length of the document,
Figure BDA0002557121860000061
the larger the value of (A), i.e. the longer the natural language length, the word frequency TcThe smaller the value of the weight factor of the position of the word is, the problem of overhigh frequency of the word in the long natural language text is effectively avoided, and meanwhile, because the position information of the word is ignored by the traditional TFIDF algorithm, but the first sentence and the last sentence in the text generally contain the summarized information of the whole text, the higher weight is given to the word in the first sentence and the last sentence, so the invention introduces the weight factor of the position of the word3When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text3(0<3< 1), when words of the rest positions of the natural language are calculated,3taking the value as 0; in addition, the prior art does not consider the influence of different distribution of the keywords on the weights of the keywords while depending on the word frequency too much, so the information quantity of the keywords is taken as the control coefficient of the TFIDF value of the keywords to be recorded into the final result, and the larger the information quantity of the keywords is, the larger the TFIDF value of the keywords is.
Finally, aiming at the problem that a large amount of effective training data is needed when the existing machine learning model is trained, the invention trains the machine learning model by using an active learning mode, a semantic analyzer does not passively receive data provided by a user, but actively requires the user to label sample data selected by the current semantic analyzer, continuously selects useful samples according to a sample selection strategy, and simultaneously continuously trains the semantic analyzer through the samples to ensure that the semantic analyzer achieves satisfactory performance. Compared with the existing training mode which needs a large amount of effective data for training, the method and the device have the advantages that the useful sample data is selected from a small amount of unlabeled data in an iterative mode, the sample data meeting the requirements is selected and correctly labeled, and the number of the samples needed in the training process of the semantic analyzer is effectively reduced.
Drawings
Fig. 1 is a schematic flowchart of a semantic parsing method for natural language according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The natural language is vectorized, and the keywords are extracted by using an improved keyword extraction algorithm, so that the extracted keywords are used for semantic accurate analysis. Fig. 1 is a schematic diagram illustrating a semantic parsing method for natural language according to an embodiment of the present invention.
In this embodiment, the semantic parsing method for natural language includes:
and S1, performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings, and performing prefix word removal processing on the natural language by using a preset dictionary.
Firstly, the invention obtains the natural language to be analyzed, and carries out word segmentation processing on the natural language to be analyzed by utilizing a forward maximum matching algorithm based on character strings, wherein the word segmentation process of the forward maximum matching algorithm based on the character strings is as follows:
1) and taking the first n characters of the character string to be processed as matching fields, and searching a pre-constructed word segmentation dictionary, wherein the number of the Chinese characters contained in the maximum entry in the word segmentation dictionary is n. If the dictionary contains the word, the matching is successful, and the word is separated;
2) starting from n +1 of the compared character string, taking n words to form a field, and matching the field in the dictionary again;
3) if the matching is not successful, the last bit of the field consisting of the n words is removed, the remaining fields consisting of the n-1 words are matched in the dictionary, and the process is continued until the segmentation is successful;
in one embodiment of the invention, if the character string to be processed is 'Chinese character is mostly ideographic characters', the step length of comparison with a dictionary is 5, the character string 'Chinese character is mostly table' is taken to be compared with the dictionary, no corresponding words are generated, the 'table' characters are removed, the character segment 'Chinese character is mostly' is used for matching until the character string 'Chinese character' is matched with the dictionary, then the character string 'Chinese character is mostly ideographic characters', and the process is circulated to segment a word 'characters'.
Furthermore, the invention uses the preset dictionary to perform traversal matching operation on the word segmentation result, and deletes the matched words; the preset dictionary is a stop word dictionary.
The stop word dictionary comprises words which have high occurrence frequency but low actual meanings in texts, mainly comprises words of tone, auxiliary words, prepositions and conjunctions, generally has no definite meaning, and only has certain functions when being put into a complete sentence, such as common words of ' in ', ' and ' then ', and the like.
S2, coding the natural language after word segmentation by using a Hash coding mode, and recoding the natural language by using a Chinese word vector model to obtain the vectorization expression of the word.
Furthermore, the invention codes the natural language after word segmentation by using a Hash coding mode, the Hash coding mode converts the input with any length into the output with fixed length through a Hash algorithm, the output is called as a Hash value, after the Chinese words are converted into the Hash values with the same length, the invention converts the decimal Hash value into a binary number, and the converted binary number is the final Hash code.
Furthermore, because the value space of each position of the Hash code is {0, 1}, the vectorization method does not consider the distance between words, i.e. the distance between vectors cannot represent the similarity of word senses, so the invention utilizes the Chinese word vectorization model to recode the Hash code, and the coding process based on the Chinese word vectorization model is as follows:
1) constructing a Chinese word vectorization model:
m(x,θ)=f(θTx)
m(·):xi→yi
×{0,1}p→×{0,1}q
wherein:
x is a natural language to be analyzed;
xin is a Hash code of the ith Chinese word;
f (-) is a point-by-point function: rq→×{0,1}qTaking f as a sigmoid function;
×{0,1}pa product space generated for the discrete set is a Hash code space;
×{0,1}qa product space generated for the interval is a word vector space after recoding;
theta is a p multiplied by q dimensional matrix and is a parameter of the model;
2) product space × {0, 1} generated using a discrete setpExactly the probability of each Chinese word occurring, and × {0, 1}pAnd (5) marking the point in (e) as e to obtain a probability language model:
e=g(βTy)+
wherein:
beta is a p multiplied by q dimensional matrix;
y is a q-dimensional column vector and represents a result after word quantization;
is a p-dimensional column vector and represents random error;
e is × {0, 1}pPoints in space, meaning the probability of the occurrence of a consequent;
g (-) is a point-by-point function: rq→×{0,1}qTaking g (-) as sigmoid function;
3) because the appearance sequence of Chinese words is changed along with the Chinese, the appearance sequence of Chinese words in the Chinese represents the appearance probability of words, so the invention obtains the objective function of the Chinese word vectorization model, and obtains the estimation of the parameter theta by solving the objective function, wherein the objective function is as follows:
Figure BDA0002557121860000081
wherein:
xin is a Hash code of the ith Chinese word;
eithe probability of occurrence of the word immediately following the ith word.
4) According to the solved parameter theta, a Chinese word vectorization model m (x, theta) is used for changing into f (theta)Tx re-encodes the Hash code of the natural language.
And S3, calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language.
Further, the invention calculates the relative entropy iv (t) of each word vector t in natural language:
Figure BDA0002557121860000082
wherein:
Figure BDA0002557121860000091
is the frequency with which the word vector t appears in all natural language text;
in a specific embodiment, t1 and t2 are two keyword vectors in natural language, t1 occurs in one natural language text as many times as t2 occurs in multiple natural language texts, which cannot simply mark the relative entropy values of the two keyword vectors as equal, and the distribution of t1 is more concentrated than that of t2 according to the information theory, indicating that t1 has higher correlation with the natural language text where it is located. Therefore, the present invention derives the above formula:
Figure BDA0002557121860000092
wherein:
μ (t) is the probability of the word vector t occurring in a natural language text;
s is the total number of words in the natural language to be analyzed;
d is the total number of the natural language texts;
tf (t) is the word frequency of the keyword vector.
Further, aiming at the problem of inconsistent text length, the invention sets a fixed value, if the actual length of the text is greater than the average length, the fixed value can restrict the text, and the negative influence on the screening of the keywords caused by the inconsistent text length can be restricted. And meanwhile, a word frequency fixed value is set, so that too high weight assigned to too many repeated words is avoided.
The invention sets the revision value of the natural language length as1The word frequency revision value is set to2The word frequency control formula is as follows:
Figure BDA0002557121860000093
wherein:
Liis the length of the current natural language;
Laveis the average length of natural language.
The larger the denominator T when the natural language length exceeds the average length of the documentcThe smaller the size, the more frequent the words in long natural language texts are suppressed.
And S4, introducing a position weight factor, and improving the TFIDF algorithm by comprehensively considering the relative entropy of the word vectors and the natural language revision value, so that the improved TFIDF algorithm is used for carrying out weight calculation on the word vectors, and the k word vectors with the highest weight are used as the keyword vectors.
The traditional TFIDF algorithm ignores the position information of the words, but the first sentence and the last sentence in the text generally contain the summarized information of the whole text, so the words in the first sentence and the last sentence should be given higher weight, and the word position weight factor is introduced3When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text3Calculating the rest bits of the natural languageWhen the words are placed,3take 0.
After the relative entropy of the word vectors and the revision value of the natural language are comprehensively considered, the invention improves the traditional TFIDF algorithm, utilizes the improved TFIDF algorithm to calculate the weight of the word vectors, takes the k word vectors with the highest weight as the keyword vectors, and adopts the calculation formula of the word vector weight as follows:
Figure BDA0002557121860000101
Figure BDA0002557121860000102
wherein:
iv (t) is the information entropy of the word vector;
tftis the word frequency of the word vector;
Liis the length of the current natural language;
Laveis the average length of the natural language;
ntthe number of natural language texts containing candidate keyword vectors t is obtained;
1set as a revision value for the length of natural language1
2The word frequency revision value.
And S5, receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector.
Further, the invention receives the keyword vector by using an LSTM model obtained based on active learning training, and performs semantic analysis based on the keyword vector, wherein the semantic analysis process comprises the following steps:
1) checking input keyword vector x using convolutioniPerforming convolution operation:
ci=f(ωxi+b)
wherein:
ω∈Rh×dweight of the convolution kernel;
h represents how many adjacent words to slide over;
b is a bias term;
f is a ReLU activation function;
therefore, the invention obtains the following characteristic diagram based on the keyword vector:
c={c1,c2,...,cn-h+1}
wherein:
n is the length of the keyword vector;
2) c is divided into several t sections averagely, and the maximum c is taken in each sectioniValues, concatenating these fetched maxima into a vector
Figure BDA0002557121860000103
In order to capture key features of different structures, the invention adopts segmented pooling to divide convolution vectors output by a convolution layer into a plurality of segments, each segment is also a small convolution vector, then maximum pooling operation is carried out on each small convolution vector to respectively extract maximum features, and then the maximum features are spliced into a new feature vector;
3) and (3) completing the analysis of the natural language semantics by utilizing a softmax classifier, and outputting a value with the maximum probability as the analyzed content:
Figure BDA0002557121860000111
wherein:
w is a weight matrix;
b is a bias term;
the active learning training process comprises the following steps:
1) and randomly selecting i samples from the candidate sample set U, and carrying out correct semantic annotation on the i samples. Constructing the correctly labeled samples into an initial training set T;
2) training an initial semantic parser fc by using an initial training set T;
3) predicting the remaining samples (U-T) of the candidate set by using an initial semantic parser fc, selecting samples (specifically belonging to a probability range value and having wrong prediction results) meeting requirements, correctly labeling the samples, and adding the samples into an initial training set T to obtain a new training set T';
the strategy of sample selection is as follows: selecting samples having a probability range value of
Figure BDA0002557121860000112
The samples in between. The sample probability values between this sample probability range are defined as the samples of uncertainty. If the predicted result of the sample is inconsistent with the actual result at the same time, namely the sample with the wrong prediction is a sample which has great influence on the semantic analyzer and meets the requirement of the method, wherein C is the category number of the sample semantics, and a and b are constants between 0 and 1;
4) training a new semantic parser fc 'by using a new training set T';
5) predicting the rest samples (U-T-T ') by using a new semantic parser fc ', selecting samples (specifically belonging to probability range values and having wrong prediction results) meeting requirements, correctly labeling the samples, and adding the samples into T ';
6) if the accuracy of the semantic parser reaches the specified accuracy, the method stops and returns to the semantic parser fc'; otherwise jump to 4), and continue the iteration.
For active learning, the classifier does not passively accept user-provided data, but actively asks the user to label the sample data currently selected by the semantic parser. And continuously selecting useful samples according to a sample selection strategy, and continuously training a semantic analyzer through the samples to achieve satisfactory performance. Compared with the semantic analysis effect achieved by randomly selected samples, the new active learning method provided by the invention can obviously reduce the number of required samples.
The following description will explain the embodiments of the present invention by a simulation experiment and test the processing method of the present invention. The hardware testing environment of the algorithm is an Ubuntu14.04 system, the algorithm runs on an NVIDIA TITANX GPU server, a deep learning framework is a deep learning framework caffe, a CPU is E5-2609 v3@1.90GHz, and an operating system is Ubuntu 16.04; the comparison algorithm is a TF-IDF based CNN model, a TF-IDF algorithm based LSTM model and a TF-IDF algorithm based RNN model.
According to the experimental result, the time for the CNN model based on the TF-IDF to complete the primary semantic analysis is 25s, the accuracy is 65.57%, the time for the LSTM model based on the TF-IDF algorithm to complete the primary semantic analysis is 56s, the accuracy is 78.04%, the time for the RNN model based on the TF-IDF algorithm to complete the primary semantic analysis is 23s, and the accuracy is 80.64%.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A semantic parsing method of a natural language, the method comprising:
performing word segmentation processing on the natural language by using a forward maximum matching algorithm based on character strings;
performing stop word processing on the natural language by using a preset dictionary;
coding the natural language subjected to word segmentation by utilizing a Hash coding mode;
recoding the natural language by using a Chinese word vector model to obtain vectorization expression of words;
calculating the relative entropy of each word vector in the natural language, and revising the length of the natural language;
carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors;
and receiving the keyword vector by using an LSTM model obtained based on active learning training, and performing semantic analysis based on the keyword vector.
2. The method for semantic parsing of natural language according to claim 1, wherein the performing word segmentation processing by using a forward maximum matching algorithm based on character strings comprises:
1) taking the first n characters of a character string to be processed as matching fields, searching a pre-constructed word segmentation dictionary, wherein the number of Chinese characters contained in the maximum entry in the word segmentation dictionary is n, if the dictionary contains the word, matching is successful, and the word is segmented;
2) starting from n +1 of the compared character string, taking n words to form a field, and matching the field in the dictionary again;
3) if the matching is not successful, the last bit of the field composed of the n words is removed, the field composed of the remaining n-1 words is matched in the dictionary, and the process is carried out until the segmentation is successful.
3. The method for parsing a natural language according to claim 2, wherein the deactivating word processing of the natural language using the predetermined dictionary comprises:
traversing and matching the word segmentation result by using a preset stop word dictionary, and deleting the matched words;
the stop word dictionary comprises words which have high occurrence frequency but low actual meanings in natural languages, mainly comprises words with tone, auxiliary words, prepositions and conjunctions, has no definite meanings usually, and has certain functions only when being put into a complete sentence.
4. A method for parsing semantics of a natural language according to claim 3, wherein the encoding of the natural language by using a Hash coding method comprises:
converting the input with any length into the output with fixed length through a hash algorithm, and outputting the output as a hash value;
after the Chinese words are converted into the Hash values with the same length, the decimal Hash values are converted into binary digits, and the converted binary digits are the final Hash code.
5. The method for semantic parsing of natural language according to claim 4, wherein said re-encoding of natural language using Chinese word vector model comprises:
1) constructing a Chinese word vectorization model:
m(x,θ)=f(θTx)
m(·):xi→yi
×{0,1}p→×{0,1}q
wherein:
x is a natural language to be analyzed;
xin is a Hash code of the ith Chinese word;
f (-) is a point-by-point function: rq→×{0,1}qTaking f as a sigmoid function;
×{0,1}pa product space generated for the discrete set is a Hash code space;
×{0,1}qthe product space generated for the interval, i.e. the recompositionA coded word vector space;
theta is a p multiplied by q dimensional matrix and is a parameter of the model;
2) product space × {0, 1} generated using a discrete setpExactly the probability of each Chinese word occurring, and × {0, 1}pAnd (5) marking the point in (e) as e to obtain a probability language model:
e=g(βTy)+
wherein:
beta is a p multiplied by q dimensional matrix;
y is a q-dimensional column vector and represents a result after word quantization;
is a p-dimensional column vector and represents random error;
e is × {0, 1}pPoints in space, meaning the probability of the occurrence of a consequent;
g (-) is a point-by-point function: rq→×{0,1}qTaking g (-) as sigmoid function;
3) because the appearance sequence of Chinese words is changed along with the Chinese, the appearance sequence of Chinese words in the Chinese represents the appearance probability of words, so the invention obtains the objective function of the Chinese word vectorization model, and obtains the estimation of the parameter theta by solving the objective function, wherein the objective function is as follows:
Figure FDA0002557121850000021
wherein:
xin is a Hash code of the ith Chinese word;
ei is the probability of the occurrence of the word immediately following the ith word;
4) according to the solved parameter theta, a Chinese word vectorization model m (x, theta) is used for changing into f (theta)Tx) re-encoding the Hash code of the natural language.
6. The method for semantic parsing of natural language according to claim 5, wherein the revising the length of the natural language comprises:
aiming at the problem of different text lengths, the invention sets a modification value and a word frequency modification value simultaneously, thereby avoiding the over-high weight assigned to the excessive repeated words;
the revision value of the natural language length is set as1The word frequency revision value is set to2The word frequency control formula is as follows:
Figure FDA0002557121850000031
wherein:
Liis the length of the current natural language;
Laveis the average length of the natural language;
the larger the denominator T when the natural language length exceeds the average length of the documentcThe smaller the size, the more frequent the words in long natural language texts are suppressed.
7. The method for semantic parsing of natural language according to claim 6, wherein the performing of the weight calculation of the word vector by using the modified TFIDF algorithm comprises:
introducing a word position weight factor to the TFIDF algorithm3When extracting keywords from natural language, weighting factors are given to word vectors in the first sentence and the last sentence of the text3When the words of the rest positions of the natural language are calculated,3taking the value as 0;
carrying out weight calculation on word vectors by using an improved TFIDF algorithm, and taking k word vectors with the highest weight as keyword vectors, wherein the calculation formula of the word vector weight is as follows:
Figure FDA0002557121850000032
Figure FDA0002557121850000033
wherein:
iv (t) is the information entropy of the word vector;
tftis the word frequency of the word vector;
Liis the length of the current natural language;
Laveis the average length of the natural language;
ntthe number of natural language texts containing candidate keyword vectors t is obtained;
1set as a revision value for the length of natural language1
2The word frequency revision value.
8. The method for semantic parsing of natural language according to claim 7, wherein the active learning training process is:
1) and randomly selecting i samples from the candidate sample set U, and carrying out correct semantic annotation on the i samples. Constructing the correctly labeled samples into an initial training set T;
2) training an initial semantic parser fc by using an initial training set T;
3) predicting the remaining samples (U-T) of the candidate set by using an initial semantic parser fc, selecting samples meeting the requirements for correct marking, and adding the samples into an initial training set T to obtain a new training set T';
the strategy of sample selection is as follows: selecting samples having a probability range value of
Figure FDA0002557121850000041
The sample probability value is in the sample probability range, and the sample probability value is defined as an uncertain sample, wherein C is the category number of sample semantics, and a and b are constants between 0 and 1;
4) training a new semantic parser fc 'by using a new training set T';
5) predicting the rest samples (U-T-T ') by using a new semantic analyzer fc ', selecting samples meeting the requirements for correct marking, and adding the samples into T ';
6) if the accuracy of the semantic parser reaches the specified accuracy, the method stops and returns to the semantic parser fc'; otherwise jump to 4), and continue the iteration.
CN202010594776.6A 2020-06-28 2020-06-28 Semantic parsing method for natural language Withdrawn CN111753550A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010594776.6A CN111753550A (en) 2020-06-28 2020-06-28 Semantic parsing method for natural language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010594776.6A CN111753550A (en) 2020-06-28 2020-06-28 Semantic parsing method for natural language

Publications (1)

Publication Number Publication Date
CN111753550A true CN111753550A (en) 2020-10-09

Family

ID=72677355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010594776.6A Withdrawn CN111753550A (en) 2020-06-28 2020-06-28 Semantic parsing method for natural language

Country Status (1)

Country Link
CN (1) CN111753550A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560497A (en) * 2020-12-10 2021-03-26 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium
CN112906403A (en) * 2021-04-25 2021-06-04 中国平安人寿保险股份有限公司 Semantic analysis model training method and device, terminal equipment and storage medium
CN112926340A (en) * 2021-03-25 2021-06-08 东南大学 Semantic matching model for knowledge point positioning
CN113312903A (en) * 2021-05-27 2021-08-27 云南大学 Method and system for constructing word stock of 5G mobile service product
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113780618A (en) * 2021-06-22 2021-12-10 冶金自动化研究设计院 Special steel production ingot type prediction method based on natural language processing and random forest
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method
CN116756578A (en) * 2023-08-21 2023-09-15 武汉理工大学 Vehicle information security threat aggregation analysis and early warning method and system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560497B (en) * 2020-12-10 2024-02-13 中国科学技术大学 Semantic understanding method and device, electronic equipment and storage medium
CN112560497A (en) * 2020-12-10 2021-03-26 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium
CN112926340A (en) * 2021-03-25 2021-06-08 东南大学 Semantic matching model for knowledge point positioning
CN112926340B (en) * 2021-03-25 2024-05-07 东南大学 Semantic matching model for knowledge point positioning
CN112906403A (en) * 2021-04-25 2021-06-04 中国平安人寿保险股份有限公司 Semantic analysis model training method and device, terminal equipment and storage medium
CN113312903A (en) * 2021-05-27 2021-08-27 云南大学 Method and system for constructing word stock of 5G mobile service product
CN113312903B (en) * 2021-05-27 2022-04-19 云南大学 Method and system for constructing word stock of 5G mobile service product
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113326693B (en) * 2021-05-28 2024-04-16 智者四海(北京)技术有限公司 Training method and system of natural language model based on word granularity
CN113780618A (en) * 2021-06-22 2021-12-10 冶金自动化研究设计院 Special steel production ingot type prediction method based on natural language processing and random forest
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method
CN116756578B (en) * 2023-08-21 2023-11-03 武汉理工大学 Vehicle information security threat aggregation analysis and early warning method and system
CN116756578A (en) * 2023-08-21 2023-09-15 武汉理工大学 Vehicle information security threat aggregation analysis and early warning method and system

Similar Documents

Publication Publication Date Title
CN111753550A (en) Semantic parsing method for natural language
CN108984526B (en) Document theme vector extraction method based on deep learning
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN110209822B (en) Academic field data correlation prediction method based on deep learning and computer
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN112818694A (en) Named entity recognition method based on rules and improved pre-training model
CN108509521B (en) Image retrieval method for automatically generating text index
JP5710581B2 (en) Question answering apparatus, method, and program
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN111966810B (en) Question-answer pair ordering method for question-answer system
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN116304066B (en) Heterogeneous information network node classification method based on prompt learning
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN115495555A (en) Document retrieval method and system based on deep learning
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN111274829A (en) Sequence labeling method using cross-language information
CN112463944A (en) Retrieval type intelligent question-answering method and device based on multi-model fusion
CN111274494B (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN116662566A (en) Heterogeneous information network link prediction method based on contrast learning mechanism
CN115878847A (en) Video guide method, system, equipment and storage medium based on natural language
Tian et al. Chinese short text multi-classification based on word and part-of-speech tagging embedding
CN116049422A (en) Echinococcosis knowledge graph construction method based on combined extraction model and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201009

WW01 Invention patent application withdrawn after publication