CN110705296A

CN110705296A - Chinese natural language processing tool system based on machine learning and deep learning

Info

Publication number: CN110705296A
Application number: CN201910867399.6A
Authority: CN
Inventors: 魏巍; 陈志毅; 李恒; 杨佳鑫; 王赞博; 徐晨维; 热克甫; 王振海
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2020-01-17

Abstract

The invention discloses a Chinese natural language processing tool system based on machine learning and deep learning, which comprises: the data processing module is used for acquiring the Chinese text to be processed and the processing task type and converting the received Chinese text into a computer-readable data format according to the processing task type; the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored natural language processing model so as to complete the corresponding natural language processing task; and the algorithm model library is used for storing the algorithm of the natural language processing task and the model obtained by training according to the algorithm. The invention constructs a reasonable system architecture, and is simpler and more efficient to use as a natural language processing tool through unified all function training interfaces, unified training flow, unified calling interfaces and unified calling flow.

Description

Chinese natural language processing tool system based on machine learning and deep learning

Technical Field

The invention relates to a natural language processing technology, in particular to a Chinese natural language processing tool system based on machine learning and deep learning.

Background

Conventional natural language processing tools are typically based on classical machine learning algorithms such as Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). With the advancement of deep learning, many deep neural network model-based studies are devoted to improving existing natural language processing algorithms, which typically encode character and word information in a distributed representation for input and learn the natural language processing task in an end-to-end training manner. Recently, more and more deep learning algorithms have been developed to perform well in natural language processing tasks, and some natural language processing tools with good performance using the latest technology have been proposed. However, the current chinese natural language processing toolkit systems based on machine learning and deep learning, covering multiple natural language processing tasks and including mainstream algorithm models, are still very rare.

Disclosure of Invention

The invention aims to solve the technical problem of providing a Chinese natural language processing tool system based on machine learning and deep learning aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese natural language processing tool system based on machine learning and deep learning, comprising:

the data processing module is used for acquiring the Chinese text to be processed and the processing task type, converting the received Chinese text into a computer-readable data format according to the processing task type and providing a uniform data processing interface for the task processing module;

the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored optimal model so as to complete the corresponding natural language processing task; using the finally updated model obtained by training to complete the natural language processing task;

the algorithm model library is used for storing the algorithm of the natural language processing task and a model obtained by training according to the algorithm;

the algorithm model library comprises algorithms including a text classification algorithm based on deep learning, a text classification algorithm based on machine learning, a clustering algorithm based on machine learning, a word segmentation, part of speech tagging and named entity recognition algorithm based on a deep sequence model, a syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graphs, a similarity algorithm based on probability statistics and deep learning, a special phrase extraction algorithm based on rule analysis, a sentence analysis algorithm based on a dependency tree and a sentence structure, and a semantic groove and intention recognition algorithm based on deep learning.

According to the scheme, the data processing module comprises an IO module, a data management module, a data cleaning module and a Token conversion module;

the IO module is used for reading and writing various types of data files; the data file includes: txt files, json files, xml files, csv files, Numpy data files, Pickle data files and MySQL database files;

the data management module is used for uniformly processing files with different data formats of different tasks; the processing comprises the steps of acquiring text data, constructing a feature mapping table and converting text content features;

the data cleaning module is used for cleaning the original text data, and comprises invalid character strings, stop words and complex and simple conversion;

and the Token conversion module is used for converting text characters (words or characters) into corresponding ids by constructing a corresponding word list.

According to the scheme, the task application module comprises a classification application module, a clustering application module, a sequence marking application module, a dependence analysis application module, a similarity application module, a sentence analysis application module and a semantic slot application module;

the classification application module is used for calling deep learning and machine learning classification algorithms in the algorithm model library, so that training and prediction of text classification tasks are realized, and model parameters obtained after training are stored in the algorithm model library;

the clustering application module is used for calling a machine learning-based clustering algorithm and an LDA topic model in an algorithm model library, converging similar texts and labeling topic words;

the sequence labeling application module is used for carrying out natural language processing including word segmentation, part of speech labeling and named entity identification;

the dependency analysis application module is used for completing syntax dependency tree analysis and semantic dependency tree analysis;

the similarity application module is used for finishing the calculation of the similarity (or distance) of two input sentences;

the sentence parsing application module is used for realizing the parsing of the sentence syntax characteristics; the input of the module is a normal text sequence, and the output is a contained characteristic phrase, a corresponding sentence category and a main and predicate object structure;

the semantic slot recognition module is used for realizing recognition of sentence intentions and acquisition of semantic slots; the model input is a normal text sequence and the output is a semantic slot including the corresponding domain, intent, and time, place, flight.

According to the scheme, the segmentation, part of speech tagging and named entity recognition algorithm based on the depth sequence model adopts a Bi-LSTM + CRF architecture, wherein the Bi-LSTM is a bidirectional long and short memory network, and a forward propagation formula of an LSTM unit with a forgetting gate is as follows:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)

o_t＝σ_g(W_ox_t+U_oh_t-1+b_o)

the initial value is c₀0 and h₀0, operator

Representing the element dot product. The index t indexes the step of time. x is the number of_tIs the input vector of the LSTM cell, f_tIs the activation vector of the forgetting gate, i_tIs the activation vector of the input/update gate, o_tIs the activation vector of the output gate, h_tIs the hidden state vector and is also the output vector, c_tIs the cell state vector, W, U, b are the weight matrix and bias to learn during training; sigma_gIs a sigmoid function, σ_hIs the tanh function;

the CRF in the model uses a linear conditional random field, where x is (x)₁,x₂…x_n)，y＝(y₁,y₂…y_n) All random variable sequences being represented by linear chains, givenIn the case of a random variable sequence x, the conditional probability distribution P (y | x) of the random variable y constitutes a conditional random field, i.e. satisfying markov:

P(y_i|x,y₁,y₂…y_n)＝P(y_i|x,y_i-1,y_i+1)

p (y | x) is a linear conditional random field. Let t_kTo transition probabilities, s_lFor the emission probability, assume t_kIs λ_k,s_lIs mu_lThen the linear conditional random field is formed by all t_k,λ_k,s_l,μ_lJointly determining; the parameterized form of the linear conditional random field at this time is as follows:

wherein Z (x) is a normalization factor:

Z(x)＝∑_yexp(∑_i,kλ_kt_k(y_i-1,y_i+1,x,i)+∑_i,lμ_ls_l(y_i,x,i))。

the invention has the following beneficial effects: the invention constructs a reasonable system architecture, unifies data input of different application functions through the data processing module, and realizes unified calling of related function algorithms by the task application module, thereby achieving unified training interfaces, unified training flow, unified calling interfaces and unified calling flow.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a schematic structural diagram of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a task application module call flow according to an embodiment of the present invention;

fig. 3 is a schematic diagram of syntax parsing of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in FIG. 1, a tool system for multiple basic Chinese natural language processing tasks (as shown in FIG. 1) implements a plurality of mainstream machine learning and deep learning algorithms to provide assistance for better Chinese text processing and understanding. Specifically, the proposed tool system comprises a data processing module, an algorithm model module, a task application module and other main modules, and a Python calling interface, a Web service interface and other auxiliary modules.

The data processing module is used for converting the Chinese text into a computer-readable data mode; in particular, for a piece of text x₁,x₂…x_nThe data processing module can convert it into a corresponding id list i₁i₂…i_nOr a text feature matrix W. In order to realize the universal and efficient conversion of Chinese text to computer readable data mode, the data processing module of the invention comprises: the system comprises an IO module, a data management module, a data cleaning module, a Token conversion module and the like.

And the IO module realizes read-write operation on txt files, json files, xml files, csv files, Numpy data files, Pickle data files and other files and MySQL and other databases. Meanwhile, the IO module comprises the functions of basic directory processing, file merging and splitting and the like. The IO module is called by other modules in a Python API mode.

And the data management module is responsible for uniformly processing files with different data formats of different tasks. The data management module can read corresponding training test data according to different tasks and convert the training test data into a uniform computer-readable data format.

The data management module comprises functions of acquiring text data, constructing a feature mapping table, converting text content into features and the like. The data management module can acquire data of tasks such as text classification, sequence labeling, dependency analysis and the like in a unified mode for being used by the corresponding task module.

And the data cleaning module is responsible for cleaning the original text data, and has the functions of removing invalid character strings, removing stop words, converting traditional Chinese characters and simplified Chinese characters and the like. The module is implemented by Python codes and can be called by other programs.

The Token conversion module is a main module for converting text data into computer readable data, and converts text characters into corresponding id through constructing a corresponding word list, so that the text characters can be directly used for subsequent tasks. The Token conversion module maintains a dictionary and an id mapping of data, determines the dictionary and the id mapping by a word frequency statistical method, and can convert original text characters into corresponding ids to be used as the input of a task model. The Token conversion module has uniform interfaces, and different tasks can be called by using the same interface.

Algorithm model library

The invention includes algorithms and models for multiple natural language processing tasks. The method mainly comprises the steps of classical deep learning and machine learning algorithms on related natural language processing tasks, and the improved algorithm and model provided by the invention. The algorithm model module comprises algorithms including a text classification algorithm based on deep learning, a text classification algorithm based on machine learning, a clustering algorithm based on machine learning, a word segmentation, part of speech tagging and named entity recognition algorithm based on a deep sequence model, a syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graphs, a similarity algorithm based on probability statistics and deep learning, a special phrase extraction algorithm based on rule analysis, a sentence analysis algorithm based on a dependency tree and a sentence structure, and a semantic groove and intention recognition algorithm based on deep learning.

The text classification algorithm based on deep learning is to adopt a deep neural network model to perform an end-to-end text classification task, wherein the text classification task comprises models such as text convolution, fasttext, bidirectional circulation attention network and Transformer. The model is trained using word2vec word embedding to minimize the cross entropy of the output and labels. The model has a uniform and convenient external calling interface and can be used for calling related tasks of text classification.

The Chinese character thereofThe convolution adopts single-layer multi-core convolution and x_iWord embedding representing the ith word in the text, then the document

Assuming the convolution weight w, the convolution calculation formula is as follows: c. C_i＝f(w·x_i:i+h-1+ b). Then the result after convolution operation is connected in series with c ═ c₁,c₂,...,c_in-h+1]And using the maximum pooling c ^ max { c }, and finally using a full-connection network for classification.

The text classification algorithm based on machine learning, namely, the traditional machine learning algorithm is adopted to realize the text classification task. The machine learning text classification algorithm in the module comprises KNN, naive Bayes, decision trees, support vector machines and integrated learning methods such as random deep forest and Bagging. The model uses the TF-IDF characteristics of the text as input, and different algorithms have respective optimization objectives. The model calling interface is the same as the text classification based on deep learning, and other models can be conveniently called.

The clustering algorithm based on machine learning mainly realizes the K-Means equal clustering algorithm and the LDA theme recognition model. The K-Means and other clustering algorithms mainly realize clustering and grouping of data. The LAD algorithm starts from the acquisition of subject terms, and the learning model marks corresponding subject terms for newly encountered texts. Both the clustering algorithm and the LDA algorithm are unsupervised learning algorithms and can be well applied to label-free data.

The method is characterized in that a deep neural network model and a conditional random field are adopted to complete corresponding sequence tagging tasks based on word segmentation, part of speech tagging and named entity recognition algorithms of the deep sequence model. The deep neural network is mainly used for obtaining text features, and the conditional random field is used for predicting a final labeling label. The method comprises the following steps of word segmentation, part of speech tagging and named entity recognition, wherein the three steps are different in specific model architecture according to different tasks. The word segmentation, part of speech tagging and named entity recognition models are trained in advance and used by the outside.

The depth sequence model algorithm mainly adopts a Bi-LSTM + CRF framework, wherein the Bi-LSTM is a bidirectional long and short memory network, and the forward propagation formula of an LSTM unit with a forgetting gate is as follows:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)

o_t＝σ_g(W_ox_t+U_oh_t-1+b_o)

the initial value is c₀0 and h₀0, operator

Representing the element dot product. The index t indexes the step of time. x is the number of_tIs the input vector of the LSTM cell, f_tIs the activation vector of the forgetting gate, i_tIs the activation vector of the input/update gate, o_tIs the activation vector of the output gate, h_tIs the hidden state vector and is also the output vector, c_tIs the cell state vector, and W, U, b are the weight matrix and bias learned during training. Sigma_gIs a sigmoid function, σ_hIs the tanh function.

The CRF in the model uses a linear conditional random field, where x is (x)₁,x₂…x_n)，y＝(y₁,y₂…y_n) Random variable sequences, which are all represented by linear chains, given a random variable sequence x, the conditional probability distribution P (y | x) of a random variable y constitutes a conditional random field, i.e. a markov property is satisfied:

P(y_i|x,y₁,y₂…y_n)＝P(y_i|x,y_i-1,y_i+1)

p (y | x) is a linear conditional random field. Is provided witht_kTo transition probabilities, s_lFor the emission probability, assume t_kIs λ_k,s_lIs mu_lThen the linear conditional random field is formed by all t_k,λ_k,s_l,μ_lAnd (4) jointly determining. The parameterized form of the linear conditional random field at this time is as follows:

wherein Z (x) is a normalization factor:

the syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graph adopts a deep neural network to obtain the characteristics of sentences, and a graph algorithm is used for obtaining a final syntactic dependency tree or semantic dependency tree. A syntactic dependency tree is a tree representation of the syntactic structure of a sentence, and a semantic dependency tree is a tree representation of the semantic structure of a sentence. The labeling scheme for syntactic and semantic dependency trees is from a common training data set and modified. The model is trained in advance and is called by the outside world.

The similarity calculation module based on probability statistics and deep learning comprises various similarity calculation methods and distance algorithms based on character strings, probability statistics and deep learning. The basic similarity algorithm comprises TF cosine similarity, TFIDF cosine similarity, substring similarity, sentence similarity based on word embedding, Jaccard coefficient similarity, Dice coefficient similarity and the like. The similarity algorithm based on deep learning adopts a neural network architecture to learn semantic similarity of two sentences. The distance algorithm includes edit distance, euclidean distance, manhattan distance, jaro _ winkler distance, chi-square distance, KL divergence, JS distance, cross entropy, and the like. Both the distance and similarity algorithms provide a generic interface for calls to be made.

The special phrase extraction algorithm based on rule parsing mainly uses the methods of word segmentation, part of speech tagging and rule parsing to extract special phrases of sentences, such as parallel phrases, shape language phrases and the like. The model provides a unified interface to the outside.

The sentence analysis algorithm based on the dependency tree and the sentence structure realizes the analysis of the sentence by using methods such as dependency tree analysis, sentence structure matching and the like. The sentence analysis comprises the functions of sentence category analysis, question category analysis, negative category analysis, main and predicate object extraction and the like. The model provides a calling interface for the outside, and calling analysis is facilitated.

The semantic slot and intention recognition algorithm based on deep learning uses an algorithm based on deep learning to realize recognition of sentence intention and acquisition of the semantic slot. The algorithm relates to the purpose recognition in the application fields of weather, class inquiry and the like and the extraction of important information semantic slots. The model is efficient and convenient, and unified interface calling is provided for the outside.

The task application module is mainly responsible for the whole-flow calling of the whole task and comprises a calling data processing module for acquiring and processing input data; calling an algorithm model module to train a corresponding function model and predicting new data; and finally, providing calling service to the outside by butting a Python calling interface and a Web service interface. The task application module comprises a classification application module, a clustering application module, a sequence labeling application module, a dependency analysis application module, a similarity application module, a sentence analysis application module and a semantic slot application module. The specific training and calling flow is shown in fig. 2.

The classification application module realizes the training and prediction of the text classification task by using deep learning and machine learning classification algorithms. The training data is stored in a data-tagged format per line and provides an interface for file testing and single sentence prediction. The test file format is the same as the training file, and the input of single sentence prediction only needs the original text character string. The test indexes mainly use accuracy, recall and F1 values. And the model parameters obtained after training are stored in a model library for subsequent testing or calling.

The clustering calling module comprises a basic clustering algorithm such as K-Means and the like and an LDA topic model, and can be used for converging similar texts and even marking topic words. The training data may be stored in a file, one per line format, without the need for tags. The clustering function may cluster the text into corresponding groups according to a predetermined number of clusters. Topic clustering will automatically learn the topic words and divide the data, and each sentence can be represented as a fixed number of subject words.

The sequence labeling application module comprises natural language processing basic functions of word segmentation, part of speech labeling, named entity recognition and the like. The training set mainly adopts a BIOES labeling system and is compatible with data set formats such as conll and the like. The input of the segmentation is a normal text sequence and the output of the segmentation is a word list of segmented words. The input of the part of speech tagging is a word list of the divided words, and the output is a corresponding part of speech tag list. The input for the named entity is a normal text sequence and the output is the identified named entity and its category.

The dependency analysis application module comprises functions of syntactic dependency tree analysis, semantic dependency tree analysis and the like. The training set is mainly a data set in the format of conll or conllu. Syntactic dependency tree parsing is to generate a tree for a sentence that represents the syntactic dependency relationships of words in the sentence. Inputting a word list of the divided words, and outputting the category of the father node of each word on the syntactic dependency tree and the edges connected with the father node. Semantic dependency tree parsing is to generate a tree for a sentence, which represents semantic dependency relationships between words in the sentence. Inputting a word list of the divided words and corresponding parts of speech, and outputting the category of the father node of each word on the semantic dependency tree and the edges connected with the father node. The word segmentation and part-of-speech tagging require the use of a sequence tagging module, and the complete flow of syntax parsing is shown in fig. 3.

The similarity application module implements a function of calculating a similarity (or distance) for two input sentences. The similarity algorithm based on string and probability statistics does not require training data. The deep learning based similarity algorithm requires training data and stores in a format of two sentences per line plus their similarity score. The input of the similarity application module is two sentences and the type of similarity algorithm used, and returns the similarity (or distance) of the sentences.

The sentence parsing application module mainly realizes the parsing of the sentence syntax characteristics. Including special phrases in sentences, such as parallel phrases, shape language phrases, etc.; meanwhile, the method also comprises the functions of sentence category analysis, question category analysis, negative category analysis, main and predicate object extraction and the like. The input of the model is a normal text sequence, and the output is a contained characteristic phrase, a corresponding sentence category, a main and predicate object structure and the like.

The semantic slot recognition module mainly realizes recognition of sentence intentions and acquisition of semantic slots. The module mainly relates to intention identification and extraction of important information semantic slots in application fields such as weather and class inquiry at present. The model input is a normal text sequence, and the output is a corresponding domain, intention, and semantic slot of time, place, flight, etc.

Python calling interface and Web service interface

The Python calling interface, the Web service interface and the like mainly provide a service calling interface for the outside. The user can use a uniform interface to complete the training of the full-flow model and the function calling. When the model is trained, only the configuration file and the training data path are needed to be transmitted, and the training of the corresponding natural language processing task model can be completed according to the requirements of the configuration file. When the function is called, the text can be correspondingly processed only by inputting the specified task and the processing text, and the result obtained by processing is obtained. All the natural language processing tasks have basically the same calling interface and the calling process is completely unified. The input of the interface is mainly character strings or key value pairs, and the output is mainly key value pairs or json files.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A Chinese natural language processing tool system based on machine learning and deep learning, comprising:

the data processing module is used for acquiring the Chinese text to be processed and the processing task type and converting the received Chinese text into a computer-readable data format according to the processing task type;

the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored natural language processing model so as to complete the corresponding natural language processing task;

2. The system of claim 1, wherein the data processing module comprises an IO module, a data management module, a data cleaning module, and a Token conversion module;

3. The system of claim 1, wherein the task application module comprises a classification application module, a clustering application module, a sequence labeling application module, a dependency parsing application module, a similarity application module, a sentence parsing application module, and a semantic groove application module;

4. The system of claim 1, wherein the deep sequence model-based segmentation, part-of-speech tagging and named entity recognition algorithm is a Bi-LSTM + CRF architecture, wherein Bi-LSTM is a bidirectional long-short memory network, and the forward propagation formula for LSTM units with forgetting gates is as follows:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)

o_t＝σ_g(W_ox_t+U_of_t-1+b_o)

the initial value is c₀0 and f₀0, operator

Representing the dot product of the elements, subscript t being the index time step, x_tIs the input vector of the LSTM cell, f_tIs the activation vector of the forgetting gate, i_tIs the activation vector of the input/update gate, o_tIs the activation vector of the output gate, f_tIs the hidden state vector and is also the output vector, c_tIs the cell state vector, W, U, b are the weight matrix and bias to learn during training; sigma_gIs a sigmoid function, σ_hIs the tanh function;

P(y_i|x,y₁,y₂…y_n)＝P(y_i|x,y_i-1,y_i+1)

p (y | x) is a linear conditional random field, let t_kTo transition probabilities, s_lFor the emission probability, assume t_kIs λ_k,s_lIs mu_lThen the linear conditional random field is formed by all t_k,λ_k,s_l,μ_lJointly determining; the parameterized form of the linear conditional random field at this time is as follows:

wherein Z (x) is a normalization factor:

Z(x)＝∑_yexp(∑_i,kλ_kt_k(y_i-1,y_i+1,x,i)+∑_i,lμ_ls_l(y_i,x,i))。