CN110705296A - Chinese natural language processing tool system based on machine learning and deep learning - Google Patents

Chinese natural language processing tool system based on machine learning and deep learning Download PDF

Info

Publication number
CN110705296A
CN110705296A CN201910867399.6A CN201910867399A CN110705296A CN 110705296 A CN110705296 A CN 110705296A CN 201910867399 A CN201910867399 A CN 201910867399A CN 110705296 A CN110705296 A CN 110705296A
Authority
CN
China
Prior art keywords
module
algorithm
data
model
natural language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910867399.6A
Other languages
Chinese (zh)
Inventor
魏巍
陈志毅
李恒
杨佳鑫
王赞博
徐晨维
热克甫
王振海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910867399.6A priority Critical patent/CN110705296A/en
Publication of CN110705296A publication Critical patent/CN110705296A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese natural language processing tool system based on machine learning and deep learning, which comprises: the data processing module is used for acquiring the Chinese text to be processed and the processing task type and converting the received Chinese text into a computer-readable data format according to the processing task type; the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored natural language processing model so as to complete the corresponding natural language processing task; and the algorithm model library is used for storing the algorithm of the natural language processing task and the model obtained by training according to the algorithm. The invention constructs a reasonable system architecture, and is simpler and more efficient to use as a natural language processing tool through unified all function training interfaces, unified training flow, unified calling interfaces and unified calling flow.

Description

Chinese natural language processing tool system based on machine learning and deep learning
Technical Field
The invention relates to a natural language processing technology, in particular to a Chinese natural language processing tool system based on machine learning and deep learning.
Background
Conventional natural language processing tools are typically based on classical machine learning algorithms such as Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). With the advancement of deep learning, many deep neural network model-based studies are devoted to improving existing natural language processing algorithms, which typically encode character and word information in a distributed representation for input and learn the natural language processing task in an end-to-end training manner. Recently, more and more deep learning algorithms have been developed to perform well in natural language processing tasks, and some natural language processing tools with good performance using the latest technology have been proposed. However, the current chinese natural language processing toolkit systems based on machine learning and deep learning, covering multiple natural language processing tasks and including mainstream algorithm models, are still very rare.
Disclosure of Invention
The invention aims to solve the technical problem of providing a Chinese natural language processing tool system based on machine learning and deep learning aiming at the defects in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese natural language processing tool system based on machine learning and deep learning, comprising:
the data processing module is used for acquiring the Chinese text to be processed and the processing task type, converting the received Chinese text into a computer-readable data format according to the processing task type and providing a uniform data processing interface for the task processing module;
the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored optimal model so as to complete the corresponding natural language processing task; using the finally updated model obtained by training to complete the natural language processing task;
the algorithm model library is used for storing the algorithm of the natural language processing task and a model obtained by training according to the algorithm;
the algorithm model library comprises algorithms including a text classification algorithm based on deep learning, a text classification algorithm based on machine learning, a clustering algorithm based on machine learning, a word segmentation, part of speech tagging and named entity recognition algorithm based on a deep sequence model, a syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graphs, a similarity algorithm based on probability statistics and deep learning, a special phrase extraction algorithm based on rule analysis, a sentence analysis algorithm based on a dependency tree and a sentence structure, and a semantic groove and intention recognition algorithm based on deep learning.
According to the scheme, the data processing module comprises an IO module, a data management module, a data cleaning module and a Token conversion module;
the IO module is used for reading and writing various types of data files; the data file includes: txt files, json files, xml files, csv files, Numpy data files, Pickle data files and MySQL database files;
the data management module is used for uniformly processing files with different data formats of different tasks; the processing comprises the steps of acquiring text data, constructing a feature mapping table and converting text content features;
the data cleaning module is used for cleaning the original text data, and comprises invalid character strings, stop words and complex and simple conversion;
and the Token conversion module is used for converting text characters (words or characters) into corresponding ids by constructing a corresponding word list.
According to the scheme, the task application module comprises a classification application module, a clustering application module, a sequence marking application module, a dependence analysis application module, a similarity application module, a sentence analysis application module and a semantic slot application module;
the classification application module is used for calling deep learning and machine learning classification algorithms in the algorithm model library, so that training and prediction of text classification tasks are realized, and model parameters obtained after training are stored in the algorithm model library;
the clustering application module is used for calling a machine learning-based clustering algorithm and an LDA topic model in an algorithm model library, converging similar texts and labeling topic words;
the sequence labeling application module is used for carrying out natural language processing including word segmentation, part of speech labeling and named entity identification;
the dependency analysis application module is used for completing syntax dependency tree analysis and semantic dependency tree analysis;
the similarity application module is used for finishing the calculation of the similarity (or distance) of two input sentences;
the sentence parsing application module is used for realizing the parsing of the sentence syntax characteristics; the input of the module is a normal text sequence, and the output is a contained characteristic phrase, a corresponding sentence category and a main and predicate object structure;
the semantic slot recognition module is used for realizing recognition of sentence intentions and acquisition of semantic slots; the model input is a normal text sequence and the output is a semantic slot including the corresponding domain, intent, and time, place, flight.
According to the scheme, the segmentation, part of speech tagging and named entity recognition algorithm based on the depth sequence model adopts a Bi-LSTM + CRF architecture, wherein the Bi-LSTM is a bidirectional long and short memory network, and a forward propagation formula of an LSTM unit with a forgetting gate is as follows:
ft=σg(Wfxt+Ufht-1+bf)
it=σg(Wixt+Uiht-1+bi)
ot=σg(Woxt+Uoht-1+bo)
Figure BDA0002201663690000041
Figure BDA0002201663690000042
the initial value is c00 and h00, operator
Figure BDA0002201663690000051
Representing the element dot product. The index t indexes the step of time. x is the number oftIs the input vector of the LSTM cell, ftIs the activation vector of the forgetting gate, itIs the activation vector of the input/update gate, otIs the activation vector of the output gate, htIs the hidden state vector and is also the output vector, ctIs the cell state vector, W, U, b are the weight matrix and bias to learn during training; sigmagIs a sigmoid function, σhIs the tanh function;
the CRF in the model uses a linear conditional random field, where x is (x)1,x2…xn),y=(y1,y2…yn) All random variable sequences being represented by linear chains, givenIn the case of a random variable sequence x, the conditional probability distribution P (y | x) of the random variable y constitutes a conditional random field, i.e. satisfying markov:
P(yi|x,y1,y2…yn)=P(yi|x,yi-1,yi+1)
p (y | x) is a linear conditional random field. Let tkTo transition probabilities, slFor the emission probability, assume tkIs λk,slIs mulThen the linear conditional random field is formed by all tkk,sllJointly determining; the parameterized form of the linear conditional random field at this time is as follows:
Figure BDA0002201663690000052
wherein Z (x) is a normalization factor:
Z(x)=∑yexp(∑i,kλktk(yi-1,yi+1,x,i)+∑i,lμlsl(yi,x,i))。
the invention has the following beneficial effects: the invention constructs a reasonable system architecture, unifies data input of different application functions through the data processing module, and realizes unified calling of related function algorithms by the task application module, thereby achieving unified training interfaces, unified training flow, unified calling interfaces and unified calling flow.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic structural diagram of an embodiment of the present invention;
FIG. 2 is a schematic diagram of a task application module call flow according to an embodiment of the present invention;
fig. 3 is a schematic diagram of syntax parsing of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in FIG. 1, a tool system for multiple basic Chinese natural language processing tasks (as shown in FIG. 1) implements a plurality of mainstream machine learning and deep learning algorithms to provide assistance for better Chinese text processing and understanding. Specifically, the proposed tool system comprises a data processing module, an algorithm model module, a task application module and other main modules, and a Python calling interface, a Web service interface and other auxiliary modules.
The data processing module is used for converting the Chinese text into a computer-readable data mode; in particular, for a piece of text x1,x2…xnThe data processing module can convert it into a corresponding id list i1i2…inOr a text feature matrix W. In order to realize the universal and efficient conversion of Chinese text to computer readable data mode, the data processing module of the invention comprises: the system comprises an IO module, a data management module, a data cleaning module, a Token conversion module and the like.
And the IO module realizes read-write operation on txt files, json files, xml files, csv files, Numpy data files, Pickle data files and other files and MySQL and other databases. Meanwhile, the IO module comprises the functions of basic directory processing, file merging and splitting and the like. The IO module is called by other modules in a Python API mode.
And the data management module is responsible for uniformly processing files with different data formats of different tasks. The data management module can read corresponding training test data according to different tasks and convert the training test data into a uniform computer-readable data format.
The data management module comprises functions of acquiring text data, constructing a feature mapping table, converting text content into features and the like. The data management module can acquire data of tasks such as text classification, sequence labeling, dependency analysis and the like in a unified mode for being used by the corresponding task module.
And the data cleaning module is responsible for cleaning the original text data, and has the functions of removing invalid character strings, removing stop words, converting traditional Chinese characters and simplified Chinese characters and the like. The module is implemented by Python codes and can be called by other programs.
The Token conversion module is a main module for converting text data into computer readable data, and converts text characters into corresponding id through constructing a corresponding word list, so that the text characters can be directly used for subsequent tasks. The Token conversion module maintains a dictionary and an id mapping of data, determines the dictionary and the id mapping by a word frequency statistical method, and can convert original text characters into corresponding ids to be used as the input of a task model. The Token conversion module has uniform interfaces, and different tasks can be called by using the same interface.
Algorithm model library
The invention includes algorithms and models for multiple natural language processing tasks. The method mainly comprises the steps of classical deep learning and machine learning algorithms on related natural language processing tasks, and the improved algorithm and model provided by the invention. The algorithm model module comprises algorithms including a text classification algorithm based on deep learning, a text classification algorithm based on machine learning, a clustering algorithm based on machine learning, a word segmentation, part of speech tagging and named entity recognition algorithm based on a deep sequence model, a syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graphs, a similarity algorithm based on probability statistics and deep learning, a special phrase extraction algorithm based on rule analysis, a sentence analysis algorithm based on a dependency tree and a sentence structure, and a semantic groove and intention recognition algorithm based on deep learning.
The text classification algorithm based on deep learning is to adopt a deep neural network model to perform an end-to-end text classification task, wherein the text classification task comprises models such as text convolution, fasttext, bidirectional circulation attention network and Transformer. The model is trained using word2vec word embedding to minimize the cross entropy of the output and labels. The model has a uniform and convenient external calling interface and can be used for calling related tasks of text classification.
The Chinese character thereofThe convolution adopts single-layer multi-core convolution and xiWord embedding representing the ith word in the text, then the document
Figure BDA0002201663690000091
Assuming the convolution weight w, the convolution calculation formula is as follows: c. Ci=f(w·xi:i+h-1+ b). Then the result after convolution operation is connected in series with c ═ c1,c2,...,cin-h+1]And using the maximum pooling c ^ max { c }, and finally using a full-connection network for classification.
The text classification algorithm based on machine learning, namely, the traditional machine learning algorithm is adopted to realize the text classification task. The machine learning text classification algorithm in the module comprises KNN, naive Bayes, decision trees, support vector machines and integrated learning methods such as random deep forest and Bagging. The model uses the TF-IDF characteristics of the text as input, and different algorithms have respective optimization objectives. The model calling interface is the same as the text classification based on deep learning, and other models can be conveniently called.
The clustering algorithm based on machine learning mainly realizes the K-Means equal clustering algorithm and the LDA theme recognition model. The K-Means and other clustering algorithms mainly realize clustering and grouping of data. The LAD algorithm starts from the acquisition of subject terms, and the learning model marks corresponding subject terms for newly encountered texts. Both the clustering algorithm and the LDA algorithm are unsupervised learning algorithms and can be well applied to label-free data.
The method is characterized in that a deep neural network model and a conditional random field are adopted to complete corresponding sequence tagging tasks based on word segmentation, part of speech tagging and named entity recognition algorithms of the deep sequence model. The deep neural network is mainly used for obtaining text features, and the conditional random field is used for predicting a final labeling label. The method comprises the following steps of word segmentation, part of speech tagging and named entity recognition, wherein the three steps are different in specific model architecture according to different tasks. The word segmentation, part of speech tagging and named entity recognition models are trained in advance and used by the outside.
The depth sequence model algorithm mainly adopts a Bi-LSTM + CRF framework, wherein the Bi-LSTM is a bidirectional long and short memory network, and the forward propagation formula of an LSTM unit with a forgetting gate is as follows:
ft=σg(Wfxt+Ufht-1+bf)
it=σg(Wixt+Uiht-1+bi)
ot=σg(Woxt+Uoht-1+bo)
Figure BDA0002201663690000101
Figure BDA0002201663690000102
the initial value is c00 and h00, operator
Figure BDA0002201663690000103
Representing the element dot product. The index t indexes the step of time. x is the number oftIs the input vector of the LSTM cell, ftIs the activation vector of the forgetting gate, itIs the activation vector of the input/update gate, otIs the activation vector of the output gate, htIs the hidden state vector and is also the output vector, ctIs the cell state vector, and W, U, b are the weight matrix and bias learned during training. SigmagIs a sigmoid function, σhIs the tanh function.
The CRF in the model uses a linear conditional random field, where x is (x)1,x2…xn),y=(y1,y2…yn) Random variable sequences, which are all represented by linear chains, given a random variable sequence x, the conditional probability distribution P (y | x) of a random variable y constitutes a conditional random field, i.e. a markov property is satisfied:
P(yi|x,y1,y2…yn)=P(yi|x,yi-1,yi+1)
p (y | x) is a linear conditional random field. Is provided withtkTo transition probabilities, slFor the emission probability, assume tkIs λk,slIs mulThen the linear conditional random field is formed by all tkk,sllAnd (4) jointly determining. The parameterized form of the linear conditional random field at this time is as follows:
wherein Z (x) is a normalization factor:
Figure BDA0002201663690000112
the syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graph adopts a deep neural network to obtain the characteristics of sentences, and a graph algorithm is used for obtaining a final syntactic dependency tree or semantic dependency tree. A syntactic dependency tree is a tree representation of the syntactic structure of a sentence, and a semantic dependency tree is a tree representation of the semantic structure of a sentence. The labeling scheme for syntactic and semantic dependency trees is from a common training data set and modified. The model is trained in advance and is called by the outside world.
The similarity calculation module based on probability statistics and deep learning comprises various similarity calculation methods and distance algorithms based on character strings, probability statistics and deep learning. The basic similarity algorithm comprises TF cosine similarity, TFIDF cosine similarity, substring similarity, sentence similarity based on word embedding, Jaccard coefficient similarity, Dice coefficient similarity and the like. The similarity algorithm based on deep learning adopts a neural network architecture to learn semantic similarity of two sentences. The distance algorithm includes edit distance, euclidean distance, manhattan distance, jaro _ winkler distance, chi-square distance, KL divergence, JS distance, cross entropy, and the like. Both the distance and similarity algorithms provide a generic interface for calls to be made.
The special phrase extraction algorithm based on rule parsing mainly uses the methods of word segmentation, part of speech tagging and rule parsing to extract special phrases of sentences, such as parallel phrases, shape language phrases and the like. The model provides a unified interface to the outside.
The sentence analysis algorithm based on the dependency tree and the sentence structure realizes the analysis of the sentence by using methods such as dependency tree analysis, sentence structure matching and the like. The sentence analysis comprises the functions of sentence category analysis, question category analysis, negative category analysis, main and predicate object extraction and the like. The model provides a calling interface for the outside, and calling analysis is facilitated.
The semantic slot and intention recognition algorithm based on deep learning uses an algorithm based on deep learning to realize recognition of sentence intention and acquisition of the semantic slot. The algorithm relates to the purpose recognition in the application fields of weather, class inquiry and the like and the extraction of important information semantic slots. The model is efficient and convenient, and unified interface calling is provided for the outside.
The task application module is mainly responsible for the whole-flow calling of the whole task and comprises a calling data processing module for acquiring and processing input data; calling an algorithm model module to train a corresponding function model and predicting new data; and finally, providing calling service to the outside by butting a Python calling interface and a Web service interface. The task application module comprises a classification application module, a clustering application module, a sequence labeling application module, a dependency analysis application module, a similarity application module, a sentence analysis application module and a semantic slot application module. The specific training and calling flow is shown in fig. 2.
The classification application module realizes the training and prediction of the text classification task by using deep learning and machine learning classification algorithms. The training data is stored in a data-tagged format per line and provides an interface for file testing and single sentence prediction. The test file format is the same as the training file, and the input of single sentence prediction only needs the original text character string. The test indexes mainly use accuracy, recall and F1 values. And the model parameters obtained after training are stored in a model library for subsequent testing or calling.
The clustering calling module comprises a basic clustering algorithm such as K-Means and the like and an LDA topic model, and can be used for converging similar texts and even marking topic words. The training data may be stored in a file, one per line format, without the need for tags. The clustering function may cluster the text into corresponding groups according to a predetermined number of clusters. Topic clustering will automatically learn the topic words and divide the data, and each sentence can be represented as a fixed number of subject words.
The sequence labeling application module comprises natural language processing basic functions of word segmentation, part of speech labeling, named entity recognition and the like. The training set mainly adopts a BIOES labeling system and is compatible with data set formats such as conll and the like. The input of the segmentation is a normal text sequence and the output of the segmentation is a word list of segmented words. The input of the part of speech tagging is a word list of the divided words, and the output is a corresponding part of speech tag list. The input for the named entity is a normal text sequence and the output is the identified named entity and its category.
The dependency analysis application module comprises functions of syntactic dependency tree analysis, semantic dependency tree analysis and the like. The training set is mainly a data set in the format of conll or conllu. Syntactic dependency tree parsing is to generate a tree for a sentence that represents the syntactic dependency relationships of words in the sentence. Inputting a word list of the divided words, and outputting the category of the father node of each word on the syntactic dependency tree and the edges connected with the father node. Semantic dependency tree parsing is to generate a tree for a sentence, which represents semantic dependency relationships between words in the sentence. Inputting a word list of the divided words and corresponding parts of speech, and outputting the category of the father node of each word on the semantic dependency tree and the edges connected with the father node. The word segmentation and part-of-speech tagging require the use of a sequence tagging module, and the complete flow of syntax parsing is shown in fig. 3.
The similarity application module implements a function of calculating a similarity (or distance) for two input sentences. The similarity algorithm based on string and probability statistics does not require training data. The deep learning based similarity algorithm requires training data and stores in a format of two sentences per line plus their similarity score. The input of the similarity application module is two sentences and the type of similarity algorithm used, and returns the similarity (or distance) of the sentences.
The sentence parsing application module mainly realizes the parsing of the sentence syntax characteristics. Including special phrases in sentences, such as parallel phrases, shape language phrases, etc.; meanwhile, the method also comprises the functions of sentence category analysis, question category analysis, negative category analysis, main and predicate object extraction and the like. The input of the model is a normal text sequence, and the output is a contained characteristic phrase, a corresponding sentence category, a main and predicate object structure and the like.
The semantic slot recognition module mainly realizes recognition of sentence intentions and acquisition of semantic slots. The module mainly relates to intention identification and extraction of important information semantic slots in application fields such as weather and class inquiry at present. The model input is a normal text sequence, and the output is a corresponding domain, intention, and semantic slot of time, place, flight, etc.
Python calling interface and Web service interface
The Python calling interface, the Web service interface and the like mainly provide a service calling interface for the outside. The user can use a uniform interface to complete the training of the full-flow model and the function calling. When the model is trained, only the configuration file and the training data path are needed to be transmitted, and the training of the corresponding natural language processing task model can be completed according to the requirements of the configuration file. When the function is called, the text can be correspondingly processed only by inputting the specified task and the processing text, and the result obtained by processing is obtained. All the natural language processing tasks have basically the same calling interface and the calling process is completely unified. The input of the interface is mainly character strings or key value pairs, and the output is mainly key value pairs or json files.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (4)

1. A Chinese natural language processing tool system based on machine learning and deep learning, comprising:
the data processing module is used for acquiring the Chinese text to be processed and the processing task type and converting the received Chinese text into a computer-readable data format according to the processing task type;
the task application module is used for calling an algorithm model library by using a unified interface according to the data acquired by the data processing module and the natural language processing requirement to finish the training of the full-flow model; providing a standard and uniform task calling interface to the outside according to the stored natural language processing model so as to complete the corresponding natural language processing task;
the algorithm model library is used for storing the algorithm of the natural language processing task and a model obtained by training according to the algorithm;
the algorithm model library comprises algorithms including a text classification algorithm based on deep learning, a text classification algorithm based on machine learning, a clustering algorithm based on machine learning, a word segmentation, part of speech tagging and named entity recognition algorithm based on a deep sequence model, a syntactic dependency analysis and semantic dependency analysis algorithm based on deep learning and graphs, a similarity algorithm based on probability statistics and deep learning, a special phrase extraction algorithm based on rule analysis, a sentence analysis algorithm based on a dependency tree and a sentence structure, and a semantic groove and intention recognition algorithm based on deep learning.
2. The system of claim 1, wherein the data processing module comprises an IO module, a data management module, a data cleaning module, and a Token conversion module;
the IO module is used for reading and writing various types of data files; the data file includes: txt files, json files, xml files, csv files, Numpy data files, Pickle data files and MySQL database files;
the data management module is used for uniformly processing files with different data formats of different tasks; the processing comprises the steps of acquiring text data, constructing a feature mapping table and converting text content features;
the data cleaning module is used for cleaning the original text data, and comprises invalid character strings, stop words and complex and simple conversion;
and the Token conversion module is used for converting text characters (words or characters) into corresponding ids by constructing a corresponding word list.
3. The system of claim 1, wherein the task application module comprises a classification application module, a clustering application module, a sequence labeling application module, a dependency parsing application module, a similarity application module, a sentence parsing application module, and a semantic groove application module;
the classification application module is used for calling deep learning and machine learning classification algorithms in the algorithm model library, so that training and prediction of text classification tasks are realized, and model parameters obtained after training are stored in the algorithm model library;
the clustering application module is used for calling a machine learning-based clustering algorithm and an LDA topic model in an algorithm model library, converging similar texts and labeling topic words;
the sequence labeling application module is used for carrying out natural language processing including word segmentation, part of speech labeling and named entity identification;
the dependency analysis application module is used for completing syntax dependency tree analysis and semantic dependency tree analysis;
the similarity application module is used for finishing the calculation of the similarity (or distance) of two input sentences;
the sentence parsing application module is used for realizing the parsing of the sentence syntax characteristics; the input of the module is a normal text sequence, and the output is a contained characteristic phrase, a corresponding sentence category and a main and predicate object structure;
the semantic slot recognition module is used for realizing recognition of sentence intentions and acquisition of semantic slots; the model input is a normal text sequence and the output is a semantic slot including the corresponding domain, intent, and time, place, flight.
4. The system of claim 1, wherein the deep sequence model-based segmentation, part-of-speech tagging and named entity recognition algorithm is a Bi-LSTM + CRF architecture, wherein Bi-LSTM is a bidirectional long-short memory network, and the forward propagation formula for LSTM units with forgetting gates is as follows:
ft=σg(Wfxt+Ufht-1+bf)
it=σg(Wixt+Uiht-1+bi)
ot=σg(Woxt+Uoft-1+bo)
Figure FDA0002201663680000041
Figure FDA0002201663680000042
the initial value is c00 and f00, operator
Figure FDA0002201663680000043
Representing the dot product of the elements, subscript t being the index time step, xtIs the input vector of the LSTM cell, ftIs the activation vector of the forgetting gate, itIs the activation vector of the input/update gate, otIs the activation vector of the output gate, ftIs the hidden state vector and is also the output vector, ctIs the cell state vector, W, U, b are the weight matrix and bias to learn during training; sigmagIs a sigmoid function, σhIs the tanh function;
the CRF in the model uses a linear conditional random field, where x is (x)1,x2…xn),y=(y1,y2…yn) Random variable sequences, which are all represented by linear chains, given a random variable sequence x, the conditional probability distribution P (y | x) of a random variable y constitutes a conditional random field, i.e. a markov property is satisfied:
P(yi|x,y1,y2…yn)=P(yi|x,yi-1,yi+1)
p (y | x) is a linear conditional random field, let tkTo transition probabilities, slFor the emission probability, assume tkIs λk,slIs mulThen the linear conditional random field is formed by all tkk,sllJointly determining; the parameterized form of the linear conditional random field at this time is as follows:
Figure FDA0002201663680000044
wherein Z (x) is a normalization factor:
Z(x)=∑yexp(∑i,kλktk(yi-1,yi+1,x,i)+∑i,lμlsl(yi,x,i))。
CN201910867399.6A 2019-09-12 2019-09-12 Chinese natural language processing tool system based on machine learning and deep learning Pending CN110705296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910867399.6A CN110705296A (en) 2019-09-12 2019-09-12 Chinese natural language processing tool system based on machine learning and deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910867399.6A CN110705296A (en) 2019-09-12 2019-09-12 Chinese natural language processing tool system based on machine learning and deep learning

Publications (1)

Publication Number Publication Date
CN110705296A true CN110705296A (en) 2020-01-17

Family

ID=69195422

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910867399.6A Pending CN110705296A (en) 2019-09-12 2019-09-12 Chinese natural language processing tool system based on machine learning and deep learning

Country Status (1)

Country Link
CN (1) CN110705296A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111414556A (en) * 2020-02-10 2020-07-14 华北电力大学 Service discovery method based on knowledge graph
CN111597790A (en) * 2020-05-25 2020-08-28 郑州轻工业大学 Natural language processing system based on artificial intelligence
CN112035622A (en) * 2020-09-04 2020-12-04 上海明略人工智能(集团)有限公司 Integrated platform and method for natural language processing
CN112132214A (en) * 2020-09-22 2020-12-25 刘秀萍 Document information accurate extraction system compatible with multiple languages
CN112287104A (en) * 2020-09-28 2021-01-29 珠海大横琴科技发展有限公司 Natural language processing method and device
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment
CN112699663A (en) * 2021-01-07 2021-04-23 中通天鸿(北京)通信科技股份有限公司 Semantic understanding system based on combination of multiple algorithms
CN113065352A (en) * 2020-06-29 2021-07-02 国网浙江省电力有限公司杭州供电公司 Operation content identification method for power grid dispatching work text
CN113360649A (en) * 2021-06-04 2021-09-07 湖南大学 Flow error control method and system based on natural language processing in RPA system
CN113449512A (en) * 2020-03-25 2021-09-28 中国电信股份有限公司 Information processing method, apparatus and computer readable storage medium
US20220164370A1 (en) * 2020-11-21 2022-05-26 International Business Machines Corporation Label-based document classification using artificial intelligence
CN114861639A (en) * 2022-05-26 2022-08-05 北京百度网讯科技有限公司 Question information generation method and device, electronic equipment and storage medium
CN116339799A (en) * 2023-04-06 2023-06-27 山景智能(北京)科技有限公司 Method, system, terminal equipment and storage medium for intelligent data interface management
CN117011612A (en) * 2023-08-16 2023-11-07 海南省新超豪信息技术有限公司 AI identification method for traditional Chinese medicinal materials
CN117077688A (en) * 2023-10-17 2023-11-17 深圳格隆汇信息科技有限公司 Information analysis method and system based on natural language processing
CN117521673A (en) * 2024-01-08 2024-02-06 安徽大学 Natural language processing system with analysis training performance
CN117909506A (en) * 2024-03-15 2024-04-19 中国电子科技集团公司第十研究所 Core theme event monitoring method and device based on fine features

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278373A1 (en) * 2013-03-15 2014-09-18 Ask Ziggy, Inc. Natural language processing (nlp) portal for third party applications
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109213846A (en) * 2018-09-13 2019-01-15 山西卫生健康职业学院 A kind of natural language processing system
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning
CN109684395A (en) * 2018-12-14 2019-04-26 浪潮软件集团有限公司 A kind of visualized data Universal joint analytic method based on natural language processing
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278373A1 (en) * 2013-03-15 2014-09-18 Ask Ziggy, Inc. Natural language processing (nlp) portal for third party applications
CN108920460A (en) * 2018-06-26 2018-11-30 武大吉奥信息技术有限公司 A kind of training method and device of the multitask deep learning model of polymorphic type Entity recognition
CN109213846A (en) * 2018-09-13 2019-01-15 山西卫生健康职业学院 A kind of natural language processing system
CN109492108A (en) * 2018-11-22 2019-03-19 上海唯识律简信息科技有限公司 Multi-level fusion Document Classification Method and system based on deep learning
CN109684395A (en) * 2018-12-14 2019-04-26 浪潮软件集团有限公司 A kind of visualized data Universal joint analytic method based on natural language processing
CN109902298A (en) * 2019-02-13 2019-06-18 东北师范大学 Domain Modeling and know-how estimating and measuring method in a kind of adaptive and learning system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SLAVKO ZITNIK等: "nutIE - A modern open source natural language processing toolkit", 《25TH TELECOMMUNICATIONS FORUM》 *
李德毅等: "《人工智能导论》", 31 August 2018, 中国科学技术出版社 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110874531A (en) * 2020-01-20 2020-03-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111414556A (en) * 2020-02-10 2020-07-14 华北电力大学 Service discovery method based on knowledge graph
CN111414556B (en) * 2020-02-10 2023-11-21 华北电力大学 Knowledge graph-based service discovery method
CN113449512A (en) * 2020-03-25 2021-09-28 中国电信股份有限公司 Information processing method, apparatus and computer readable storage medium
CN111597790A (en) * 2020-05-25 2020-08-28 郑州轻工业大学 Natural language processing system based on artificial intelligence
CN111597790B (en) * 2020-05-25 2023-12-05 郑州轻工业大学 Natural language processing system based on artificial intelligence
CN113065352A (en) * 2020-06-29 2021-07-02 国网浙江省电力有限公司杭州供电公司 Operation content identification method for power grid dispatching work text
CN113065352B (en) * 2020-06-29 2022-07-19 国网浙江省电力有限公司杭州供电公司 Method for identifying operation content of power grid dispatching work text
CN112035622A (en) * 2020-09-04 2020-12-04 上海明略人工智能(集团)有限公司 Integrated platform and method for natural language processing
CN112132214A (en) * 2020-09-22 2020-12-25 刘秀萍 Document information accurate extraction system compatible with multiple languages
CN112287104A (en) * 2020-09-28 2021-01-29 珠海大横琴科技发展有限公司 Natural language processing method and device
US20220164370A1 (en) * 2020-11-21 2022-05-26 International Business Machines Corporation Label-based document classification using artificial intelligence
US11809454B2 (en) * 2020-11-21 2023-11-07 International Business Machines Corporation Label-based document classification using artificial intelligence
CN112632980A (en) * 2020-12-30 2021-04-09 广州友圈科技有限公司 Enterprise classification method and system based on big data deep learning and electronic equipment
CN112699663A (en) * 2021-01-07 2021-04-23 中通天鸿(北京)通信科技股份有限公司 Semantic understanding system based on combination of multiple algorithms
CN113360649B (en) * 2021-06-04 2024-01-05 湖南大学 Natural language processing-based flow error control method and system in RPA system
CN113360649A (en) * 2021-06-04 2021-09-07 湖南大学 Flow error control method and system based on natural language processing in RPA system
CN114861639B (en) * 2022-05-26 2023-03-10 北京百度网讯科技有限公司 Question information generation method and device, electronic equipment and storage medium
CN114861639A (en) * 2022-05-26 2022-08-05 北京百度网讯科技有限公司 Question information generation method and device, electronic equipment and storage medium
CN116339799A (en) * 2023-04-06 2023-06-27 山景智能(北京)科技有限公司 Method, system, terminal equipment and storage medium for intelligent data interface management
CN116339799B (en) * 2023-04-06 2023-11-28 山景智能(北京)科技有限公司 Method, system, terminal equipment and storage medium for intelligent data interface management
CN117011612A (en) * 2023-08-16 2023-11-07 海南省新超豪信息技术有限公司 AI identification method for traditional Chinese medicinal materials
CN117077688A (en) * 2023-10-17 2023-11-17 深圳格隆汇信息科技有限公司 Information analysis method and system based on natural language processing
CN117077688B (en) * 2023-10-17 2024-03-29 深圳市临其境科技有限公司 Information analysis method and system based on natural language processing
CN117521673A (en) * 2024-01-08 2024-02-06 安徽大学 Natural language processing system with analysis training performance
CN117521673B (en) * 2024-01-08 2024-03-22 安徽大学 Natural language processing system with analysis training performance
CN117909506A (en) * 2024-03-15 2024-04-19 中国电子科技集团公司第十研究所 Core theme event monitoring method and device based on fine features
CN117909506B (en) * 2024-03-15 2024-06-04 中国电子科技集团公司第十研究所 Core theme event monitoring method and device based on fine features

Similar Documents

Publication Publication Date Title
CN110705296A (en) Chinese natural language processing tool system based on machine learning and deep learning
CN106776562B (en) Keyword extraction method and extraction system
Jung Semantic vector learning for natural language understanding
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
Gasmi et al. LSTM recurrent neural networks for cybersecurity named entity recognition
CN109508459B (en) Method for extracting theme and key information from news
CN111737496A (en) Power equipment fault knowledge map construction method
CN110263325B (en) Chinese word segmentation system
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
CN108874774B (en) Service calling method and system based on intention understanding
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN110457676B (en) Evaluation information extraction method and device, storage medium and computer equipment
CN111241294A (en) Graph convolution network relation extraction method based on dependency analysis and key words
JPWO2014033799A1 (en) Word semantic relation extraction device
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN113672718A (en) Dialog intention recognition method and system based on feature matching and field self-adaption
CN113821635A (en) Text abstract generation method and system for financial field
CN113065349A (en) Named entity recognition method based on conditional random field
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
US20230259708A1 (en) System and methods for key-phrase extraction
CN114218921A (en) Problem semantic matching method for optimizing BERT
CN112732863B (en) Standardized segmentation method for electronic medical records
CN112487813B (en) Named entity recognition method and system, electronic equipment and storage medium
CN107562907B (en) Intelligent lawyer expert case response device
CN110532553B (en) Water conservancy space relation word recognition and extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination