CN113535961A - System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof - Google Patents

System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof Download PDF

Info

Publication number
CN113535961A
CN113535961A CN202110886442.0A CN202110886442A CN113535961A CN 113535961 A CN113535961 A CN 113535961A CN 202110886442 A CN202110886442 A CN 202110886442A CN 113535961 A CN113535961 A CN 113535961A
Authority
CN
China
Prior art keywords
text
model
data
word
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110886442.0A
Other languages
Chinese (zh)
Inventor
王永剑
孙亚茹
杨莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute of the Ministry of Public Security
Original Assignee
Third Research Institute of the Ministry of Public Security
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute of the Ministry of Public Security filed Critical Third Research Institute of the Ministry of Public Security
Priority to CN202110886442.0A priority Critical patent/CN113535961A/en
Publication of CN113535961A publication Critical patent/CN113535961A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a system for realizing multi-language mixed short text classification processing based on small sample learning, wherein the system comprises a data acquisition module, a short text classification module and a short text classification module, wherein the data acquisition module is used for inputting a small number of preset label samples into the system; the data preprocessing module is used for preprocessing the data of the preset label sample; the model calculation processing module is used for extracting key features and generating a corresponding model accuracy calculation result; and the model generation and output module is used for predicting the model prediction result of the current text data and further updating and iterating the output model through sampling and auditing the model prediction result. The invention also relates to a corresponding method, device, processor and storage medium thereof. By adopting the system, the method, the device, the processor and the storage medium thereof, the mining of large-scale data potential information is completed by utilizing small sample learning in a time-saving and labor-saving manner, word formation information and word cross-correlation information are effectively obtained, and the system, the method, the device, the processor and the storage medium thereof have great innovation.

Description

System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof
Technical Field
The invention relates to the technical field of deep learning, in particular to the technical field of natural language processing, and specifically relates to a system, a method, a device, a memory and a computer readable storage medium for realizing multi-language mixed short text classification processing based on small sample learning.
Background
Text classification is the task of assigning labels to text, is one of the important and fundamental tasks in natural language processing, and is beneficial to support many downstream tasks such as emotion classification, topic extraction and the like. And a text classification technology which can not separate keys from the value information mining of the text platform. The issue of text belongs to short texts mostly, and has the characteristics of short sentences, multiple languages, content diversity, informality, grammar errors, popular languages, slang and the like, so that an effective text classification technology is needed to solve the problem of short text classification with a plurality of mixed languages.
Traditional text classification algorithms focus more on linear expressions of text, such as support vector machine models that use dictionaries or n-gram word vectors as input. Research in recent years has shown that non-linear models can capture text context information efficiently, and can produce more accurate predictions than linear models. The convolutional neural network model is a typical nonlinear model, which converts local features of data into low-dimensional vectors and retains task-related information. This efficient mapping method is superior to the sequence model in short text representation.
The convolutional neural network acquires the characteristic information of the data area by adopting maximum pooling, and only the characteristic with the maximum area value is reserved during calculation. As the number of convolution layers increases, the positioning information related to the target is gradually lost. Text regions may express more complex concepts that are learned by extracting the most significant feature information in a region by maximizing the extracted feature region alone, ignoring other useful information. In addition, coupling connections between network layers may add redundancy to the model.
In addition to the performance of the model, the quality of the data features also has a large impact on the results of downstream tasks. For a short text with Multi-language mixture, the existing models, such as Multi-language models like Multi-language Bert and LASER, cannot well represent the features of different languages in the same feature space. Therefore, the same feature space representation calculation cannot be performed among multiple languages, and semantic deviation occurs.
The attention mechanism is a method for effectively focusing on key information in model input data. The attention model not only pays particular attention to the feature information in the training process, but also effectively adjusts parameters of the neural network according to different features, and can mine more hidden feature information.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and providing a system, method, apparatus, memory and computer-readable storage medium for implementing a classification process of a multi-language hybrid short text based on small sample learning, which can effectively obtain word formation information and word cross-correlation information.
In order to achieve the above objects, the system, method, apparatus, memory and computer readable storage medium thereof for implementing a multi-lingual hybrid short text classification processing based on small sample learning according to the present invention are as follows:
the system for realizing the classification processing of the multi-language mixed short text based on the small sample learning is mainly characterized by comprising the following steps:
the data acquisition module is used for inputting a small number of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing on the preset label sample;
the model calculation processing module is connected with the data preprocessing module and used for extracting key features according to the text data acquired after preprocessing and generating a corresponding model accuracy calculation result; and
and the model generation and output module is connected with the model calculation processing module and used for predicting the model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the output model through sampling and auditing the model prediction result.
Preferably, the model calculation processing module specifically includes:
the word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing of word sets on the small amount of label text data samples obtained after batch processing;
the text characteristic embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into a text integral characteristic to be used as the input of an effective convolution layer;
the text key region characteristic unit is connected with the text characteristic embedding unit and is used for acquiring text key characteristic information in the text overall characteristic;
the text type judging unit is connected with the text key area characteristic unit and is used for analyzing and calculating the classification type of the current input text; and
and the model accuracy calculation unit is connected with the text type judgment unit and is used for calculating the model accuracy of the text information obtained after the text processing.
Preferably, the model generation and output module specifically includes:
the model prediction processing unit is used for inputting multi-language mixed short text data and performing model prediction;
the prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
and the sampling and auditing unit is connected with the prediction result output unit and is used for sampling and auditing the model prediction result so as to detect the accuracy of the prediction model.
Preferably, the sampling auditing unit judges whether to perform updating calibration according to the following rules through a system preset threshold value:
if the text data sampled and audited by the sampling auditing unit is larger than a threshold value, adding new label data to the data acquisition module to perform iterative update processing of the model; otherwise
And if the text data sampled and audited by the sampling auditing unit is not greater than the threshold value, the data acquisition module is added with new label data after calibration processing to perform iterative update processing of the model.
The method for realizing the classification processing of the multi-language mixed short text based on the small sample learning by using the system is mainly characterized by comprising the following steps:
(1) acquiring text sub-word information from a multi-language mixed short text;
(2) carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) performing text characteristic embedding on the preprocessed text sub-word information to obtain input information of an effective convolution layer;
(4) acquiring adjacent word information and text key area information of the text sub-word information by adopting different kernel convolutions;
(5) judging the category of the text according to probability distribution;
(6) and predicting a classification model and processing for mining new text data information according to the category information, and updating and iterating the model.
Preferably, the step (3) specifically includes the following steps:
(3.1) searching words, if not, segmenting according to n-gram to form a sub-word library, searching special sub-words before segmenting, and entering the step (3.3); otherwise, entering the step (3.2);
(3.2) if yes, segmenting according to the special sub-words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to the n-gram to form a corresponding formed sub-word library, and entering the step (3.3);
and (3.3) affine transforming the sub-word library formed after segmentation to the representation of the word level, adding the newly represented word as a special sub-word to the sub-word set, and calculating the sub-word representation of the higher layer.
Preferably, the sub-word characterization of the higher layer is calculated according to the following formula:
Figure BDA0003194308920000031
wherein g is a subword, i is the ith word in the sentence, WgiFor data transformation matrix, θwIs a set of words and phrases, and is,
Figure BDA0003194308920000032
representing a characterization of the sub-word g, ui|g(1 ≦ i ≦ n) i.e. the higher level representation of the subword.
Preferably, the step (4) specifically includes the following steps:
(4.1) representing the sub-words of the higher layer u after affine transformationi|gCombined into text integral features
Figure BDA0003194308920000041
As an input to the active convolutional layer;
(4.2) performing one-dimensional convolution on the text features by adopting convolution kernels with different widths and different channel numbers to obtain global features containing different adjacent word information;
and (4.3) performing text global feature control by adopting a self-attention mechanism, thereby calculating and outputting text key region information features.
Preferably, the global characteristics of the different neighboring word information in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
where k is the width, l is the number of convolutions, ReLU is the activation function, UlInput feature data, Conv, of kernel convolution representing the l-th layer1×kRepresenting a convolution operation with a kernel width k, Vl+1Representing the global features after convolution of the ith layer.
Preferably, the step (4.3) specifically includes the following steps:
(4.3.1) calculating the coupling coefficient c according to the following equationjmTaking the weight as the salient weight of the global feature of the text:
Figure BDA0003194308920000042
wherein j is the jth row of the feature matrix after convolution, m is the mth column of the feature matrix after convolution, bjmIs a convolution of characteristic values, u 'inside the text data'jmIs a characteristic value before convolution at the initial time, cjmThe attention value of the jth row and mth column of the convolved eigenvalue.
(4.3.2) calculating the text key area information characteristics v according to a global pooling calculation formulamWherein, u'm|jThe characteristics of the j-th row before convolution specifically include:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the efficiently pooled internal coefficients b according to the following formulajm
bjm=bjm+vm·u′m|j
Preferably, the step (5) specifically comprises the following steps:
(5.1) text key region characteristics output by 2 kinds of kernel convolution with different widths
Figure BDA0003194308920000043
And
Figure BDA0003194308920000044
integrating into a text information characteristic vector through a splicing function f
Figure BDA0003194308920000045
The text information feature vector v is calculated according to the following formula:
Figure BDA0003194308920000051
(5.2) inputting the text information feature vector v into a feed-forward neural network FFNN (·) to output text category features, and predicting probability distribution of multiple categories of texts by adopting soft max function
Figure BDA0003194308920000052
The probability distribution is calculated according to the following formula
Figure BDA0003194308920000053
Figure BDA0003194308920000054
Preferably, the step (6) specifically includes the following steps:
(6.1) carrying out category information mining on the unmarked multilingual mixed short text data by using the trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the output result of the model in a sampling auditing mode;
and (6.3) expanding the new category data as a label sample into the sample data set for updating and iterative processing of the model.
Preferably, the step (6.2) specifically comprises the following steps:
the accuracy of the model was checked using the following audit criteria:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the score is 0/1;
(6.2.3) calculating the sampling accuracy rate according to the following formula:
Figure BDA0003194308920000055
(6.2.4) setting a check threshold value for comparison.
Preferably, the step (6.2.4) is specifically:
if the sampling accuracy is higher than the threshold value, expanding the new class data serving as a label sample into a sample data set; and if the sampling rate is lower than the threshold value, amplifying the sampling proportion, calibrating the error label, and further training the model after the error label is expanded to the label sample data set.
The device for realizing the classification processing of the multi-language mixed short text based on the small sample learning is mainly characterized by comprising the following steps:
a processor configured to execute computer-executable instructions;
a memory storing one or more computer-executable instructions that, when executed by the processor, perform the steps of the above-described method for performing a multilingual hybrid short-text classification based on small-sample learning.
The processor for realizing the classification of the multi-language mixed short text based on the small sample learning is mainly characterized in that the processor is configured to execute computer executable instructions, and the computer executable instructions are executed by the processor to realize the steps of the method for realizing the classification of the multi-language mixed short text based on the small sample learning.
The computer-readable storage medium is mainly characterized by having a computer program stored thereon, wherein the computer program can be executed by a processor to implement the steps of the method for implementing the classification processing of the multi-language mixed short text based on the small sample learning.
By adopting the system, the method, the device, the memory and the computer readable storage medium for realizing the classification processing of the multi-language mixed short text based on the small sample learning, aiming at the problems of difficult representation of unknown words, multi-language semantic deviation, time and labor waste of manual labeling and the like in the prior art, the invention provides the multi-language mixed short text classification model combining the convolutional neural network with the attention mechanism, which can utilize a small amount of samples to learn the short text characteristics of the multi-language mixture, and solve the problems of difficult representation of the unknown words and rare words, difficult capture of short text information, semantic drift of the multi-language characteristics, redundancy of model parameters and the like. On one hand, from the internal structure and the forming mode of the word, the sub-word characteristics at the bottom layer are mapped to the characteristics at the higher layer, and the sub-word characteristics are shared, so that the influence of the unknown word and the rare word on the model and the problem of the characteristics of the multi-language word in the same space are solved. On the other hand, the local convolution concerned by the deep convolutional neural network can effectively capture the association between adjacent words, but the maximum pooling thereof depends on the maximized value of the feature region to extract the most significant information, and other useful information is ignored. The method utilizes the convolution kernels with multiple channels and unequal widths to capture the information of adjacent words with different numbers, and simultaneously adopts a coupling coefficient calculation method to extract the most significant information in the sentence without neglecting other related information. And mining new data according to a model for learning a small amount of label data, and updating the label data set and the training model by sampling discrimination. The method utilizes small samples to learn, saves time and labor, completes the mining of large-scale data potential information, effectively obtains word formation information and word cross-correlation information, and has great innovation.
Drawings
FIG. 1 is a schematic diagram of a framework structure of a system for implementing a classification process of a multi-language mixed short text based on small sample learning according to the present invention.
FIG. 2 is a diagram of an embodiment of the present invention for implementing a method for implementing a classification process of a multi-language mixed short text based on small sample learning.
Detailed Description
In order to more clearly describe the technical contents of the present invention, the following further description is given in conjunction with specific embodiments.
Before describing in detail embodiments that are in accordance with the present invention, it should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, the system for implementing a classification process of a multi-language mixed short text based on small sample learning includes:
the data acquisition module is used for inputting a small number of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing on the preset label sample;
the model calculation processing module is connected with the data preprocessing module and used for extracting key features according to the text data acquired after preprocessing and generating a corresponding model accuracy calculation result; and
and the model generation and output module is connected with the model calculation processing module and used for predicting the model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the output model through sampling and auditing the model prediction result.
As a preferred embodiment of the present invention, the model calculation processing module specifically includes:
the word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing of word sets on the small amount of label text data samples obtained after batch processing;
the text characteristic embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into a text integral characteristic to be used as the input of an effective convolution layer;
the text key region characteristic unit is connected with the text characteristic embedding unit and is used for acquiring text key characteristic information in the text overall characteristic;
the text type judging unit is connected with the text key area characteristic unit and is used for analyzing and calculating the classification type of the current input text; and
and the model accuracy calculation unit is connected with the text type judgment unit and is used for calculating the model accuracy of the text information obtained after the text processing.
As a preferred embodiment of the present invention, the model generation and output module specifically includes:
the model prediction processing unit is used for inputting multi-language mixed short text data and performing model prediction;
the prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
and the sampling and auditing unit is connected with the prediction result output unit and is used for sampling and auditing the model prediction result so as to detect the accuracy of the prediction model.
As a preferred embodiment of the present invention, the sampling audit unit judges whether to perform update calibration according to the following rules through a system preset threshold:
if the text data sampled and audited by the sampling auditing unit is larger than a threshold value, adding new label data to the data acquisition module to perform iterative update processing of the model; otherwise
And if the text data sampled and audited by the sampling auditing unit is not greater than the threshold value, the data acquisition module is added with new label data after calibration processing to perform iterative update processing of the model.
The method for realizing the classification processing of the multi-language mixed short texts based on the small sample learning by using the system comprises the following steps:
(1) acquiring text sub-word information from a multi-language mixed short text;
(2) carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) performing text characteristic embedding on the preprocessed text sub-word information to obtain input information of an effective convolution layer;
(4) acquiring adjacent word information and text key area information of the text sub-word information by adopting different kernel convolutions;
(5) judging the category of the text according to probability distribution;
(6) and predicting a classification model and processing for mining new text data information according to the category information, and updating and iterating the model.
As a preferred embodiment of the present invention, the step (3) specifically comprises the following steps:
(3.1) searching words, if not, segmenting according to n-gram to form a sub-word library, searching special sub-words before segmenting, and entering the step (3.3); otherwise, entering the step (3.2);
(3.2) if yes, segmenting according to the special sub-words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to the n-gram to form a corresponding formed sub-word library, and entering the step (3.3);
and (3.3) affine transforming the sub-word library formed after segmentation to the representation of the word level, adding the newly represented word as a special sub-word to the sub-word set, and calculating the sub-word representation of the higher layer.
As a preferred embodiment of the present invention, the sub-word characterization of the upper layer is calculated according to the following formula:
Figure BDA0003194308920000081
wherein g is a subword, i is the ith word in the sentence, WgiFor data transformation matrix, θwIs a set of words and phrases, and is,
Figure BDA0003194308920000082
representing a characterization of the sub-word g, ui|g(1 ≦ i ≦ n) i.e. the higher level representation of the subword.
As a preferred embodiment of the present invention, the step (4) specifically comprises the following steps:
(4.1) representing the sub-words of the higher layer u after affine transformationi|gCombined into text integral features
Figure BDA0003194308920000083
As an input to the active convolutional layer;
(4.2) performing one-dimensional convolution on the text features by adopting convolution kernels with different widths and different channel numbers to obtain global features containing different adjacent word information;
and (4.3) performing text global feature control by adopting a self-attention mechanism, thereby calculating and outputting text key region information features.
As a preferred embodiment of the present invention, the global features of the different neighboring word information in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
where k is the width, l is the number of convolutions, ReLU is the activation function, UlInput feature data, Conv, of kernel convolution representing the l-th layer1×kRepresenting a convolution operation with a kernel width k, Vl+1Representing the global features after convolution of the ith layer.
As a preferred embodiment of the present invention, the step (4.3) specifically comprises the following steps:
(4.3.1) calculating the coupling coefficient c according to the following equationjmTaking the weight as the salient weight of the global feature of the text:
Figure BDA0003194308920000091
wherein j is the jth row of the feature matrix after convolution, m is the mth column of the feature matrix after convolution, bjmIs a convolution of characteristic values, u 'inside the text data'jmIs a characteristic value before convolution at the initial time, cjmThe attention value of the jth row and mth column of the convolved eigenvalue.
(4.3.2) calculating the text key area information characteristics v according to a global pooling calculation formulamWherein, u'm|jThe characteristics of the j-th row before convolution are specifically as follows:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the efficiently pooled internal coefficients b according to the following formulajm
bjm=bjm+vm·u′m|j
As a preferred embodiment of the present invention, the step (5) specifically comprises the following steps:
(5.1) text key region characteristics output by 2 kinds of kernel convolution with different widths
Figure BDA0003194308920000092
And
Figure BDA0003194308920000093
integrating into a text information characteristic vector through a splicing function f
Figure BDA0003194308920000094
The text information feature vector v is calculated according to the following formula:
Figure BDA0003194308920000095
(5.2) inputting the text information feature vector v into a feed-forward neural network FFNN (·) to output text category features, and predicting probability distribution of multiple categories of texts by adopting soft max function
Figure BDA0003194308920000096
The probability distribution is calculated according to the following formula
Figure BDA0003194308920000097
Figure BDA0003194308920000098
As a preferred embodiment of the present invention, the step (6) specifically comprises the following steps:
(6.1) carrying out category information mining on the unmarked multilingual mixed short text data by using the trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the output result of the model in a sampling auditing mode;
and (6.3) expanding the new category data as a label sample into the sample data set for updating and iterative processing of the model.
As a preferred embodiment of the present invention, the step (6.2) specifically comprises the following steps:
the accuracy of the model was checked using the following audit criteria:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the score is 0/1;
(6.2.3) calculating the sampling accuracy rate according to the following formula:
Figure BDA0003194308920000101
(6.2.4) setting a check threshold value for comparison.
As a preferred embodiment of the present invention, the step (6.2.4) is specifically:
if the sampling accuracy is higher than the threshold value, expanding the new class data serving as a label sample into a sample data set; and if the sampling rate is lower than the threshold value, amplifying the sampling proportion, calibrating the error label, and further training the model after the error label is expanded to the label sample data set.
The device for realizing the classification processing of the multi-language mixed short text based on the small sample learning comprises the following steps:
a processor configured to execute computer-executable instructions;
a memory storing one or more computer-executable instructions that, when executed by the processor, perform the steps of the above-described method for performing a multilingual hybrid short-text classification based on small-sample learning.
The processor for implementing the classification processing of the multi-language hybrid short text based on the small sample learning is configured to execute computer executable instructions, and the computer executable instructions, when executed by the processor, implement the steps of the method for implementing the classification processing of the multi-language hybrid short text based on the small sample learning.
The computer readable storage medium has a computer program stored thereon, the computer program being executable by a processor to perform the steps of the above method for performing a multi-lingual hybrid short text classification process based on small sample learning.
In a specific embodiment of the invention, the technical scheme starts from the internal structure and the forming mode of words to construct a sub-word embedded network, and the influence of unknown words on the model is relieved, and simultaneously a multi-language mixed feature space is constructed, so that the distances of words with the same semantics and different languages in the feature space are close. In order to solve the generalization problem of the global pooling of the deep convolutional neural network, a coupling coefficient is adopted to effectively calculate the relevant characteristics of the local text region. And the convolution kernels with multiple channels and unequal widths are used for capturing information of adjacent words with different numbers, so that other related information is not ignored when the main information in the sentence is extracted by the model. The method comprises the following steps:
step one, text sub-word information is obtained. Embedding of the text depends on the sub-word information of each word, segmenting the words by adopting n-element grammar to form a sub-word library, and then carrying out affine transformation to the representation of the word level.
And step two, embedding text features. And combining the affine transformed word characteristics into the text overall characteristics to be used as the input of the effective convolution layer.
And step three, acquiring the information of the text key area. And the text key region characteristics are obtained by calculating different wide convolution kernels and coupling coefficients. And performing one-dimensional convolution on the text features by adopting convolution kernels with different widths and different channel numbers to obtain global features containing different adjacent word information. The convolution window slides on the input feature array in sequence, the data in the window and the data in the convolution kernel are multiplied and summed according to elements to obtain elements of corresponding output positions, and therefore adjacent word information of different distances is captured. And performing text global feature control on different text feature data after convolution by adopting an attention mechanism, and calculating a coupling coefficient to serve as a salient weight of the global feature. Then, the global pooling calculation outputs text key region information characteristics. Finally, effective pooling is performed to capture key information in the text without losing other related information, and the internal coefficients of the effective pooling are iteratively updated.
And step four, judging the type of the text. And integrating the information features of the text key regions output by the cores with different widths into a text information feature vector through a splicing function. And then, outputting text category characteristics to the text information characteristics through a layer of feedforward neural network, predicting the probability distribution of multiple categories of the text by adopting softmax, and calculating the category to which the text belongs through the probability distribution.
And step five, predicting and mining new text data information. And (5) carrying out category information mining on the unmarked multilingual mixed short text data by using the trained model. And detecting the accuracy of the model output result in a sampling auditing mode. The auditing standard adopts the following steps: (1) sampling a prediction result of 5% of the amount of the prediction data; (2) manually judging, wherein the score is 0/1; (3) the sampling accuracy rate is calculated, and the sampling accuracy rate is calculated,
Figure BDA0003194308920000111
(4) a threshold value is set. And if the sampling accuracy is higher than the threshold value, expanding the new class data into the sample data set as the label sample. If the sampling rate is lower than the threshold value, the sampling proportion is amplified, the error label is calibrated, and the model is further trained after the error label is expanded to a label sample data set; (5) and updating the iterative model.
Referring to fig. 2, in an embodiment of the present invention, taking the case of distinguishing the short text mixed with multiple languages of chinese and english as an example, the method for classifying the short text mixed with multiple languages of the present invention includes the following steps:
1. and (4) preparing data. Text data is read from a multi-language mixed short text. For example, read a short text sentence of chinese-english mixture: the tiny medical course, sol cool, is opened by Ting _ h
Figure BDA0003194308920000112
The sentences contain special symbols such as expressions.
2. And (4) preprocessing data. And de-marking points, de-emoticons and other de-noising processing which is irrelevant to text classification information is carried out on the text. Chinese is separated from other languages, English is in space, and Chinese is segmented as a whole. The result after the segmentation was: { 'Ting _ h' 'sets up' psychological 'course' so 'cool' }.
3. Embedding the sub-words to obtain the information of the text sub-words. Searching the sub-words, and if not, segmenting according to the n-gram. For example, the subword 'Ting _ h' is sliced by 3-gram. Adding special characters to the beginning and end of a word before 3-dimensional grammar splitting "<"and">"to distinguish subwords as suffixes and suffixes. Before segmentation, searching for special sub-words, if yes, segmenting according to the special sub-words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to the n-gram. After segmentation, 6 sub-words are formed: {<‘Ti’‘Tin’‘ing’‘ng_’‘g_h’‘_h’>H, then add the newly characterized word Ting _ h as a special sub-word to the set of sub-words θwCalculating the sub-word representation of the higher layer according to formula (1), wherein
Figure BDA0003194308920000121
Representing a characterization of the sub-word g, ui|g(1 ≦ i ≦ n) i.e. the higher level representation of the subword.
Figure BDA0003194308920000122
4. And (5) carrying out different kernel convolutions to obtain adjacent word information. Symbolizing the affine transformed word ui|gCombined into text integral features
Figure BDA0003194308920000123
As input for the active convolutional layer. Convolution kernels of different widths may be used to obtain the correlation of different numbers of adjacent words. Adopting convolution with different widths k as t to check text features for one-dimensional convolution according to formula (2) to obtain global features containing different adjacent word information
Figure BDA0003194308920000124
Vl+1(Ul)=ReLU(Conv1×k(Ul)) (2)
Wherein U islInput feature data, Conv, of kernel convolution representing the l-th layer1×kRepresenting a convolution operation with a kernel width k, Vl+1Representing the global features after convolution of the ith layer. Assuming that the sentence length is 10 and the dimension is feature 5, the input dimension is (5 × 10). The number of channels is set to 4 and the convolution kernel widths are 2 and 4, respectively. Therefore, two text region features with different widths are obtained by one convolution respectively
Figure BDA0003194308920000125
And
Figure BDA0003194308920000126
5. and effectively pooling to obtain the key area information of the text. Adopting an attention mechanism to carry out text global feature control, and calculating a coupling coefficient according to a formula (3)
Figure BDA0003194308920000127
And
Figure BDA0003194308920000128
as a salient weight of the global feature, the coupling coefficient is calculated as follows:
Figure BDA0003194308920000129
wherein, bjmIs a characteristic value inside the convolved data, and is initially u'jm. Outputting text key region information characteristics according to a global pooling calculation formula (4)
Figure BDA00031943089200001210
And
Figure BDA00031943089200001211
vm=∑jcjm·u′m|j (4)
iteratively updating the efficiently pooled internal coefficients b according to equation (5)jm
bjm=bjm+vm·u′m|j (5)
6. And (4) performing classification calculation. Text key region characteristics output by 2 kinds of kernel convolution with different widths
Figure BDA00031943089200001212
And
Figure BDA00031943089200001213
integrating into a text information characteristic vector through a splicing function f
Figure BDA00031943089200001214
The final text-information feature vector v is,
Figure BDA0003194308920000131
then, outputting text category characteristics to the text information characteristics through a layer of feed-forward neural network FFNN (DEG), and then predicting the probability distribution of multiple categories of the text by adopting softmax
Figure BDA0003194308920000132
Figure BDA0003194308920000133
And (5) model prediction. The model here is a 4-class model, labeled sports, education, entertainment, music. By probability distribution
Figure BDA0003194308920000134
And calculating and outputting {0.01,0.91,0.067 and 0.013}, wherein the value of the label corresponding to the education category is the maximum, namely the sentence category result output by the model is education.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by suitable instruction execution devices.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of terms "an embodiment," "some embodiments," "an example," "a specific example," or "an embodiment," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
By adopting the system, the method, the device, the memory and the computer readable storage medium for realizing the classification processing of the multi-language mixed short text based on the small sample learning, aiming at the problems of difficult representation of unknown words, multi-language semantic deviation, time and labor waste of manual labeling and the like in the prior art, the invention provides the multi-language mixed short text classification model combining the convolutional neural network with the attention mechanism, which can utilize a small amount of samples to learn the short text characteristics of the multi-language mixture, and solve the problems of difficult representation of the unknown words and rare words, difficult capture of short text information, semantic drift of the multi-language characteristics, redundancy of model parameters and the like. On one hand, from the internal structure and the forming mode of the word, the sub-word characteristics at the bottom layer are mapped to the characteristics at the higher layer, and the sub-word characteristics are shared, so that the influence of the unknown word and the rare word on the model and the problem of the characteristics of the multi-language word in the same space are solved. On the other hand, the local convolution concerned by the deep convolutional neural network can effectively capture the association between adjacent words, but the maximum pooling thereof depends on the maximized value of the feature region to extract the most significant information, and other useful information is ignored. The method utilizes the convolution kernels with multiple channels and unequal widths to capture the information of adjacent words with different numbers, and simultaneously adopts a coupling coefficient calculation method to extract the most significant information in the sentence without neglecting other related information. And mining new data according to a model for learning a small amount of label data, and updating the label data set and the training model by sampling discrimination. The method utilizes small samples to learn, saves time and labor, completes the mining of large-scale data potential information, effectively obtains word formation information and word cross-correlation information, and has great innovation.
In this specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (17)

1. A system for implementing multi-language hybrid short text classification processing based on small sample learning, the system comprising:
the data acquisition module is used for inputting a small number of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing on the preset label sample;
the model calculation processing module is connected with the data preprocessing module and used for extracting key features according to the text data acquired after preprocessing and generating a corresponding model accuracy calculation result; and
and the model generation and output module is connected with the model calculation processing module and used for predicting the model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the output model through sampling and auditing the model prediction result.
2. The system for implementing classification of multi-lingual hybrid short text based on small sample learning as claimed in claim 1, wherein the model calculation processing module specifically comprises:
the word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing of word sets on the small amount of label text data samples obtained after batch processing;
the text characteristic embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into a text integral characteristic to be used as the input of an effective convolution layer;
the text key region characteristic unit is connected with the text characteristic embedding unit and is used for acquiring text key characteristic information in the text overall characteristic;
the text type judging unit is connected with the text key area characteristic unit and is used for analyzing and calculating the classification type of the current input text; and
and the model accuracy calculation unit is connected with the text type judgment unit and is used for calculating the model accuracy of the text information obtained after the text processing.
3. The system for implementing multi-lingual hybrid short text classification processing based on small sample learning of claim 2, wherein the model generation and output module specifically comprises:
the model prediction processing unit is used for inputting multi-language mixed short text data and performing model prediction;
the prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
and the sampling and auditing unit is connected with the prediction result output unit and is used for sampling and auditing the model prediction result so as to detect the accuracy of the prediction model.
4. The system for implementing classification processing of mixed short text with multiple languages based on small sample learning as claimed in claim 3, wherein the sampling auditing unit determines whether to update and calibrate according to the following rules through a system preset threshold value:
if the text data sampled and audited by the sampling auditing unit is larger than a threshold value, adding new label data to the data acquisition module to perform iterative update processing of the model; otherwise
And if the text data sampled and audited by the sampling auditing unit is not greater than the threshold value, the data acquisition module is added with new label data after calibration processing to perform iterative update processing of the model.
5. A method for implementing multi-language hybrid short text classification processing based on small sample learning by using the system of claim 4, wherein the method comprises the following steps:
(1) acquiring text sub-word information from a multi-language mixed short text;
(2) carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) performing text characteristic embedding on the preprocessed text sub-word information to obtain input information of an effective convolution layer;
(4) acquiring adjacent word information and text key area information of the text sub-word information by adopting different kernel convolutions;
(5) judging the category of the text according to probability distribution;
(6) and predicting a classification model and processing for mining new text data information according to the category information, and updating and iterating the model.
6. The method for implementing a classification process of a multi-lingual mixed short text based on small sample learning as claimed in claim 5, wherein the step (3) comprises the following steps:
(3.1) searching words, if not, segmenting according to n-gram to form a sub-word library, searching special sub-words before segmenting, and entering the step (3.3); otherwise, entering the step (3.2);
(3.2) if yes, segmenting according to the special sub-words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to the n-gram to form a corresponding formed sub-word library, and entering the step (3.3);
and (3.3) affine transforming the sub-word library formed after segmentation to the representation of the word level, adding the newly represented word as a special sub-word to the sub-word set, and calculating the sub-word representation of the higher layer.
7. The method for implementing classification processing of multi-lingual mixed short texts based on small sample learning as claimed in claim 6, wherein the sub-word characterization of the higher layer is calculated according to the following formula:
Figure FDA0003194308910000021
wherein g is a subword, i is the ith word in the sentence, WgiFor data transformation matrix, θwIs a set of words and phrases, and is,
Figure FDA0003194308910000022
representing a characterization of the sub-word g, ui|g(1 ≦ i ≦ n) i.e. the higher level representation of the subword.
8. The method for implementing multi-lingual hybrid short text classification processing based on small sample learning according to claim 7, wherein the step (4) comprises the following steps:
(4.1) representing the sub-words of the higher layer u after affine transformationi|gCombined into text integral features
Figure FDA0003194308910000031
As an input to the active convolutional layer;
(4.2) performing one-dimensional convolution on the text features by adopting convolution kernels with different widths and different channel numbers to obtain global features containing different adjacent word information;
and (4.3) performing text global feature control by adopting a self-attention mechanism, thereby calculating and outputting text key region information features.
9. The method for implementing multi-lingual hybrid short text classification processing based on small sample learning as claimed in claim 8, wherein the global features of different neighboring word information in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
where k is the width, l is the number of convolutions, ReLU is the activation function, UlInput feature data, Conv, of kernel convolution representing the l-th layer1×kIs shown as havingConvolution operation with kernel width k, Vl+1Representing the global features after convolution of the ith layer.
10. The method for implementing a classification process of a multi-lingual hybrid short text based on small sample learning as claimed in claim 9, wherein the step (4.3) comprises the following steps:
(4.3.1) calculating the coupling coefficient c according to the following equationjmTaking the weight as the salient weight of the global feature of the text:
Figure FDA0003194308910000032
wherein j is the jth row of the feature matrix after convolution, m is the mth column of the feature matrix after convolution, bjmIs a convolution of characteristic values, u 'inside the text data'jmIs a characteristic value before convolution at the initial time, cjmThe attention value of the jth row and mth column of the convolved eigenvalue.
(4.3.2) calculating the text key area information characteristics v according to a global pooling calculation formulamWherein, u'm|jThe characteristics of the j-th row before convolution specifically include:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the efficiently pooled internal coefficients b according to the following formulajm
bjm=bjm+vm·u′m|j
11. The method for implementing a classification process of a multi-lingual mixed short text based on small sample learning as claimed in claim 10, wherein the step (5) comprises the following steps:
(5.1) text key region characteristics output by 2 kinds of kernel convolution with different widths
Figure FDA0003194308910000041
And
Figure FDA0003194308910000042
integrating into a text information characteristic vector through a splicing function f
Figure FDA0003194308910000043
The text information feature vector v is calculated according to the following formula:
Figure FDA0003194308910000044
(5.2) inputting the text information feature vector v into a feed-forward neural network FFNN (·) to output text category features, and predicting probability distribution of multiple categories of texts by adopting soft max function
Figure FDA0003194308910000045
The probability distribution is calculated according to the following formula
Figure FDA0003194308910000046
Figure FDA0003194308910000047
12. The method for implementing a classification process of a multi-lingual hybrid short text based on small sample learning as claimed in claim 11, wherein the step (6) comprises the following steps:
(6.1) carrying out category information mining on the unmarked multilingual mixed short text data by using the trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the output result of the model in a sampling auditing mode;
and (6.3) expanding the new category data as a label sample into the sample data set for updating and iterative processing of the model.
13. The method for implementing a multi-lingual hybrid short text classification processing based on small sample learning according to claim 12, wherein the step (6.2) comprises the following steps:
the accuracy of the model was checked using the following audit criteria:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the score is 0/1;
(6.2.3) calculating the sampling accuracy rate according to the following formula:
Figure FDA0003194308910000048
(6.2.4) setting a check threshold value for comparison.
14. The method for performing classification of mixed short text with multiple languages based on small sample learning as claimed in claim 13, wherein the step (6.2.4) is specifically as follows:
if the sampling accuracy is higher than the threshold value, expanding the new class data serving as a label sample into a sample data set; and if the sampling rate is lower than the threshold value, amplifying the sampling proportion, calibrating the error label, and further training the model after the error label is expanded to the label sample data set.
15. An apparatus for implementing a multi-lingual hybrid short text classification process based on small sample learning, the apparatus comprising:
a processor configured to execute computer-executable instructions;
a memory storing one or more computer-executable instructions that, when executed by the processor, perform the steps of the method of implementing a multi-lingual mixed short text classification process based on small sample learning of any one of claims 5 to 14.
16. A processor for implementing a multi-lingual mixed short text classification process based on small sample learning, wherein the processor is configured to execute computer-executable instructions which, when executed by the processor, implement the steps of the method for implementing a multi-lingual mixed short text classification process based on small sample learning according to any one of claims 5 to 14.
17. A computer-readable storage medium, having stored thereon a computer program executable by a processor to perform the steps of the method for small sample learning based multilingual hybrid short text classification method of any one of claims 5-14.
CN202110886442.0A 2021-08-03 2021-08-03 System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof Pending CN113535961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110886442.0A CN113535961A (en) 2021-08-03 2021-08-03 System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110886442.0A CN113535961A (en) 2021-08-03 2021-08-03 System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof

Publications (1)

Publication Number Publication Date
CN113535961A true CN113535961A (en) 2021-10-22

Family

ID=78090291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110886442.0A Pending CN113535961A (en) 2021-08-03 2021-08-03 System, method and device for realizing multi-language mixed short text classification processing based on small sample learning, memory and storage medium thereof

Country Status (1)

Country Link
CN (1) CN113535961A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Similar Documents

Publication Publication Date Title
US10839794B2 (en) Method and apparatus for correcting input speech based on artificial intelligence, and storage medium
CN107305768B (en) Error-prone character calibration method in voice interaction
Kim et al. Two-stage multi-intent detection for spoken language understanding
US11030407B2 (en) Computer system, method and program for performing multilingual named entity recognition model transfer
Cho et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation
US10025778B2 (en) Training markov random field-based translation models using gradient ascent
US8185376B2 (en) Identifying language origin of words
US7917350B2 (en) Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
Haruechaiyasak et al. A comparative study on Thai word segmentation approaches
US7467079B2 (en) Cross lingual text classification apparatus and method
US8909514B2 (en) Unsupervised learning using global features, including for log-linear model word segmentation
US20070100814A1 (en) Apparatus and method for detecting named entity
Duong Natural language processing for resource-poor languages
Zitouni et al. Arabic diacritic restoration approach based on maximum entropy models
Nguyen et al. Adaptive edit-distance and regression approach for post-OCR text correction
US20220391647A1 (en) Application-specific optical character recognition customization
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
US20090240501A1 (en) Automatically generating new words for letter-to-sound conversion
Chen et al. Integrating natural language processing with image document analysis: what we learned from two real-world applications
JP5565827B2 (en) A sentence separator training device for language independent word segmentation for statistical machine translation, a computer program therefor and a computer readable medium.
Godard Unsupervised word discovery for computational language documentation
Bensalah et al. Arabic machine translation based on the combination of word embedding techniques
Zheng et al. Character-based parsing with convolutional neural network
Trye et al. A hybrid architecture for labelling bilingual māori-english tweets
Naranpanawa et al. Analyzing subword techniques to improve english to sinhala neural machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination