CN113535961B - System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning - Google Patents

System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning Download PDF

Info

Publication number
CN113535961B
CN113535961B CN202110886442.0A CN202110886442A CN113535961B CN 113535961 B CN113535961 B CN 113535961B CN 202110886442 A CN202110886442 A CN 202110886442A CN 113535961 B CN113535961 B CN 113535961B
Authority
CN
China
Prior art keywords
text
model
data
information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110886442.0A
Other languages
Chinese (zh)
Other versions
CN113535961A (en
Inventor
王永剑
孙亚茹
杨莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Third Research Institute of the Ministry of Public Security
Original Assignee
Third Research Institute of the Ministry of Public Security
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Third Research Institute of the Ministry of Public Security filed Critical Third Research Institute of the Ministry of Public Security
Priority to CN202110886442.0A priority Critical patent/CN113535961B/en
Publication of CN113535961A publication Critical patent/CN113535961A/en
Application granted granted Critical
Publication of CN113535961B publication Critical patent/CN113535961B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a system for realizing multi-language mixed short text classification processing based on small sample learning, wherein the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for inputting a small amount of preset label samples into the system; the data preprocessing module is used for preprocessing the data of the preset label sample; the model calculation processing module is used for extracting key features and generating corresponding model accuracy calculation results; the model generation and output module is used for predicting a model prediction result of the current text data and further updating and iterating the output model through sampling and auditing processing of the model prediction result. The invention also relates to a corresponding method, device, processor and storage medium thereof. By adopting the system, the method, the device, the processor and the storage medium thereof, the mining of the potential information of the large-scale data is completed by utilizing the small sample learning with time and labor saving, the word formation information and the word cross-correlation information are effectively obtained, and the system, the method, the device, the processor and the storage medium thereof have great innovation.

Description

System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning
Technical Field
The invention relates to the technical field of deep learning, in particular to the technical field of natural language processing, and specifically relates to a system, a method, a device, a memory and a computer readable storage medium for realizing multi-language mixed short text classification processing based on small sample learning.
Background
Text classification is a task of assigning labels to text, one of the important and fundamental tasks in natural language processing, which advantageously supports many downstream tasks such as emotion classification, topic extraction, etc. And excavating the value information of the text delivery platform without opening a key text classification technology. The text is mostly short text, and has the characteristics of short sentences, multiple languages, content diversity, informal, grammar errors, popular words, slang and the like, so that an effective text classification technology is needed to solve the problem of short text classification with multiple language mixtures.
Traditional text classification algorithms focus much on linear expressions of text, such as support vector machine models that use lexicons or n-gram word vectors as inputs. Recent years of research have shown that nonlinear models can effectively capture text context information and can produce more accurate predictions than linear models. The convolutional neural network model is a typical nonlinear model that converts local features of data into low-dimensional vectors and retains information related to tasks. This efficient mapping method performs better on short text than the sequence model.
The convolutional neural network adopts maximum pooling to obtain the characteristic information of the data area, and only the characteristic with the maximum area value is reserved during calculation. As the number of convolution layers increases, the target-related positioning information is gradually lost. Text regions may express more complex concepts, and this learning approach, which relies solely on extracting feature region maximization to extract the most prominent feature information in the region, ignores other useful information. In addition, the coupled connections between the network layers may increase redundancy of the model.
In addition to the performance of the model, the quality of the data features has a large impact on the outcome of the downstream task. In the face of short texts with Multi-language mixing, existing models, such as Multi-lingual Bert and LASER, cannot better characterize the features of different languages in the same feature space. The phenomenon that the multiple languages cannot represent calculation in the same feature space and semantic deviation occurs is caused.
The attention mechanism is a method for effectively focusing on key information in model input data. The attention model not only pays attention to the characteristic information in the training process, but also effectively adjusts the parameters of the neural network aiming at different characteristics, so that more hidden characteristic information can be mined.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a system, a method, a device, a memory and a computer readable storage medium thereof, wherein the system, the method, the device, the memory and the computer readable storage medium can effectively acquire word formation information and word cross-correlation information and realize multilingual mixed short text classification processing based on small sample learning.
To achieve the above object, a system, a method, an apparatus, a memory, and a computer-readable storage medium for realizing multilingual mixed short text classification processing based on small sample learning according to the present invention are as follows:
The system for realizing multilingual mixed short text classification processing based on small sample learning is mainly characterized by comprising the following components:
The data acquisition module is used for inputting a small amount of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing operation on the preset label sample;
The model calculation processing module is connected with the data preprocessing module and is used for extracting key features according to the text data obtained after preprocessing and generating a corresponding model accuracy calculation result; and
The model generation and output module is connected with the model calculation processing module and is used for predicting a model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the output model through sampling and auditing processing of the model prediction result.
Preferably, the model calculation processing module specifically includes:
The word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing of word sets on the small quantity of tag text data samples obtained after batch processing;
the text feature embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into text integral features serving as the input of the effective convolution layer;
The text key region feature unit is connected with the text feature embedding unit and used for acquiring text key feature information in the text integral feature;
The text category judging unit is connected with the text key region characteristic unit and is used for analyzing and calculating the category to which the current input text belongs; and
The model accuracy calculating unit is connected with the text category judging unit and is used for calculating the model accuracy of the text information obtained after the text processing.
Preferably, the model generating and outputting module specifically includes:
The model prediction processing unit is used for inputting multi-language mixed short text data and carrying out model prediction;
The prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
And the sampling auditing unit is connected with the prediction result output unit and is used for sampling and auditing the model prediction result so as to detect the accuracy of the prediction model.
Preferably, the sampling auditing unit judges whether to update and calibrate according to the following rules through a preset threshold value of the system:
if the text data subjected to sampling audit by the sampling audit unit is larger than a threshold value, adding new tag data to the data acquisition module to perform iterative updating processing of the model; otherwise
And if the text data subjected to sampling and auditing by the sampling and auditing unit is not greater than the threshold value, the new data of the tag is added to the data acquisition module for performing iterative updating processing of the model after calibration processing is required.
The method for realizing multilingual mixed short text classification processing based on small sample learning by using the system is mainly characterized by comprising the following steps:
(1) Acquiring text sub-word information from the multi-language mixed short text;
(2) Carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) Embedding text features into the preprocessed text subword information to obtain input information of an effective convolution layer;
(4) Different kernels are adopted for convolution, so that adjacent word information and text key area information of the text sub word information are obtained;
(5) Judging the category to which the text belongs through probability distribution;
(6) And predicting the classification model according to the category information, mining new text data information, and updating and iterating the model.
Preferably, the step (3) specifically includes the following steps:
(3.1) searching words, if not, splitting according to n-gram to form a sub word library, searching special sub words before splitting, and entering the step (3.3); otherwise, go to step (3.2);
(3.2) if the word is available, segmenting according to special sub words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to n-gram to form a corresponding sub word library, and entering the step (3.3);
And (3.3) affine transforming the sub word library formed after segmentation into word-level characterization, adding the newly characterized words into a sub word set as special sub words, and calculating the sub word characterization of a higher layer.
Preferably, the higher level subword representation is calculated according to the following formula:
wherein g is a subword, i is an ith word in the sentence, W gi is a data conversion matrix, θ w is a word set, Representing the representation of the sub-word g, and u i|g (i is more than or equal to 1 and less than or equal to n), namely representing the higher layer of the sub-word.
Preferably, the step (4) specifically includes the following steps:
(4.1) combining the affine transformed higher-level subword representation u i|g into a text ensemble feature As an input to the active convolution layer;
(4.2) adopting convolution check text features with different widths and different channel numbers to carry out one-dimensional convolution to obtain global features containing different adjacent word information;
(4.3) performing text global feature control by adopting a self-attention mechanism, so as to calculate and output text key region information features.
Preferably, the global features of the different neighborhood word information described in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
Where k is the width, l is the number of kernel convolutions, reLU is the activation function, U l represents the kernel convolutions input feature data of the first layer, conv 1×k represents the convolution operation with a kernel width of k, and V l+1 represents the global feature after convolving the first layer.
Preferably, the step (4.3) specifically includes the following steps:
(4.3.1) calculating a coupling coefficient c jm according to the following formula, using it as a salient weight of the global feature of the text:
Where j is the j-th row of the feature matrix after convolution, m is the m-th column of the feature matrix after convolution, b jm is the feature value inside the text data after convolution, u' jm is the feature value before convolution at the initial time, and c jm is the attention value of the j-th row and m-th column of the feature value after convolution.
(4.3.2) Calculating the text key region information feature v m according to a global pooling calculation formula, wherein u' m|j is a feature of a j th row before convolution, specifically:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the effectively pooled internal coefficients b jm according to the following formula:
bjm=bjm+vm·u′m|j
Preferably, the step (5) specifically includes the following steps:
(5.1) text key region features for 2 different width deconvolution outputs And/>Integration into a text information feature vector/>, by means of a stitching function fThe text information feature vector v is calculated according to the following formula:
(5.2) inputting the text information feature vector v into a feedforward neural network FFNN (fuzzy neural network) to output text category features, and predicting the probability distribution of multiple text categories by adopting a soft max function The probability distribution/>, is calculated according to the following formula
Preferably, the step (6) specifically includes the following steps:
(6.1) carrying out category information mining on the untagged multi-language mixed short text data by using a trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the model output result in a sampling auditing mode;
(6.3) expanding the new category data as a label sample into a sample data set for updating and iterative processing of the model.
Preferably, the step (6.2) specifically includes the following steps:
The accuracy of the model is checked by adopting the following auditing standard:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the scoring value is 0/1;
(6.2.3) calculating the sampling accuracy according to the following formula:
(6.2.4) setting a checking threshold value for comparison.
Preferably, the step (6.2.4) is specifically:
If the sampling accuracy is higher than the threshold value, expanding the new category data as a label sample into a sample data set; if the sample ratio is lower than the threshold value, the sampling ratio is amplified, the error label is calibrated, and the model is further trained after the error label is expanded to the label sample data set.
The device for realizing the multi-language mixed short text classification processing based on the small sample learning is mainly characterized by comprising the following components:
A processor configured to execute computer-executable instructions;
A memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method for performing multilingual mixed short text classification based on small sample learning described above.
The processor for realizing the multi-language mixed short text classification processing based on the small sample learning is mainly characterized in that the processor is configured to execute computer executable instructions, and when the computer executable instructions are executed by the processor, the steps of the method for realizing the multi-language mixed short text classification processing based on the small sample learning are realized.
The computer readable storage medium is characterized in that the computer program is stored thereon, and the computer program can be executed by a processor to implement the steps of the method for implementing multilingual mixed short text classification processing based on small sample learning.
The system, the method, the device, the memory and the computer readable storage medium for realizing multi-language mixed short text classification processing based on small sample learning are adopted, and the multi-language mixed short text classification model combining a convolutional neural network with an attention mechanism is provided for solving the problems of difficult characterization of the unregistered words and the rare words, difficult capture of short text information, multi-language characteristic semantic drift, model parameter redundancy and the like in the prior art. On one hand, from the internal structure and the formation mode of the words, the sub-word features of the bottom layer are mapped to the representation of the higher layer, and the sub-word features are shared so as to relieve the influence of the unregistered words and the uncommon words on the model and the problem that the multi-language word features are represented in the same space. On the other hand, local convolution of interest by deep convolutional neural networks can effectively capture the associations between neighboring words, but its maximum pooling relies on the maximized number of feature regions to extract the most significant information, ignoring other useful information. The invention captures different numbers of adjacent word information by utilizing the convolution kernels with multiple channels and unequal widths, and simultaneously adopts a coupling coefficient calculation method to extract the most remarkable information in sentences without neglecting other related information. And mining new data according to a model learned by a small amount of tag data, and updating the tag data set and the training model by adopting sampling discrimination. The method utilizes small sample learning to finish mining potential information of large-scale data in a time-saving and labor-saving manner, effectively obtains word formation information and word cross-correlation information, and has great innovation.
Drawings
Fig. 1 is a schematic diagram of a framework structure of a system for realizing multilingual mixed short text classification processing based on small sample learning according to the present invention.
Fig. 2 is a schematic diagram of a specific embodiment of a method for implementing multilingual mixed short text classification processing based on small sample learning according to the present invention.
Detailed Description
In order to more clearly describe the technical contents of the present invention, a further description will be made below in connection with specific embodiments.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, the system for implementing multilingual mixed short text classification based on small sample learning includes:
The data acquisition module is used for inputting a small amount of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing operation on the preset label sample;
The model calculation processing module is connected with the data preprocessing module and is used for extracting key features according to the text data obtained after preprocessing and generating a corresponding model accuracy calculation result; and
The model generation and output module is connected with the model calculation processing module and is used for predicting a model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the output model through sampling and auditing processing of the model prediction result.
As a preferred embodiment of the present invention, the model calculation processing module specifically includes:
The word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing of word sets on the small quantity of tag text data samples obtained after batch processing;
the text feature embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into text integral features serving as the input of the effective convolution layer;
The text key region feature unit is connected with the text feature embedding unit and used for acquiring text key feature information in the text integral feature;
The text category judging unit is connected with the text key region characteristic unit and is used for analyzing and calculating the category to which the current input text belongs; and
The model accuracy calculating unit is connected with the text category judging unit and is used for calculating the model accuracy of the text information obtained after the text processing.
As a preferred embodiment of the present invention, the model generating and outputting module specifically includes:
The model prediction processing unit is used for inputting multi-language mixed short text data and carrying out model prediction;
The prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
And the sampling auditing unit is connected with the prediction result output unit and is used for sampling and auditing the model prediction result so as to detect the accuracy of the prediction model.
As a preferred embodiment of the present invention, the sampling auditing unit judges whether to perform update calibration according to the following rules through a preset threshold value of the system:
if the text data subjected to sampling audit by the sampling audit unit is larger than a threshold value, adding new tag data to the data acquisition module to perform iterative updating processing of the model; otherwise
And if the text data subjected to sampling and auditing by the sampling and auditing unit is not greater than the threshold value, the new data of the tag is added to the data acquisition module for performing iterative updating processing of the model after calibration processing is required.
The method for realizing multilingual mixed short text classification processing based on small sample learning by using the system comprises the following steps:
(1) Acquiring text sub-word information from the multi-language mixed short text;
(2) Carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) Embedding text features into the preprocessed text subword information to obtain input information of an effective convolution layer;
(4) Different kernels are adopted for convolution, so that adjacent word information and text key area information of the text sub word information are obtained;
(5) Judging the category to which the text belongs through probability distribution;
(6) And predicting the classification model according to the category information, mining new text data information, and updating and iterating the model.
As a preferred embodiment of the present invention, the step (3) specifically includes the following steps:
(3.1) searching words, if not, splitting according to n-gram to form a sub word library, searching special sub words before splitting, and entering the step (3.3); otherwise, go to step (3.2);
(3.2) if the word is available, segmenting according to special sub words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to n-gram to form a corresponding sub word library, and entering the step (3.3);
And (3.3) affine transforming the sub word library formed after segmentation into word-level characterization, adding the newly characterized words into a sub word set as special sub words, and calculating the sub word characterization of a higher layer.
As a preferred embodiment of the present invention, the higher level subword representation is calculated according to the following formula:
wherein g is a subword, i is an ith word in the sentence, W gi is a data conversion matrix, θ w is a word set, Representing the representation of the sub-word g, and u i|g (i is more than or equal to 1 and less than or equal to n), namely representing the higher layer of the sub-word.
As a preferred embodiment of the present invention, the step (4) specifically includes the following steps:
(4.1) combining the affine transformed higher-level subword representation u i|g into a text ensemble feature As an input to the active convolution layer;
(4.2) adopting convolution check text features with different widths and different channel numbers to carry out one-dimensional convolution to obtain global features containing different adjacent word information;
(4.3) performing text global feature control by adopting a self-attention mechanism, so as to calculate and output text key region information features.
As a preferred embodiment of the present invention, the global features of the different neighborhood word information described in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
Where k is the width, l is the number of kernel convolutions, reLU is the activation function, U l represents the kernel convolutions input feature data of the first layer, conv 1×k represents the convolution operation with a kernel width of k, and V l+1 represents the global feature after convolving the first layer.
As a preferred embodiment of the present invention, the step (4.3) specifically includes the steps of:
(4.3.1) calculating a coupling coefficient c jm according to the following formula, using it as a salient weight of the global feature of the text:
Where j is the j-th row of the feature matrix after convolution, m is the m-th column of the feature matrix after convolution, b jm is the feature value inside the text data after convolution, u' jm is the feature value before convolution at the initial time, and c jm is the attention value of the j-th row and m-th column of the feature value after convolution.
(4.3.2) Calculating the text key region information feature v m according to a global pooling calculation formula, wherein u' m|j is a feature of a j th line before convolution, specifically:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the effectively pooled internal coefficients b jm according to the following formula:
bjm=bjm+vm·u′m|j
As a preferred embodiment of the present invention, the step (5) specifically includes the steps of:
(5.1) text key region features for 2 different width deconvolution outputs And/>Integration into a text information feature vector/>, by means of a stitching function fThe text information feature vector v is calculated according to the following formula:
(5.2) inputting the text information feature vector v into a feedforward neural network FFNN (fuzzy neural network) to output text category features, and predicting the probability distribution of multiple text categories by adopting a soft max function The probability distribution/>, is calculated according to the following formula
As a preferred embodiment of the present invention, the step (6) specifically includes the steps of:
(6.1) carrying out category information mining on the untagged multi-language mixed short text data by using a trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the model output result in a sampling auditing mode;
(6.3) expanding the new category data as a label sample into a sample data set for updating and iterative processing of the model.
As a preferred embodiment of the present invention, the step (6.2) specifically includes the steps of:
The accuracy of the model is checked by adopting the following auditing standard:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the scoring value is 0/1;
(6.2.3) calculating the sampling accuracy according to the following formula:
(6.2.4) setting a checking threshold value for comparison.
As a preferred embodiment of the present invention, the step (6.2.4) specifically comprises:
If the sampling accuracy is higher than the threshold value, expanding the new category data as a label sample into a sample data set; if the sample ratio is lower than the threshold value, the sampling ratio is amplified, the error label is calibrated, and the model is further trained after the error label is expanded to the label sample data set.
The device for realizing the multi-language mixed short text classification processing based on the small sample learning comprises:
A processor configured to execute computer-executable instructions;
A memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method for performing multilingual mixed short text classification based on small sample learning described above.
The processor for realizing the multi-language mixed short text classification processing based on the small sample learning is configured to execute computer executable instructions, and the computer executable instructions realize the steps of the method for realizing the multi-language mixed short text classification processing based on the small sample learning when being executed by the processor.
The computer readable storage medium having stored thereon a computer program executable by a processor to perform the steps of the method for performing multilingual mixed short text classification based on small sample learning described above.
In a specific embodiment of the invention, the technical scheme constructs the sub-word embedded network from the internal structure and the formation mode of the words, and constructs the multi-language mixed feature space while relieving the influence of the unregistered words on the model, so that the distances of the words with the same semantics and different languages in the feature space are similar. In order to solve the problem of generalization of global pooling of the deep convolutional neural network, a coupling coefficient is adopted to effectively calculate the relevant characteristics of a text local area. And capturing different numbers of adjacent word information by utilizing the convolution kernels with multiple channels and unequal widths, so that the model does not ignore other related information when extracting main information in sentences. The method comprises the following steps:
Step one, text sub-word information is obtained. The text is embedded in the word by means of sub-word information of each word, the word is segmented by means of n-gram to form a sub-word library, and affine transformation is carried out to the representation of the word level.
And step two, embedding text features. The affine transformed word tokens are combined into a text ensemble feature as input to an effective convolution layer.
And step three, acquiring text key area information. The text key region features are calculated by different wide convolution kernels and coupling coefficients. And carrying out one-dimensional convolution on the text features by adopting convolution check text features with different widths and different channel numbers to obtain global features containing different adjacent word information. The convolution window sequentially slides on the input feature array, and data in the window and data in the convolution kernel are multiplied and summed according to elements to obtain elements of corresponding output positions, so that adjacent word information of different distances is captured. And performing text global feature control on different convolved text feature data by adopting a self-attention mechanism, and taking the calculated coupling coefficient as the salient weight of the global feature. Then, the text key region information features are calculated and output in a global pooling mode. Finally, efficient pooling is performed to capture key information in the text without losing other related information, iteratively updating the internal coefficients of the efficient pooling.
And fourthly, judging the category to which the text belongs. And integrating the text key region information characteristics output by the cores with different widths into a text information characteristic vector through a splicing function. And outputting text category characteristics to the text information characteristics through a layer of feedforward neural network, predicting the probability distribution of the text multi-category by adopting softmax, and calculating the category to which the text belongs through the probability distribution.
And fifthly, predicting and mining new text data information. And carrying out category information mining on the untagged multi-language mixed short text data by using the trained model. And detecting the accuracy rate of the model output result by adopting a sampling auditing mode. The auditing standard comprises the following steps: (1) sampling a 5% prediction result of the predicted data amount; (2) manually judging, wherein the scoring value is 0/1; (3) The accuracy of the sampling is calculated and,(4) A threshold is set. If the sampling accuracy is above the threshold, the new category data is extended as a label sample into the sample dataset. If the sampling ratio is lower than the threshold value, amplifying the sampling ratio, calibrating the error label, and further training a model after expanding the error label to a label sample data set; (5) updating the iterative model.
Referring to fig. 2, in an embodiment of the invention, taking multi-language mixed short text for distinguishing chinese and english as an example, the multi-language mixed short text classification method of the invention includes the following steps:
1. data preparation. Text data is read from the multi-lingual mixed short text. For example, read a short text sentence mixed by Chinese and English: the ting_h is provided with psychological courses, so cool The sentence contains special symbols such as expressions and the like.
2. And (5) preprocessing data. And performing de-punctuation, de-expression symbol and the like on the text, which are irrelevant to the text classification information. Chinese is separated from other languages, english is divided according to space, and Chinese is divided according to whole. The result after the segmentation is: { 'Ting_h' opens "psychological" course "so" cool ".
3. And embedding the subwords to obtain text subword information. Searching the subwords, and if not, splitting according to the n-gram. For example, the molecular word 'ting_h' is cut in 3-gram. Before 3-gram segmentation, special characters "<" and ">" are added to the head and tail of a word to distinguish the subwords as prefixes and suffixes. Before segmentation, searching special subwords, if yes, segmenting according to the special subwords, and segmenting the rest according to n-gram, otherwise, directly segmenting according to the n-gram. After segmentation, the words are 6 subwords: { < 'Ti' 'Tin' 'ng' 'ng_' 'g_h' 'h' >, then adding the newly characterized word Ting_h as a special subword to the subword set θ w, computing a higher level of subword characterization according to equation (1), whereinRepresenting the representation of the sub-word g, and u i|g (i is more than or equal to 1 and less than or equal to n), namely representing the higher layer of the sub-word.
4. And different kernels are convolved to acquire the adjacent word information. Combining affine transformed word tokens u i|g into a text ensemble featureAs an input to the active convolution layer. Convolution kernels of different widths may be employed to obtain correlations for different numbers of neighboring words. According to formula (2), performing one-dimensional convolution on the text features by adopting convolution check text features with different width k channels as t to obtain global features/>, which contain different adjacent word information, of the text features
Vl+1(Ul)=ReLU(Conv1×k(Ul)) (2)
Where U l represents the kernel convolution input feature data of the first layer, conv 1×k represents the convolution operation with a kernel width of k, and V l+1 represents the global feature after convolving the first layer. Here, assuming that the sentence length is 10 and the dimension is feature 5, the input dimension is (5×10). The number of channels is set to 4, and the convolution kernel widths are respectively 2 and 4. Thus, the text region characteristics with two different widths are respectively obtained through one convolutionAnd/>
5. And effectively pooling to obtain text key region information. Text global feature control is performed by adopting a self-attention mechanism, and a coupling coefficient is calculated according to a formula (3)And/>As a salient weight of the global feature, the coupling coefficient is calculated as follows:
Wherein b jm is a characteristic value inside the convolved data, and is initially u' jm. Outputting text key region information characteristics according to a global pooling calculation formula (4) And/>
vm=∑jcjm·u′m|j (4)
Iteratively updating the effectively pooled internal coefficients b jm according to equation (5):
bjm=bjm+vm·u′m|j (5)
6. and (5) classifying and calculating. Text key region features for 2 different width kernel convolution outputs And/>Integration into a text information feature vector/>, by means of a stitching function fThe final text information feature vector v is,
Then, outputting text category characteristics to the text information characteristics through a layer of feedforward neural network FFNN (& gt), and predicting the probability distribution of multiple categories of the text by adopting softmax
Model prediction. The model is a 4-class model, and the labels are sports, education, entertainment and music. By probability distributionThe calculation output {0.01,0.91,0.067,0.013}, the value corresponding to the "education" category label is the largest, namely, the sentence category result output by the model is "education".
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device.
Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "specific examples," or "embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.
The system, the method, the device, the memory and the computer readable storage medium for realizing multi-language mixed short text classification processing based on small sample learning are adopted, and the multi-language mixed short text classification model combining a convolutional neural network with an attention mechanism is provided for solving the problems of difficult characterization of the unregistered words and the rare words, difficult capture of short text information, multi-language characteristic semantic drift, model parameter redundancy and the like in the prior art. On one hand, from the internal structure and the formation mode of the words, the sub-word features of the bottom layer are mapped to the representation of the higher layer, and the sub-word features are shared so as to relieve the influence of the unregistered words and the uncommon words on the model and the problem that the multi-language word features are represented in the same space. On the other hand, local convolution of interest by deep convolutional neural networks can effectively capture the associations between neighboring words, but its maximum pooling relies on the maximized number of feature regions to extract the most significant information, ignoring other useful information. The invention captures different numbers of adjacent word information by utilizing the convolution kernels with multiple channels and unequal widths, and simultaneously adopts a coupling coefficient calculation method to extract the most remarkable information in sentences without neglecting other related information. And mining new data according to a model learned by a small amount of tag data, and updating the tag data set and the training model by adopting sampling discrimination. The method utilizes small sample learning to finish mining potential information of large-scale data in a time-saving and labor-saving manner, effectively obtains word formation information and word cross-correlation information, and has great innovation.
In this specification, the invention has been described with reference to specific embodiments thereof. It will be apparent that various modifications and variations can be made without departing from the spirit and scope of the invention. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (12)

1. A method for implementing multilingual mixed short text classification processing based on small sample learning by using a system for implementing multilingual mixed short text classification processing based on small sample learning, the system comprising:
The data acquisition module is used for inputting a small amount of preset label samples into the system;
the data preprocessing module is connected with the data acquisition module and is used for carrying out data set division, data cleaning and batch processing operation on the preset label sample;
The model calculation processing module is connected with the data preprocessing module and is used for extracting key features according to the text data obtained after preprocessing and generating a corresponding model accuracy calculation result; and
The model generation and output module is connected with the model calculation processing module and is used for predicting a model prediction result of the current text data according to the model accuracy calculation result and further updating and iterating the model generation and output module through sampling and auditing processing of the model prediction result;
The model calculation processing module specifically comprises:
The word information processing unit is connected with the data preprocessing module and is used for performing n-element lexical segmentation, word embedding and iterative processing on a small amount of preset label samples obtained after batch processing;
the text feature embedding unit is connected with the word information processing unit and is used for combining the word information subjected to the iterative processing into text integral features serving as the input of the effective convolution layer;
The text key region feature unit is connected with the text feature embedding unit and used for acquiring text key feature information in the text integral feature;
The text category judging unit is connected with the text key region characteristic unit and is used for analyzing and calculating the category to which the current input text belongs; and
The model accuracy calculating unit is connected with the text category judging unit and is used for calculating the model accuracy of the text information obtained after the text processing;
the model generation and output module specifically comprises:
The model prediction processing unit is used for inputting multi-language mixed short text data and carrying out model prediction;
The prediction result output unit is connected with the model prediction processing unit and is used for outputting a model prediction result; and
The sampling auditing unit is connected with the prediction result output unit and is used for sampling auditing the model prediction result so as to detect the accuracy of the prediction model;
The sampling auditing unit judges whether to update and calibrate according to the following rules through a system preset threshold value:
if the text data subjected to sampling audit by the sampling audit unit is larger than a threshold value, adding new tag data to the data acquisition module to perform iterative updating processing of the model; otherwise
The sampling auditing unit samples and audits the text data not more than a threshold value, and then adds new label data to the data acquisition module for iterative updating of the model after calibration processing;
the method is characterized by comprising the following steps:
(1) Acquiring text sub-word information from the multi-language mixed short text;
(2) Carrying out data set division, data cleaning and batch operation pretreatment on the text sub-word information;
(3) Embedding text features into the preprocessed text subword information to obtain input information of an effective convolution layer;
(4) Different kernels are adopted for convolution, so that adjacent word information and text key area information of the text sub word information are obtained;
(5) Judging the category to which the text belongs through probability distribution;
(6) Predicting the classification model according to the category information, mining new text data information, and updating and iterating the model;
The step (3) specifically comprises the following steps:
(3.1) searching words, if not, splitting according to n-gram to form a sub word library, searching special sub words before splitting, and entering the step (3.3); otherwise, go to step (3.2);
(3.2) if the word is available, segmenting according to special sub words, segmenting the rest part according to n-gram, otherwise, directly segmenting according to n-gram to form a corresponding sub word library, and entering the step (3.3);
And (3.3) affine transforming the sub word library formed after segmentation into word-level characterization, adding the newly characterized words into a sub word set as special sub words, and calculating the sub word characterization of a higher layer.
2. The method for realizing multilingual mixed short text classification processing based on small sample learning according to claim 1, wherein the higher-level subword representation is calculated according to the following formula:
wherein g is a subword, i is an ith word in the sentence, W gi is a data conversion matrix, θ w is a word set, Representing the representation of the sub-word g, and u i|g (i is more than or equal to 1 and less than or equal to n), namely representing the higher layer of the sub-word.
3. The method for realizing multilingual mixed short text classification processing based on small sample learning according to claim 2, wherein the step (4) specifically comprises the steps of:
(4.1) combining the affine transformed higher-level subword representation u i|g into a text ensemble feature As an input to the active convolution layer;
(4.2) adopting convolution check text features with different widths and different channel numbers to carry out one-dimensional convolution to obtain global features containing different adjacent word information;
(4.3) performing text global feature control by adopting a self-attention mechanism, so as to calculate and output text key region information features.
4. A method for implementing multilingual mixed short text classification processing based on small sample learning as claimed in claim 3, wherein the different neighboring word information global features described in step (4.2) are calculated according to the following formula:
Vl+1(Ul)=ReLU(Conv1×k(Ul));
Where k is the width, l is the number of kernel convolutions, reLU is the activation function, U l represents the kernel convolutions input feature data of the first layer, conv 1×k represents the convolution operation with a kernel width of k, and V l+1 represents the global feature after convolving the first layer.
5. The method for realizing multilingual mixed short text classification processing based on small sample learning according to claim 4, wherein said step (4.3) specifically comprises the steps of:
(4.3.1) calculating a coupling coefficient c jm according to the following formula, using it as a salient weight of the global feature of the text:
Wherein j is the j-th row of the feature matrix after convolution, m is the m-th column of the feature matrix after convolution, b jm is the feature value in the text data after convolution, and c jm is the attention value of the j-th row and m-th column of the feature value after convolution;
(4.3.2) calculating the text key region information feature v m according to a global pooling calculation formula, wherein u' m|j is a feature of a j th row before convolution, specifically:
vm=∑jcjm·u′m|j
(4.3.3) iteratively updating the effectively pooled internal coefficients b jm according to the following formula:
bjm=bjm+vm·u′m|j
6. The method for realizing multilingual mixed short text classification based on small sample learning according to claim 5, wherein said step (5) specifically comprises the steps of:
(5.1) text key region features for 2 different width deconvolution outputs And/>Integration into a text information feature vector/>, by means of a stitching function fThe text information feature vector v is calculated according to the following formula:
Wherein, Is the information characteristic of the ith text key area;
(5.2) inputting the text information feature vector v into a feedforward neural network FFNN (fuzzy neural network) to output text category features, and predicting the probability distribution of multiple text categories by adopting a soft max function The probability distribution/>, is calculated according to the following formula
7. The method for realizing multilingual mixed short text classification based on small sample learning according to claim 6, wherein said step (6) specifically comprises the steps of:
(6.1) carrying out category information mining on the untagged multi-language mixed short text data by using a trained model and carrying out classification model prediction;
(6.2) detecting the accuracy rate of the model output result in a sampling auditing mode;
(6.3) expanding the new category data as a label sample into a sample data set for updating and iterative processing of the model.
8. The method for realizing multilingual mixed short text classification processing based on small sample learning according to claim 7, wherein said step (6.2) specifically comprises the steps of:
The accuracy of the model is checked by adopting the following auditing standard:
(6.2.1) sampling 5% of the predicted data amount;
(6.2.2) manually judging, wherein the scoring value is 0/1;
(6.2.3) calculating the sampling accuracy according to the following formula:
(6.2.4) setting a checking threshold value for comparison.
9. The method for realizing multilingual mixed short text classification processing based on small sample learning according to claim 8, wherein the step (6.2.4) is specifically:
If the sampling accuracy is higher than the threshold value, expanding the new category data as a label sample into a sample data set; if the sample ratio is lower than the threshold value, the sampling ratio is amplified, the error label is calibrated, and the model is further trained after the error label is expanded to the label sample data set.
10. An apparatus for implementing multilingual mixed short text classification processing based on small sample learning, the apparatus comprising:
A processor configured to execute computer-executable instructions;
a memory storing one or more computer-executable instructions which, when executed by the processor, perform the steps of the method of performing multilingual mixed short text classification processing based on small sample learning of any one of claims 1 to 9.
11. A processor for performing multilingual mixed short text classification based on small sample learning, wherein the processor is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the method for performing multilingual mixed short text classification based on small sample learning as recited in any one of claims 1 to 9.
12. A computer-readable storage medium, having stored thereon a computer program executable by a processor to perform the steps of the method of any one of claims 1 to 9 for performing multilingual mixed short text classification based on small sample learning.
CN202110886442.0A 2021-08-03 2021-08-03 System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning Active CN113535961B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110886442.0A CN113535961B (en) 2021-08-03 2021-08-03 System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110886442.0A CN113535961B (en) 2021-08-03 2021-08-03 System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning

Publications (2)

Publication Number Publication Date
CN113535961A CN113535961A (en) 2021-10-22
CN113535961B true CN113535961B (en) 2024-06-07

Family

ID=78090291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110886442.0A Active CN113535961B (en) 2021-08-03 2021-08-03 System, method, device, memory and storage medium for realizing multilingual mixed short text classification processing based on small sample learning

Country Status (1)

Country Link
CN (1) CN113535961B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190034823A1 (en) * 2017-07-27 2019-01-31 Getgo, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization
CN110134786A (en) * 2019-05-14 2019-08-16 南京大学 A kind of short text classification method based on theme term vector and convolutional neural networks
CN111428026A (en) * 2020-02-20 2020-07-17 西安电子科技大学 Multi-label text classification processing method and system and information data processing terminal

Also Published As

Publication number Publication date
CN113535961A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
US11238845B2 (en) Multi-dialect and multilingual speech recognition
US9223779B2 (en) Text segmentation with multiple granularity levels
US7467079B2 (en) Cross lingual text classification apparatus and method
CN108460011B (en) Entity concept labeling method and system
US6816830B1 (en) Finite state data structures with paths representing paired strings of tags and tag combinations
Haruechaiyasak et al. A comparative study on Thai word segmentation approaches
US10025778B2 (en) Training markov random field-based translation models using gradient ascent
US8655646B2 (en) Apparatus and method for detecting named entity
US8140332B2 (en) Technique for searching out new words that should be registered in dictionary for speech processing
US8909514B2 (en) Unsupervised learning using global features, including for log-linear model word segmentation
US8849665B2 (en) System and method of providing machine translation from a source language to a target language
US7624006B2 (en) Conditional maximum likelihood estimation of naïve bayes probability models
US20060015326A1 (en) Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
Nguyen et al. Adaptive edit-distance and regression approach for post-OCR text correction
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN111324771A (en) Video tag determination method and device, electronic equipment and storage medium
Zitouni et al. Arabic diacritic restoration approach based on maximum entropy models
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
EP3598321A1 (en) Method for parsing natural language text with constituent construction links
US20220391647A1 (en) Application-specific optical character recognition customization
CN112417823B (en) Chinese text word order adjustment and word completion method and system
Chen et al. Integrating natural language processing with image document analysis: what we learned from two real-world applications
JP5565827B2 (en) A sentence separator training device for language independent word segmentation for statistical machine translation, a computer program therefor and a computer readable medium.
Bensalah et al. Arabic machine translation based on the combination of word embedding techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant