CN114492661A - Text data classification method and device, computer equipment and storage medium - Google Patents

Text data classification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114492661A
CN114492661A CN202210134070.0A CN202210134070A CN114492661A CN 114492661 A CN114492661 A CN 114492661A CN 202210134070 A CN202210134070 A CN 202210134070A CN 114492661 A CN114492661 A CN 114492661A
Authority
CN
China
Prior art keywords
text data
similarity
data
label
pinyin
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210134070.0A
Other languages
Chinese (zh)
Inventor
胡天瑞
侯晓龙
江炼鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210134070.0A priority Critical patent/CN114492661A/en
Publication of CN114492661A publication Critical patent/CN114492661A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The embodiment provides a text data classification method and device, computer equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring original text data and label text data, and inputting the original text data and the label text data into a pre-training model for similarity prediction to obtain a first relevant label; respectively carrying out pinyin conversion on the original text data and the label text data to obtain first pinyin data and second pinyin data; inputting the first pinyin data and the second pinyin data into a pre-training model for similarity prediction to obtain a second correlation label; and obtaining a target related label from the first related label and the second related label to obtain a target classification result. By adding the label text data, the pre-training model understands the association of the label meanings, and obtains the voice information with the semantic meaning through phonetization, so that the accuracy of similarity prediction of the pre-training model is improved, and the accuracy of text data classification is further improved.

Description

Text data classification method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a text data classification method and apparatus, a computer device, and a storage medium.
Background
With the development of artificial intelligence technology, the utilization rate of the robot is gradually improved. In the natural language processing of machine learning, the robot needs to understand and intend to classify text data input by a user. Generally, the classification model directly classifies text data, but this may cause the classification model to not effectively understand the meaning of each label, which in turn may cause low classification accuracy.
Disclosure of Invention
The embodiment of the disclosure mainly aims to provide a text data classification method and device, a computer device and a storage medium, which can improve the accuracy of text data classification.
In order to achieve the above object, a first aspect of the embodiments of the present disclosure provides a text data classification method, where the method includes:
acquiring original text data and corresponding label text data;
inputting the original text data and the label text data into a pre-training model for similarity prediction processing to obtain a plurality of first related labels;
performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data;
inputting the first pinyin data and the second pinyin data into the pre-training model for similarity prediction processing to obtain a plurality of second relevant labels;
and obtaining target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result.
In some embodiments, the inputting the original text data and the label text data into a pre-training model for similarity prediction processing to obtain a plurality of first relevant labels includes:
acquiring a plurality of preset label sample data;
calculating the similarity between the original text data and each label sample data to obtain a similarity matrix;
coding the similarity matrix to obtain a plurality of first similarity vectors;
and performing pooling processing on the plurality of first similar vectors to obtain a plurality of first related labels.
In some embodiments, the calculating a similarity between the original text data and each of the tag sample data to obtain a similarity matrix includes:
performing word embedding processing on the original text data to obtain a first embedded vector, and performing word embedding processing on each tag sample data to obtain a second embedded vector;
performing dot product operation on each first embedding vector and the corresponding second embedding vector to obtain a plurality of similarity values;
and constructing the similarity matrix according to the plurality of similarity values.
In some embodiments, the pinyin conversion of the original text data to obtain first pinyin data includes:
acquiring a plurality of pre-stored preset fields; each preset field corresponds to a preset pinyin;
splitting the original text data to obtain a plurality of original fields;
screening a plurality of target fields from the plurality of preset fields according to the plurality of original fields;
acquiring a target pinyin corresponding to each target field from the preset pinyins to obtain a plurality of target pinyins;
and splicing the target pinyins to obtain the first pinyin data.
In some embodiments, the obtaining the target related label from the plurality of first related labels and the plurality of second related labels to obtain the target classification result includes:
calculating the similarity between each first related label and the label text data to obtain a plurality of first similarity values;
calculating the similarity between each second related label and the second pinyin data to obtain a plurality of second similarity values;
and acquiring target related labels from the plurality of first related labels and the plurality of second related labels according to the plurality of first similar values and the plurality of second similar values to obtain the target classification result.
In some embodiments, the obtaining, according to the plurality of first similarity values and the plurality of second similarity values, a target related label from the plurality of first related labels and the plurality of second related labels to obtain the target classification result;
adding each first similarity value and the corresponding second similarity value to obtain a plurality of preliminary similarity values;
acquiring the maximum preliminary similarity value as a target similarity value;
and according to the target similarity value, acquiring the corresponding target related label from the plurality of first related labels and the plurality of second related labels to obtain the target classification result.
In some embodiments, after obtaining the target related label from the plurality of first related labels and the plurality of second related labels, and obtaining a target classification result, the method further includes:
and optimizing a loss function of the pre-training model according to the target classification result and the label text data so as to update the pre-training model.
A second aspect of the embodiments of the present disclosure provides a text data classification apparatus, including:
an acquisition module: the method comprises the steps of obtaining original text data and corresponding label text data;
the first prediction module: the system comprises a pre-training model, a plurality of first relevant labels and a plurality of second relevant labels, wherein the pre-training model is used for inputting the original text data and the label text data to the pre-training model for similarity prediction processing to obtain the plurality of first relevant labels;
a conversion module: the system comprises a tag text data acquisition module, a first pinyin conversion module and a second pinyin conversion module, wherein the tag text data acquisition module is used for acquiring tag text data;
a second prediction module: the second phonetic data is used for inputting the first phonetic data and the second phonetic data into the pre-training model similarity prediction processing to obtain a plurality of second related labels;
a classification module: and the target classification module is used for acquiring target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result.
A third aspect of the embodiments of the present disclosure provides a computer device, which includes a memory and a processor, where the memory stores therein a computer program, and when the computer program is executed by the processor, the processor is configured to perform the method according to any one of the embodiments of the first aspect of the present disclosure.
A fourth aspect of the embodiments of the present disclosure provides a storage medium, which is a computer-readable storage medium, and the storage medium stores computer-executable instructions, where the computer-executable instructions are configured to cause a computer to perform the method according to any one of the embodiments of the first aspect of the present disclosure.
According to the text data classification method and device, the computer equipment and the storage medium provided by the embodiment of the disclosure, original text data and corresponding label text data are obtained and input into a pre-training model for similarity prediction processing, so that a plurality of first relevant labels are obtained; performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data; inputting the first pinyin data and the second pinyin data into a pre-training model for similarity prediction processing to obtain a plurality of second relevant labels; and obtaining target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result. According to the embodiment of the disclosure, the label text data is additionally added in the pre-training model, so that the pre-training model can fully understand the association between the actual meanings of the labels, and the initial text data and the label text data are spelled to enable the pre-training model to acquire the speech information out of the semantic meaning, thereby improving the accuracy of similarity prediction of the pre-training model and further improving the accuracy of text data classification.
Drawings
Fig. 1 is a flowchart of a text data classification method provided by an embodiment of the present disclosure;
FIG. 2 is a flowchart of step S200 in FIG. 1;
FIG. 3 is a flowchart of step S220 in FIG. 2;
FIG. 4 is a flowchart of step S300 in FIG. 1;
FIG. 5 is a flowchart of step S500 in FIG. 1;
FIG. 6 is a flowchart of step S530 in FIG. 5;
fig. 7 is a block diagram of a module structure of a text data classification apparatus according to an embodiment of the present disclosure;
fig. 8 is a hardware structure diagram of a computer device provided in an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
First, several terms referred to in the present application are resolved:
artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.
Chat robot (Chatterbot): is a computer program that carries out conversations via conversation or text. Human dialog can be simulated, passing Turing test. The chat robot can be used for practical purposes such as customer service or information acquisition. Some chat robots carry natural language processing systems, but most simple systems only capture input keywords and then search for the most appropriate answer sentence from the database. The chat bot is part of a virtual assistant (e.g., Google smart assistant) that can interface with applications, websites, and instant messaging platforms (Facebook Messenger) of many organizations. Non-assistant applications include chat rooms for entertainment purposes, research and product-specific promotions, social networking robots.
Beautiful Soup: is a library of python, the most important function is to crawl data from web pages. Beautiful Soup supports HTML parsers in the Python standard library, and also supports parsers from some third parties.
PyPinyin: the method is a Chinese character-to-pinyin library provided in Python, can be used for occasions such as phonetic notation, sorting, retrieval and the like of Chinese characters, and has the main characteristics of matching the most correct pinyin according to phrase intelligence, supporting polyphonic characters, simple traditional support and phonetic notation support and supporting various different pinyin and phonetic notation styles.
One-Hot Encoding (One-Hot Encoding): also known as one-bit-efficient encoding, mainly uses a bit state register to encode each state, with each state being represented by its own independent register bit and only one bit being active at any time.
Encoding (encoder): the input sequence is converted into a vector of fixed length.
Decoding (decoder): converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.
BERT (bidirectional Encoder retrieval from transformations) model: the BERT model further increases the generalization capability of a word vector model, fully describes character-level, word-level, sentence-level and even sentence-level relational characteristics, and is constructed based on a Transformer. There are three embeddings in BERT, namely Token Embedding, Segment Embedding and Position Embedding; wherein, Token entries is a word vector, the first word is a CLS mark, and the first word can be used for the subsequent classification task; segment Embeddings are used to distinguish two sentences because pre-training does not only do LM but also do classification tasks with two sentences as input; position entries, where the Position word vector is not a trigonometric function in transform, but is learned by BERT training. But the BERT directly trains a position embedding to reserve position information, a vector is randomly initialized at each position, model training is added, and finally an embedding containing the position information is obtained, and the BERT selects direct splicing in the combination mode of the position embedding and the word embedding.
Embedding (embedding): embedding is a vector representation, which means that a low-dimensional vector represents an object, which can be a word, a commodity, a movie, etc.; the embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings, for example, the distance between the embedding (revenge league) and the embedding (ironmen) is very close, but the distance between the embedding (revenge league) and the embedding (dinners) is far away. The embedding essence is mapping from a semantic space to a vector space, and simultaneously, the relation of an original sample in the semantic space is kept as much as possible in the vector space, for example, the positions of two words with similar semantics in the vector space are also relatively close. The embedding can encode an object by using a low-dimensional vector and also can reserve the meaning of the object, is usually applied to machine learning, and in the process of constructing a machine learning model, the object is encoded into a low-dimensional dense vector and then transmitted to the DNN, so that the efficiency is improved.
Word embedding model (word embedding model): the content of the request text can be converted into a vector representation.
Short text matching model (Enhanced LSTM for Natural Language Inference, ESIM): the model is a text similarity calculation model, in particular to an enhanced version LSTM which is transformed into natural language inference and mainly comprises the following components: input coding, local inference modeling and inference combining.
Bi-directional Long Short-Term Memory (BilSTM): the method is divided into 2 independent LSTMs, an input sequence is respectively input into 2 LSTM neural networks in a positive sequence and a negative sequence for feature extraction, and a word vector formed by splicing 2 output vectors (namely extracted feature vectors) is used as the final feature expression of the word. The model design concept of the BilSTM is that the feature data obtained at the moment t has information between the past and the future, and the efficiency and the performance of the neural network structure model for extracting the text features are superior to those of a single LSTM structure model. The 2 LSTM neural network parameters in the BilSTM are independent from each other, and only the word-embedding word vector list is shared by the LSTM neural network parameters.
Cross Entropy (Cross Entropy): the method is mainly used for measuring the difference information between two probability distributions. The performance of a language model is typically measured in terms of cross-entropy and complexity (perplexity). The meaning of cross entropy is the difficulty of text recognition using this model, or from a compression point of view, on average, several bits per word are encoded. The meaning of complexity is the number of branches that represent this text average with the model, whose inverse can be considered as the average probability of each word. Smoothing means that a probability value is given to the combination of N-tuples that is not observed, so as to ensure that a probability value can be obtained always through a language model by the word sequence. Commonly used smoothing techniques are Turing estimation, subtractive interpolation smoothing, Katz smoothing, and Kneser-Ney smoothing. The cross entropy can be used as a loss function in a neural network (machine learning), p represents the distribution of real marks, q is the distribution of predicted marks of the trained model, and the cross entropy loss function can measure the similarity between p and q. The cross entropy as the loss function has the advantage that the problem of the learning rate reduction of the mean square error loss function can be avoided when the gradient is reduced by using the sigmoid function, because the learning rate can be controlled by the output error. In feature engineering, it can be used to measure the similarity between two random variables. In the language model (NLP), since the true distribution p is unknown, in the language model, the model is obtained by a training set, and the cross entropy is a measure of the accuracy of the model on a test set.
self-Attention Mechanism (Attention Mechanism): the attention mechanism may enable a neural network to have the ability to focus on a subset of its inputs (or features), selecting a particular input, and be applied to any type of input regardless of its shape. In situations where computing power is limited, the attention mechanism is a resource allocation scheme that is the primary means to solve the information overload problem, allocating computing resources to more important tasks.
Pooling (Pooling): is an important concept in convolutional neural networks, which is actually a form of downsampling. There are many different forms of non-linear pooling functions, of which "Max pooling" is the most common. The method divides an input image into a plurality of rectangular areas and outputs the maximum value to each sub-area. Intuitively, this mechanism can be effective because, after a feature is found, its precise location is far less important than the relationship of its relative location to other features. The pooling layer will constantly reduce the spatial size of the data and hence the number of parameters and the amount of calculations will also decrease, which to some extent also controls the overfitting. Typically, pooling layers are periodically inserted between convolutional layers of a CNN. The pooling layer will typically act on each input feature separately and reduce its size. The most common form of pooling layer is a block that is partitioned from the image every 2 elements, and then takes the maximum of 4 numbers in each block. This would reduce the amount of data by 75%. In addition to maximum pooling, other pooling functions may be used by the pooling layer, such as "average pooling" or even "L2-norm pooling", or the like.
And (3) back propagation: the principle of back propagation is that training set data is input to an input layer of a neural network, passes through a hidden layer of the neural network, finally reaches an output layer of the neural network and outputs a result; calculating the error between the estimated value and the actual value because the output result of the neural network has an error with the actual result, and reversely propagating the error from the output layer to the hidden layer until the error is propagated to the input layer; in the process of back propagation, adjusting the values of various parameters according to errors; and continuously iterating the process until convergence.
The Recurrent Neural Network (RNN) is a Recurrent Neural Network (Recurrent Neural Network) in which sequence data is input, recursion is performed in the evolution direction of the sequence, and all nodes (Recurrent units) are connected in a chain, wherein a Bidirectional Recurrent Neural Network (Bi-RNN) and a Long-Short Term Memory Network (Long Short-Term Memory Network (LSTM)) are common Recurrent Neural networks. The recurrent neural network has memory, parameter sharing and graph completion (training completion), and thus has certain advantages in learning the nonlinear characteristics of a sequence. The recurrent neural network has applications in Natural Language Processing (NLP), such as speech recognition, Language modeling, machine translation, and other fields, and is also used for various time series predictions. A cyclic Neural Network constructed by introducing a Convolutional Neural Network (CNN) can process computer vision problems containing sequence input.
With the development of artificial intelligence technology, the usage rate of robots, such as chat robots, is gradually increasing. In the natural language processing of machine learning, the robot needs to understand and intend to classify text data input by a user. Generally, the classification model directly classifies text data, but this may cause the classification model to not effectively understand the meaning of each label, which in turn may cause low classification accuracy. The specific problems are as follows:
the traditional method does not consider the content of the label, and directly carries out multi-label classification on the text data. However, such a method may cause that the classification model cannot effectively understand the meaning between the labels and the corresponding text information, thereby causing low learning efficiency and poor classification effect of the classification model. In particular, in the insurance field, there is a relationship between the meanings of the labels, and if the classification model cannot understand the relationship between the labels in the insurance field, confusion is easy to occur, resulting in low classification accuracy.
In addition, the traditional classification model is only understood based on text sentence meaning, and does not consider other information such as dialogue voice and the like. However, the actual dialog text has speech information beyond the sentence meaning, such as harmonic vocabulary, misreading, misinterpretation, and the like. The presence of such information in a speech dialog cannot be effectively taken into account using conventional classification models, thereby affecting the actual effectiveness of the classification.
Based on this, the embodiment of the present disclosure provides a text data classification method and apparatus, a computer device, and a storage medium, which can improve the accuracy of text data classification.
The embodiment of the present disclosure provides a text data classification method and apparatus, a computer device, and a storage medium, which are specifically described with reference to the following embodiments, and first, a text data classification method in the embodiment of the present disclosure is described.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The embodiment of the disclosure provides a text data classification method, and relates to the field of artificial intelligence. The text data classification method provided by the embodiment of the disclosure can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, smart watch, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a text data classification method, but is not limited to the above form.
The disclosed embodiments are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Referring to fig. 1, a text data classification method according to an embodiment of the first aspect of the embodiment of the present disclosure includes, but is not limited to, steps S100 to S500.
Step S100, acquiring original text data and corresponding label text data;
step S200, inputting original text data and label text data into a pre-training model for similarity prediction processing to obtain a plurality of first relevant labels;
step S300, performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data;
s400, inputting the first pinyin data and the second pinyin data into a pre-training model for similarity prediction processing to obtain a plurality of second related labels;
step S500, obtaining a target related label from the plurality of first related labels and the plurality of second related labels, and obtaining a target classification result.
In step S100 of some embodiments, original text data and corresponding label text data are obtained, where the original text data is a text to be classified, for example, the original text data corresponding to an insurance field may be some insurance products, such as "golden life", and the like, and in consideration of that various harmonic words, common words, and unusual words may appear when the insurance products are named, the corresponding original text data needs to be extracted in consideration of the contents of these aspects when the original text data is obtained in the embodiments of the present disclosure; the tag text data is a basis for classifying the original text data. In the insurance field, each category has a definite label text to define the label text, for example, the label text can include "premium", "term" and "product", and different label text data are used for distinguishing the categories of the original text data, so that the classification effect is improved.
It should be noted that, in practical applications, data on some relevant web pages may be crawled by a web crawler or other technical means to obtain original text data, for example: the original text data is obtained by compiling the web crawler and crawling the data in a targeted manner after the data source is set.
After the original text data is acquired, some preprocessing operations may be performed on the original text data to delete unnecessary characters in order to make the classification result more accurate. For example: the original text data captured from the web page often contains unnecessary content, such as HTML tags, which has little effect on the analysis of the text, and the beautifull sound library can be used to delete redundant tags. For example: in a commonly used text corpus, particularly when processing english, accented characters and letters may often be processed. It is therefore necessary to ensure that these characters are converted and normalized to ASCII characters, e.g. to e, so that accented characters are deleted. In addition, processes such as tokenization, removal of extra spaces, text lower case conversion, spelling correction, grammar error correction, removal of duplicate characters, and the like may be performed to ensure correctness of the initial text data, thereby improving accuracy of text classification.
In step S200 of some embodiments, the original text data and the label text data are input to a pre-training model, and the pre-training model performs text similarity matching on the original text data and the label text data to obtain a plurality of predicted first relevant labels.
It should be noted that, in consideration of the cost and the comprehensive efficiency of model training, the pre-training model may adopt a BERT model or a BERT CN model, or may use a pre-training model related to the insurance field, where the training process of the pre-training model is as follows: the method comprises the steps of obtaining a source text, namely a text to be classified, matching the source text and the text subjected to the phonetic processing with texts corresponding to labels respectively, and performing similarity matching, fusion and loss function calculation by using a BERT model for learning. Specifically, an input text is converted into a corresponding text vector through a pre-training model; carrying out similarity matching on the text vectors through a short text matching model to obtain a similarity matching result; performing prediction processing according to the similarity matching result, and calculating a first predicted value; in addition, the text pinyin and the label text pinyin are processed in the same way, and a second predicted value is calculated; and averaging the first predicted value and the second predicted value to obtain a final classification predicted value, and calculating a loss function, such as a multi-class cross entropy loss function, according to the final classification predicted value to update the model parameters to obtain a pre-training model which is trained preliminarily.
In step S300 of some embodiments, the original text data is pinyin-converted to obtain first pinyin data, and the label text data is pinyin-converted to obtain second pinyin data, where the pinyin conversion is to convert the text into a pinyin representation, for example, if the original text data or the label text data is "claim amount", the corresponding pinyin data is "[ 'li' ], [ 'pei' ], [ 'e' ], [ 'du' ]", and each pinyin is further recorded with a corresponding tone. In the disclosed embodiment, the tone corresponding to the [ 'li' ] is the third sound, the tone corresponding to the [ 'pei' ] is the second sound, the tone corresponding to the [ 'e' ] is the second sound, and the tone corresponding to the [ 'du' ] is the fourth sound.
In step S400 of some embodiments, the first pinyin data and the second pinyin data are input to a pre-training model, and the pre-training model performs pinyin similarity matching on the first pinyin data and the second pinyin data to obtain a plurality of predicted second related labels. It should be noted that the model structure of the pre-trained model in step S200 is the same as that of the pre-trained model in step S400, and the difference is that the input data type is different, the data type input in step S200 is text data, and the data input in step S400 is pinyin data.
In step S500 of some embodiments, an object related label is obtained from the plurality of first related labels and the plurality of second related labels, and an object classification result is obtained. It should be noted that after the plurality of first relevant tags and the plurality of second relevant tags are obtained in steps S200 and S400, the required target relevant tags need to be further determined according to the similarity, and the target classification result is obtained.
In some embodiments, as shown in fig. 2, step S200 includes, but is not limited to, steps S210 through S240.
Step S210, obtaining a plurality of preset label sample data;
step S220, calculating the similarity of the original text data and the sample data of each label to obtain a similarity matrix;
step S230, encoding the similarity matrix to obtain a plurality of first similarity vectors;
step S240, performing pooling processing on the plurality of first similar vectors to obtain a plurality of first related labels.
In step S210 of some embodiments, a plurality of preset tag sample data are obtained, where the tag sample data and the tag text data are also bases for classifying the original text data, but the tag sample data is a preset tag text in all the classifications, and the tag text data is a tag text corresponding to the original text data.
In step S220 of some embodiments, the similarity between the original text data and the sample data of each tag is calculated to obtain a similarity matrix. Specifically, the original text data and the tag sample data are encoded to obtain text sequences, the similarity between the text sequences is calculated, and a similarity matrix can be constructed according to the similarity value of each original text data and each tag sample data.
In step S230 of some embodiments, the similarity matrix is encoded to obtain a plurality of first similarity vectors, and specifically, the similarity matrix is input to a preset neural network for encoding to obtain a plurality of first similarity vectors used for characterizing the similarity value.
In step S240 of some embodiments, a pooling process, such as an average pooling and a maximum pooling, is performed on the plurality of first similar vectors to obtain a plurality of first correlation labels. In practical application, each first similar vector can be subjected to maximum pooling processing through a pooling layer, so that a plurality of pooled feature vectors can be obtained. Note that since the sizes of features (feature maps) obtained by convolution kernels of different sizes are also different, a pooling function is used for each feature map, and the dimensions of the feature maps are made the same. The most common is max pooling, such that each convolution kernel gets a feature which is a value, and max pooling is used for all convolution kernels to get a plurality of pooled feature vectors.
In some embodiments, as shown in fig. 3, step S220 specifically includes, but is not limited to, step S221 to step S223.
Step S221, carrying out word embedding processing on original text data to obtain a first embedded vector, and carrying out word embedding processing on each label sample data to obtain a second embedded vector;
step S222, performing dot product operation on each first embedded vector and the corresponding second embedded vector to obtain a plurality of similarity values;
step S223, a similarity matrix is constructed according to the plurality of similarity values.
In step S221 of some embodiments, a word embedding model, for example, a BERT model, is used to encode the original text data and the tag sample data, and the original text data and the tag sample data are converted into corresponding encoding vectors, i.e., a first embedding vector and a second embedding vector are obtained.
In step S222 of some embodiments, a dot product operation is performed on each first embedding vector and the corresponding second embedding vector to obtain a plurality of similarity values. The dot product refers to a cosine distance between one of the first embedded vectors and the corresponding second embedded vector, and the dot product operation is performed on the first embedded vector and the second embedded vector, actually, the cosine similarity of the two vectors is calculated, and the larger the cosine similarity value is, the more similar the first embedded vector and the second embedded vector is, the higher the similarity value is. It should be noted that, in the embodiment of the present disclosure, the value of the cosine similarity may be directly used as the similarity value.
In step S223 of some embodiments, a similarity matrix is constructed according to a plurality of similarity values, specifically, the number of rows and the number of columns of the similarity matrix are determined according to the number of the first embedded vectors and the number of the second embedded vectors, and then the similarity values are filled in the similarity matrix, so as to obtain the constructed similarity matrix.
In practical applications, the process of performing word embedding processing may be: noting n label sample data under all classifications, can be noted as l1...lnThe original text data can be denoted as wiThe label text data corresponding to the original text data can be marked as li. Using w as input text (which may be raw text data or its corresponding spelled data),the input text is converted into corresponding text vectors by the BERT model, i.e., word embedding (first and second embedding vectors).
In practical applications, after obtaining the first embedding vector and the second embedding vector, the similarity value of the first embedding vector and the second embedding vector needs to be calculated, and the process may be as follows: inputting the first embedding vector and the second embedding vector into an ESIM module for similarity matching to obtain y1'...yn' the specific calculation formula is shown as formula (1):
yj'=ESIM(w,lj)=(n,c,e) (1)
wherein, yj' represents y1'...yn' A similarity value, ESIM (w, l), between a first embedding vector and a corresponding second embedding vectorj) Representing original text data w input into an ESIM module, and tag text data l to which w correspondsj(n, c, e) represents w and ljThe previous similarity relationship, also represents the tag's location output.
To facilitate understanding of the similarity matching process of the ESIM model, the following description is first made based on ESIM requirements: ESIM is an interactive text matching network, and the usual training process is: encoding the two input text sequences; local inference is carried out, and similarity interaction of two text sequences is mainly calculated; performing sum and product operation, and splicing to form a new sequence; then, the output result is obtained by encoding through the BilSTM.
Specifically, the process of similarity matching by the ESIM module is as follows: inputting the original text data and the corresponding label text data into a pre-training model for word embedding, or inputting the first pinyin data and the corresponding second pinyin data into the pre-training model for word embedding, so as to obtain word embedding data. And then, inputting the word embedding data into a first BilSTM network preset by a pre-training model for coding to obtain a text vector. Then, calculating the similarity (including multiplication, addition and other calculations) between the text vector and the corresponding label text data to obtain a similarity matrix, and inputting the similarity matrix into a second preset BilSTM network for encoding to obtain an output vector. And finally, performing pooling treatment on the output vector to obtain final ternary output (n, c, e).
In practical applications, the similarity of the mechanism can also be calculated based on attention, in short, a normalized weight is calculated by taking a word (e.g., "claim") in w and each word in l (e.g., "word"), and a similarity value between w and I is obtained according to the weight value.
And after similarity matching is carried out through an ESIM model, generating a first related label according to label text data, or generating a second related label according to second pinyin data. For example, if the corresponding tag text data of the original text data is i, the true tag value yiTaking (0,0,1) to mean the relevant label; in the tag sample data, the true values of other tags except the related tags are all set to be (0,1,0), which means that the tags are not related, and the incidence relation between the original text data and the tag text data can be known through the true values of the tags.
In some embodiments, as shown in fig. 4, step S300 specifically includes, but is not limited to, step S310 to step S350.
Step S310, acquiring a plurality of pre-stored preset fields;
step S320, splitting the original text data to obtain a plurality of original fields;
s330, screening a plurality of target fields from a plurality of preset fields according to a plurality of original fields;
step S340, obtaining target pinyin corresponding to each target field from a plurality of preset pinyins to obtain a plurality of target pinyins;
step S350, splicing a plurality of target pinyins to obtain first pinyin data.
In step S310 of some embodiments, a plurality of pre-stored preset fields are obtained, where each preset field corresponds to a preset pinyin, where the preset fields are a plurality of chinese characters or words that are set or stored, and each preset field corresponds to a preset pinyin that is processed by performing pinyin processing.
In step S320 of some embodiments, the original text data is split into a plurality of original fields, specifically, the original text data may be split according to words, or each word of the original text is split into a plurality of original fields, for example, the original text data "amount of claim" may be split into four original fields of "reason", "claim", "amount", "degree".
In step S330 of some embodiments, a plurality of target fields are screened from a plurality of preset fields according to a plurality of original fields, and specifically, an original field identical to the preset field is obtained as the target field.
In step S340 of some embodiments, after the target fields are determined, since each target field corresponds to one preset pinyin, the target pinyin corresponding to the target field is obtained from the preset pinyins, so as to obtain multiple target pinyins.
In step S350 in some embodiments, the obtained target pinyins are sequentially spliced to obtain first pinyin data.
Note that the process of performing the spelling processing on the tag text data is the same as the process of performing the spelling processing on the original text data, and specific reference may be made to steps S310 to S350, which will not be described in detail herein.
In practical application, an open-source pinyin library such as PyPinyin can be used for performing pinyin processing on original text data and tag text data, and specifically, each Chinese character in the original text data and the tag text data is converted into pinyin through multiple Chinese characters recorded in the pinyin library and corresponding pinyin information, so that polyphone characters and rare characters can be effectively identified, the accuracy of pinyin can be ensured, and the accuracy of classification can be improved.
In some embodiments, as shown in fig. 5, step S500 specifically includes, but is not limited to, step S510 to step S530.
Step S510, calculating the similarity between each first relevant label and the label text data to obtain a plurality of first similarity values;
step S520, calculating the similarity between each second related label and the second pinyin data to obtain a plurality of second similarity values;
step S530, obtaining the target related label from the plurality of first related labels and the plurality of second related labels according to the plurality of first similar values and the plurality of second similar values, and obtaining a target classification result.
In step S510 and step S520 of some embodiments, a similarity between each first relevant tag and the tag text data is calculated to obtain a plurality of first similarity values; and calculating the similarity between each second related label and the second pinyin data to obtain a plurality of second similarity values, specifically, calculating the first similarity value and the second similarity value by calculating similarity calculation methods such as dot product similarity, cosine similarity, euclidean similarity, and the like, and the application is not limited specifically.
In step S530 of some embodiments, according to the first similarity values and the second similarity values, the target related label with the maximum similarity is obtained from the first related labels and the second related labels, and a target classification result is obtained.
In some embodiments, as shown in fig. 6, step S530 specifically includes, but is not limited to, step S531 to step S533.
Step S531, adding each first similarity value and the corresponding second similarity value to obtain a plurality of preliminary similarity values;
step S532, acquiring a maximum primary similarity value as a target similarity value;
step S533, according to the target similarity value, obtaining a corresponding target related label from the plurality of first related labels and the plurality of second related labels, and obtaining a target classification result.
In step S531 of some embodiments, each first similarity value and the corresponding second similarity value are added to obtain a plurality of preliminary similarity values. In other words, the similarity between the text and the corresponding classification label is used as a first similarity value, the similarity between the pinyin of the text and the pinyin of the corresponding classification label is used as a second similarity value, the first similarity value and the second similarity value are combined, for example, added to obtain a preliminary similarity value, and the first similarity value and the second similarity value are combined to enable the similarity relation between the text and the corresponding classification label to be more accurately represented, so that the classification effect and the robustness of the pre-training model are enhanced.
In step S532 of some embodiments, the one having the greatest degree of similarity is selected from the plurality of preliminary similarity values as the target similarity value.
In step S533 of some embodiments, a first relevant tag and a second relevant tag corresponding to the target similarity value are obtained as the target relevant tags, and the target classification result is obtained from the target relevant tags. For example, the label a and the label B can determine that the classification result is a, the label C can determine that the classification result is B, and the target related label and the mapping relationship can determine the target classification result.
In practical application, the pre-training model is required to be fused with the original text data and the corresponding first pinyin data structure when the pre-training model is used for learning to finish Chinese, and the classification result with the highest relevance degree output by the pre-training model is taken as the target classification result.
Specifically, the first relation value of ESIM results of the original text data w to all tag sample data is ew1...ewnThe second relation value of the first pinyin data wp to the pinyin data corresponding to all the label sample data is ep1...epnAdding the first relation value and the second relation value to obtain a target relation value e1...en. Wherein for e1...enThe process of calculating a certain target relationship value in (2) is specifically shown as formula (2):
ej=ewj+epj (2)
wherein e isjDenotes e1...enA certain target relationship value, ewjMeans ew1...ewnThe first relation value, epjMeans ew1...ewnThe second relation value of (1).
And then, taking the label k with the maximum target relation value as a target classification result. Through training of the pre-training model, the label k has the lowest loss value, which means that the similarity between the original text data and the label is the greatest, and that the original text data belongs to the label k is also indicated.
According to the method and the device for identifying the polyphone characters, the representations of the input text sequence and the label text sequence are obtained through the pre-training model, knowledge is introduced by using the label text content, and the knowledge is introduced by using the spelling result of the input text sequence and the label text sequence, so that polyphone characters and rare characters can be effectively identified, and the effect and the robustness of model classification are enhanced.
In some embodiments, after step S500, the text data classification method of the embodiments of the present disclosure further includes, but is not limited to, the steps of: and optimizing a loss function of the pre-training model according to the target classification result and the label text data to update the pre-training model, specifically, taking a loss value of the loss function in the pre-training model as a reverse propagation quantity, and adjusting model parameters of the pre-training model to update the pre-training model. In an embodiment of the disclosure, the function of the pre-trained model may select a cross-entropy loss function.
In practical application, inputting original text data and label text data into a pre-training model for classification prediction processing to obtain a first predicted value; inputting the first pinyin data and the second pinyin data into a pre-training model for the same classification prediction processing to obtain a second prediction value; and calculating the average value of the first predicted value and the second predicted value, and updating the model parameters of the pre-training model according to the average value and the cross entropy loss function.
It should be noted that, in consideration of the training efficiency of the training cost of the model and the characteristics of the insurance field, the pre-training model may also use a pre-training model related to the insurance field in addition to the BERT model, and the embodiment of the present disclosure is not particularly limited.
The text data classification method provided by the embodiment of the disclosure includes the steps of obtaining original text data and corresponding label text data, inputting the original text data and the label text data into a pre-training model for similarity prediction processing, and obtaining a plurality of first relevant labels; performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data; inputting the first pinyin data and the second pinyin data into a pre-training model for similarity prediction processing to obtain a plurality of second relevant labels; and obtaining target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result. According to the embodiment of the disclosure, the label text data is additionally added in the pre-training model, so that the pre-training model can fully understand the association between the actual meanings of the labels, and the initial text data and the label text data are spelled to enable the pre-training model to acquire the speech information out of the semantic meaning, thereby improving the accuracy of similarity prediction of the pre-training model and further improving the accuracy of text data classification.
In addition, the text classification characteristic in the insurance field is comprehensively considered, text label information and text voice information are additionally added in the pre-training model for processing, information generated by dialog in the insurance field is fully utilized, information loss is effectively avoided, actual performance of the model is improved, and the model has better performance in the insurance field than a traditional model. The embodiment of the disclosure combines a self-attention model such as BERT and an RNN model (from ESIM model) such as LSTM to classify texts, and compares the texts with each label respectively; therefore, all additional information added by the method can be fully processed, the performance of the model is effectively improved, and the problems of over-fitting, under-fitting and the like are avoided.
An embodiment of the present disclosure further provides a text data classification device, as shown in fig. 7, which can implement the text data classification method described above, and the text data classification device includes: an obtaining module 610, a first predicting module 620, a converting module 630, a second predicting module 640 and a classifying module 650; the obtaining module 610 is configured to obtain original text data and corresponding label text data; the first prediction module 620 is configured to input the original text data and the label text data into a pre-training model for similarity prediction processing, so as to obtain a plurality of first relevant labels; the conversion module 630 is configured to perform pinyin conversion on the original text data to obtain first pinyin data, and perform pinyin conversion on the tag text data to obtain second pinyin data; the second prediction module 640 is configured to input the first pinyin data and the second pinyin data to a pre-training model similarity prediction process to obtain a plurality of second relevant labels; the classification module 650 is configured to obtain a target related label from the plurality of first related labels and the plurality of second related labels, and obtain a target classification result.
The text data classification device in the embodiment of the present disclosure is configured to execute the text data classification method in the above embodiment, and a specific processing procedure of the text data classification device is the same as the text data classification method in the above embodiment, which is not described in detail here.
An embodiment of the present disclosure further provides a computer device, including:
at least one processor, and,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions for execution by the at least one processor to cause the at least one processor, when executing the instructions, to implement a method as in any one of the embodiments of the first aspect of the application.
The hardware structure of the computer apparatus will be described in detail with reference to fig. 8. The computer device includes: a processor 710, a memory 720, an input/output interface 730, a communication interface 740, and a bus 750.
The processor 710 may be implemented by a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided by the embodiments of the present disclosure;
the Memory 720 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 720 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 720 and called by the processor 710 to execute the text data classification method of the embodiments of the present disclosure;
an input/output interface 730 for implementing information input and output;
the communication interface 740 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.) or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.); and
a bus 750 that transfers information between various components of the device (e.g., processor 710, memory 720, input/output interface 730, and communication interface 740);
wherein processor 710, memory 720, input/output interface 730, and communication interface 740 are communicatively coupled to each other within the device via bus 750.
The disclosed embodiment also provides a storage medium which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the text data classification method of the disclosed embodiment.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
According to the text data classification method and device, the computer equipment and the storage medium provided by the embodiment of the disclosure, original text data and corresponding label text data are obtained and input into a pre-training model for similarity prediction processing, so that a plurality of first relevant labels are obtained; performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data; inputting the first pinyin data and the second pinyin data into a pre-training model for similarity prediction processing to obtain a plurality of second relevant labels; and obtaining target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result. According to the embodiment of the disclosure, the label text data is additionally added in the pre-training model, so that the pre-training model can fully understand the association between the actual meanings of the labels, and the initial text data and the label text data are spelled to enable the pre-training model to acquire the speech information out of the semantic meaning, thereby improving the accuracy of similarity prediction of the pre-training model and further improving the accuracy of text data classification.
The embodiments described in the embodiments of the present disclosure are for more clearly illustrating the technical solutions of the embodiments of the present disclosure, and do not constitute a limitation to the technical solutions provided in the embodiments of the present disclosure, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present disclosure are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.
Those skilled in the art will appreciate that the solutions shown in fig. 1-6 are not meant to limit embodiments of the present disclosure, and may include more or fewer steps than those shown, or may combine certain steps, or different steps.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The preferred embodiments of the present disclosure have been described above with reference to the accompanying drawings, and therefore do not limit the scope of the claims of the embodiments of the present disclosure. Any modifications, equivalents and improvements within the scope and spirit of the embodiments of the present disclosure should be considered within the scope of the claims of the embodiments of the present disclosure by those skilled in the art.

Claims (10)

1. A method of classifying text data, comprising:
acquiring original text data and corresponding label text data;
inputting the original text data and the label text data into a pre-training model for similarity prediction processing to obtain a plurality of first related labels;
performing pinyin conversion on the original text data to obtain first pinyin data, and performing pinyin conversion on the label text data to obtain second pinyin data;
inputting the first pinyin data and the second pinyin data into the pre-training model for similarity prediction processing to obtain a plurality of second relevant labels;
and obtaining target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result.
2. The method of claim 1, wherein the inputting the original text data and the label text data into a pre-training model for similarity prediction processing to obtain a plurality of first relevant labels comprises:
acquiring a plurality of preset label sample data;
calculating the similarity between the original text data and each label sample data to obtain a similarity matrix;
coding the similarity matrix to obtain a plurality of first similarity vectors;
and performing pooling processing on the plurality of first similar vectors to obtain a plurality of first related labels.
3. The method of claim 2, wherein the calculating the similarity between the original text data and each of the tag sample data to obtain a similarity matrix comprises:
performing word embedding processing on the original text data to obtain a first embedded vector, and performing word embedding processing on each tag sample data to obtain a second embedded vector;
performing dot product operation on each first embedding vector and the corresponding second embedding vector to obtain a plurality of similarity values;
and constructing the similarity matrix according to the plurality of similarity values.
4. The method of claim 1, wherein the pinyin converting the original text data to obtain first pinyin data, includes:
acquiring a plurality of pre-stored preset fields; each preset field corresponds to a preset pinyin;
splitting the original text data to obtain a plurality of original fields;
screening a plurality of target fields from the plurality of preset fields according to the plurality of original fields;
acquiring a target pinyin corresponding to each target field from the preset pinyins to obtain a plurality of target pinyins;
and splicing the target pinyins to obtain the first pinyin data.
5. The method of claim 1, wherein obtaining the target related label from the first related labels and the second related labels to obtain the target classification result comprises:
calculating the similarity between each first related label and the label text data to obtain a plurality of first similarity values;
calculating the similarity between each second related label and the second pinyin data to obtain a plurality of second similarity values;
and acquiring target related labels from the plurality of first related labels and the plurality of second related labels according to the plurality of first similar values and the plurality of second similar values to obtain the target classification result.
6. The method according to claim 5, wherein the target related label is obtained from the plurality of first related labels and the plurality of second related labels according to the plurality of first similarity values and the plurality of second similarity values, so as to obtain the target classification result;
adding each first similarity value and the corresponding second similarity value to obtain a plurality of preliminary similarity values;
acquiring the maximum preliminary similarity value as a target similarity value;
and according to the target similarity value, acquiring the corresponding target related label from the plurality of first related labels and the plurality of second related labels to obtain the target classification result.
7. The method according to any one of claims 1 to 6, wherein after obtaining the target related label from the plurality of first related labels and the plurality of second related labels to obtain the target classification result, the method further comprises:
and optimizing a loss function of the pre-training model according to the target classification result and the label text data so as to update the pre-training model.
8. A text data classification apparatus, comprising:
an acquisition module: the method comprises the steps of obtaining original text data and corresponding label text data;
the first prediction module: the system comprises a pre-training model, a plurality of first relevant labels and a plurality of second relevant labels, wherein the pre-training model is used for inputting the original text data and the label text data to the pre-training model for similarity prediction processing to obtain the plurality of first relevant labels;
a conversion module: the system comprises a tag text data acquisition module, a first pinyin conversion module and a second pinyin conversion module, wherein the tag text data acquisition module is used for acquiring tag text data;
a second prediction module: the second phonetic data is used for inputting the first phonetic data and the second phonetic data into the pre-training model similarity prediction processing to obtain a plurality of second related labels;
a classification module: and the target classification module is used for acquiring target related labels from the plurality of first related labels and the plurality of second related labels to obtain a target classification result.
9. A computer device, characterized in that the computer device comprises a memory and a processor, wherein the memory has stored therein a computer program, and the processor is adapted to perform, when the computer program is executed by the processor: the method of any one of claims 1 to 7.
10. A storage medium which is a computer-readable storage medium, wherein the computer-readable storage stores a computer program, and when the computer program is executed by a computer, the computer is configured to perform: the method of any one of claims 1 to 7.
CN202210134070.0A 2022-02-14 2022-02-14 Text data classification method and device, computer equipment and storage medium Pending CN114492661A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210134070.0A CN114492661A (en) 2022-02-14 2022-02-14 Text data classification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210134070.0A CN114492661A (en) 2022-02-14 2022-02-14 Text data classification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114492661A true CN114492661A (en) 2022-05-13

Family

ID=81479841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210134070.0A Pending CN114492661A (en) 2022-02-14 2022-02-14 Text data classification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114492661A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048523A (en) * 2022-06-27 2022-09-13 北京百度网讯科技有限公司 Text classification method, device, equipment and storage medium
CN116796723A (en) * 2023-03-15 2023-09-22 华院计算技术(上海)股份有限公司 Text set matching method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115048523A (en) * 2022-06-27 2022-09-13 北京百度网讯科技有限公司 Text classification method, device, equipment and storage medium
CN115048523B (en) * 2022-06-27 2023-07-18 北京百度网讯科技有限公司 Text classification method, device, equipment and storage medium
CN116796723A (en) * 2023-03-15 2023-09-22 华院计算技术(上海)股份有限公司 Text set matching method and device, electronic equipment and storage medium
CN116796723B (en) * 2023-03-15 2024-02-06 华院计算技术(上海)股份有限公司 Text set matching method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110688854B (en) Named entity recognition method, device and computer readable storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
WO2023134083A1 (en) Text-based sentiment classification method and apparatus, and computer device and storage medium
CN114722826B (en) Model training method and device, electronic equipment and storage medium
CN113887215A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN113901191A (en) Question-answer model training method and device
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
CN114298287A (en) Knowledge distillation-based prediction method and device, electronic equipment and storage medium
CN108268629B (en) Image description method and device based on keywords, equipment and medium
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN116541493A (en) Interactive response method, device, equipment and storage medium based on intention recognition
CN114358020A (en) Disease part identification method and device, electronic device and storage medium
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
WO2023159759A1 (en) Model training method and apparatus, emotion message generation method and apparatus, device and medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN114998041A (en) Method and device for training claim settlement prediction model, electronic equipment and storage medium
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
CN114625877A (en) Text classification method and device, electronic equipment and storage medium
CN114490949A (en) Document retrieval method, device, equipment and medium based on BM25 algorithm
CN114936274A (en) Model training method, dialogue generating device, dialogue training equipment and storage medium
CN114416995A (en) Information recommendation method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination