CN114942994A

CN114942994A - Text classification method, text classification device, electronic equipment and storage medium

Info

Publication number: CN114942994A
Application number: CN202210687739.9A
Authority: CN
Inventors: 周敏芳; 任彧; 王建明; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-08-26

Abstract

The embodiment of the application provides a text classification method, a text classification device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring an original text to be classified; performing subject term recognition on an original text through a preset subject term recognition model to obtain an entity subject term; splicing the original text and the entity subject term to obtain a target embedded feature vector; performing classification probability calculation on the target embedded feature vectors through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label; and screening the reference classification labels according to the classification probability values to obtain target classification labels of the original text. The text classification method and device can improve the accuracy of text classification.

Description

Text classification method, text classification device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a text classification method, a text classification device, an electronic device, and a storage medium.

Background

In a common classification scene, a manual judgment mode is often adopted to classify materials (such as text materials or image materials), and the classification accuracy is affected by the mode which often has greater human subjectivity, so that how to improve the accuracy of text classification becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a text classification method, a text classification device, an electronic device and a storage medium, and aims to improve the accuracy of text classification.

In order to achieve the above object, a first aspect of an embodiment of the present application provides a text classification method, where the method includes:

acquiring an original text to be classified;

performing subject term recognition on the original text through a preset subject term recognition model to obtain an entity subject term;

splicing the original text and the entity subject term to obtain a target embedded feature vector;

performing classification probability calculation on the target embedded feature vectors through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label;

and screening the reference classification labels according to the classification probability value to obtain the target classification labels of the original text.

In some embodiments, the step of performing subject word recognition on the original text through a preset subject word recognition model to obtain an entity subject word includes:

performing word segmentation processing on the original text through the subject word recognition model to obtain a target text word segment;

extracting keywords of the target text word segment according to preset weight parameters to obtain a text keyword set;

and combining the text keywords in the text keyword set to obtain the entity subject term.

In some embodiments, the step of performing word segmentation processing on the original text through the topic word recognition model to obtain a target text word segment includes:

performing vocabulary recognition on the original text through a preset word segmentation device to obtain word segment entity characteristics;

performing word segmentation processing on the original text according to the entity characteristics of the word segments to obtain initial text word segments;

and filtering the initial text word segment to obtain the target text word segment.

In some embodiments, the step of extracting keywords from the target text field according to a preset weight parameter to obtain a text keyword set includes:

ordering the importance of the target text word segments to obtain a text word segment sequence;

carrying out weight distribution on the text word segment sequence according to the weight parameters to obtain a weighted word segment sequence;

and carrying out word segment screening on the weighted word segment sequence to obtain the text keyword set.

In some embodiments, the step of performing concatenation processing on the original text and the entity topic word to obtain a target embedded feature vector includes:

performing word embedding processing on the original text to obtain a text embedding characteristic vector;

performing word embedding processing on the entity subject term to obtain a subject term embedding feature vector;

and splicing the text embedded feature vector and the subject word embedded feature vector according to a preset splicing sequence to obtain the target embedded feature vector.

In some embodiments, the step of performing classification probability calculation on the target embedded feature vector through a preset text classification model and reference classification tags to obtain a classification probability value corresponding to each reference classification tag includes:

acquiring the reference classification label;

performing word embedding processing on the reference classification label to obtain a reference classification label vector;

and performing vector similarity calculation on the reference classification label vector and the target embedded characteristic vector through the text classification model to obtain the classification probability value.

In some embodiments, before the step of performing classification probability calculation on the target embedded feature vector through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label, the method includes pre-training the text classification model, and specifically includes:

acquiring a label text and a reference classification label;

performing subject term identification on the label text to obtain a sample subject term;

splicing the label text and the sample subject term to obtain a sample embedded feature vector;

carrying out classification probability calculation on the sample embedded feature vectors through the text classification model and the reference classification labels to obtain a sample classification predicted value;

calculating a loss value through the loss function of the text classification model and the sample classification predicted value to obtain a model loss value;

and carrying out gridding parameter adjustment on the text classification model according to the model loss value so as to optimize the text classification model.

In order to achieve the above object, a second aspect of an embodiment of the present application provides a text classification apparatus, including:

the device comprises:

the text acquisition module is used for acquiring original texts to be classified;

the subject term identification module is used for carrying out subject term identification on the original text through a preset subject term identification model to obtain an entity subject term;

the splicing module is used for splicing the original text and the entity subject term to obtain a target embedded characteristic vector;

the probability calculation module is used for performing classification probability calculation on the target embedded feature vector through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label;

and the screening module is used for screening the reference classification label according to the classification probability value to obtain a target classification label of the original text.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.

The text classification method, the text classification device, the electronic equipment and the storage medium provided by the application are used for classifying the original text to be classified; and performing subject word recognition on the original text through a preset subject word recognition model to obtain an entity subject word, conveniently obtaining the entity subject word capable of representing the text content of the original text, and visually reflecting the subject content of the original text through the entity subject word. Further, splicing the original text and the entity subject term to obtain a target embedded feature vector; the classification probability calculation is carried out on the target embedded feature vectors through the preset text classification model and the reference classification labels to obtain the classification probability value corresponding to each reference classification label, the correlation degree of the target embedded feature vectors and each reference classification label can be reflected through the classification probability value, and the classification labels to which the target embedded feature vectors belong can be favorably determined.

Drawings

Fig. 1 is a flowchart of a text classification method provided in an embodiment of the present application;

FIG. 2 is a flowchart of step S102 in FIG. 1;

fig. 3 is a flowchart of step S201 in fig. 2;

FIG. 4 is a flowchart of step S202 in FIG. 2;

fig. 5 is a flowchart of step S103 in fig. 1;

FIG. 6 is another flowchart of a text classification method provided in an embodiment of the present application;

fig. 7 is a flowchart of step S104 in fig. 1;

fig. 8 is a schematic structural diagram of a text classification apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, expert systems, and the like. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (Information Extraction): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology may be various types of information.

Latent Dirichlet Allocation (LDA), is a topic model (topic model) that gives the topic of each document in a document set in the form of a probability distribution. LDA is a typical bag-of-words model, i.e. it considers a document as a collection of words with no sequential or chronological relationship between the words. A document may contain multiple topics, with each word in the document being generated from one of the topics.

Softmax function: the Softmax function is a normalized exponential function that "compresses" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector such that each element ranges between (0,1) and the sum of all elements is 1, which is commonly used in multi-classification problems.

A word segmentation device: the word segmenter accepts a character string as input, splits the character string into independent words or vocabulary units (tokens), and then outputs a vocabulary unit stream (token stream).

Embedding (embedding): embedding is a vector representation, which means that a low-dimensional vector represents an object, which can be a word, a commodity, a movie, etc.; the embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings, for example, the distance between the embedding (revenge league) and the embedding (ironmen) is very close, but the distance between the embedding (revenge league) and the embedding (dinners) is far away. The embedding essence is mapping from a semantic space to a vector space, and simultaneously, the relation of an original sample in the semantic space is kept as much as possible in the vector space, for example, the positions of two words with similar semantics in the vector space are also relatively close. The embedding can encode the object by using a low-dimensional vector and also can reserve the meaning of the object, is often applied to machine learning, and in the process of constructing a machine learning model, the object is encoded into a low-dimensional dense vector and then transmitted to DNN (dynamic network) so as to improve the efficiency.

BERT (bidirectional Encoder retrieval from transformations) model: the BERT model further increases the generalization capability of a word vector model, fully describes character-level, word-level, sentence-level and even sentence-level relational characteristics, and is constructed based on a Transformer. There are three embeddings in BERT, namely Token Embedding, Segment Embedding and Position Embedding; wherein Token Embeddings are word vectors, the first word is a CLS mark, and the first word can be used for subsequent classification tasks; segment Embeddings are used to distinguish two sentences because pre-training does not only do LM but also do classification tasks with two sentences as input; position entries, where the Position word vector is not a trigonometric function in transform, but is learned by BERT training. But the BERT directly trains a position embedding to reserve position information, a vector is randomly initialized at each position, model training is added, and finally an embedding containing the position information is obtained, and the BERT selects direct splicing in the combination mode of the position embedding and the word embedding.

Based on this, embodiments of the present application provide a text classification method, a text classification apparatus, an electronic device, and a storage medium, which aim to improve accuracy of text classification.

The text classification method, the text classification device, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and first, the text classification method in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a text classification method, and relates to the technical field of artificial intelligence. The text classification method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a text classification method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a text classification method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S105.

Step S101, obtaining an original text to be classified;

step S102, performing subject term recognition on an original text through a preset subject term recognition model to obtain an entity subject term;

step S103, splicing the original text and the entity subject term to obtain a target embedded feature vector;

step S104, performing classification probability calculation on the target embedded feature vectors through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label;

and S105, screening the reference classification labels according to the classification probability values to obtain target classification labels of the original text.

The steps S101 to S105 illustrated in the embodiment of the present application are performed by acquiring an original text to be classified; and performing subject word recognition on the original text through a preset subject word recognition model to obtain an entity subject word, conveniently obtaining the entity subject word capable of representing the text content of the original text, and visually reflecting the subject content of the original text through the entity subject word. Further, splicing the original text and the entity subject term to obtain a target embedded feature vector; the classification probability calculation is carried out on the target embedded characteristic vectors through the preset text classification model and the reference classification labels to obtain the classification probability value corresponding to each reference classification label, the correlation degree of the target embedded characteristic vectors and each reference classification label can be reflected through the classification probability value, and the classification labels to which the target embedded characteristic vectors belong can be favorably determined.

In step S101 of some embodiments, the web crawler may be written, and the data source is set and then the data is crawled with a target, so as to obtain the original text data to be classified. The original text to be classified may also be obtained by other means, but is not limited thereto. The original text to be classified may be a commodity description text, a news text, or other types of text materials, without limitation.

Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, step S201 to step S203:

step S201, performing word segmentation processing on an original text through a subject word recognition model to obtain a target text word segment;

step S202, extracting keywords from the target text word segment according to preset weight parameters to obtain a text keyword set;

step S203, combining the text keywords in the text keyword set to obtain an entity subject word.

In step S201 of some embodiments, the topic word recognition model may be constructed based on the LDA algorithm. Dividing an original text into a plurality of initial text word segments through a word segmentation device of a subject word recognition model and a preset part-of-speech type pair, and then screening the initial text word segments to obtain a target text word segment, wherein the preset part-of-speech type comprises a noun, a verb, an adjective, an adverb and the like.

In step S202 of some embodiments, the weight parameter may be set according to an actual service requirement, or may be set according to the importance of the target text field, without limitation. When extracting keywords from the target text word segments, the series of target text word segment movement cores can be firstly sequenced according to the importance degree of the target text word segments to obtain the importance sequence of the target text word segments, so that the preset weight parameters are given to each target text word segment in the importance sequence from large to small, and the text keyword set is obtained.

In step S203 of some embodiments, the entity topic word may be formed by combining one or more text keywords in the text keyword set, and when a combination of multiple text keywords is involved, the entity topic word may be combined randomly according to the service requirement without limitation. For example, when a plurality of text keywords are involved, three text keywords are selected and combined to form an entity topic word.

Referring to fig. 3, in some embodiments, step S301 may include, but is not limited to, step S301 to step S303:

step S301, carrying out vocabulary recognition on an original text through a preset word segmentation device to obtain word segment entity characteristics;

step S302, performing word segmentation processing on the original text according to the entity characteristics of the word segments to obtain initial text word segments;

step S303, filtering the initial text word segment to obtain a target text word segment.

In step S301 and step S302 of some embodiments, the preset word segmenter may be a Jieba word segmenter, and the like, without limitation. Taking a Jieba word segmentation device as an example, performing word recognition processing on an original text through the Jieba word segmentation device, specifically, firstly loading a dictionary file in the Jieba word segmentation device, obtaining each word in the Jieba word segmentation device and the occurrence frequency of the word, further traversing the dictionary file, constructing a directed acyclic graph of all possible word segmentation conditions in original sentence data through all character string matching, calculating the maximum probability in all paths from each character node in the original text to the end of a sentence, simultaneously recording the end position of the corresponding character word segment in the directed acyclic graph when the maximum probability is recorded, taking the node path and the node information as the entity characteristics of the word segment, and performing word segment segmentation on the original text according to the node path to obtain an initial text word segment.

In step S303 in some embodiments, the initial text word may be filtered according to the word length of the initial text word, and the initial text word whose word length exceeds the preset word length threshold is removed, so as to obtain the target text word meeting the requirement.

Through the steps S301 to S303, the original text can be conveniently split into a plurality of individual target text word segments, and semantic content information of the original text is represented by the target text word segments, so that the efficiency of text classification is improved.

Referring to fig. 4, in some embodiments, step S302 may include, but is not limited to, step S401 to step S403:

s401, performing importance sequencing on target text word segments to obtain a text word segment sequence;

step S402, carrying out weight distribution on the text segment sequence according to the weight parameters to obtain a weighted segment sequence;

and step S403, performing word segment screening on the weighted word segment sequence to obtain a text keyword set.

In step S401 in some embodiments, importance ranking is performed on the target text word segments according to the part-of-speech category of the target text word segments and the service scenario requirement, so as to obtain a text word segment sequence, for example, in the importance ranking process of some embodiments, the importance of a noun is higher than an adjective, the importance of an adjective is higher than a verb, and the importance of a verb is higher than an adverb.

In step S402 of some embodiments, a preset weight parameter is sequentially assigned to each target text word segment in the text word segment sequence according to the magnitude of the value to obtain a weighted word segment sequence, for example, if the target text word segment of the text word segment sequence is "hair dryer", "home", "black", or foldable ", and the weight parameters include 0.5, 0.21, 0.08, and 0.13, the weight 0.5 is matched to the target text word segment" hair dryer ", and the weight 0.21 is matched to the target text word segment" home "; match weight 0.13 to target text word segment "black"; the weight of 0.08 is matched to the target text segment "collapsible". The weighted segment sequence can be expressed as "hair dryer (0.5)", "home (0.21)", "black (0.13)", "foldable (0.08)".

In step S403 of some embodiments, word segment screening is performed on the weighted word segment sequence according to the part-of-speech category and the weight parameter of the target text word segment, so as to filter out the target text word segments whose part-of-speech category does not meet the current service requirement and the target text word segments whose weight parameter is less than or equal to the preset threshold, thereby reducing the number of word segments of the target text word segment and obtaining a text keyword set meeting the current service requirement.

Referring to fig. 5, in some embodiments, step S103 may include, but is not limited to, step S501 to step S503:

step S501, performing word embedding processing on an original text to obtain a text embedding characteristic vector;

step S502, word embedding processing is carried out on the entity subject word to obtain a subject word embedding characteristic vector;

and S503, splicing the text embedded characteristic vector and the subject word embedded characteristic vector according to a preset splicing sequence to obtain a target embedded characteristic vector.

In step S501 of some embodiments, the original text is subjected to word embedding processing by a BERT model or the like, and the original text is embedded into a low-dimensional continuous vector space from a high-dimensional space, so that each word or phrase in the original text is mapped to a vector on a real number domain, and a text embedding feature vector is obtained.

In step S502 of some embodiments, the entity subject word is subjected to word embedding processing by a BERT model or the like, and the entity subject word is embedded into a low-dimensional continuous vector space from a high-dimensional space, so that each word or phrase in the entity subject word is mapped to a vector in a real number domain, and a subject word embedding feature vector is obtained.

In step S503 of some embodiments, in order to embody the subject content of the original text, the preset stitching order may be that the subject word is before and the original text is after, so when the text embedding feature vector and the subject word embedding feature vector are stitched, the subject word embedding feature vector is placed at a front position, the text embedding feature vector is placed at a post-examination position, and the subject word embedding feature vector and the text embedding feature vector are vector-added, so as to obtain the target embedding feature vector.

Referring to fig. 6, before step S104 in some embodiments, the text classification method further includes pre-training a text classification model, specifically including but not limited to steps S601 to S606:

step S601, acquiring a label text and a reference classification label;

step S602, performing subject term identification on the label text to obtain a sample subject term;

step S603, splicing the label text and the sample subject term to obtain a sample embedded feature vector;

step S604, carrying out classification probability calculation on the sample embedded characteristic vectors through a text classification model and a reference classification label to obtain a sample classification predicted value;

step S605, calculating a loss value through a loss function of the text classification model and a sample classification predicted value to obtain a model loss value;

and step S606, performing gridding parameter adjustment on the text classification model according to the model loss value so as to optimize the text classification model.

In step S601 of some embodiments, a web crawler may be written, and a data source is set and then data is crawled with a target to obtain a tag text. The label text may be obtained in other ways, but is not limited thereto. The label text can be a commodity description text with a product label, a news text or other types of text materials, and is not limited; the reference classification label can be set according to actual business requirements, and the reference classification label can include specific products in various fields, such as a sweeper, a dust collector, an unmanned aerial vehicle and the like, without limitation.

In step S602 in some embodiments, performing word segmentation processing on the tag text through the topic word recognition model to obtain a tag text word segment; extracting keywords from the label text according to preset weight parameters to obtain a sample keyword set; and finally, combining the sample keywords in the sample keyword set to obtain a sample subject term. The process is substantially the same as the above steps S201 to S203, and is not described herein again.

In step S603 of some embodiments, word embedding processing is performed on the tag text to obtain a tag text embedding vector; carrying out word embedding processing on the sample subject term to obtain a sample subject term embedding vector; and splicing the sample subject word embedding vector and the label text embedding vector according to the sequence that the sample subject word embedding vector is in front of the label text embedding vector is behind the sample subject word embedding vector to obtain a sample embedding characteristic vector.

In step S604 of some embodiments, a classification probability calculation is performed on the sample-embedded feature vectors through a classification function of the text classification model and the reference classification tags, where the classification function may be a softmax function, and a probability distribution of the sample-embedded feature vectors is created on each reference classification tag through the softmax function, resulting in a sample classification prediction value that may characterize a probability that the sample-embedded feature vectors belong to each reference classification tag.

In step S605 of some embodiments, the Loss function is a minimum Loss function, and the optimum result of the text classification model may be reflected by the minimum Loss function, and the minimum Loss function may be expressed as Loss ∑ y _i logy′ _i +(1-y _i )log(1-y′ _i ) (ii) a Wherein, y _i Is the true tag of the tag text, y' _i The model Loss value Loss of the text classification model can be conveniently calculated through the Loss function and the sample classification predicted value.

In step S606 of some embodiments, when performing gridding parameter tuning on the text classification model according to the model loss value, the method includes adjusting the maximum text vector length, the training rate, the batch _ size, and the like, and training and verifying the text classification model according to a preset value to find a parameter of an optimal result, where the main parameter includes indicators such as auc, precision, and call. For example, in the process of gridding parameter adjustment, the maximum value and the minimum value are respectively set for different parameters, the parameter adjustment is continuously and circularly performed until the optimal result appears in the text classification model or the training frequency exceeds the set threshold value, the parameter adjustment is stopped, and the final text classification model is obtained.

Through the steps S601 to S606, the text classification model can be trained and optimized more accurately, the optimal text classification model is obtained, and the training effect and the model performance of the text classification model are improved.

Referring to fig. 7, in some embodiments, step S104 may include, but is not limited to, step S701 to step S703:

step S701, acquiring a reference classification label;

step S702, performing word embedding processing on the reference classification label to obtain a reference classification label vector;

and step S703, calculating the vector similarity of the reference classification label vector and the target embedded characteristic vector through a text classification model to obtain a classification probability value.

In step S701 of some embodiments, the reference classification label may be set according to actual business requirements, and the reference classification label may include specific products in various fields, such as a sweeper, a vacuum cleaner, an unmanned aerial vehicle, and the like, without limitation.

In step S702 of some embodiments, word embedding processing is performed on the reference classification tags, so as to implement mapping of the reference classification tags from a semantic space to a vector space, and obtain classification tags in a vector form, that is, a reference classification tag vector.

In step S703 of some embodiments, when the vector similarity calculation is performed on the reference classification tag vector and the target embedded feature vector by using the text classification model, a cosine similarity calculation or other collaborative filtering algorithms may be used to perform the similarity calculation on the reference classification tag vector and the target embedded feature vector, and the obtained similarity value is used as a classification probability value, which can represent the degree of correlation and the degree of approximation between the reference classification tag vector and the target embedded feature vector, and can more conveniently and intuitively reflect the degree of proximity of the target embedded feature vector to each reference classification tag vector by using the classification probability value.

In step S105 of some embodiments, since the classification probability value may represent a possibility that the original text belongs to each reference classification tag, and the larger the classification probability value is, the higher the possibility that the original text belongs to a category corresponding to the reference classification tag is, when the reference classification tag is screened according to the classification probability value, the reference classification tag with the highest classification probability value is selected as a target classification tag of the original text, and the target classification tag may be used to represent the category to which the original text belongs.

In one embodiment, the original text 1 is a news description text, which includes text content of "the sweeper robot market is developed vigorously, sweepers in the household field are well known, and the commercial field is gradually promoted. At present, machine sanitation workers are popularized and used on squares to keep the squares clean. "; the original text 2 is a commodity description text, and the text content included in the description is that "a certain brand of sweeping robot can realize sweeping and mopping integration, and has 2500Pa large suction force. The text classification method of the embodiment of the application can conveniently extract the entity subject words of the original text 1 as 'sweeping, commercial', and the entity subject words of the original text 2 as 'sweeping, suction, household'; therefore, the original text 1 and the entity subject words thereof are classified through the text classification model, the original text 1 is classified into a reference classification label of a sweeper, the original text 2 and the entity subject words thereof are classified through the text classification model, and the original text 2 is classified into a reference classification label of the sweeper, so that classification of different information of the same type of product can be realized, information, news, commodity description and the like of the same type of product are put into the same category, classification of mixed categories is effectively realized, and classification accuracy of materials and materials is improved.

The text classification method comprises the steps of obtaining original texts to be classified; and performing subject word recognition on the original text through a preset subject word recognition model to obtain an entity subject word, conveniently obtaining the entity subject word capable of representing the text content of the original text, and visually reflecting the subject content of the original text through the entity subject word. Further, splicing the original text and the entity subject term to obtain a target embedded feature vector; the classification probability calculation is carried out on the target embedded feature vectors through the preset text classification model and the reference classification labels to obtain the classification probability value corresponding to each reference classification label, the correlation degree of the target embedded feature vectors and each reference classification label can be reflected through the classification probability value, and the classification labels to which the target embedded feature vectors belong can be favorably determined.

Referring to fig. 8, an embodiment of the present application further provides a text classification apparatus, which can implement the text classification method, and the apparatus includes:

a text obtaining module 801, configured to obtain an original text to be classified;

the subject term recognition module 802 is configured to perform subject term recognition on an original text through a preset subject term recognition model to obtain an entity subject term;

the splicing module 803 is configured to splice the original text and the entity topic word to obtain a target embedded feature vector;

the probability calculation module 804 is used for performing classification probability calculation on the target embedded feature vectors through a preset text classification model and reference classification labels to obtain a classification probability value corresponding to each reference classification label;

and the screening module 805 is configured to perform screening processing on the reference classification tags according to the classification probability values to obtain target classification tags of the original text.

In some embodiments, the topic word recognition module 802 includes:

the word segmentation unit is used for performing word segmentation processing on the original text through the subject word recognition model to obtain a target text word segment;

the keyword extraction unit is used for extracting keywords from the target text word segment according to preset weight parameters to obtain a text keyword set;

and the combining unit is used for combining the text keywords in the text keyword set to obtain the entity subject words.

In some embodiments, the word segmentation unit comprises:

the vocabulary recognition subunit is used for performing vocabulary recognition on the original text through a preset word segmentation device to obtain word segment entity characteristics;

the word segmentation subunit is used for carrying out word segmentation processing on the original text according to the entity characteristics of the word segments to obtain initial text word segments;

and the filtering subunit is used for filtering the initial text word segment to obtain a target text word segment.

In some embodiments, the keyword extraction unit includes:

the sequencing subunit is used for sequencing the importance of the target text word segment to obtain a text word segment sequence;

the distribution subunit is used for carrying out weight distribution on the text word segment sequence according to the weight parameters to obtain a weighted word segment sequence;

and the screening subunit is used for screening the word segments of the weighted word segment sequence to obtain a text keyword set.

In some embodiments, the splicing module 803 comprises:

the first word embedding unit is used for carrying out word embedding processing on the original text to obtain a text embedding characteristic vector;

the second word embedding unit is used for carrying out word embedding processing on the entity subject word to obtain a subject word embedding feature vector;

and the first splicing unit is used for splicing the text embedded characteristic vector and the subject word embedded characteristic vector according to a preset splicing sequence to obtain a target embedded characteristic vector.

In some embodiments, the probability calculation module 804 includes:

a tag acquisition unit configured to acquire a reference classification tag;

the third word embedding unit is used for carrying out word embedding processing on the reference classification label to obtain a reference classification label vector;

and the first calculating unit is used for calculating the vector similarity of the reference classification label vector and the target embedded characteristic vector through a text classification model to obtain a classification probability value.

In some embodiments, the text classification device further includes a model training module, which specifically includes:

the acquisition unit is used for acquiring the label text and the reference classification label;

the identification unit is used for identifying the subject term of the label text to obtain a sample subject term;

the second splicing unit is used for splicing the label text and the sample subject term to obtain a sample embedded feature vector;

the second calculation unit is used for performing classification probability calculation on the sample embedded feature vectors through the text classification model and the reference classification labels to obtain a sample classification predicted value;

the third calculation unit is used for calculating a loss value through a loss function of the text classification model and the sample classification predicted value to obtain a model loss value;

and the optimization unit is used for carrying out gridding parameter adjustment on the text classification model according to the model loss value so as to optimize the text classification model.

The specific implementation of the text classification apparatus is substantially the same as the specific implementation of the text classification method, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: the text classification system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the text classification method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the text classification method according to the embodiments of the present application;

an input/output interface 903 for implementing information input and output;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable a communication connection within the device with each other through a bus 905.

Embodiments of the present application further provide a storage medium, which is a computer-readable storage medium for computer-readable storage, and the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the foregoing text classification method.

The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer-executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the text classification method, the text classification device, the electronic equipment and the storage medium, the original text to be classified is obtained; and performing subject word recognition on the original text through a preset subject word recognition model to obtain an entity subject word, conveniently obtaining the entity subject word capable of representing the text content of the original text, and visually reflecting the subject content of the original text through the entity subject word. Further, splicing the original text and the entity subject term to obtain a target embedded feature vector; the classification probability calculation is carried out on the target embedded characteristic vectors through the preset text classification model and the reference classification labels to obtain the classification probability value corresponding to each reference classification label, the correlation degree of the target embedded characteristic vectors and each reference classification label can be reflected through the classification probability value, and the classification labels to which the target embedded characteristic vectors belong can be favorably determined.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of text classification, the method comprising:

acquiring an original text to be classified;

2. The method for classifying a text according to claim 1, wherein the step of performing topic word recognition on the original text by using a preset topic word recognition model to obtain an entity topic word comprises:

extracting keywords from the target text word segment according to preset weight parameters to obtain a text keyword set;

3. The method of claim 2, wherein the step of performing word segmentation on the original text by the topic word recognition model to obtain a target text word segment comprises:

4. The method for classifying texts according to claim 2, wherein the step of extracting keywords from the target text segment according to a preset weight parameter to obtain a text keyword set comprises:

performing importance sequencing on the target text word segments to obtain a text word segment sequence;

5. The method for classifying a text according to claim 1, wherein the step of concatenating the original text and the entity subject term to obtain the target embedded feature vector comprises:

carrying out word embedding processing on the original text to obtain a text embedding characteristic vector;

and splicing the text embedded characteristic vector and the subject term embedded characteristic vector according to a preset splicing sequence to obtain the target embedded characteristic vector.

6. The method for classifying texts according to claim 1, wherein the step of calculating the classification probability of the target embedded feature vector according to a preset text classification model and reference classification labels to obtain the classification probability value corresponding to each reference classification label comprises:

acquiring the reference classification label;

and calculating the vector similarity of the reference classification label vector and the target embedded characteristic vector through the text classification model to obtain the classification probability value.

7. The method according to any one of claims 1 to 6, wherein before the step of calculating the classification probability of the target embedded feature vector through a preset text classification model and reference classification labels to obtain the classification probability value corresponding to each reference classification label, the method includes pre-training the text classification model, specifically including:

acquiring a label text and a reference classification label;

calculating a loss value through a loss function of the text classification model and the sample classification predicted value to obtain a model loss value;

8. An apparatus for classifying text, the apparatus comprising:

the splicing module is used for splicing the original text and the entity subject term to obtain a target embedded feature vector;

9. An electronic device, characterized in that the electronic device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, which program, when executed by the processor, implements the steps of the text classification method according to any one of claims 1 to 7.

10. A storage medium, which is a computer-readable storage medium, for computer-readable storage, characterized in that the storage medium stores one or more programs, which are executable by one or more processors to implement the steps of the text classification method of any one of claims 1 to 7.