CN110781296A - Data classification method based on deep learning and related equipment thereof - Google Patents

Data classification method based on deep learning and related equipment thereof Download PDF

Info

Publication number
CN110781296A
CN110781296A CN201910871231.2A CN201910871231A CN110781296A CN 110781296 A CN110781296 A CN 110781296A CN 201910871231 A CN201910871231 A CN 201910871231A CN 110781296 A CN110781296 A CN 110781296A
Authority
CN
China
Prior art keywords
class
word
name
segmentation
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910871231.2A
Other languages
Chinese (zh)
Inventor
唐亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN201910871231.2A priority Critical patent/CN110781296A/en
Publication of CN110781296A publication Critical patent/CN110781296A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides a data classification method based on deep learning and related equipment thereof, wherein the data classification method based on deep learning comprises the following steps: performing text processing on the label name corresponding to the acquired data to be classified to obtain a target name; performing text word segmentation on the target name, and extracting a first class of feature word segmentation, a second class of feature word segmentation and a third class of feature word segmentation; respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector; and importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result, and taking the recognition result as a classification result of the data to be classified. According to the technical scheme, the efficiency and the accuracy of classifying the data to be classified are improved, and the working efficiency of a user is further improved.

Description

Data classification method based on deep learning and related equipment thereof
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data classification method based on deep learning and related equipment thereof.
Background
With the development of society, various data layers are endless, and in order to facilitate a user to identify data, data needs to be classified, a traditional data classification mode generally adopts a template matching and labeling mode to process, but the flexibility of the method for supporting data is not enough, so that the template updating is not timely in the template matching process, the template cannot be accurately utilized to perform matching and labeling, the efficiency of data classification is low, and the accuracy of data classification is influenced.
Disclosure of Invention
The embodiment of the invention provides a data classification method based on deep learning and related equipment thereof, which are used for solving the problems of low data classification efficiency and low accuracy.
A data classification method based on deep learning comprises the following steps:
acquiring a label name corresponding to data to be classified from a label database;
performing text processing on the label name to obtain a target name;
performing text word segmentation on the target name, and extracting a first class of feature word segmentation, a second class of feature word segmentation and a third class of feature word segmentation;
respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector;
and importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the label name, and taking the recognition result as a classification result of the data to be classified.
A deep learning-based data classification apparatus comprising:
the first acquisition module is used for acquiring a label name corresponding to the data to be classified from the label database;
the text processing module is used for performing text processing on the label name to obtain a target name;
the characteristic word segmentation acquisition module is used for performing text word segmentation on the target name and extracting a first type of characteristic word segmentation, a second type of characteristic word segmentation and a third type of characteristic word segmentation;
the word vector conversion module is used for respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector;
and the recognition module is used for importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the label name and taking the recognition result as a classification result of the data to be classified.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above deep learning based data classification method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned deep learning-based data classification method.
According to the data classification method based on deep learning and the related equipment thereof, text processing is carried out on the obtained label name to obtain a corresponding target name, text word segmentation is carried out on the target name to obtain extracted first class feature word segmentation, second class feature word segmentation and third class feature word segmentation, word vector conversion is carried out on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation respectively to obtain a first class word vector, a second class word vector and a third class word vector, finally the first class word vector, the second class word vector and the third class word vector are led into a target classification model to be recognized to obtain a recognition result corresponding to the label name, and the recognition result is used as a classification result of data to be classified. Therefore, automatic classification of the data to be classified is achieved, the efficiency and the accuracy of classifying the data to be classified can be improved by means of word segmentation of the label names and recognition by combining a target classification model, and the work efficiency of querying by using classification results of users is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flowchart of a deep learning-based data classification method according to an embodiment of the present invention;
fig. 2 is a flowchart of step S2 in the deep learning-based data classification method according to the embodiment of the present invention;
FIG. 3 is a flowchart illustrating step S24 of the deep learning-based data classification method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating step S3 of the deep learning-based data classification method according to the embodiment of the present invention;
FIG. 5 is a flowchart illustrating step S33 of the deep learning-based data classification method according to the embodiment of the present invention;
FIG. 6 is a flowchart of a target classification model obtained by training with training samples in the deep learning-based data classification method according to the embodiment of the present invention;
FIG. 7 is a flowchart of training a convolutional neural network model in a deep learning-based data classification method according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a deep learning-based data classification apparatus according to an embodiment of the present invention;
fig. 9 is a block diagram of a basic mechanism of a computer device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data classification method based on deep learning is applied to the server side, and the server side can be specifically realized by an independent server or a server cluster consisting of a plurality of servers. In one embodiment, as shown in fig. 1, a deep learning-based data classification method is provided, which includes the following steps:
s1: and acquiring the label name corresponding to the data to be classified from the label database.
In the embodiment of the invention, the label database is detected, when the label name corresponding to the data to be classified is detected to exist in the label database, the label name is directly extracted, and the label name is deleted from the label database after extraction.
The label database is a database which is specially used for storing label names corresponding to the data to be classified.
The data to be classified refers to data which needs to be classified.
It should be noted that the label names in the label database are usually in a sentence form, such as "multifunctional pen for myopia prevention by vision correction with small sagitty intelligent posture correcting pen", "super large beach scarf and gauze scarf gift box for autumn air-conditioning warm scarf seaside", "collar coupon found 30" child thermos cup and straw dual-purpose kindergarten baby water bottle, male and female pupils drop-proof portable water cup ", and the like.
S2: and performing text processing on the label name to obtain a target name.
In the embodiment of the present invention, the text processing refers to processing for modifying the label name according to a rule set by a user, where the rule set by the user specifically may be removing punctuation marks, letter case conversion, and the like.
Specifically, the label name after text processing is determined as the target name by importing the label name into a preset modification port for text processing. The preset modification port is a processing port which is specially used for performing text processing on the label name.
For example, the tag name: the multi-functional intelligent pen of the star of tomorrow, after leading in the label name and modifying the port and carrying on the text processing in preserving, the target word segmentation that obtains is: a multifunctional intelligent pen for the tomorrow.
S3: and performing text word segmentation on the target name, and extracting a first class of feature word segmentation, a second class of feature word segmentation and a third class of feature word segmentation.
In the embodiment of the invention, the first-class characteristic word segmentation refers to word segmentation obtained after text word segmentation is carried out on a target name according to a first word segmentation rule; the second type of characteristic word segmentation refers to word segmentation obtained after text word segmentation is carried out on the target name according to a second word segmentation rule; the third class of feature word segmentation refers to word segmentation obtained after text word segmentation is carried out on the target name according to a third word segmentation rule; and the first word segmentation rule, the second word segmentation rule and the third word segmentation rule are different.
Specifically, the target name is led into a preset word segmentation port, a first word segmentation rule, a second word segmentation rule and a third word segmentation rule are selected to perform word segmentation processing on the target name, and then a first class feature word segmentation, a second class feature word segmentation and a third class feature word segmentation after word segmentation processing are obtained respectively. The preset word segmentation port is a processing port specially used for performing word segmentation processing on the target name, and the preset word segmentation port comprises a first word segmentation rule, a second word segmentation rule and a third word segmentation rule.
S4: and respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector.
Specifically, the first-class feature word segmentation, the second-class feature word segmentation and the third-class feature word segmentation are directly and respectively led into a preset processing library to be subjected to word vector conversion processing, and the converted first-class word vector, second-class word vector and third-class word vector are output.
The preset processing library is a database specially used for converting the feature word segmentation into the word vector, and specifically, word vector conversion processing is performed by using a word2vec model.
S5: and importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the label name, and taking the recognition result as a classification result of the data to be classified.
In the embodiment of the invention, the target classification model is specially used for identifying the identification results of the label names corresponding to the first class word vectors, the second class word vectors and the third class word vectors.
Specifically, the first class word vector, the second class word vector and the third class word vector respectively corresponding to the label names obtained in step S4 are all imported into a pre-trained target classification model, when the target classification model detects the first class word vector, the second class word vector and the third class word vector, the recognition results of the label names corresponding to the first class word vector, the second class word vector and the third class word vector are automatically recognized, the recognition results are output, and the recognition results are used as classification results of the data to be classified.
In this embodiment, text processing is performed on the obtained tag name to obtain a corresponding target name, text word segmentation is performed on the target name to obtain extracted first class feature word segmentation, second class feature word segmentation and third class feature word segmentation, word vector conversion is performed on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation respectively to obtain a first class word vector, a second class word vector and a third class word vector, and finally the first class word vector, the second class word vector and the third class word vector are imported into a target classification model for recognition to obtain a recognition result corresponding to the tag name, and the recognition result is used as a classification result of data to be classified. Therefore, automatic classification of the data to be classified is achieved, the efficiency and the accuracy of classifying the data to be classified can be improved by means of word segmentation of the label names and recognition by combining a target classification model, and the work efficiency of querying by using classification results of users is further improved.
In an embodiment, as shown in fig. 2, the step S2 of performing text processing on the tag name to obtain the target name includes the following steps:
s21: and performing punctuation removal processing on the label name to obtain a first name.
The method comprises the steps of matching a label name by using a symbol expression, deleting the punctuation when the punctuation exists in the matched label name, and determining the label name after the deletion as a first name, wherein the symbol expression refers to a regular expression specially used for matching the punctuation in the label name, and the specific regular expression can be' \\ pP + - $ ^ | < ^ to ^ $ + - < > ¥ × ] $.
The regular expression of the regular matching is used for processing the character string, and can describe the rules of the character in the character string by using some specific characters, thereby matching, extracting or replacing the character string which accords with a certain rule, and also can be used for searching, deleting and replacing the character string, and the searching speed is high and accurate.
It should be noted that, when there is a punctuation mark in the label name that is not matched, no processing is performed, and the label name is directly used as the first name.
For example, the label name is [ Tie Tijia 30] the anti-falling portable water cup for the children thermos cup and straw dual-purpose kindergarten baby kettle for the students in the primary schools, the label name is matched by using the symbolic expression of '\\ pP + - $ ═ Λ > - > Λ + } Λ | < > ¥ x ],' to obtain the symbol in the label name, and the symbol is [ C ], then the symbol [ C ] is deleted, and the first treated label is the anti-falling portable water cup for the children thermos cup and straw dual-purpose kindergarten baby kettle for the students in the primary schools.
S22: and converting capital letters in the first name into lower-case letters by utilizing regular matching to obtain a second name.
Specifically, according to the first name obtained in step S21, each character in the first name is traversed, an alphabet conversion expression is used to match each character in the first name, if the matched character is an upper case, the upper case is converted into a lower case, and after all the characters are matched, the matched first name is used as the second name.
The letter conversion expression refers to a regular expression that is used to match capital letters in a first name and convert the capital letters into lower case letters, and a specific regular expression may be $ reg '/(\ w +)/e'.
S23: and carrying out full-angle to half-angle processing on the second name to obtain a third name.
In the embodiment of the present invention, the second name obtained in step S22 is directly imported into a preset conversion library to perform a half-angle conversion process, so as to obtain a third name after the conversion process, where the preset conversion library is a database that is specially used for identifying the full-angle characters in the second name and converting the full-angle characters into the half-angle characters, and the preset conversion library may specifically use regular matching to perform the process, or may use a preset script to perform the process.
S24: and filtering the third name according to a preset rule to obtain a target name.
Specifically, the preset rule refers to a rule set by the user for filtering the third name, and may specifically be filtering sensitive characters, filtering characters preset by the user, and the like. And filtering the third name by directly utilizing a preset rule, and determining the filtered third name as a target name.
In this embodiment, the target name is obtained by performing punctuation removal processing on the label name, converting capital letters into lowercase letters, performing full-angle to half-angle processing, and performing filtering processing. Therefore, the target name can be accurately acquired, the accuracy of word segmentation by using the target name in the follow-up process is improved, and the accuracy of target classification model identification is improved.
In an embodiment, as shown in fig. 3, the step S24 of filtering the third name according to the preset rule to obtain the target name includes the following steps:
s241: and acquiring stop words from a preset stop word bank.
In the embodiment of the invention, stop words refer to that certain characters or words are automatically filtered before or after natural language data is processed in the information retrieval process, so as to save storage space and improve search efficiency, and the characters or words are called stop words.
Specifically, the stop words are directly obtained from a preset stop word bank. The preset disabled word library refers to a database specially used for storing disabled words.
S242: and matching the third name with the stop word, and if the third name contains the stop word, deleting the vocabulary in the third name which is the same as the stop word, and taking the deleted third name as the target name.
Specifically, according to the stop word obtained in step S241, the stop word is matched with the third name, and when a vocabulary identical to the stop word exists in the third name, it indicates that the vocabulary needs to be filtered, and the vocabulary is subjected to deletion processing, and the deleted third name is determined as the target name.
If the matched third name does not contain the same vocabulary as the stop word, the third name is directly determined as the target name.
For example, there is a third name "baby's toy". If the stop word is ' yes ', matching ' the stop word ' with ' the toy of a baby ' with the third name to obtain ' the word ' the same as the stop word ' exists in the third name, and deleting the word to obtain a target name of ' the toy of a baby '; if the stop word is ' good ', the fact that the words identical to the stop word do not exist in the third name is obtained by matching the stop word with the third name, and the third name is determined as the target name, namely the target name is ' baby ' toy '.
In this embodiment, the target name is obtained by deleting the vocabulary in the third name that is the same as the stop word in the manner that the obtained stop word is matched with the third name. By means of the mode that stop words are matched with the third name, words which do not meet the requirements of the user in the third name can be deleted, and accuracy of the target name is further improved.
In an embodiment, as shown in fig. 4, in S3, performing text segmentation on the target name, and extracting the first type feature segmentation, the second type feature segmentation, and the third type feature segmentation includes the following steps:
s31: and performing word segmentation processing on the target name by using a preset word segmentation device to obtain a first class word segmentation, a second class word segmentation and a third class word segmentation, wherein the first class word segmentation, the second class word segmentation and the third class word segmentation all comprise at least two words.
In the embodiment of the invention, the target name is directly led into a preset word segmentation device, and 3 different word segmentation parameters are set to perform word segmentation processing on the target name, so that a first class word segmentation, a second class word segmentation and a third class word segmentation after the word segmentation processing are respectively obtained, and the first class word segmentation, the second class word segmentation and the third class word segmentation all comprise at least two words.
The preset word segmentation device is a word segmentation tool used for performing word segmentation processing on the target name, and specifically can be an N-Gram word segmentation device, a word segmentation device and the like.
Preferably, the embodiment of the invention mainly adopts an N-Gram word segmentation device.
It should be noted that the target name is led into the N-Gram participler, and the value of N is set to 1, 2, and 3 respectively for word segmentation, i.e. the 1-Gram participle, 2-Gram participle, and 3-Gram participle after word segmentation are obtained respectively. Wherein, the 1-gram participles represent first class participles, the 2-gram participles represent second class participles, and the 3-gram participles represent third class participles.
For example, if the target name is: "seaside super large beach towel female" carries out word segmentation processing by an N-Gram word segmentation device, if the value of N is 1, the 1-Gram word segmentation is "seaside", "super large", "beach", "towel" or "female"; if the value of N is 2, the 2-gram is divided into a large seaside, a large sand beach, a sand beach towel and a towel girl; if the value of N is 3, the 3-gram is divided into a large beach beside the sea, a large beach towel and a beach towel female.
S32: and respectively acquiring the word frequency corresponding to each vocabulary in the first class participle, the second class participle and the third class participle according to a preset word frequency database.
In the embodiment of the present invention, the term frequency refers to a frequency used for evaluating the degree of repetition of a term for a document or a domain document set in a corpus. And according to the first category participle, the second category participle and the third category participle obtained in the step S31, respectively obtaining word frequency corresponding to each vocabulary in the first category participle, the second category participle and the third category participle from a preset word frequency database.
The preset word frequency database is specially used for storing different vocabularies and word frequencies corresponding to the vocabularies. For example, the word frequency for the word "beach" is 100 times.
S33: and comparing the word frequency with a preset threshold value, and determining a first class of feature participles, a second class of feature participles and a third class of feature participles according to a preset condition.
Specifically, the word frequency of each vocabulary contained in the first class participle is compared with a preset threshold value, and the first class characteristic participle is determined according to a preset condition; and similarly, obtaining second-class characteristic participles and third-class characteristic participles.
The preset threshold may be 10 or 100, and the specific value range is set according to the actual requirement of the user, which is not limited herein.
The preset condition is a condition set by a user for determining corresponding characteristic word segmentation according to a comparison result obtained by comparing the word frequency with a preset threshold value.
In this embodiment, a preset participle device is used to perform participle processing on a target name to obtain a first class participle, a second class participle and a third class participle, a word frequency corresponding to each word in the first class participle, the second class participle and the third class participle is obtained, and the first class characteristic participle, the second class characteristic participle and the third class characteristic participle are determined based on a preset condition in a manner of comparing the word frequency with a preset threshold. Therefore, the first-class feature word segmentation, the second-class feature word segmentation and the third-class feature word segmentation are accurately obtained, and the accuracy of word vector conversion by subsequently utilizing the first-class feature word segmentation, the second-class feature word segmentation and the third-class feature word segmentation is further improved.
In one embodiment, as shown in fig. 5, in S33, the step of comparing the word frequency with the preset threshold and determining the first class, the second class and the third class of feature segmentation according to the preset condition includes the following steps:
s331: comparing the word frequency with a preset threshold, if the word frequency is larger than or equal to the preset threshold in the first class of participles, determining the word as a first class characteristic participle, if the word frequency is larger than or equal to the preset threshold in the second class of participles, determining the word as a second class characteristic participle, and if the word frequency is larger than or equal to the preset threshold in the third class of participles, determining the word as a third class characteristic participle.
In the embodiment of the present invention, the word frequency corresponding to each word in the first class participle, the second class participle and the third class participle is obtained according to step S32, the word frequency corresponding to each word in the first class participle is compared with a preset threshold, and if the word frequency is greater than or equal to the preset threshold, the word corresponding to the word frequency is determined as the first class characteristic participle; and similarly, obtaining second-class characteristic participles and third-class characteristic participles.
For example, the vocabulary a and the vocabulary B exist in the first class participle, the corresponding word frequencies are 50 and 60 respectively, if the preset threshold is 50, the word frequencies corresponding to the vocabulary a and the vocabulary B are respectively compared with the preset threshold, and both the vocabulary a and the vocabulary B are determined as the first class characteristic participle because the word frequencies corresponding to the vocabulary a and the vocabulary B are greater than or equal to the preset threshold.
S332: if the vocabulary with the word frequency smaller than the preset threshold exists in the first class participle, the vocabulary is replaced by the preset vocabulary and is determined as the first class characteristic participle, if the vocabulary with the word frequency smaller than the preset threshold exists in the second class participle, the vocabulary is replaced by the preset vocabulary and is determined as the second class characteristic participle, and if the vocabulary with the word frequency smaller than the preset threshold exists in the third class participle, the vocabulary is replaced by the preset vocabulary and is determined as the third class characteristic participle.
Specifically, the word frequency corresponding to each word in the first class participle is compared with a preset threshold, if the word frequency is smaller than the preset threshold, the word corresponding to the word frequency is replaced by the preset word, and the word after being replaced by the preset word is determined as the first class characteristic participle; and similarly, obtaining second-class characteristic participles and third-class characteristic participles. The preset vocabulary refers to vocabulary set according to actual requirements of a user, and preferably, the preset vocabulary in the embodiment of the invention is UNK.
For example, the preset vocabulary is UNK, the vocabulary C exists in the first class participle, the corresponding word frequencies of the vocabulary C are respectively 50, if the preset threshold is 80, the vocabulary C is compared with the preset threshold, and the vocabulary C is replaced by the UNK because the word frequencies corresponding to the vocabulary C are all smaller than the preset threshold, and the UNK is determined as the first class characteristic participle.
In the embodiment, words greater than or equal to a preset threshold value in the first class of participles are determined as first class characteristic participles in a mode of comparing the word frequency with the preset threshold value, and similarly, second class characteristic participles and third class characteristic participles are determined; and replacing the vocabulary smaller than the preset threshold value in the first class of participles with the preset vocabulary and determining the vocabulary as the first class of characteristic participles, and similarly determining the second class of characteristic participles and the third class of characteristic participles. Therefore, the first-class feature word segmentation, the second-class feature word segmentation and the third-class feature word segmentation are accurately obtained, and the accuracy of word vector conversion by subsequently utilizing the first-class feature word segmentation, the second-class feature word segmentation and the third-class feature word segmentation is further improved.
In an embodiment, as shown in fig. 6, after step S4 and before step S5, the method for classifying data based on deep learning further includes the following steps:
s6: and acquiring a training sample from a preset sample library.
In the embodiment of the present invention, the training sample refers to sample data specially used for training the convolutional neural network model to obtain the target classification model. The training samples are directly obtained from a preset sample library, wherein the preset sample library is a database specially used for storing the training samples.
S7: and leading the training sample into a convolutional neural network for training to obtain a target classification model.
Specifically, the training samples obtained in step S6 are imported into a convolutional neural network model for training, and the model that meets the user setting requirements after training is determined as the target classification model.
In this embodiment, the target classification model is obtained by obtaining a training sample and training the convolutional neural network by using the training sample. Therefore, accurate training of the target classification model is achieved, and accuracy of classification and identification of the label name by the target classification model in the follow-up process is guaranteed.
In an embodiment, as shown in fig. 7, the step S7 of importing the training samples into a convolutional neural network for training to obtain the target classification model includes the following steps:
s71: and initializing the convolutional neural network model to obtain an initial model.
In the embodiment of the invention, model parameters of a convolutional neural network model are initialized by a server, and an initial parameter is given to the weight and the bias of each network layer in the convolutional neural network model, so that the convolutional neural network model can extract and calculate the characteristics of a training sample according to the initial parameter, wherein the weight and the bias are model parameters used for performing refraction transformation calculation on input data in the network, and the result output by the network after calculation can be consistent with the actual condition.
It can be understood that, taking the example of receiving information by a person, after the person receives the information and is judged and transmitted by neurons in the brain of the person, the person can obtain a certain result or cognition, that is, a process of acquiring cognition from the information, and the training process of the convolutional neural network model is to optimize the weight and bias of the neuron connection in the network, so that the recognition result of the trained convolutional neural network model on the data to be recognized can achieve the recognition effect consistent with the real situation.
Optionally, the server may optionally obtain a weight as an initial parameter in an interval of [ -0.30, +0.30], and set the initial parameter in an interval with an average value of 0 and smaller, so as to improve the convergence rate of the model and improve the construction efficiency of the model.
S72: and importing the training sample into an initial model for convolution operation, and outputting a first output value corresponding to the first word vector, a second output value corresponding to the second word vector and a third output value corresponding to the third word vector, wherein the training sample comprises the first word vector, the second word vector and the third word vector.
In the embodiment of the invention, because the training sample comprises the first class word vector, the second class word vector and the third class word vector, the training sample is imported into the input layer, the convolution layer, the pooling layer, the splicing layer, the full-link layer and the softmax in the initial model to carry out convolution operation, and finally, the convolution result is output to respectively obtain the first output value corresponding to the first class word vector, the second output value corresponding to the second class word vector and the third output value corresponding to the third class word vector.
It should be noted that the input layer, the convolution layer, the pooling layer, the splicing layer, the full-link layer, and the softmax all have preset convolution kernels, and by importing the input data into each layer, the convolution operation can be performed according to the preset convolution kernels, so as to obtain corresponding output results.
S73: and calculating to obtain a comprehensive output value according to a weighted summation mode based on the first output value, the second output value, the third output value and a preset weight value.
Specifically, based on the first output value, the second output value, and the third output value obtained in step S72, a comprehensive output value is calculated according to formula (1):
y=a 1y 1+a 2y 2+a 3y 3formula (1)
Wherein y is the integrated output value, y 1Is a first output value, y 2Is the second output value, y 3Is a third output value, a 1、a 2、a 3Is a preset weight value, and a 1+a 2+a 3=1。
S74: and comparing the comprehensive output values with preset precision values, and if the comprehensive output values of the continuous preset number are less than or equal to the preset precision values, determining the initial model corresponding to the comprehensive output values as a target classification model.
Specifically, the integrated output value obtained in step S73 is compared with a preset precision value, if the integrated output value is less than or equal to the preset precision value, the latest historical integrated output value of the preset number of initial models corresponding to the integrated output value is obtained from the preset historical database, and if all the historical integrated output values of the preset number are less than or equal to the preset precision value, the initial model corresponding to the integrated output value is determined as the target classification model.
The preset precision value is a numerical value specially used for judging whether the initial model can reach the user standard, and specifically may be 0.8, or may be set according to the actual requirements of the user. The preset number may be 10 specifically, or may be set according to the actual requirement of the user.
In this embodiment, an initial model is obtained by initializing a convolutional neural network model, a first output value, a second output value, and a third output value are obtained by calculation according to a training sample, a comprehensive output value is obtained by calculation according to formula (1), and finally, the initial model corresponding to the comprehensive output values of which the number is continuously preset and is less than or equal to a preset precision value is determined as a target classification model by comparing the comprehensive output value with the preset precision value. Therefore, training and tuning of the initial model are achieved, and the accuracy of the target classification model in recognizing the training samples is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a deep learning based data classification apparatus is provided, and the deep learning based data classification apparatus corresponds to the deep learning based data classification method in the above embodiments one to one. As shown in fig. 8, the deep learning-based data classification apparatus includes a first obtaining module 81, a text processing module 82, a feature segmentation obtaining module 83, a word vector conversion module 84, and a recognition module 85. The functional modules are explained in detail as follows:
the first obtaining module 81 is configured to obtain a tag name corresponding to data to be classified from a tag database;
the text processing module 82 is used for performing text processing on the label name to obtain a target name;
the feature segmentation obtaining module 83 is configured to perform text segmentation on the target name, and extract a first type of feature segmentation, a second type of feature segmentation, and a third type of feature segmentation;
a word vector conversion module 84, configured to perform word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation respectively to obtain a first class word vector, a second class word vector and a third class word vector;
and the recognition module 85 is configured to import the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, output a recognition result corresponding to the label name, and use the recognition result as a classification result of the data to be classified.
Further, the text processing module 82 includes:
the symbol removal submodule is used for performing punctuation symbol removal processing on the label name to obtain a first name;
the case conversion sub-module is used for converting capital letters in the first name into lowercase letters by utilizing regular matching to obtain a second name;
the full-half-angle conversion submodule is used for performing full-angle to half-angle processing on the second name to obtain a third name;
and the filtering submodule is used for filtering the third name according to a preset rule to obtain a target name.
Further, the filtering submodule includes:
the second acquisition unit is used for acquiring stop words from a preset stop word bank;
and the deleting unit is used for matching the third name with the stop word, and if the third name contains the stop word, deleting the vocabulary in the third name which is the same as the stop word, and taking the deleted third name as the target name.
Further, the feature segmentation obtaining module 83 includes:
the word segmentation processing sub-module is used for carrying out word segmentation processing on the target name by utilizing a preset word segmentation device to obtain a first class word segmentation, a second class word segmentation and a third class word segmentation, wherein the first class word segmentation, the second class word segmentation and the third class word segmentation all comprise at least two words;
the third obtaining sub-module is used for respectively obtaining the word frequency corresponding to each vocabulary in the first class participle, the second class participle and the third class participle according to a preset word frequency database;
and the comparison submodule is used for comparing the word frequency with a preset threshold value and determining the first class of characteristic participles, the second class of characteristic participles and the third class of characteristic participles according to preset conditions.
Further, the comparison submodule includes:
the first comparison unit is used for comparing the word frequency with a preset threshold, if the word frequency is larger than or equal to the preset threshold in the first class of participles, the word is determined as a first class characteristic participle, if the word frequency is larger than or equal to the preset threshold in the second class participle, the word is determined as a second class characteristic participle, and if the word frequency is larger than or equal to the preset threshold in the third class participle, the word is determined as a third class characteristic participle;
and the second comparison unit is used for replacing the vocabulary with the preset vocabulary and determining the vocabulary as the first class characteristic participle if the vocabulary with the word frequency smaller than the preset threshold exists in the first class participle, replacing the vocabulary with the preset vocabulary and determining the vocabulary as the second class characteristic participle if the vocabulary with the word frequency smaller than the preset threshold exists in the second class participle, and replacing the vocabulary with the preset vocabulary and determining the vocabulary as the third class characteristic participle if the vocabulary with the word frequency smaller than the preset threshold exists in the third class participle.
Further, the data classification device based on deep learning further comprises:
the fourth acquisition module is used for acquiring the training samples from the preset sample library;
and the training module is used for importing the training samples into the convolutional neural network for training to obtain a target classification model.
Further, the training module comprises:
the initialization submodule is used for initializing the convolutional neural network model to obtain an initial model;
the output sub-module is used for importing the training sample into the initial model for convolution operation, and outputting a first output value corresponding to the first word vector, a second output value corresponding to the second word vector and a third output value corresponding to the third word vector, wherein the training sample comprises the first word vector, the second word vector and the third word vector;
the calculation submodule is used for calculating to obtain a comprehensive output value according to a weighting summation mode based on the first output value, the second output value, the third output value and a preset weight value;
and the model determining submodule is used for comparing the comprehensive output values with preset precision values, and if the comprehensive output values of the continuous preset number are less than or equal to the preset precision values, determining the initial model corresponding to the comprehensive output values as a target classification model.
Some embodiments of the present application disclose a computer device. Referring specifically to fig. 9, a basic structure block diagram of a computer device 90 according to an embodiment of the present application is shown.
As illustrated in fig. 9, the computer device 90 includes a memory 91, a processor 92, and a network interface 93 communicatively connected to each other through a system bus. It is noted that only a computer device 90 having components 91-93 is shown in FIG. 9, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 91 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 91 may be an internal storage unit of the computer device 90, such as a hard disk or a memory of the computer device 90. In other embodiments, the memory 91 may also be an external storage device of the computer device 90, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 90. Of course, the memory 91 may also include both internal and external memory units of the computer device 90. In this embodiment, the memory 91 is generally used for storing an operating system installed on the computer device 90 and various types of application software, such as program codes of the deep learning-based data classification method. Further, the memory 91 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 92 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 92 is typically used to control the overall operation of the computer device 90. In this embodiment, the processor 92 is configured to execute the program code stored in the memory 91 or process data, for example, execute the program code of the deep learning-based data classification method.
The network interface 93 may include a wireless network interface or a wired network interface, and the network interface 93 is generally used to establish a communication connection between the computer device 90 and other electronic devices.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing a tag name information entry program executable by at least one processor to cause the at least one processor to perform the steps of any one of the deep learning based data classification methods described above.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a computer device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
Finally, it should be noted that the above-mentioned embodiments illustrate only some of the embodiments of the present application, and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A deep learning-based data classification method is characterized by comprising the following steps:
acquiring a label name corresponding to data to be classified from a label database;
performing text processing on the label name to obtain a target name;
performing text word segmentation on the target name, and extracting a first class of feature word segmentation, a second class of feature word segmentation and a third class of feature word segmentation;
respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector;
and importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the label name, and taking the recognition result as a classification result of the data to be classified.
2. The deep learning-based data classification method according to claim 1, wherein the step of performing text processing on the tag name to obtain the target name comprises:
performing punctuation mark removal processing on the label name to obtain a first name;
converting capital letters in the first name into lowercase letters by utilizing regular matching to obtain a second name;
performing full-angle to half-angle processing on the second name to obtain a third name;
and filtering the third name according to a preset rule to obtain the target name.
3. The deep learning-based data classification method according to claim 2, wherein the step of filtering the third name according to a preset condition to obtain the target name comprises:
acquiring stop words from a preset stop word bank;
and matching the third name with the stop word, and if the stop word is contained in the third name, deleting the vocabulary in the third name which is the same as the stop word, and taking the deleted third name as the target name.
4. The data classification method based on deep learning of claim 1, wherein the step of performing text segmentation on the target name and extracting a first class feature segmentation, a second class feature segmentation and a third class feature segmentation comprises:
performing word segmentation processing on the target name by using a preset word segmentation device to obtain a first class word segmentation, a second class word segmentation and a third class word segmentation, wherein the first class word segmentation, the second class word segmentation and the third class word segmentation all comprise at least two words;
respectively acquiring the word frequency corresponding to each vocabulary in the first category participles, the second category participles and the third category participles according to a preset word frequency database;
and comparing the word frequency with a preset threshold value, and determining the first class feature participle, the second class feature participle and the third class feature participle according to a preset condition.
5. The deep learning-based data classification method according to claim 4, wherein the step of comparing the word frequency with a preset threshold and determining the first class feature segmentation, the second class feature segmentation and the third class feature segmentation according to a preset condition comprises:
comparing the word frequency with a preset threshold, if the word frequency is larger than or equal to the preset threshold in the first class of participles, determining the word as the first class of characteristic participles, if the word frequency is larger than or equal to the preset threshold in the second class of participles, determining the word as the second class of characteristic participles, and if the word frequency is larger than or equal to the preset threshold in the third class of participles, determining the word as the third class of characteristic participles;
if the vocabulary with the word frequency smaller than the preset threshold exists in the first class participle, replacing the vocabulary with the preset vocabulary and determining the vocabulary as the first class characteristic participle, if the vocabulary with the word frequency smaller than the preset threshold exists in the second class participle, replacing the vocabulary with the preset vocabulary and determining the vocabulary as the second class characteristic participle, and if the vocabulary with the word frequency smaller than the preset threshold exists in the third class participle, replacing the vocabulary with the preset vocabulary and determining the vocabulary as the third class characteristic participle.
6. The data classification method based on deep learning of claim 1, wherein after the step of performing word vector transformation on the first class of feature participles, the second class of feature participles and the third class of feature participles respectively to obtain a first class word vector, a second class word vector and a third class word vector, the step of importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the tag name, and before the step of taking the recognition result as the classification result of the data to be classified, the data classification method based on deep learning further comprises:
acquiring a training sample from a preset sample library;
and importing the training sample into a convolutional neural network for training to obtain the target classification model.
7. The deep learning-based data classification method according to claim 6, wherein the step of introducing the training samples into a convolutional neural network for training to obtain the target classification model comprises:
initializing the convolutional neural network model to obtain an initial model;
importing the training sample into the initial model to perform convolution operation, and outputting a first output value corresponding to the first word vector, a second output value corresponding to the second word vector and a third output value corresponding to the third word vector, wherein the training sample comprises the first word vector, the second word vector and the third word vector;
calculating to obtain a comprehensive output value according to a weighting summation mode based on the first output value, the second output value, the third output value and a preset weight value;
and comparing the comprehensive output value with a preset precision value, and if the comprehensive output values of the continuous preset number are less than or equal to the preset precision value, determining an initial model corresponding to the comprehensive output value as the target classification model.
8. A deep learning-based data classification apparatus, characterized in that the deep learning-based data classification apparatus comprises:
the first acquisition module is used for acquiring a label name corresponding to the data to be classified from the label database;
the text processing module is used for performing text processing on the label name to obtain a target name;
the characteristic word segmentation acquisition module is used for performing text word segmentation on the target name and extracting a first type of characteristic word segmentation, a second type of characteristic word segmentation and a third type of characteristic word segmentation;
the word vector conversion module is used for respectively carrying out word vector conversion processing on the first class feature word segmentation, the second class feature word segmentation and the third class feature word segmentation to obtain a first class word vector, a second class word vector and a third class word vector;
and the recognition module is used for importing the first class word vector, the second class word vector and the third class word vector into a pre-trained target classification model for recognition, outputting a recognition result corresponding to the label name and taking the recognition result as a classification result of the data to be classified.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the deep learning based data classification method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the deep learning based data classification method according to any one of claims 1 to 7.
CN201910871231.2A 2019-09-16 2019-09-16 Data classification method based on deep learning and related equipment thereof Pending CN110781296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910871231.2A CN110781296A (en) 2019-09-16 2019-09-16 Data classification method based on deep learning and related equipment thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910871231.2A CN110781296A (en) 2019-09-16 2019-09-16 Data classification method based on deep learning and related equipment thereof

Publications (1)

Publication Number Publication Date
CN110781296A true CN110781296A (en) 2020-02-11

Family

ID=69383500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910871231.2A Pending CN110781296A (en) 2019-09-16 2019-09-16 Data classification method based on deep learning and related equipment thereof

Country Status (1)

Country Link
CN (1) CN110781296A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354194A (en) * 2014-08-19 2016-02-24 上海中怡通信息科技有限公司 Intelligent commodity classifying method and system
CN106339495A (en) * 2016-08-31 2017-01-18 广州智索信息科技有限公司 Topic detection method and system based on hierarchical incremental clustering
CN106372063A (en) * 2016-11-01 2017-02-01 上海智臻智能网络科技股份有限公司 Information processing method and device and terminal
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354194A (en) * 2014-08-19 2016-02-24 上海中怡通信息科技有限公司 Intelligent commodity classifying method and system
CN106339495A (en) * 2016-08-31 2017-01-18 广州智索信息科技有限公司 Topic detection method and system based on hierarchical incremental clustering
CN106372063A (en) * 2016-11-01 2017-02-01 上海智臻智能网络科技股份有限公司 Information processing method and device and terminal
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning

Similar Documents

Publication Publication Date Title
CN110909548B (en) Chinese named entity recognition method, device and computer readable storage medium
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN107291783B (en) Semantic matching method and intelligent equipment
EP3660733B1 (en) Method and system for information extraction from document images using conversational interface and database querying
CN110377903B (en) Sentence-level entity and relation combined extraction method
WO2020232861A1 (en) Named entity recognition method, electronic device and storage medium
CN110851596A (en) Text classification method and device and computer readable storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN109902303B (en) Entity identification method and related equipment
CN110532381A (en) A kind of text vector acquisition methods, device, computer equipment and storage medium
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN112800239B (en) Training method of intention recognition model, and intention recognition method and device
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN112861518B (en) Text error correction method and device, storage medium and electronic device
CN108388554A (en) Text emotion identifying system based on collaborative filtering attention mechanism
CN113326702B (en) Semantic recognition method, semantic recognition device, electronic equipment and storage medium
CN112765319B (en) Text processing method and device, electronic equipment and storage medium
CN112800184B (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN112966685A (en) Attack network training method and device for scene text recognition and related equipment
CN111046674B (en) Semantic understanding method and device, electronic equipment and storage medium
CN110489727B (en) Person name recognition method and related device
CN111506726A (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN113220828B (en) Method, device, computer equipment and storage medium for processing intention recognition model
CN110309252B (en) Natural language processing method and device
CN117951249A (en) Knowledge base response method and system based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200211

RJ01 Rejection of invention patent application after publication