CN115795030A - Text classification method and device, computer equipment and storage medium - Google Patents

Text classification method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115795030A
CN115795030A CN202211319702.7A CN202211319702A CN115795030A CN 115795030 A CN115795030 A CN 115795030A CN 202211319702 A CN202211319702 A CN 202211319702A CN 115795030 A CN115795030 A CN 115795030A
Authority
CN
China
Prior art keywords
target
theme
text
feature
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211319702.7A
Other languages
Chinese (zh)
Inventor
郑子彬
刘小慧
赵山河
蔡倬
邬稳
朱煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Merchants Union Consumer Finance Co Ltd
Sun Yat Sen University
Original Assignee
Merchants Union Consumer Finance Co Ltd
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Merchants Union Consumer Finance Co Ltd, Sun Yat Sen University filed Critical Merchants Union Consumer Finance Co Ltd
Priority to CN202211319702.7A priority Critical patent/CN115795030A/en
Publication of CN115795030A publication Critical patent/CN115795030A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application relates to a text classification method, a text classification device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a text to be classified, and acquiring a target theme feature vocabulary; calculating a text feature vector to be classified corresponding to the text to be classified; obtaining a theme characteristic vector corresponding to each candidate theme, wherein the theme characteristic vector is obtained by calculation based on a characteristic word set corresponding to the candidate theme; calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity; and obtaining a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified. By adopting the method, the accuracy of short text classification can be improved.

Description

Text classification method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text classification method, apparatus, computer device, storage medium, and computer program product.
Background
With the development of the field ontology technology, a text classification technology based on the field ontology appears, and the technology expands the text topic feature vector and corrects the weight of the topic feature vector by calculating the cosine similarity between the feature words of the text topic feature vector and the field ontology and extracting the words larger than a threshold value, thereby achieving the purpose of text semantic feature expansion.
The traditional text classification process mainly performs text classification based on keyword matching, the technical method based on keyword matching has poor support capability on semantic matching, the performance of the method depends on the understanding of a user on the method, and the method has great limitation.
However, due to the characteristics of short text length, small information amount, sparse features for expressing text topics and the like, the conventional text classification method cannot be well applied to short text classification, so that the short text classification accuracy is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a text classification method, apparatus, computer device, computer readable storage medium and computer program product for topic classification of short texts, which improve the accuracy of short text classification.
A method of text classification, the method comprising:
the method comprises the steps of obtaining a text to be classified, obtaining a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for expanding the characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
calculating a text feature vector to be classified corresponding to the text to be classified;
obtaining a theme characteristic vector corresponding to each candidate theme, wherein the theme characteristic vector is obtained by calculation based on a characteristic word set corresponding to the candidate theme;
calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and obtaining a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
In one embodiment, before obtaining the target topic feature vocabulary, the method further includes:
acquiring a target field narrative table and determining the application purpose of a field body;
converting the narrative words in the target field narrative word list into concepts in the field ontology based on the application purpose of the field ontology to obtain target concepts;
determining the level relation among the target concepts based on the hierarchical relation among the narrative words in the target field narrative word list to obtain a target level relation;
adding attributes to the target concept based on the meaning-limited words and the annotations of the narrative words in the target field narrative word list to obtain target attributes;
adding an inter-word relation for the target concept based on the inter-word relation among the narrative words in the target field narrative word list to obtain a target inter-word relation;
forming a target concept model based on the target concept, the target level relation, the target attribute and the target interword relation;
creating an example corresponding to the target concept based on the target concept model to obtain a target example;
and constructing a target domain ontology based on the target conceptual model and the target instance.
In one embodiment, before obtaining the target topic feature vocabulary, the method further includes:
acquiring an application range of a field ontology, acquiring a term list corresponding to the field ontology based on the application range, and establishing a concept structure of the field ontology based on the term list to acquire a target concept structure of the field ontology;
defining concept attributes and set class constraints of a field ontology based on the target concept structure to obtain target concept attributes and target class constraints of the field ontology, and forming a target concept model of the field ontology based on the target concept structure, the target concept attributes and the target class constraints;
and establishing corresponding examples of each class in the field ontology based on the target conceptual model to obtain target examples of the field ontology, and forming the target field ontology based on the target conceptual model and the target examples.
In one embodiment, after the target domain ontology is constructed based on the target conceptual model and the target instance, the method further includes:
acquiring the data volume of the historical text;
when the data volume is larger than a first threshold value, taking two thirds of texts of the historical texts as a training set, and taking one third of texts of the historical texts as a test set to obtain the training set and the test set;
when the data volume is smaller than or equal to a first threshold value and larger than a second threshold value, dividing the historical text into a preset number of sample sets with the same size, sequentially taking each sample set as a test set, and combining the sets of the sample sets except the test set into a training set to obtain the training set and the test set;
when the data volume is smaller than a second threshold value, repeatedly and randomly extracting a sample set with the size consistent with that of the historical text from the historical text as a training set, taking data which do not appear in the training set in the historical text as a test set, and obtaining the training set and the test set, wherein the first threshold value is larger than the second threshold value;
the test set is used for testing the accuracy of the target theme corresponding to the text to be classified.
In one embodiment, after obtaining the training set and the test set, the method further includes:
carrying out preprocessing operations of word segmentation and word stop removal on the training set to obtain a target historical text;
calculating the weight of each word in the target historical text, and performing text representation on the target historical text based on the weight;
sequentially selecting a theme from candidate themes corresponding to the target historical text, calculating a total weight and a weight average value corresponding to each word in the selected theme, sequencing each word in descending order based on the total weight, sequentially selecting a preset number of sequenced words as feature words of the selected theme, taking the weight average value corresponding to each feature word as a weight corresponding to the feature words, obtaining a theme feature vector of the selected theme based on each feature word and the weight corresponding to each feature word until each candidate theme of the target historical text obtains a corresponding theme feature vector, and taking each theme feature vector as a first theme feature vector;
and performing feature filtering processing on each first theme feature vector to obtain a second theme feature vector corresponding to each candidate theme, and obtaining the theme feature vocabulary based on the second theme feature vector, wherein each candidate theme in the target historical text and a feature word set corresponding to the candidate theme are stored in the theme feature vocabulary, and the feature word set comprises feature words and weight values corresponding to the feature words.
In one embodiment, after obtaining the topic feature vocabulary based on the second topic feature vector, the method further includes:
acquiring a threshold, analyzing the target field body, and analyzing to obtain example words with negative information in the target field body;
sequentially selecting a theme from the candidate themes of the theme feature vocabulary, calculating the similarity between each feature word and each example word in a feature word set corresponding to the selected theme, and selecting and obtaining the maximum similarity corresponding to each example word based on each similarity;
based on the comparison result of the maximum similarity corresponding to each example word and the threshold, taking the example words with the maximum similarity larger than the threshold in each example word as feature related words, adding each feature related word into the feature word set corresponding to the selected theme to obtain an updated feature word set, wherein the weight of each feature related word is equal to the weight of the feature word corresponding to the corresponding example word, and obtaining a first target theme feature vector corresponding to the selected theme based on the updated feature word set;
when each candidate topic in the topic feature vocabulary is selected, obtaining a first target topic feature vector corresponding to each candidate topic, performing feature filtering processing on each first target topic feature vector to obtain a target topic feature vector corresponding to each candidate topic, and obtaining a target topic feature vocabulary based on each target topic feature vector.
In one embodiment, after obtaining the candidate topic corresponding to the target topic feature vector as the target topic corresponding to the text to be classified, the method further includes:
obtaining a theme basis corresponding to a target object;
and acquiring a target initial history text based on a theme basis corresponding to the target object, wherein a target theme corresponding to the target initial history text is consistent with the theme basis, and sending the target initial history text to a terminal corresponding to the target object.
An apparatus for text classification, the apparatus comprising:
the system comprises a data acquisition module, a target topic feature vocabulary, a target domain ontology and a topic feature vocabulary, wherein the data acquisition module is used for acquiring a text to be classified and acquiring a target topic feature vocabulary, each candidate topic and a feature word set corresponding to the candidate topic are stored in the target topic feature vocabulary, one feature word set comprises each feature word corresponding to the same candidate topic, the target topic feature vocabulary is obtained based on the target domain ontology and the topic feature vocabulary, the target domain ontology is used for providing feature words of an extended topic feature vocabulary, the topic feature vocabulary is obtained based on a training set, and the training set is obtained by dividing from a historical text;
the text feature vector generation module is used for calculating text feature vectors to be classified corresponding to the texts to be classified;
a topic feature vector obtaining module, configured to obtain a topic feature vector corresponding to each candidate topic, where the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic;
the target topic feature vector determining module is used for calculating the similarity between the text feature vector to be classified and each topic feature vector and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and the target theme determining module is used for acquiring a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
A computer apparatus. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
the method comprises the steps of obtaining a text to be classified, obtaining a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for expanding the characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
calculating a text feature vector to be classified corresponding to the text to be classified;
obtaining a topic feature vector corresponding to each candidate topic, wherein the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic;
calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and obtaining a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
A computer readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
the method comprises the steps of obtaining a text to be classified, obtaining a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for expanding the characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
calculating a text feature vector to be classified corresponding to the text to be classified;
obtaining a theme characteristic vector corresponding to each candidate theme, wherein the theme characteristic vector is obtained by calculation based on a characteristic word set corresponding to the candidate theme;
calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and acquiring a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
A computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
the method comprises the steps of obtaining a text to be classified, obtaining a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for expanding characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
calculating a text feature vector to be classified corresponding to the text to be classified;
obtaining a topic feature vector corresponding to each candidate topic, wherein the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic;
calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and acquiring a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
According to the text classification method, the text classification device, the text classification computer equipment, the text classification storage medium and the text classification computer program product, a target topic feature vocabulary is obtained by obtaining a text to be classified, each candidate topic and a feature word set corresponding to the candidate topic are stored in the target topic feature vocabulary, one feature word set comprises each feature word corresponding to the same candidate topic, the target topic feature vocabulary is obtained based on a target field body and a topic feature vocabulary, the target field body is used for expanding the feature words of the topic feature vocabulary, the topic feature vocabulary is obtained based on a training set, the training set is obtained by dividing historical texts, a text feature vector to be classified corresponding to the text to be classified is calculated, a topic feature vector to be classified corresponding to each candidate topic is obtained, the topic feature vector to be calculated based on the feature word set corresponding to the candidate topic, the similarity between the text feature vector to be classified and each topic feature vector is calculated, and a target topic feature vector to be classified corresponding to the text to be classified is obtained based on the similarity. The method comprises the steps of constructing a target field body, obtaining a target topic feature vocabulary based on the target field body and a training set, calculating to obtain text topic feature vectors to be classified corresponding to texts to be classified, calculating to obtain similarities between the text feature vectors to be classified and the topic feature vectors based on the text topic feature vectors to be classified and topic feature vectors corresponding to candidate topics in the target topic feature vocabulary, obtaining target topic vectors corresponding to the texts to be classified based on the similarities, obtaining target topics corresponding to the texts to be classified based on the target topic vectors, and expanding text semantic features, so that the accuracy of short text classification is improved.
Drawings
FIG. 1 is a diagram of an exemplary environment in which a text classification method may be implemented;
FIG. 2 is a flow diagram that illustrates a method for text classification in one embodiment;
FIG. 3 is a flow chart illustrating the formation of a target domain ontology according to an embodiment;
FIG. 4 is a schematic flow chart of domain ontology construction based on the narrative list;
FIG. 5 is a flow chart illustrating the formation of a target domain ontology according to an embodiment;
FIG. 6 is a schematic flow chart of training set and test set determination in one embodiment;
FIG. 7 is a flow diagram that illustrates the generation of a topic feature vocabulary in one embodiment;
FIG. 8 is a flow diagram that illustrates the determination of a vocabulary of target topic characteristics in one embodiment;
FIG. 9 is a schematic diagram of a process for target initial history text push in one embodiment;
FIG. 10 is a block diagram showing the structure of a text classification device in one embodiment;
FIG. 11 is a diagram of the internal structure of a computer device in one embodiment;
FIG. 12 is a diagram of an internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
The text classification method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server. The terminal 102 is configured to obtain and display a target initial history text, where a target theme corresponding to the target initial history text is consistent with a theme basis corresponding to a target object. The server 104 is configured to obtain a text to be classified and a target topic feature vocabulary, calculate a text feature vector to be classified corresponding to the text to be classified, obtain a target topic feature vector based on the text feature vector to be classified and a topic feature vector corresponding to each candidate topic in the target topic feature vocabulary, and use a candidate topic corresponding to the target topic feature vector as a target topic corresponding to the text to be classified. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, as shown in fig. 2, a text classification method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:
step S200, a text to be classified is obtained, a target theme characteristic vocabulary is obtained, each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained based on a target field body and a theme characteristic vocabulary, the target field body is used for expanding the characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained based on a training set, and the training set is obtained by dividing from historical texts.
The text to be classified refers to short text meeting preset conditions. The target theme characteristic vocabulary table refers to a vocabulary table which comprises themes, characteristic words corresponding to the themes and weights corresponding to the characteristic words. The feature word set refers to a set of words capable of expressing the same subject, wherein each word has a corresponding weight. The target domain ontology refers to a theory that can be used to describe concepts and relationships between concepts in the target domain, provide a vocabulary of concepts and relationships between concepts in the target domain, or dominate in the target domain, wherein the target domain ontology considers the chinese-english case. The theme feature vocabulary refers to a vocabulary in which feature words of each theme have not been expanded by applying the target domain ontology. The feature words refer to words that can express a certain class of subject. The training set refers to a part of historical texts distributed from the historical texts. The history text refers to short text which is collected and stored in a database and meets a preset condition.
Specifically, a target domain ontology needs to be constructed before the target topic features are obtained, and two methods can be adopted for constructing the target domain ontology: and (4) constructing a target field ontology from scratch and constructing the target field ontology based on the original narrative table. In addition, a topic feature vocabulary table is obtained through a training set, then an example word is provided through a target field body, similarity calculation is carried out on the example word and feature words corresponding to candidate topics in the topic feature vocabulary table, the topics corresponding to the example word are classified based on the similarity, the example word is used as a feature related word and added to the corresponding topic, and therefore the topic feature vocabulary table is expanded, and the target topic feature vocabulary table is obtained.
Step S202, calculating the characteristic vector of the text to be classified corresponding to the text to be classified.
The text feature vector to be classified is in a form of representing the text to be classified into a mathematical vector, and the text feature vector to be classified is a weight corresponding to the text to be classified.
Specifically, to calculate the similarity between the feature vector of the text to be classified and the feature vector of the topic corresponding to each candidate topic, the weight of the text to be classified needs to be calculated, and a vector space model is used to perform text representation on the text to be classified.
Step S202, a theme characteristic vector corresponding to each candidate theme is obtained, and the theme characteristic vector is obtained by calculation based on the characteristic word set corresponding to the candidate theme.
Wherein, the candidate theme refers to the theme contained in the target theme characteristic vocabulary. The theme feature vector is a form of expressing each theme as a mathematical vector, and the theme feature vector comprises each corresponding feature word and a weight corresponding to the feature word.
Specifically, before calculating the similarity between the text feature vector to be classified and the topic feature vector corresponding to each candidate topic, data required for calculation needs to be acquired, and the topic feature vector mainly prepares data for a subsequent process.
Step S204, calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity.
Wherein, the similarity refers to a measure for comprehensively evaluating the similarity of the meaning between the text to be classified and each topic. The target topic feature vector refers to a topic feature vector corresponding to a topic with the closest meaning to the text to be classified.
Specifically, the similarity between the text feature vector to be classified and each topic feature vector is calculated, and the topic feature vector corresponding to the maximum similarity in each similarity is selected as the target topic feature vector corresponding to the text to be classified. The similarity calculation method adopts a word similarity method provided by HowNet, and HowNet takes concepts represented by Chinese and English words as description objects to discloseThe common sense knowledge base with the basic content of the relationship between concepts and the attribute of the concepts, wherein the specific calculation formula is shown as formula (1), wherein C i Feature vector, T, representing topic i j Feature vector, w, representing complaint document j ik Is the weight of the feature word in the subject feature vector, v jk And the weight value is the weight value in the text feature vector to be classified.
Figure BDA0003910731490000061
Step S206, acquiring a candidate topic corresponding to the target topic feature vector as a target topic corresponding to the text to be classified.
The target theme refers to the theme with the most similar meaning to the text to be classified, the target theme can be used for acquiring various semantic information in the text to be classified, and the corresponding historical text can also be sent to the terminal based on the target theme.
In the text classification method, a target topic feature vocabulary is obtained by obtaining a text to be classified, wherein each candidate topic and a feature word set corresponding to the candidate topic are stored in the target topic feature vocabulary, one feature word set comprises each feature word corresponding to the same candidate topic, the target topic feature vocabulary is obtained based on a target field body and a topic feature vocabulary, the target field body is used for expanding the feature words of the topic feature vocabulary, the topic feature vocabulary is obtained based on a training set, the training set is obtained by dividing historical texts, a text feature vector to be classified corresponding to the text to be classified is calculated, a topic feature vector corresponding to each candidate topic is obtained, the topic feature vector is calculated based on the feature word set corresponding to the candidate topic, the similarity between the text feature vector to be classified and each topic feature vector is calculated, a target topic feature vector corresponding to the text to be classified is obtained based on the similarity, and a candidate topic corresponding to the target topic feature vector is obtained as a target topic corresponding to the text to be classified. The method comprises the steps of constructing a target field body, obtaining a target topic feature vocabulary based on the target field body and a training set, calculating to obtain a text topic feature vector to be classified corresponding to a text to be classified, calculating to obtain the similarity between the text feature vector to be classified and each topic feature vector based on the text topic feature vector to be classified and a topic feature vector corresponding to each candidate topic in the target topic feature vocabulary, obtaining a target topic vector corresponding to the text to be classified based on the similarity, obtaining a target topic corresponding to the text to be classified based on the target topic vector, and expanding text semantic features, so that the accuracy of short text classification is improved.
In one embodiment, as shown in fig. 3, before step S200, the method further includes:
and step S300, acquiring a target field narrative word list and determining the application purpose of the field ontology.
Wherein the narrative table refers to a tool for controlling vocabularies, which is used for collecting vocabularies capable of representing a specific subject field, and the vocabularies are arranged in a specific structure to display the relationship between the vocabularies. The domain ontology refers to a theory that can be used to describe concepts and relationships between concepts in a specific domain, provide word lists of concepts and relationships between concepts in a certain professional subject domain, or dominate in a corresponding domain. The application purpose refers to the range and scene of the application of the domain ontology and the target to be achieved.
And step S302, converting the narrative words in the target field narrative word list into concepts in the field ontology based on the application purpose of the field ontology to obtain target concepts.
Wherein the target concept refers to a concept of a word that is a concept related to a target field.
Specifically, the original narrative words in the target field may exist in the narrative word list which does not conform to the related concepts of the target field, and in order to ensure that all the concepts contained in the constructed field ontology conform to the application range of the target field ontology, the narrative words in the target field narrative word list need to be converted into the concepts in the field ontology.
And S304, determining the level relation among the target concepts based on the hierarchical relation among the narrative words in the target field narrative word list to obtain the target level relation.
Wherein, hierarchical relationships refer to the broad degree of the narrative concepts, similar to inclusion versus exclusion relationships. The hierarchical relationship refers to a relationship between target concepts divided according to the degree of the breadth of the concepts. The target level relation refers to the level relation of the target concepts corresponding to the target field.
Specifically, determining and improving important concepts related to the target field in the narrative word list, and finding the relationship between the concepts is also one of important steps for constructing a target field ontology, and the hierarchical relationship between the target concepts is a basis for clearing the concept structure.
And S306, adding attributes to the target concept based on the meaning-limited words and the annotations of the narrative words in the target field narrative word list to obtain target attributes.
The term "meaning" is used herein to refer to a term that can be used to define an expression, such as an expression, a category, a quantity, or a non-quantity. Comments refer to some information about the interpretation or augmentation of a word. Target attributes refer to attributes that can describe the structure inherent between target concepts of the target domain ontology.
Specifically, describing the target domain, it is not enough to describe only the concept of the target domain, but also to describe the intrinsic structure of the concept, and the target attribute is to be used to describe the intrinsic structure of the concept of the target domain, thus adding the concept attribute to the target concept.
And S308, adding an inter-word relation for the target concept based on the inter-word relation among the narrative words in the target field narrative word list to obtain a target inter-word relation.
Wherein an inter-word relationship refers to a relationship that exists between narrative words. The relationship among the target words refers to the relationship among related narrative words in a narrative word list obtained according to target field analysis.
Specifically, in order to better describe the target domain, the relationship between words is also analyzed, and a constraint relationship between words (or classes corresponding to the words) is found, so as to better construct the target domain ontology.
And S310, forming a target concept model based on the target concept, the target level relation, the target attribute and the target interword relation.
Wherein the target conceptual model refers to an initial domain ontology to which no instances have been added.
Step S312, creating an instance corresponding to the target concept based on the target concept model, so as to obtain a target instance.
The target instance refers to utterances which can be expressed most accurately and are created for each narrative under the application scene based on the target field.
And step S314, forming a target field ontology based on the target conceptual model and the target instance.
Specifically, a flowchart for building a domain ontology based on a narrative word list is shown in FIG. 4.
In the embodiment, the target field is constructed through the original narrative word list of the target field, so that the workload of constructing the body of the target field is reduced to a great extent, and important terms of the target field do not need to be collected again. In addition, the narrative table is helpful to clear the relationship between important terms, thereby improving the usefulness of the target domain ontology.
In one embodiment, as shown in fig. 5, before step S200, the method further includes:
and S500, acquiring an application range of the field ontology, acquiring a term list corresponding to the field ontology based on the application range, and establishing a concept structure of the field ontology based on the term list to obtain a target concept structure of the field ontology.
The application range refers to a range of fields in which the target domain ontology to be constructed can be used. The term manifest refers to a listing of words that have a relationship to the target domain ontology. The concept structure refers to a relationship structure constructed by subdividing each term into categories according to the broad degree of the term concept through a top-down method. The target concept structure refers to a relationship structure constructed by performing category subdivision on each term in the target field according to the concept broad degree of each term in the target field through a top-down method.
Specifically, before the target domain ontology is constructed, the application range of the target domain ontology needs to be determined, and then important terms related to the target domain ontology are collected according to the application range of the target domain ontology, the credit ontology is constructed according to the application range of the target domain ontology, so the collected terms mainly focus on terms related to a credit complaint text, and the terms are mainly collected from channels such as documents and patents related to credit, standard documents of countries and institutions, consulting financial experts, related websites and complaints, and the like. Then, according to the collected important terms, listing a corresponding term list, analyzing the term list to clear the relationship among the terms, removing the terms with concepts irrelevant to the target field, performing formal coding on the determined concept relationship among the terms, and then performing category subdivision on each term from the most generalized concept by adopting a top-down method, thereby constructing and obtaining a target concept structure.
Step S502, defining the concept attribute and the setting class constraint of the domain ontology based on the target concept structure to obtain the target concept attribute and the target class constraint of the domain ontology, and forming a target concept model of the domain ontology based on the target concept structure, the target concept attribute and the target class constraint.
The concept attributes refer to attributes capable of describing an internal structure between the domain ontology concepts, wherein the attributes are divided into data attributes and object attributes, the data attributes refer to characteristics corresponding to data, and the object attributes refer to characteristics that a current object can be distinguished from other objects. Class constraints refer to constraints that attribute a class. The target concept attribute refers to an attribute capable of describing an internal structure between target domain ontology concepts. Target class constraints refer to constraints that attribute individual classes in the target domain ontology. The target conceptual model refers to the initial domain ontology to which no instances have been added.
Specifically, the attributes of the target domain ontology may describe the internal structure between concepts, because to create the target domain, a corresponding attribute needs to be defined for the target concept, and after a general attribute is defined in the parent concept, the child concept may inherit all the attributes of the parent concept, or may define the attributes specific to the parent concept, that is, define the private attributes of the parent concept, set constraints on classes through the defined attributes, that is, set constraints that must have some attributes as constraints of the classes for some classes, and corresponding children inherit the constraints of the corresponding parent concept, so as to obtain the target concept attributes and target class constraints corresponding to the target domain ontology, and based on the obtained target concept, target class constraints and target concept structure, a preliminary target domain ontology, that is, a target concept structure is formed.
Step S504, establishing corresponding examples of each class in the domain ontology based on the target conceptual model, obtaining target examples of the domain ontology, and forming the target domain ontology based on the target conceptual model and the target examples.
Wherein, the example refers to a kind of expression words created for each class according to a specific application scenario, and the words can express the most accurate concept. The target instance refers to utterances which can be most accurately expressed and are created for various classes in the target field based on the target field application scene.
Specifically, the way that the association between the class and the instance may have is also one of the elements of the domain ontology, because the corresponding instance of each class needs to be created according to the specific application to obtain the most accurately expressive instance, so as to construct the target domain ontology according to the instance and the target conceptual model.
In the embodiment, by autonomously collecting important terms of the target field, the correlation between the collected terms and the application range of the target field can be better ensured. All things used in the target field ontology construction process are from nothing to nothing, so that the created target field ontology has certain novelty.
In one embodiment, as shown in fig. 6, after step S314 or S504, the method further includes:
step S600, acquiring a data amount of the history text.
Step S602, when the data volume is larger than a first threshold value, taking two thirds of texts of the historical texts as a training set, and taking one third of texts of the historical texts as a test set to obtain the training set and the test set.
Wherein the first threshold refers to a boundary for judging the historical text to be large data.
Step S604, when the data volume is smaller than or equal to a first threshold value and larger than a second threshold value, dividing the historical text into a preset number of sample sets with the same size, taking each sample set as a test set in sequence, and combining the sets of the sample sets except the test set into a training set to obtain the training set and the test set.
The second threshold value refers to a limit for judging the historical text as small data, and when the limit is between the first threshold value and the second threshold value, the historical text data is divided into medium data. The preset number refers to a number specified by a person.
Step S606, when the data amount is smaller than a second threshold value, a sample set with the size consistent with that of the historical text is repeatedly and randomly extracted from the historical text to serve as a training set, data which do not appear in the training set in the historical text serve as a test set, the training set and the test set are obtained, and the first threshold value is larger than the second threshold value.
Wherein a sample set refers to a set containing samples repeatedly randomly drawn from a historical text.
And step S608, the test set is used for testing the accuracy of the target theme corresponding to the text to be classified.
In this embodiment, by specifying the method for dividing the training set and the test set according to the size of the data volume of the historical text, it is possible to avoid a large influence on the classification result due to too many or too few historical texts, which is beneficial to improving the accuracy of text classification.
In one embodiment, as shown in fig. 7, after step S602 or S604 or S606, the method further includes:
and S700, carrying out preprocessing operation of word segmentation and word stop removal on the training set to obtain a target historical text.
The word segmentation refers to relevant operations of scanning out words in a sentence and re-segmenting long words. Stop words refer to certain words or phrases that are automatically filtered out before or after processing text in order to save storage space and improve search efficiency in information retrieval. The stop word is an operation of removing the stop word contained in the history text after the word segmentation operation is performed on the history text. The target historical text refers to the text obtained after word segmentation and word deactivation operations are performed on the training set.
Specifically, word segmentation is carried out by adopting a word segmentation tool ICTCCLAS researched and developed by Chinese academy of sciences, and words which are high in occurrence frequency but small in practical meaning in a training set after word segmentation are removed according to a Chinese word stop table recognized on a network after word segmentation, wherein the words mainly comprise adverbs, null words and language atmosphere words.
Step S702, calculating the weight of each word in the target historical text, and performing text representation on the target historical text based on the weight.
Wherein, the weight value refers to the TF-IDF value of each document in a certain theme of each word. Text representation refers to the operation of representing words in text in the form of mathematical vectors.
Specifically, the calculation is performed using a TF-IDF method, and the TF-IDF calculation formula is shown in formula (2), wherein w ij Representing the weight, tf, of the jth word of the ith text ij Indicating the number of times the jth word of the ith 1 text appears in the ith text, df ij The number of texts of the jth word of the ith text in all texts corresponding to the selected theme is shown, and N is the total number of texts in the target historical text corresponding to the selected theme. After the weight of each word is calculated through a TF-IDF method, a vector space model is adopted to represent a target historical text, and each word in the target historical text is represented in a mathematical vector form. In addition, considering that words in the text titles can highlight text subjects more than words in the text main body, when the weight is calculated, the words in the text main body are normally counted according to the number of times of the words appearing in the text, and the words in the titles are calculated according to the number of times of lambda (lambda is a parameter set artificially), wherein the lambda value can be determined by firstly properly increasing the weight of the words in the titles, then analyzing the classification effect of the current lambda value in a part of data sets, and then adjusting the lambda value according to the classification effect.
Figure BDA0003910731490000101
Step S704, a theme is sequentially selected from candidate themes corresponding to the target historical text, a total weight and a weight average value corresponding to each word in the selected theme are calculated, the words are sorted in descending order based on the total weight, a preset number of sorted words are sequentially selected as feature words of the selected theme, the weight average value corresponding to each feature word is used as a weight corresponding to the feature word, a theme feature vector of the selected theme is obtained based on each feature word and the weight corresponding to each feature word until each candidate theme of the target historical text obtains a corresponding theme feature vector, and each theme feature vector is used as a first theme feature vector.
The total weight value refers to the sum of the weight values of each word in each document in the target historical text corresponding to the selected theme. The weight average value refers to the average value of the weights of each word in each document in the target historical text corresponding to the selected theme. The first topic feature vector refers to a topic feature vector in which a plurality of feature items unrelated to topics exist in each topic feature vector, and the existing feature items are not filtered.
Specifically, the weight corresponding to each word needs to be calculated to construct the topic feature vector, the feature words corresponding to each topic are screened out based on the weights of the words, which words are used as feature words are determined according to the total weight of the words, and the average value of the weights of the feature words in each document is used as the weight corresponding to the feature words. The topic feature vector is constructed based on the feature words and the weight values of the feature words, but the topic feature vector also has many feature items irrelevant to the topic, so that the topic feature vector cannot be used as the topic feature vector corresponding to each candidate topic in the topic feature vocabulary, and the feature filtering processing needs to be performed on the first topic feature vector.
Step S706, performing feature filtering processing on each first topic feature vector to obtain a second topic feature vector corresponding to each candidate topic, and obtaining the topic feature vocabulary based on the second topic feature vector, wherein feature word sets corresponding to each candidate topic and candidate topic in the target historical text are stored in the topic feature vocabulary, and each feature word set comprises feature words and weight values corresponding to the feature words.
The second theme feature vector is a theme feature vector obtained after feature filtering is carried out on the first theme feature vector.
In particular, in order to avoid the influence of many irrelevant feature items existing in the first topic feature vector on the classification result, feature filtering processing needs to be performed on the first topic feature vector. And as for the constructed first theme feature vectors, feature words appearing in the first theme feature vectors of the currently selected theme and also appearing in the first theme feature vectors corresponding to any other three candidate themes are discarded, and the feature item of each first theme feature vector is independent and non-overlapping with the feature item in any other first theme feature vector, so that the feature items of the second theme feature vectors are independent from each other, and the purpose of feature dimension reduction is achieved. The characteristic items comprise characteristic words and weight values corresponding to the characteristic words.
In this embodiment, a training set is first subjected to word segmentation and word deactivation preprocessing to obtain a target historical text, then a weight of each word in the target historical text is calculated and text representation is performed, feature words are screened out based on the weight of each word, and finally a topic feature vector is constructed based on the feature words and the weights of the feature words to obtain a topic feature vocabulary table.
In one embodiment, as shown in fig. 8, after step S706, the method further includes:
and step S800, acquiring a threshold, analyzing the target field body, and analyzing to obtain the example words with negative information in the target field body.
Wherein, the threshold value refers to the lowest value of the corresponding similarity when determining the similarity between the words. Example words refer to words in an example of a target domain ontology.
Specifically, the expansion of the subject feature vocabulary requires the provision of feature related words by means of the target domain ontology, and since the method is directed to short texts of complaints, example words with negative information are selected.
Step S802, a theme is selected from the candidate themes of the theme characteristic vocabulary in sequence, the similarity between each characteristic word and each example word in the characteristic word set corresponding to the selected theme is calculated, and the maximum similarity corresponding to each example word is selected and obtained based on each similarity.
Specifically, each example word and each feature word are subjected to similarity calculation, wherein the similarity calculation method adopts a word similarity method provided by HowNet, and a comparison result of the similarity is used for generating example words similar to the feature words as feature related words.
Step S804, based on a comparison result between the maximum similarity corresponding to each example word and the threshold, taking the example word with the maximum similarity greater than the threshold in each example word as a feature related word, adding each feature related word to the feature word set corresponding to the selected topic to obtain an updated feature word set, where a weight of each feature related word is equal to a weight of the feature word corresponding to the corresponding example word, and obtaining a first target topic feature vector corresponding to the selected topic based on the updated feature word set.
Wherein, the characteristic related words refer to example words most similar to a certain characteristic word in the selected subject. The first target theme feature vector is a theme feature vector calculated after the corresponding feature words and the weight values of the feature words are updated according to the themes.
Specifically, according to the definition of the threshold, the example words similar to the feature words in the original subjects are selected to expand the feature word set corresponding to each subject, and the classification of the short text is more accurate and efficient by expanding the subject feature vocabulary.
Step 806, when each candidate topic in the topic feature vocabulary is selected, obtaining a first target topic feature vector corresponding to each candidate topic, performing feature filtering processing on each first target topic feature vector to obtain a target topic feature vector corresponding to each candidate topic, and obtaining a target topic feature vocabulary based on each target topic feature vector.
The target topic feature vector is a topic feature vector obtained after feature filtering is performed on the first target topic feature vector, and the target topic feature vector is used for calculating the similarity with the text feature vector to be classified.
Specifically, the feature filtering processing needs to be performed on the first target subject feature vector to avoid that many irrelevant feature items exist in the first target subject feature vector to affect the classification result. Feature words appearing in the first target subject feature vectors and appearing in any three other first target subject feature vectors are also discarded, so that mutual independence of feature items among the target subject feature vectors is ensured, and the accuracy of text classification is improved.
In this embodiment, the example words of the extended topic feature vocabulary are provided by the target domain ontology, then the example words meeting the preset conditions among the example words are screened out as feature related words based on a word similarity method provided by HowNet, and the feature related words are added into the corresponding topic feature vectors, so that the target topic feature vocabulary is obtained, which makes the calculation of the similarity more accurate, and improves the accuracy of short text classification.
In one embodiment, as shown in fig. 9, after step S206, the method further includes:
step S900, obtaining a theme basis corresponding to the target object;
wherein, the theme is according to the theme corresponding to the text that the target object needs.
And S902, acquiring a target initial historical text based on the theme basis corresponding to the target object, wherein the target theme corresponding to the target initial historical text is consistent with the theme basis, and sending the target initial historical text to a terminal corresponding to the target object.
Wherein the target initial history text refers to text stored in the database without any processing of the text.
Specifically, the database stores data information for performing topic classification on the historical texts, the server can search a target topic consistent with the topic basis based on the topic basis, and then the historical texts corresponding to the target topic are sent to the terminal.
In the embodiment, the historical text corresponding to the target theme consistent with the theme basis is acquired according to the theme basis, so that the target object can acquire the information which the target object wants to know more quickly and conveniently, and the experience of the target object is improved.
In one embodiment, instead of using the HowNet generic ontology mentioned in the methods of the present application, it is possible to use the WordNet, euro WordNet, coreNet generic ontologies, whose semantic relationships are used as a bridge to provide the same functionality as knowledge in the human brain, linking class and unlabeled documents, thus achieving automatic text classification.
In one embodiment, important terms required for constructing the domain ontology are collected through channels such as documents and patents related to the application domain of the domain ontology, national and institutional standard documents, financial experts consulted, messages and complaints of related websites and the like, a term list is listed according to the important terms, the relationship among the terms is analyzed and cleared, the conceptual structure of the domain ontology is constructed, in addition, the conceptual attributes of the domain ontology are defined, class constraints are set, corresponding examples of each class are created based on the conceptual attributes, the conceptual attributes and the class constraints, and finally, a target domain ontology, namely a credit domain ontology is constructed. Then, complaint texts collected by merchants are used as historical texts, word segmentation and stop word removal preprocessing is carried out on the historical texts to obtain target historical texts, a subject feature vocabulary related to credit aspects is constructed according to historical text training, a target field vocabulary is obtained based on the subject feature vocabulary and a credit field body, wherein the credit field body provides feature related words for expanding the subject feature vocabulary, when a new text is obtained and topics to which the new text belongs need to be known, feature vectors of the new text can be obtained, the feature vectors are used as text feature vectors to be classified, then similarity between the text feature vectors to be classified and subject feature vectors corresponding to candidate topics in the target subject feature vocabulary is obtained, and topics corresponding to the subject feature vectors corresponding to the maximum similarity in the similarity are selected as target topics corresponding to the new text, so that important information can be obtained from the new text more quickly. The important information is distinguished from the complaint text, the topics corresponding to the important information are known, problems can be found from the complaint text, accordingly, merchants can conveniently make corresponding solutions according to the problems, potential vulnerabilities of credit products can be found from a large number of complaint texts, and the vulnerabilities can also play a role in early warning of financial credit. In addition, the complaint texts related to the subjects can be searched according to the subjects which the customers want to know, so that the information which the customers want to know is obtained from the complaint texts and the responses of the merchants to the complaint texts, the time for the users to search mass information is reduced, and the experience of the users is improved. The method adopts the domain ontology to expand the semantic features of the text, and overcomes the defect of insufficient inherent features of the short text, thereby improving the classification performance of the short text and greatly improving the accuracy of short text classification.
Based on the same inventive concept, the embodiment of the application also provides a text classification device for realizing the text classification method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in one or more embodiments of the text classification device provided below can be referred to the limitations of the text classification method in the above, and are not described herein again.
In one embodiment, as shown in fig. 10, there is provided a text classification apparatus including: a data obtaining module 1000, a text feature vector to be classified generating module 1002, a topic feature vector obtaining module 1004, a target topic feature vector determining module 1006 and a target topic determining module 1008, wherein:
the data acquisition module 1000 is configured to acquire a text to be classified, and acquire a target topic feature vocabulary, where each candidate topic and a feature word set corresponding to the candidate topic are stored in the target topic feature vocabulary, and one feature word set includes each feature word corresponding to the same candidate topic, the target topic feature vocabulary is obtained based on a target field ontology and a topic feature vocabulary, the target field ontology is configured to provide feature words of an extended topic feature vocabulary, the topic feature vocabulary is obtained based on a training set, and the training set is obtained by dividing a history text.
The module 1002 for generating a text feature vector to be classified is configured to calculate a text feature vector to be classified corresponding to the text to be classified.
A topic feature vector obtaining module 1004, configured to obtain a topic feature vector corresponding to each candidate topic, where the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic.
A target topic feature vector determination module 1006, configured to calculate similarity between the feature vector of the text to be classified and each topic feature vector, and obtain a target topic feature vector corresponding to the text to be classified based on the similarity.
A target topic determining module 1008, configured to obtain a candidate topic corresponding to the target topic feature vector as a target topic corresponding to the text to be classified.
In one embodiment, the text classification apparatus further includes a first target domain ontology generating module 1010, where the first target domain ontology generating module 1010 is configured to obtain a target domain narrative vocabulary and determine an application purpose of a domain ontology; converting the narrative words in the target field narrative word list into concepts in the field ontology based on the application purpose of the field ontology to obtain target concepts; determining the level relation among the target concepts based on the hierarchical relation among the narrative words in the target field narrative word list to obtain a target level relation; adding attributes to the target concept based on the meaning-limited words and the annotations of the narrative words in the target field narrative word list to obtain target attributes; adding an inter-word relation for the target concept based on the inter-word relation among the narratives in the target field narrative word list to obtain a target inter-word relation; forming a target concept model based on the target concept, the target level relation, the target attribute and the target inter-word relation; creating an instance corresponding to the target concept based on the target concept model to obtain a target instance; and constructing a target domain ontology based on the target conceptual model and the target instance.
In an embodiment, the text classification apparatus further includes a second target domain ontology generating module 1012, where the second target domain ontology generating module 1012 is configured to obtain an application range of the domain ontology, obtain a term list corresponding to the domain ontology based on the application range, and establish a concept structure of the domain ontology based on the term list to obtain a target concept structure of the domain ontology; defining concept attributes and set class constraints of a field ontology based on the target concept structure to obtain target concept attributes and target class constraints of the field ontology, and forming a target concept model of the field ontology based on the target concept structure, the target concept attributes and the target class constraints; and establishing corresponding examples of each class in the field ontology based on the target conceptual model to obtain target examples of the field ontology, and forming the target field ontology based on the target conceptual model and the target examples.
In one embodiment, the text classification apparatus further includes a training set and test set determination module 1014, where the training set and test set determination module 1014 is configured to obtain a data amount of the historical text; when the data volume is larger than a first threshold value, taking two thirds of texts of the historical texts as a training set, and taking one third of texts of the historical texts as a test set to obtain the training set and the test set; when the data volume is smaller than or equal to a first threshold value and larger than a second threshold value, dividing the historical text into a preset number of sample sets with the same size, sequentially taking each sample set as a test set, and combining the sets of the sample sets except the test set into a training set to obtain the training set and the test set; when the data volume is smaller than a second threshold value, repeatedly and randomly extracting a sample set with the size consistent with that of the historical text from the historical text to serve as a training set, taking data which do not appear in the training set in the historical text as a test set, and obtaining the training set and the test set, wherein the first threshold value is larger than the second threshold value; the test set is used for testing the accuracy of the target theme corresponding to the text to be classified.
In one embodiment, the text classification apparatus further includes a topic feature vocabulary generating module 1016, where the topic feature vocabulary generating module 1016 is configured to perform pre-processing operations of word segmentation and word removal on the training set to obtain a target historical text; calculating the weight of each word in the target historical text, and performing text representation on the target historical text based on the weight; sequentially selecting a theme from candidate themes corresponding to the target historical text, calculating a total weight and a weight average value corresponding to each word in the selected theme, sequencing each word in a descending order based on the total weight, sequentially selecting a preset number of sequenced words as feature words of the selected theme, taking the weight average value corresponding to each feature word as a weight corresponding to the feature word, obtaining a theme feature vector of the selected theme based on each feature word and the weight corresponding to each feature word until each candidate theme of the target historical text obtains a corresponding theme feature vector, and taking each theme feature vector as a first theme feature vector; and performing feature filtering processing on each first theme feature vector to obtain a second theme feature vector corresponding to each candidate theme, and obtaining the theme feature vocabulary based on the second theme feature vector, wherein each candidate theme in the target historical text and a feature word set corresponding to the candidate theme are stored in the theme feature vocabulary, and the feature word set comprises feature words and weight values corresponding to the feature words.
In one embodiment, the text classification apparatus further includes a target topic feature vocabulary determining module 1018, where the target topic feature vocabulary determining module 1018 is configured to obtain a threshold, analyze the target domain ontology, and analyze the target domain ontology to obtain example words with negative information; sequentially selecting a theme from the candidate themes of the theme feature vocabulary, calculating the similarity between each feature word and each example word in a feature word set corresponding to the selected theme, and selecting and obtaining the maximum similarity corresponding to each example word based on each similarity; based on the comparison result of the maximum similarity corresponding to each example word and the threshold, taking the example words with the maximum similarity larger than the threshold in each example word as feature related words, adding each feature related word into the feature word set corresponding to the selected theme to obtain an updated feature word set, wherein the weight of each feature related word is equal to the weight of the feature word corresponding to the corresponding example word, and obtaining a first target theme feature vector corresponding to the selected theme based on the updated feature word set; when each candidate topic in the topic feature vocabulary is selected, obtaining a first target topic feature vector corresponding to each candidate topic, performing feature filtering processing on each first target topic feature vector to obtain a target topic feature vector corresponding to each candidate topic, and obtaining a target topic feature vocabulary based on each target topic feature vector.
In one embodiment, the text classification apparatus further includes a history text pushing module 1020, where the history text pushing module 1020 is configured to obtain a theme basis corresponding to the target object; and acquiring a target initial history text based on a theme basis corresponding to the target object, wherein a target theme corresponding to the target initial history text is consistent with the theme basis, and sending the target initial history text to a terminal corresponding to the target object.
The modules in the text classification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 11. The computer device comprises a processor, a memory, an Input/Output (I/O) interface and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used to store historical text data. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a method of text classification.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer apparatus includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for communicating with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of text classification. The display unit of the computer device is used for forming a visual picture and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the configurations shown in fig. 11 and 12 are block diagrams of only some of the configurations relevant to the present application, and do not constitute a limitation on the computing devices to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the above method examples when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of the above-described method embodiments.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A method of text classification, the method comprising:
the method comprises the steps of obtaining a text to be classified, obtaining a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for expanding characteristic words of the theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
calculating a text feature vector to be classified corresponding to the text to be classified;
obtaining a topic feature vector corresponding to each candidate topic, wherein the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic;
calculating the similarity between the text feature vector to be classified and each topic feature vector, and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and obtaining a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
2. The method of claim 1, wherein prior to obtaining the target topic feature vocabulary, further comprising:
acquiring a target field narrative table, and determining the application purpose of a field ontology;
converting the narrative words in the target field narrative word list into concepts in the field ontology based on the application purpose of the field ontology to obtain target concepts;
determining a hierarchical relationship among the target concepts based on the hierarchical relationship among the narrative words in the target field narrative word list to obtain a target hierarchical relationship;
adding attributes to the target concept based on the meaning-limited words and the annotations of the narrative words in the target field narrative word list to obtain target attributes;
adding an inter-word relation for the target concept based on the inter-word relation among the narratives in the target field narrative word list to obtain a target inter-word relation;
forming a target concept model based on the target concept, the target level relation, the target attribute and the target inter-word relation;
creating an instance corresponding to the target concept based on the target concept model to obtain a target instance;
and constructing a target domain ontology based on the target conceptual model and the target instance.
3. The method of claim 1, wherein prior to obtaining the target topic feature vocabulary, further comprising:
acquiring an application range of a field ontology, acquiring a term list corresponding to the field ontology based on the application range, and establishing a concept structure of the field ontology based on the term list to acquire a target concept structure of the field ontology;
defining concept attributes and setting class constraints of a field ontology based on the target concept structure to obtain target concept attributes and target class constraints of the field ontology, and forming a target concept model of the field ontology based on the target concept structure, the target concept attributes and the target class constraints;
and establishing corresponding examples of each class in the domain ontology based on the target conceptual model to obtain target examples of the domain ontology, and forming the target domain ontology based on the target conceptual model and the target examples.
4. The method according to claim 2 or claim 3, wherein after constructing the target domain ontology based on the target conceptual model and the target instance, further comprising:
acquiring the data volume of the historical text;
when the data volume is larger than a first threshold value, taking two thirds of texts of the historical texts as a training set, and taking one third of texts of the historical texts as a test set to obtain the training set and the test set;
when the data volume is smaller than or equal to a first threshold value and larger than a second threshold value, dividing the historical text into a preset number of sample sets with the same size, taking each sample set as a test set in sequence, and combining the sets of the sample sets except the test set into a training set to obtain the training set and the test set;
when the data volume is smaller than a second threshold value, repeatedly and randomly extracting a sample set with the size consistent with that of the historical text from the historical text as a training set, taking data which do not appear in the training set in the historical text as a test set, and obtaining the training set and the test set, wherein the first threshold value is larger than the second threshold value;
the test set is used for testing the accuracy of the target theme corresponding to the text to be classified.
5. The method of claim 4, wherein after obtaining the training set and the test set, further comprising:
carrying out preprocessing operations of word segmentation and word stop removal on the training set to obtain a target historical text;
calculating the weight of each word in the target historical text, and performing text representation on the target historical text based on the weight;
sequentially selecting a theme from candidate themes corresponding to the target historical text, calculating a total weight and a weight average value corresponding to each word in the selected theme, sequencing each word in descending order based on the total weight, sequentially selecting a preset number of sequenced words as feature words of the selected theme, taking the weight average value corresponding to each feature word as a weight corresponding to the feature words, obtaining a theme feature vector of the selected theme based on each feature word and the weight corresponding to each feature word until each candidate theme of the target historical text obtains a corresponding theme feature vector, and taking each theme feature vector as a first theme feature vector;
and performing feature filtering processing on each first theme feature vector to obtain a second theme feature vector corresponding to each candidate theme, and obtaining the theme feature vocabulary based on the second theme feature vector, wherein each candidate theme in the target historical text and a feature word set corresponding to the candidate theme are stored in the theme feature vocabulary, and the feature word set comprises feature words and weight values corresponding to the feature words.
6. The method of claim 5, wherein after obtaining the topic feature vocabulary based on the second topic feature vector, further comprising:
acquiring a threshold, analyzing the target field body, and analyzing to obtain example words with negative information in the target field body;
sequentially selecting a theme from the candidate themes of the theme feature vocabulary table, calculating the similarity between each feature word and each example word in a feature word set corresponding to the selected theme, and selecting and obtaining the maximum similarity corresponding to each example word based on each similarity;
based on the comparison result of the maximum similarity corresponding to each example word and the threshold, taking the example words with the maximum similarity larger than the threshold in each example word as feature related words, adding each feature related word into the feature word set corresponding to the selected theme to obtain an updated feature word set, wherein the weight of each feature related word is equal to the weight of the feature word corresponding to the corresponding example word, and obtaining a first target theme feature vector corresponding to the selected theme based on the updated feature word set;
when each candidate topic in the topic feature vocabulary is selected, obtaining a first target topic feature vector corresponding to each candidate topic, performing feature filtering processing on each first target topic feature vector to obtain a target topic feature vector corresponding to each candidate topic, and obtaining a target topic feature vocabulary based on each target topic feature vector.
7. The method according to claim 1, wherein after the candidate topic corresponding to the target topic feature vector is obtained as the target topic corresponding to the text to be classified, the method further comprises:
obtaining a theme basis corresponding to a target object;
and acquiring a target initial history text based on a theme basis corresponding to the target object, wherein a target theme corresponding to the target initial history text is consistent with the theme basis, and sending the target initial history text to a terminal corresponding to the target object.
8. An apparatus for classifying text, the apparatus comprising:
the data acquisition module is used for acquiring a text to be classified and acquiring a target theme characteristic vocabulary, wherein each candidate theme and a characteristic word set corresponding to the candidate theme are stored in the target theme characteristic vocabulary, one characteristic word set comprises each characteristic word corresponding to the same candidate theme, the target theme characteristic vocabulary is obtained on the basis of a target field body and a theme characteristic vocabulary, the target field body is used for providing characteristic words of an extended theme characteristic vocabulary, the theme characteristic vocabulary is obtained on the basis of a training set, and the training set is obtained by dividing from a historical text;
the text feature vector generation module is used for calculating a text feature vector to be classified corresponding to the text to be classified;
a topic feature vector obtaining module, configured to obtain a topic feature vector corresponding to each candidate topic, where the topic feature vector is obtained by calculation based on a feature word set corresponding to the candidate topic;
the target topic feature vector determining module is used for calculating the similarity between the text feature vector to be classified and each topic feature vector and acquiring a target topic feature vector corresponding to the text to be classified based on the similarity;
and the target theme determining module is used for acquiring a candidate theme corresponding to the target theme characteristic vector as a target theme corresponding to the text to be classified.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202211319702.7A 2022-10-26 2022-10-26 Text classification method and device, computer equipment and storage medium Pending CN115795030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211319702.7A CN115795030A (en) 2022-10-26 2022-10-26 Text classification method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211319702.7A CN115795030A (en) 2022-10-26 2022-10-26 Text classification method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115795030A true CN115795030A (en) 2023-03-14

Family

ID=85433910

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211319702.7A Pending CN115795030A (en) 2022-10-26 2022-10-26 Text classification method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115795030A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116723059A (en) * 2023-08-10 2023-09-08 湖南润科通信科技有限公司 Security analysis system for network information
CN116821349A (en) * 2023-08-29 2023-09-29 中国标准化研究院 Literature analysis method and management system based on big data
CN117371440A (en) * 2023-12-05 2024-01-09 广州阿凡提电子科技有限公司 Topic text big data analysis method and system based on AIGC

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116723059A (en) * 2023-08-10 2023-09-08 湖南润科通信科技有限公司 Security analysis system for network information
CN116723059B (en) * 2023-08-10 2023-10-20 湖南润科通信科技有限公司 Security analysis system for network information
CN116821349A (en) * 2023-08-29 2023-09-29 中国标准化研究院 Literature analysis method and management system based on big data
CN116821349B (en) * 2023-08-29 2023-10-31 中国标准化研究院 Literature analysis method and management system based on big data
CN117371440A (en) * 2023-12-05 2024-01-09 广州阿凡提电子科技有限公司 Topic text big data analysis method and system based on AIGC
CN117371440B (en) * 2023-12-05 2024-03-12 广州阿凡提电子科技有限公司 Topic text big data analysis method and system based on AIGC

Similar Documents

Publication Publication Date Title
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US9613024B1 (en) System and methods for creating datasets representing words and objects
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
CN111753198A (en) Information recommendation method and device, electronic equipment and readable storage medium
WO2018184518A1 (en) Microblog data processing method and device, computer device and storage medium
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN111539197A (en) Text matching method and device, computer system and readable storage medium
KR20200007713A (en) Method and Apparatus for determining a topic based on sentiment analysis
CN111090771B (en) Song searching method, device and computer storage medium
CN115374781A (en) Text data information mining method, device and equipment
CN110969005B (en) Method and device for determining similarity between entity corpora
CN114547303A (en) Text multi-feature classification method and device based on Bert-LSTM
CN111859955A (en) Public opinion data analysis model based on deep learning
CN110198291B (en) Webpage backdoor detection method, device, terminal and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN115129864A (en) Text classification method and device, computer equipment and storage medium
CN113434639A (en) Audit data processing method and device
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium
CN115878761B (en) Event context generation method, device and medium
TW202018535A (en) Apparatus and method for predicting response of an article
KR102649622B1 (en) Method, computer device, and computer program for providing brand reputation analysis service
Khan Processing big data with natural semantics and natural language understanding using brain-like approach
CN113807429B (en) Enterprise classification method, enterprise classification device, computer equipment and storage medium
Balasundaram et al. Social Media Monitoring Of Airbnb Reviews Using AI: A Sentiment Analysis Approach For Immigrant Perspectives In The UK

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Zhaolian Consumer Finance Co.,Ltd.

Applicant after: SUN YAT-SEN University

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant before: MERCHANTS UNION CONSUMER FINANCE Co.,Ltd.

Country or region before: China

Applicant before: SUN YAT-SEN University

CB02 Change of applicant information