CN111611393A - Text classification method, device and equipment - Google Patents

Text classification method, device and equipment Download PDF

Info

Publication number
CN111611393A
CN111611393A CN202010607094.4A CN202010607094A CN111611393A CN 111611393 A CN111611393 A CN 111611393A CN 202010607094 A CN202010607094 A CN 202010607094A CN 111611393 A CN111611393 A CN 111611393A
Authority
CN
China
Prior art keywords
text
word
classified
syntactic
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010607094.4A
Other languages
Chinese (zh)
Inventor
孙宝林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202010607094.4A priority Critical patent/CN111611393A/en
Publication of CN111611393A publication Critical patent/CN111611393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification discloses a text classification method, a text classification device and text classification equipment. The scheme comprises the following steps: acquiring a text to be classified; performing word segmentation processing on the text to be classified to obtain a word segmentation set; determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified; determining a syntactic relation among participles in the participle set based on the participle set; determining the syntactic characteristics of the text to be classified according to the syntactic relation; performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features; and performing text classification based on the fusion characteristics.

Description

Text classification method, device and equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text classification method, apparatus, and device.
Background
With the development of computer technology, people can obtain more and more information through networks. In order to make a user obtain information which is safer and more in line with the user's needs, information is generally classified, wherein the classification of the related text may refer to a process in which a computer automatically classifies the input text according to a certain category system through an algorithm.
With the development of artificial intelligence, text classification techniques have been widely applied in the fields of text review, advertisement filtering, emotion analysis, public opinion analysis, anti-yellow recognition, and the like. Therefore, how to improve the accuracy of text classification is also a technical problem to be solved in the field.
Disclosure of Invention
The embodiment of the specification provides a text classification method and a text classification device, so that the accuracy of text classification is improved.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
the text classification method provided by the embodiment of the specification comprises the following steps:
acquiring a text to be classified;
performing word segmentation processing on the text to be classified to obtain a word segmentation set;
determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
determining a syntactic relation among participles in the participle set based on the participle set;
determining the syntactic characteristics of the text to be classified according to the syntactic relation;
performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and performing text classification based on the fusion characteristics.
An embodiment of this specification provides a text classification device, includes:
the text acquisition module is used for acquiring texts to be classified;
the word segmentation processing module is used for carrying out word segmentation processing on the text to be classified to obtain a word segmentation set;
the word feature determination module is used for determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
the syntactic relation determining module is used for determining the syntactic relation among the participles in the participle set based on the participle set;
the syntactic characteristic determining module is used for determining the syntactic characteristics of the text to be classified according to the syntactic relation;
the fusion calculation module is used for performing feature fusion calculation on the basis of the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and the text classification module is used for classifying the texts based on the fusion characteristics.
An embodiment of the present specification provides a text classification device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring a text to be classified;
performing word segmentation processing on the text to be classified to obtain a word segmentation set;
determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
determining a syntactic relation among participles in the participle set based on the participle set;
determining the syntactic characteristics of the text to be classified according to the syntactic relation;
performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and performing text classification based on the fusion characteristics.
Embodiments of the present specification provide a computer readable medium having computer readable instructions stored thereon, which are executable by a processor to implement one of the above-mentioned text classification methods.
One embodiment of the present description achieves the following advantageous effects: in the embodiment of the specification, the word feature set and the syntactic feature of the text to be classified are fused to obtain the fused feature, the text to be classified is classified by utilizing the fused feature, and the syntactic structure information of the text to be classified can be applied to the text classification, so that the text classification can not only depend on the relation between adjacent words, but also takes the syntactic structure of the whole text into consideration, the text is classified from the whole perspective of the text, the whole expression meaning of the text can be more accurately known, and the accuracy of the text classification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a scene diagram of an overall scheme of a text classification method in an embodiment of the present specification;
fig. 2 is a schematic flowchart of a text classification method provided in an embodiment of the present specification;
fig. 3 is a schematic flowchart of a text classification method provided in an embodiment of the present specification;
fig. 4 is a schematic structural diagram of a text classification apparatus corresponding to fig. 2 provided in an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a text classification device corresponding to fig. 2 provided in an embodiment of the present specification.
Detailed Description
To make the objects, technical solutions and advantages of one or more embodiments of the present disclosure more apparent, the technical solutions of one or more embodiments of the present disclosure will be described in detail and completely with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present specification, and not all embodiments. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any creative effort fall within the protection scope of one or more embodiments of the present disclosure.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.
In the prior art, a traditional text classification model usually only associates adjacent words and/or terms, but in practical application, the adjacent words and/or terms do not have certain semantic association in the whole text, which results in the text classification model and fails to deeply understand the true meaning of the text.
W Li et al propose a Text Classification method (Recursive graphical Networks for Text Classification) for a Recursive graphical neural network, in which each word is used as a node, and if the word is an adjacent word, the byte points are connected. And introducing a global node g which is connected with all word nodes and used as an article node. The construction of such node relationships is somewhat crude.
C Sun et al propose a method for joint type reasoning of entities and relations through a graph volume network to classify and extract information, firstly, entity extraction adopts a classical BI-LSTM (Bi-directional Long Short-term memory network) mode, relation embedding adopts a CNN (convolutional neural network) + MLP (Multi layer Perceptron) mode to embed, node vectors and relation vectors corresponding to the entities are convoluted to respectively obtain final representations of the entities and the relations, and a loss function is calculated. However, in the method, the node relation is output through model calculation, and certain precision loss exists.
In order to solve the defects in the prior art, the scheme provides the following embodiments:
fig. 1 is a scene schematic diagram of an overall scheme of a text classification method in an embodiment of the present specification. As shown in fig. 1, the method in the embodiment of the present specification may classify the adopted web page information, may obtain the web page information 1 through a web page information acquisition tool, extract the text information 2 in the web page information 1, input the obtained text information 2 as a text to be classified into a server 3 for classification processing, and finally may classify the text to be classified into a predetermined category 4. The server 3 may perform operations such as word segmentation, feature extraction, syntactic relation extraction, fusion calculation, and text classification on the text to be classified.
Next, a text classification method provided in an embodiment of the specification will be specifically described with reference to the accompanying drawings:
fig. 2 is a schematic flowchart of a text classification method provided in an embodiment of the present specification. From the viewpoint of a program, the execution subject of the flow may be a program installed in an application server or an application client.
As shown in fig. 2, the process may include the following steps:
step 202: and acquiring the text to be classified.
In the embodiment of the present specification, text information recorded in a web page, a book, a newspaper, a journal, and the like may be used as a text to be classified, or a text provided by a user may be used as the text to be classified, and a source of the text to be classified is not limited herein. For example, due to the development of computer technology, many news messages are spread through the network, a company needs to acquire news messages related to the company in time, especially some negative news, and the news messages related to the company in the network can be used as texts to be classified according to the needs of the company; for another example, a news publishing organization needs to publish edited news in a classified manner, such as sports, entertainment, and life, and the news to be classified provided by the news publishing organization can be used as the text to be classified.
Step 204: and performing word segmentation processing on the text to be classified to obtain a word segmentation set.
In the embodiment of the present specification, the text to be classified may include a chinese text, which is usually a sentence or an article including a plurality of words, and in the embodiment of the present specification, the text to be classified may be subjected to word segmentation processing, the text to be classified is decomposed into a plurality of words and/or words, a set of the decomposed words and/or words is referred to as a word segmentation set, and the word segmentation set may include all the words and/or words constituting the text to be classified.
Step 206: determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified.
In the embodiment of the present specification, the participles in the participle set may be represented in the form of word features, the word features may be represented numerically by the participle characteristics, and the word feature set corresponding to the text to be classified is formed by the word features corresponding to the participles. In practical application, the word characteristics of the participle can be determined according to the parts of speech of the participle, the word forming habit of the participle, the use frequency of the participle, the characteristics of adjacent participles and other factors.
Step 208: and determining the syntactic relation among the participles in the participle set based on the participle set.
The syntactic relation in the embodiment of the description can understand that the dependency relation between the segmented words obtained according to the sentence structure of the text to be classified can include the word relation in language expressions such as a predicate relation, a dynamic guest relation, a fixed relation, a mediated guest relation and the like.
Step 210: and determining the syntactic characteristics of the text to be classified according to the syntactic relation.
In the embodiments of the present specification, the syntactic relationship may be expressed in the form of syntactic characteristics, and the syntactic characteristics may be a numerical representation of the syntactic relationship, so as to be identified by a computer model in the following.
Step 212: and performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features.
In the embodiment of the present specification, feature fusion calculation may be performed on a word feature set and a syntactic feature of a text to be classified to obtain a fusion feature, where the fusion feature may be understood as fusing the syntactic feature into the word feature set of the text to be classified to obtain a word feature set including the whole syntactic relation of the text to be classified. The fusion characteristics can reflect the characteristics of each word segmentation from the overall level of the text to be classified, so that the overall expression meaning of the text can be more accurately obtained, and the classification accuracy can be improved.
Step 214: and performing text classification based on the fusion characteristics.
In the embodiment of the present specification, the text to be classified may be classified according to the fusion features, where a classification model trained in advance may be used to classify the text, for example, the classification model may be a two-classifier, a multiple-classifier, or the like trained in advance.
In the text classification method in the embodiment of the present description, the word feature set and the syntactic feature of the text to be classified are fused to obtain the fused feature, the text to be classified is classified by using the fused feature, and the syntactic structure information of the text to be classified can be applied to the text classification, so that the text classification can be performed not only by relying on the relation between adjacent words, but also by considering the syntactic structure of the whole text, the text is classified from the whole perspective of the text, the whole expression meaning of the text can be known more accurately, and the accuracy of the text classification is improved.
For clarity of explanation of the advantages of the solution in the embodiment of the present specification, it is assumed that news messages about company a obtained need to be classified, and a message "company a witness company B illegally funded" is now obtained.
Because the common text classification method in the prior art classifies the messages according to certain keywords, the meaning of the whole sentence is not considered, and the messages have the characters of 'illegal collection', the current classification method can classify the messages into negative messages. But the illegal funding in the message is performed by company B, not company A, the message is not a negative message about company A, and the current classification result is not reasonable.
In the embodiment of the present specification, the word "a company zhuang telling B company illegal funding" may be processed into four words, i.e., "a company", "zhuang", "B company" and "illegal funding", and the syntactic relation between the words is obtained as: the 'company A' and the 'state notice' are in a main-meaning relationship, the 'state notice' and the 'illegal funding' are in a guest-moving relationship, and the 'company B' and the 'illegal funding' are in a main-meaning relationship; and fusing the syntactic relation and the word segmentation characteristics to obtain fusion characteristics, and classifying according to the fusion characteristics, wherein the classification result is a neutral message relative to the A company. The text classification method in the embodiment of the specification can more accurately acquire the integral expression meaning of the text and improve the accuracy of text classification by considering the global syntactic information.
It should be understood that the order of some steps in the method described in one or more embodiments of the present disclosure may be interchanged according to actual needs, or some steps may be omitted or deleted.
Based on the method of fig. 2, the present specification also provides some specific embodiments of the method, which are described below.
Optionally, in step 204, performing word segmentation processing on the text to be classified to obtain a word segmentation set, which may specifically include:
performing word segmentation processing on the text to be classified according to preset word segmentation granularity to obtain at least one word segmentation set;
and obtaining that the word segmentation set comprises at least one word segmentation based on the at least one word segmentation.
In practical application, the words in the English sentence are separated by spaces, and can be segmented according to the spaces, and the characters in the Chinese sentence are closely connected, and in some cases, a word can be expressed by connecting a plurality of adjacent characters. The word segmentation in the embodiment of the present specification may refer to segmenting a text word sequence to be classified into meaningful words.
In this embodiment of the present specification, word segmentation processing may be performed on a text to be classified according to a user-defined word segmentation granularity, and the larger the word segmentation granularity is, the more the number of words that can be included in an obtained word segmentation is, for example, a word segmentation of "shanghai people court" may be processed into two word segmentations of "shanghai" and "people court" according to different word segmentation granularities, or may be processed into three word segmentations of "shanghai", "people" and "court", or even may be processed into one word segmentation of "shanghai people court". In practical application, different word segmentation granularity can be set according to different requirements.
In practical application, the text to be classified can be matched with the entries in the machine dictionary according to a certain strategy, and if a certain character string in the text to be classified is found in the dictionary, the word segmentation corresponding to the character string is obtained according to the structure in the dictionary.
In practical application, because the frequency or probability of adjacent appearance of characters can reflect the credibility of the adjacent characters forming words, the combination condition of each character which adjacently appears in the corpus can be counted, the adjacent appearance probability of two characters is calculated, and if the probability is higher than a certain threshold value, the two characters can form a word. Of course, on the basis of two adjacent characters, the combination condition of the adjacent front or rear characters of the two characters and the two characters can be continuously calculated, and whether the participles of more characters can be formed or not is further confirmed, and to a certain extent, the more characters are contained in one participle, the clearer meaning of the participle is expressed.
In the embodiment of the present specification, a custom database may be further provided, where the custom database may include some vocabularies that are not included in the machine dictionary but are common in practical applications, for example, some network vocabularies, popular words, and the like that appear according to the development of a network, and the accuracy of the word segmentation processing may be improved by using the custom database.
In practical applications, word segmentation Processing may be performed by a general Natural Language Processing (NLP) tool, for example, a chinese Language Processing tool such as HanLP (chinese Language Processing).
When the event processing is performed by using a computer, generally, the event needs to be converted into a numerical value form, so that the computer performs operations such as calculation, and the determining, in step S206 of the embodiment of the present specification, the word features of the participles in the participle set based on the participle set may specifically include:
converting a word in the word segmentation set through a word vectorization model to obtain a word vector corresponding to the word; the word vectorization model is obtained through self-learning model training and is used for representing word segmentation into a numerical form.
In the embodiment of the present specification, the pre-trained word vectorization model may be used to perform vectorization representation on the obtained segmented words, so that the obtained segmented words may be converted into a numerical form that can be processed by the machine model.
In practical applications, each word in the text may be represented numerically, for example, by the numbers 0, 1, 2, etc.; further associating adjacent characters in the text according to the relevance of each character in the participle, for example, in the form of n-gram; the participles may also be represented in digitized form in conjunction with an attention mechanism.
In practical application, a word vectorization model can be obtained according to a preset logic and a corpus training self-learning model, and word segmentation can be converted into a word vector form by adopting one-hot coding, an N-gram model, a word2vec model, a FastText tool and the like. In the embodiment of the present specification, a specific word vectorization model is not specifically limited, and may be selected according to actual requirements, for example, when an existing word2vec model can meet requirements, in order to save resources such as time, manpower, material resources, and the like, word vectorization conversion for word segmentation may be performed by using the word2vec model; when the existing model cannot meet the requirements or has specific requirements on the model, the word vectorization model can be obtained through the machine self-learning model according to the corpus.
In this embodiment of the present specification, a word feature set of a text to be classified may be obtained according to a word vector corresponding to a word segmentation, and the obtaining of the word feature set of the text to be classified in step 206 may specifically include:
constructing a feature matrix according to the word vectors; the characteristic matrix is a matrix with m rows and n columns, wherein m is the total number of the participles, and n is the characteristic dimension of each participle.
In the embodiment of the present specification, the text to be classified may be formed by word segmentation, and the text to be classified may be converted into a numerical form according to the obtained word vector corresponding to the word segmentation, so as to obtain a word feature set corresponding to the text to be classified. The word feature set can be a set of word features of word segmentation, and a feature matrix for representing the word feature set corresponding to the text to be classified can be constructed according to the obtained word vectors.
It should be noted that when the word segmentation processing is performed on the text to be classified, word segmentation results obtained by different word segmentation tools may be different, and the total number of obtained word segmentations may also be different; when the word segmentation is subjected to vectorization representation, different vectorization tools are adopted, the obtained characteristic dimensions and numerical values may be different, and the setting can also be performed according to requirements. In the embodiments of the present specification, the number of specific segmented words and the number of feature dimensions are not specifically limited as long as the requirements can be met. For example, when the classification requirement is high, a higher-dimensional word vector may be employed; to reduce the amount of computation and improve the computation efficiency, lower-dimensional word vectors may be used.
In order to improve the accuracy of text classification, the method in this embodiment of the present specification further analyzes syntax information of each participle in a text, and in step 208, based on the participle set, determines a syntactic relationship between the participles in the participle set, which may specifically include:
determining the part of speech of each participle in the participle set;
determining the dependency relationship among the participles based on the parts of speech; and the dependency relationship represents the syntactic collocation relationship of the two participles in the text to be classified.
In practical applications, a sentence is composed of words in a certain order, and the words have a specific part of speech in the sentence, such as nouns, verbs, prepositions, and the like. For clearly expressing the meaning of the sentence, the collocation relationship of the syntax also has certain rules, for example, noun + verb can form a main-meaning structure, so that in practical application, there is a certain dependency relationship between the participles in the text.
In the embodiment of the description, the part-of-speech of each participle in the participle set can be determined by using the trained syntactic analysis model, so that the dependency relationship among the participles is determined according to the part-of-speech of the participle, and the syntactic collocation relationship contained in the text to be classified is extracted. It should be noted that the syntactic analysis model may be obtained by training on the basis of a self-learning model according to requirements, or an existing model with a syntactic analysis function may be used, for example, the aforementioned HanLP chinese language processing tool may also extract the dependency syntactic relation of the chinese text. In the embodiments of the present specification, the specific machine model used is not particularly limited as long as the dependency relationship between the participles can be obtained.
In this embodiment of the present specification, the dependency relationship between the segmented words may be converted into a numerical value, and the determining the syntactic characteristics of the text to be classified according to the syntactic relationship may specifically include:
constructing a dependency syntax tree based on the dependency relationship;
and obtaining an adjacency matrix based on the dependency syntax tree, wherein the adjacency matrix is used for representing the syntactic characteristics of the text to be classified.
In the embodiment of the present specification, the dependency syntax tree may be a tree-shaped graph structure, and is used to represent a modified or collocation relationship between words in a sentence, so as to describe the syntax structure of the sentence. The dependency syntax tree can be further converted into a matrix form for calculation of the model.
Wherein, the dependency syntax tree may include:
a root node word; the root node words comprise predicates in the texts to be classified;
hierarchy node terms of at least one hierarchy;
for any hierarchy, one node word of the any hierarchy has the dependency relationship with one node word of a hierarchy of a level above the any hierarchy.
In practical applications, a root node word of the dependency syntax tree may be set, or a participle may be automatically selected as the root node word by using a machine learning model, where the root node word may be the most important word in a sentence, and usually the main predicate in the sentence may be the root node word. The primary predicate may be understood as the predicate or predicates at the top of the sentence, rather than the predicates in the clauses. In general, a predicate may be a verb, and in some specific cases, it may be an adjective, a noun, a preposition, or the like, and in actual application, a phrase or a short sentence having a main predicate structure may be a predicate in a complete sentence in which the phrase or the short sentence is located. In practical application, the root node words of the sentences can be determined by utilizing the self-learning capability of the machine model, and the root nodes can be selected according to preset rules.
After determining the root word, the participle having a dependency relationship with the root word may be used as a participle node of a next level of the root word, where the root word may be called a parent node, the participle node of the next level may be called a first child node, and the parent node and the first child node are connected; further determining second child nodes having dependency relationships with the first child nodes, wherein the first child node may be referred to as a parent node of the second child node, and the first child node is connected to the second child node; by analogy, a dependency syntax tree corresponding to the text to be classified can be obtained finally, and the nodes of the dependency syntax tree correspond to the word segmentation of the text to be classified.
In an embodiment of the present specification, the obtaining an adjacency matrix based on the dependency syntax tree may specifically include:
converting the dependency syntax tree into an adjacency matrix according to a preset relation table;
the adjacent matrix is a matrix with m rows and m columns, wherein m is the total number of the participles; the word segmentation is ordered according to the character sequence of the text to be classified, and factors Ai and j in the adjacency matrix represent the dependency relationship between the ith word segmentation and the jth word segmentation, wherein i is less than or equal to m, and j is less than or equal to m.
In general, when Ai, j is 0, it may indicate that there is no direct dependency relationship between the ith participle and the jth participle, that is, there is no direct syntactic relationship between the ith participle and the jth participle in the whole text sentence to be classified; if Ai, j is not 0, it may indicate that there is a dependency relationship between the ith participle and the jth participle, and different values may indicate different dependency relationships, or it may be understood that there is a specific syntactic relationship between the ith participle and the jth participle.
Whether the dependency relationship exists between the participles can be determined by analyzing the connection condition between the nodes in the dependency syntax tree, and the dependency syntax tree can be converted into an adjacency matrix by representing the specific dependency relationship into a preset numerical value form according to a preset relationship table.
For more clearly explaining the method in the embodiment of the present specification, the method is described with reference to the flowchart illustrated in fig. 3, and fig. 3 is a flowchart illustrating a text classification method provided in the embodiment of the present specification.
As shown in fig. 3, assuming that the text to be classified is "a bulletin-b illegal fundamentation" 302, the text is subjected to word segmentation processing, and the obtained word segmentation result 304 is four word segmentations of "a", "a bulletin", "b" and "illegal fundamentation"; vectorizing the obtained segmented words by using a word vectorization tool to obtain a feature matrix 306, wherein the feature matrix is a 4 x 756 dimensional matrix; the obtained participles can be converted into a dependency syntax tree 308 through the syntactic relationship among the participles, and a 4 × 4-dimensional adjacency matrix 310 is further obtained according to the dependency syntax tree 308, wherein numbers in the adjacency matrix represent the syntactic relationship among the participles, and the specific corresponding relationship between the numbers and the syntax in practical application can be set according to practical requirements. The obtained adjacency matrix 310 and the feature matrix 306 are subjected to fusion calculation to obtain a fusion feature 312, in practical application, the fusion feature 312 can be represented in a matrix form with the same dimension as the feature matrix 306, and then text classification can be performed through a classification model 314 according to the fusion feature 312.
The text classification method in the embodiment of the specification can fuse the syntactic characteristics in the text into the classification, can better understand the meaning of the text, performs the classification from the overall perspective of the text, and can improve the accuracy and the precision of the classification.
In practical applications, before building the dependency syntax tree 308, the parts of speech of each participle and the dependency relationship between the participles shown in table 1 below can be obtained through analysis and information extraction:
word sequence number Word segmentation Part of speech Number of associated words Syntactic relationships
1 First of all Rank of name 2 Relationship between major and minor
2 Zhuangzhang Verb and its usage 0 Keyword
3 Second step Noun (name) 4 Relationship between major and minor
4 Illegal collecting of capital Verb and its usage 2 Moving guest relationship
TABLE 1
As shown in table 1, the text to be classified is divided into 4 segments, word numbers 1 to 4 are assigned according to the character sequence of the text to be classified, the part of speech of each segment, the word sequence number of the associated segment corresponding to each segment, and the syntactic association relationship between each segment and the associated segment. The keyword 'zhuang' can be used as a root node word, the participles associated with the 'zhuang' are 'A' and 'illegal funding', and the 'A' and the 'illegal funding' can be respectively connected with the 'zhuang' as child nodes of the 'zhuang'; furthermore, "a" and "illegal fundamentals" are respectively used as father nodes, corresponding child nodes are respectively determined, as can be seen from table 1, a cardinal-predicate relationship exists between "b" and "illegal fundamentals", and "b" can be used as a child node of "illegal fundamentals", so that a dependency syntax tree containing each node and a corresponding relationship of the node can be constructed.
The specification may also convert the dependency syntax tree into an adjacency matrix form according to a preset relationship table, where the preset relationship table may be a correspondence table of syntax relationships and numerical value representations. Table 2 is a preset relationship table provided in the embodiments of the present specification:
numerical value code Syntactic relationships
1 Relationship between major and minor
2 Moving guest relationship
3 Preposition object
4 In a parallel relationship
TABLE 2
As shown in table 2, the preset relationship table may include syntactic relationships and corresponding values, and the corresponding values are used to represent specific syntactic relationships in the adjacency matrix. In practical applications, the preset relationship table may have a plurality of syntactic relations, for example, 15 syntactic relations in the common chinese syntax may be included, for example, in addition to the syntactic relations shown in the table above, an inter-guest relation, a centering relation, a middle-state relation, a dynamic complement relation, a parallel relation, a betweent relation, an additional relation, and the like may also be included, a corresponding relation of a mark symbol may also be included, and a corresponding numerical value may also be set for the core word. The content of the specific preset relationship table is not specifically limited in this specification as long as different syntactic relationships can be embodied.
In practical application, the adjacency matrix may be a matrix with m rows and m columns, where m is the total number of the participles; the participles are ordered according to the character sequence of the text to be classified, factors Ai and j in the adjacency matrix represent the dependency relationship between the ith participle and the jth participle, wherein i is less than or equal to m, and j is less than or equal to m. As shown in Table 2, when there is a predicate relation between the 1 st participle and the 2 nd participle in the adjacency matrix, the factor A of the 1 st row and the 2 nd column in the adjacency matrix 1,21, similarly, there is a predicate between the 2 nd participle and the 1 st participle, adjacent to the 2 nd row and 1 st column factor A in the matrix2,1May also be 1; and by analogy, obtaining an adjacency matrix corresponding to the text to be classified according to a preset relation table.
It should be noted that the above contents are only for clearly explaining the method in the embodiment of the present specification, the contents of tables 1 and 2 are only schematic representations, and may be set according to actual requirements in practical applications, and the relation expression manner between the participles and the specific contents of the preset relation table are not limited herein as long as the requirements can be met.
In this embodiment of the present specification, a convolution operation may be performed on an adjacency matrix and a feature matrix to obtain a fusion feature, and in step 212, the performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic feature to obtain the fusion feature specifically may include:
and inputting the word feature set and the syntactic features of the text to be classified into a convolutional neural network model to obtain an output result of the convolutional neural network model.
In the embodiment of the present specification, a feature matrix representing a word feature set of a text to be classified and an adjacent matrix representing a syntactic feature may be input to a convolutional neural network model for fusion calculation, and an obtained output result is used as a fusion feature for text classification.
The neural network model can be any one of a graph convolution neural network model and a graph attention network model.
A Graph Convolutional neural Network (GCN) model is a method for deep learning coding by Graph Convolutional, and can be performed by combining feature information and a structure at the same time, so that feature characterization can be performed on original information better.
A Graph Attention network (Graph Attention Networks) model considers more nodes and interactive information among the nodes on the basis of a Graph convolution neural network model.
In this embodiment of the present specification, the text to be classified may be classified based on the obtained fusion result, and in step 214, the classifying the text based on the fusion feature may specifically include:
and inputting the fusion characteristics into a text classification model, wherein the text classification model is obtained by training a self-learning model.
With the development of computer technology, technicians can achieve specific effects by training a machine model, particularly can simplify manual intervention steps and improve model effects by utilizing the self-learning function of the machine model, and can classify by adopting a text classification model obtained by training the self-learning model in the embodiment of the specification.
In practical application, classifying texts based on the fusion features may specifically include:
calculating the probability value of the text to be classified belonging to each preset text type according to the fusion characteristics;
and determining the preset text type with the maximum probability value as the type of the text to be classified.
In the embodiment of the present specification, a probability value that a text to be classified belongs to each preset text type can be obtained, and the larger the probability value is, the higher the probability that the text belongs to the classification type is. And a corresponding preset probability value can be set for the preset text type, and the text to be classified which is greater than or equal to the preset probability value is determined as the preset text type.
In practical application, texts with probability values close to the preset probability value or the medium probability value can be submitted to expert for examination, the examination result of the expert is used as a training sample, and the classification model is further trained, so that the classification accuracy of the model can be improved.
The intermediate probability value is understood to be a probability value located in the middle of the overall probability value, for example, the overall probability value is 0 to 100, and the probability values with the probability values of 45 to 55 may be referred to as intermediate probability values. It should be noted that, in practical application, the preset probability value and the medium probability value may be set according to actual needs, and specific values in the embodiments of the present specification are not limited as long as the requirements can be met.
To improve the accuracy of classification, the method in the embodiment of the present specification may further include: constructing a loss function; calculating a loss value of the text classification model according to the loss function; the loss value represents the degree of difference between the prediction and the actual data; and when the loss value is larger than a preset loss value, training the text classification model.
In practical application, different classification models can be trained according to different requirements, for example, when a text needs to be classified into categories such as sports, entertainment, humanity, animation, food and the like, the model can be trained into a multi-classification model capable of being classified into corresponding categories.
For another example, some companies or enterprises need to pay attention to whether there is a negative or positive message to the company or enterprise in daily life, and then the company or enterprise can take countermeasures in time with respect to the corresponding message, so as to reduce adverse effects or losses to the company or enterprise caused by the negative message as much as possible, and meanwhile, the positive message can be fully utilized to improve the credibility or benefits of the enterprise, and the model can be trained as a public opinion classification model according to the requirement.
In step 214 in this embodiment of the present specification, after the inputting the fusion feature into the text classification model, the method may further include:
judging whether the classification result of the text classification model is a negative result;
if yes, starting an early warning process.
When public sentiment classification is carried out, when the classification result is a negative result, an early warning flow can be started for prompting corresponding treatment. For example, classification processing is performed on 'A reports illegal funding of B', an obtained classification result is a negative message for B, an early warning process can be started for prompting that B has a negative message, and further B can judge whether to perform further processing; the obtained classification result is neutral for the nail, and an early warning process can not be started.
The text classification method in the embodiment of the specification can fuse the syntactic relation into the classification characteristics, and can improve the accuracy of classification.
The text classification method in this embodiment of the present description may be applied to document classification, and may also be applied to web page classification, and when applied to web page classification, before the obtaining the text to be classified in step 202, the method may further include:
capturing webpage information;
extracting text information in the webpage information;
and cleaning the text information to obtain the text to be classified.
In practical application, a web page information capturing tool can be used for acquiring web page information, the commonly acquired web page information may include contents such as a toolbar, a picture, a link and the like besides text information, the text information in the web page information can be extracted, the text information is cleaned, stop words and the like are removed, and a text to be classified is acquired.
In the embodiment of the present specification, the text to be classified may include at least one sentence, when the text to be classified includes a plurality of sentences, the text to be classified may be decomposed into a plurality of single sentences according to punctuations in the text, and classification processing may be performed on the single sentence as a unit to obtain a classification result corresponding to the single sentence. In practical application, a weight can be set for each single sentence, the comprehensive proportion of each classification is counted by using a preset weight and the classification result of the single sentence, and the classification with the highest comprehensive proportion is determined as the classification corresponding to the text to be classified.
Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method. Fig. 4 is a schematic structural diagram of a text classification device corresponding to fig. 2 provided in an embodiment of the present disclosure. As shown in fig. 4, the apparatus may include:
a text obtaining module 402, configured to obtain a text to be classified;
a word segmentation processing module 404, configured to perform word segmentation processing on the text to be classified to obtain a word segmentation set;
a word feature determining module 406, configured to determine, based on the word segmentation set, word features of the segments in the word segmentation set, so as to obtain a word feature set of the text to be classified;
a syntactic relation determining module 408, configured to determine, based on the set of segmented words, syntactic relations between segmented words in the set of segmented words;
a syntactic characteristic determining module 410, configured to determine syntactic characteristics of the text to be classified according to the syntactic relationship;
a fusion calculation module 412, configured to perform feature fusion calculation based on the word feature set of the text to be classified and the syntactic feature, so as to obtain a fusion feature;
and a text classification module 414, configured to perform text classification based on the fusion features.
The examples of this specification also provide some specific embodiments of the apparatus based on the apparatus of fig. 4, which is described below.
Optionally, the word segmentation processing module 404 in this embodiment of the present specification may be specifically configured to:
performing word segmentation processing on the text to be classified according to preset word segmentation granularity to obtain a word segmentation set;
the word segmentation set comprises at least one word segmentation.
Optionally, the word feature determining module 406 in this embodiment of the present specification may be specifically configured to:
converting a word in the word segmentation set through a word vectorization model to obtain a word vector corresponding to the word; the word vectorization model is obtained through self-learning model training and is used for representing word segmentation into a numerical form.
The word feature determination module 406 may be further configured to:
constructing a feature matrix according to the word vectors; the characteristic matrix is a matrix with m rows and n columns, wherein m is the total number of the participles, and n is the characteristic dimension of each participle.
Optionally, in this embodiment of the present specification, the syntactic relation determining module 408 may be specifically configured to:
determining the part of speech of each participle in the participle set;
determining the dependency relationship among the participles based on the parts of speech; and the dependency relationship represents the syntactic collocation relationship of the two participles in the text to be classified.
Optionally, in this embodiment of the present specification, the syntactic characteristic determining module 410 may be specifically configured to:
constructing a dependency syntax tree based on the dependency relationship;
and obtaining an adjacency matrix based on the dependency syntax tree, wherein the adjacency matrix is used for representing the syntactic characteristics of the text to be classified.
The dependency syntax tree may specifically include:
a root node word; the root node words comprise predicates in the texts to be classified;
hierarchy node terms of at least one hierarchy;
for any hierarchy, one node word of the any hierarchy has the dependency relationship with one node word of a hierarchy of a level above the any hierarchy.
Optionally, the obtaining an adjacency matrix based on the dependency syntax tree may specifically include:
converting the dependency syntax tree into an adjacency matrix according to a preset relation table;
the adjacent matrix is a matrix with m rows and m columns, wherein m is the total number of the participles; the word segmentation is ordered according to the character sequence of the text to be classified, and factors Ai and j in the adjacency matrix represent the dependency relationship between the ith word segmentation and the jth word segmentation, wherein i is less than or equal to m, and j is less than or equal to m.
Based on the same idea, the embodiment of the present specification further provides a device corresponding to the above method.
Fig. 5 is a schematic structural diagram of a text classification device corresponding to fig. 2 provided in an embodiment of the present specification. As shown in fig. 5, the apparatus 500 may include:
at least one processor 510; and the number of the first and second groups,
a memory 530 communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory 530 stores instructions 520 executable by the at least one processor 510 to enable the at least one processor 510 to:
acquiring a text to be classified;
performing word segmentation processing on the text to be classified to obtain a word segmentation set;
determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
determining a syntactic relation among participles in the participle set based on the participle set;
determining the syntactic characteristics of the text to be classified according to the syntactic relation;
performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and performing text classification based on the fusion characteristics. .
Based on the same idea, the embodiment of the present specification further provides a computer-readable medium corresponding to the above method. The computer readable medium has computer readable instructions stored thereon, which are executable by the processor to implement the above text classification method:
the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the text classification device shown in fig. 4, since it is basically similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information which can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (22)

1. A method of text classification, comprising:
acquiring a text to be classified;
performing word segmentation processing on the text to be classified to obtain a word segmentation set;
determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
determining a syntactic relation among participles in the participle set based on the participle set;
determining the syntactic characteristics of the text to be classified according to the syntactic relation;
performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and performing text classification based on the fusion characteristics.
2. The method according to claim 1, wherein performing a word segmentation process on the text to be classified to obtain a word segmentation set, specifically comprising:
performing word segmentation processing on the text to be classified according to preset word segmentation granularity to obtain a word segmentation set; the word segmentation set comprises at least one word segmentation.
3. The method according to claim 1, wherein determining word features of the participles in the participle set based on the participle set specifically includes:
converting a word in the word segmentation set through a word vectorization model to obtain a word vector corresponding to the word; the word vectorization model is obtained through self-learning model training.
4. The method according to claim 3, wherein the obtaining of the word feature set of the text to be classified specifically includes:
constructing a feature matrix according to the word vectors; the characteristic matrix is a matrix with m rows and n columns, wherein m is the total number of the participles, and n is the characteristic dimension of each participle.
5. The method according to claim 1, wherein determining a syntactic relation between the participles in the participle set based on the participle set specifically comprises:
determining the part of speech of each participle in the participle set;
determining the dependency relationship among the participles based on the parts of speech; and the dependency relationship represents the syntactic collocation relationship of the two participles in the text to be classified.
6. The method according to claim 5, wherein the determining the syntactic characteristics of the text to be classified according to the syntactic relation specifically includes:
constructing a dependency syntax tree based on the dependency relationship;
and obtaining an adjacency matrix based on the dependency syntax tree, wherein the adjacency matrix is used for representing the syntactic characteristics of the text to be classified.
7. The method according to claim 6, wherein the dependency syntax tree specifically comprises:
a root node word; the root node words comprise predicates in the texts to be classified;
hierarchy node terms of at least one hierarchy;
for any hierarchy, one node word of the any hierarchy has the dependency relationship with one node word of a hierarchy of a level above the any hierarchy.
8. The method according to claim 6, wherein obtaining the adjacency matrix based on the dependency syntax tree specifically includes:
converting the dependency syntax tree into an adjacency matrix according to a preset relation table;
the adjacent matrix is a matrix with m rows and m columns, wherein m is the total number of the participles; the word segmentation is ordered according to the character sequence of the text to be classified, and factors Ai and j in the adjacency matrix represent the dependency relationship between the ith word segmentation and the jth word segmentation, wherein i is less than or equal to m, and j is less than or equal to m.
9. The method according to claim 1, wherein the performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic feature to obtain a fusion feature specifically comprises:
and inputting the word feature set and the syntactic features of the text to be classified into a convolutional neural network model to obtain an output result of the convolutional neural network model.
10. The method according to claim 1, wherein the text classification based on the fusion features specifically comprises:
and inputting the fusion characteristics into a text classification model, wherein the text classification model is obtained by training a self-learning model.
11. The method of claim 8, wherein the neural network model is any one of a graph convolution neural network model and a graph attention network model.
12. The method of claim 1, before obtaining the text to be classified, further comprising:
capturing webpage information;
extracting text information in the webpage information;
and cleaning the text information to obtain the text to be classified.
13. The method of claim 1, wherein the text classification model comprises a public sentiment classification model, and wherein entering the fused features into the text classification model further comprises:
judging whether the classification result of the text classification model is a negative result;
if yes, starting an early warning process.
14. A text classification apparatus comprising:
the text acquisition module is used for acquiring texts to be classified;
the word segmentation processing module is used for carrying out word segmentation processing on the text to be classified to obtain a word segmentation set;
the word feature determination module is used for determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
the syntactic relation determining module is used for determining the syntactic relation among the participles in the participle set based on the participle set;
the syntactic characteristic determining module is used for determining the syntactic characteristics of the text to be classified according to the syntactic relation;
the fusion calculation module is used for performing feature fusion calculation on the basis of the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and the text classification module is used for classifying the texts based on the fusion characteristics.
15. The apparatus according to claim 14, wherein the segmentation processing module is specifically configured to:
performing word segmentation processing on the text to be classified according to preset word segmentation granularity to obtain a word segmentation set;
the word segmentation set comprises at least one word segmentation.
16. The apparatus of claim 14, wherein the word feature determination module is specifically configured to:
converting a word in the word segmentation set through a word vectorization model to obtain a word vector corresponding to the word; the word vectorization model is obtained through self-learning model training.
17. The apparatus of claim 16, the word feature determination module further configured to:
constructing a feature matrix according to the word vectors; the characteristic matrix is a matrix with m rows and n columns, wherein m is the total number of the participles, and n is the characteristic dimension of each participle.
18. The apparatus of claim 14, wherein the syntactic relationship determining module is specifically configured to:
determining the part of speech of each participle in the participle set;
determining the dependency relationship among the participles based on the parts of speech; and the dependency relationship represents the syntactic collocation relationship of the two participles in the text to be classified.
19. The apparatus of claim 18, wherein the syntactic feature determining module is specifically configured to:
constructing a dependency syntax tree based on the dependency relationship;
and obtaining an adjacency matrix based on the dependency syntax tree, wherein the adjacency matrix is used for representing the syntactic characteristics of the text to be classified.
20. The apparatus according to claim 19, wherein the dependency syntax tree further comprises:
a root node word; the root node words comprise predicates in the texts to be classified;
hierarchy node terms of at least one hierarchy;
for any hierarchy, one node word of the any hierarchy has the dependency relationship with one node word of a hierarchy of a level above the any hierarchy.
21. The apparatus according to claim 19, wherein the deriving an adjacency matrix based on the dependency syntax tree specifically includes:
converting the dependency syntax tree into an adjacency matrix according to a preset relation table;
the adjacent matrix is a matrix with m rows and m columns, wherein m is the total number of the participles; the word segmentation is ordered according to the character sequence of the text to be classified, and factors Ai and j in the adjacency matrix represent the dependency relationship between the ith word segmentation and the jth word segmentation, wherein i is less than or equal to m, and j is less than or equal to m.
22. A text classification apparatus comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to:
acquiring a text to be classified;
performing word segmentation processing on the text to be classified to obtain a word segmentation set;
determining word features of the participles in the participle set based on the participle set to obtain a word feature set of the text to be classified;
determining a syntactic relation among participles in the participle set based on the participle set;
determining the syntactic characteristics of the text to be classified according to the syntactic relation;
performing feature fusion calculation based on the word feature set of the text to be classified and the syntactic features to obtain fusion features;
and performing text classification based on the fusion characteristics.
CN202010607094.4A 2020-06-29 2020-06-29 Text classification method, device and equipment Pending CN111611393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010607094.4A CN111611393A (en) 2020-06-29 2020-06-29 Text classification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010607094.4A CN111611393A (en) 2020-06-29 2020-06-29 Text classification method, device and equipment

Publications (1)

Publication Number Publication Date
CN111611393A true CN111611393A (en) 2020-09-01

Family

ID=72201110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010607094.4A Pending CN111611393A (en) 2020-06-29 2020-06-29 Text classification method, device and equipment

Country Status (1)

Country Link
CN (1) CN111611393A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131506A (en) * 2020-09-24 2020-12-25 厦门市美亚柏科信息股份有限公司 Webpage classification method, terminal equipment and storage medium
CN112182230A (en) * 2020-11-27 2021-01-05 北京健康有益科技有限公司 Text data classification method and device based on deep learning
CN112861517A (en) * 2020-12-24 2021-05-28 杭州电子科技大学 Chinese spelling error correction model
CN112926337A (en) * 2021-02-05 2021-06-08 昆明理工大学 End-to-end aspect level emotion analysis method combined with reconstructed syntax information
CN116402019A (en) * 2023-04-21 2023-07-07 华中农业大学 Entity relationship joint extraction method and device based on multi-feature fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN109918500A (en) * 2019-01-17 2019-06-21 平安科技(深圳)有限公司 File classification method and relevant device based on convolutional neural networks
CN111160008A (en) * 2019-12-18 2020-05-15 华南理工大学 Entity relationship joint extraction method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107180023A (en) * 2016-03-11 2017-09-19 科大讯飞股份有限公司 A kind of file classification method and system
CN109918500A (en) * 2019-01-17 2019-06-21 平安科技(深圳)有限公司 File classification method and relevant device based on convolutional neural networks
CN111160008A (en) * 2019-12-18 2020-05-15 华南理工大学 Entity relationship joint extraction method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
成邦文等: "统计调查数据质量控制——数据审核与评估的理论、方法及实践", 科学技术文献出版社, pages: 75 *
许晶航等: "基于图注意力网络的因果关系抽取", 《计算机研究与发展》, no. 01 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131506A (en) * 2020-09-24 2020-12-25 厦门市美亚柏科信息股份有限公司 Webpage classification method, terminal equipment and storage medium
CN112182230A (en) * 2020-11-27 2021-01-05 北京健康有益科技有限公司 Text data classification method and device based on deep learning
CN112861517A (en) * 2020-12-24 2021-05-28 杭州电子科技大学 Chinese spelling error correction model
CN112926337A (en) * 2021-02-05 2021-06-08 昆明理工大学 End-to-end aspect level emotion analysis method combined with reconstructed syntax information
CN116402019A (en) * 2023-04-21 2023-07-07 华中农业大学 Entity relationship joint extraction method and device based on multi-feature fusion
CN116402019B (en) * 2023-04-21 2024-02-02 华中农业大学 Entity relationship joint extraction method and device based on multi-feature fusion

Similar Documents

Publication Publication Date Title
Lu et al. VGCN-BERT: augmenting BERT with graph embedding for text classification
Arora et al. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
Neelakandan et al. A gradient boosted decision tree-based sentiment classification of twitter data
CN108304468B (en) Text classification method and text classification device
CN111967242B (en) Text information extraction method, device and equipment
US11501082B2 (en) Sentence generation method, sentence generation apparatus, and smart device
CN111611393A (en) Text classification method, device and equipment
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
US20150199333A1 (en) Automatic extraction of named entities from texts
US20190392035A1 (en) Information object extraction using combination of classifiers analyzing local and non-local features
Banik et al. Evaluation of naïve bayes and support vector machines on bangla textual movie reviews
Sanyal et al. Resume parser with natural language processing
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN115048944B (en) Open domain dialogue reply method and system based on theme enhancement
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
CN111950287A (en) Text-based entity identification method and related device
CN111159412A (en) Classification method and device, electronic equipment and readable storage medium
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
Fernandes et al. Appellate court modifications extraction for portuguese
Fkih et al. Hidden data states-based complex terminology extraction from textual web data model
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
Tapsai et al. Thai Natural Language Processing: Word Segmentation, Semantic Analysis, and Application
Shahbazi et al. Toward representing automatic knowledge discovery from social media contents based on document classification
Singh et al. Words are not equal: Graded weighting model for building composite document vectors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40036405

Country of ref document: HK

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200901