CN108664512B

CN108664512B - Text object classification method and device

Info

Publication number: CN108664512B
Application number: CN201710209192.0A
Authority: CN
Inventors: 刘晓华; 李航; 涂兆鹏; 严哲峰
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2021-02-09
Anticipated expiration: 2037-03-31
Also published as: CN108664512A

Abstract

The application discloses a text object classification method and device, and belongs to the technical field of internet. The method comprises the following steps: acquiring node characteristics of nodes of an i-th layer, wherein the node characteristics are splicing characteristics obtained after splicing of a first type of characteristics and a second type of characteristics, or the node characteristics are second type of characteristics, the first type of characteristics are extracted through artificial characteristic engineering, the second type of characteristics are learned through a machine, and each node of an i + 1-th layer corresponds to a neural network; inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into a neural network corresponding to the (i + 1) th layer node to obtain second type characteristics of the (i + 1) th layer node; and when the (i + 1) th layer node is the root node, inputting the node characteristics of the root node into a classifier, and outputting the class label of the text object through the classifier. The method and the device solve the problem that the accuracy of the classification result is influenced by one-sided feature comparison extracted through the neural network, can extract more comprehensive features, and improve the accuracy of the classification result.

Description

Text object classification method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a text object classification method and apparatus.

Background

The web page classification means that web pages are divided into different categories according to the content of web page documents, and corresponding category labels are added.

The web document is a text described by using HyperText Markup Language (HTML), and the name of the category tag usually corresponds to a specific application scenario, for example: the category labels for news pages may be "domestic", "international", "political", "economic", etc., and the category labels for sensitive pages may be "contains sensitive information" and "does not contain sensitive information". When the web pages are classified, the web pages are generally regarded as a text, then each word in the text is represented by a word vector, the text is encoded through a neural network based on the representation of the word vector to obtain the characteristics of the text, and then the characteristics of the text are input into a classifier to be classified to obtain the class label to which the text belongs.

However, the features extracted through the neural network are relatively simple, the features of the web page cannot be completely described, and the accuracy of the classification result is affected.

Disclosure of Invention

In order to solve the problem that the features extracted through a neural network in the prior art cannot be completely described due to one-sided comparison, and further accuracy of a classification result is affected, the embodiment of the application provides a text object classification method and device. The technical scheme is as follows:

in a first aspect, a text object classification method is provided, the text object has a tree structure, the tree structure comprises n layers of nodes, the text content of the ith layer of nodes in the tree structure is a part of the text content of the (i + 1) th layer of nodes, n and i are positive integers, i is more than or equal to 1 and less than n, the node at the bottommost layer is called the first layer of nodes, the node at the topmost layer is called the root node, the number of the root nodes is one, the features extracted through artificial feature engineering are called first-class features, and the features learned through a machine are called second-class features; when node features are extracted from the (i + 1) th layer nodes, the node features of all the (i) th layer nodes belonging to the (i + 1) th layer nodes are firstly obtained, the node features are splicing features obtained after splicing the second type features of corresponding nodes or the first type features and the second type features of corresponding nodes, then the node features of all the (i) th layer nodes are input into a neural network corresponding to the (i + 1) th layer nodes, each (i + 1) th layer node corresponds to a respective neural network, and the second type features of the (i + 1) th layer nodes are output by the neural network; calculating the node characteristics of each node layer by layer from the bottommost layer of the tree structure until the node characteristics of the root node are calculated, inputting the node characteristics of the root node into a classifier, and outputting a class label of the text object by the classifier; in the process of determining the node characteristics of each node, the node characteristics of at least one node are splicing characteristics obtained by splicing the first type of characteristics and the second type of characteristics, in addition, the (i + 1) th layer node is also called an intermediate node, and the (i + 1) th layer node is also called a child node of the (i + 1) th layer node.

When the node characteristics of the text object are extracted, the node characteristics of at least one node in the tree structure of the text object are spliced by the first class characteristics and the second class characteristics, so that only the second class characteristics obtained through a neural network are extracted from the node characteristics of the text object, and meanwhile, the first class characteristics extracted through artificial characteristic engineering are fused, the characteristics of the text object can be more comprehensively expressed by the node characteristics output by the root node, the problem that the accuracy of a classification result is influenced by the fact that the characteristics extracted through the neural network in the related technology are more comprehensive and cannot be completely described is solved, and the effects of extracting more comprehensive characteristics and improving the accuracy of the classification result are achieved.

In addition, the node characteristics of each ith layer node of the same (i + 1) th layer node are input into the neural network corresponding to the (i + 1) th layer node to obtain the second class characteristics of the (i + 1) th layer node, so that the text object with the tree structure can calculate the node characteristics of each node layer by layer.

In addition, each (i + 1) th layer node corresponds to a respective neural network, so that the design mode of the neural network is more flexible, and the first class of features and the second class of features can be spliced on any node of the tree structure.

In a first possible implementation manner of the first aspect, when the node features of each node are calculated, the node feature of at least one node is a splicing feature obtained by splicing the first type of feature and the second type of feature, or the node feature of the node having at least one path is a splicing feature obtained by splicing the first type of feature and the second type of feature, or the node feature of at least one layer of nodes is a splicing feature obtained by splicing the first type of feature and the second type of feature, or the node features of all nodes in the tree structure are splicing features.

When the node characteristics of part of the nodes are splicing characteristics, the neural network corresponding to each node can be flexibly designed, so that the structural applicability of the neural network is stronger; when the node features of all the nodes are the splicing features, the neural network can splice the first class features and the second class features of all the nodes, so that the finally obtained node features of the root node can represent more comprehensive features of the text object to the greatest extent.

With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner, in the tree structure, the neural networks corresponding to the nodes of the same type are the same, and the neural networks corresponding to the nodes of different types are the same or different.

The neural networks corresponding to the nodes of the same type are designed to be the same, so that the nodes of the same type can multiplex the same neural network, and the training cost of the neural network is reduced; the neural networks corresponding to different types of nodes are the same, so that the training cost of the neural networks can be reduced, and the neural networks corresponding to different types of nodes are different, so that the neural networks are more flexibly designed, and different functions are realized for different types of nodes.

With reference to the first aspect, the first possible implementation manner of the first aspect, or the second possible implementation manner of the first aspect, in a third possible implementation manner, when the type of the neural network corresponding to the i +1 th layer node is a first type of neural network, the second type of feature of the i +1 th layer node is calculated by inputting the node feature of each i-th layer node belonging to the i +1 th layer node into the first type of neural network corresponding to the i +1 th layer node, splicing the node features of each i-th layer node through the first type of neural network, and taking the spliced feature as the second type of feature of the i +1 th layer node.

The first type of neural network realizes the function of splicing the node characteristics of each ith layer node in multiple stages, and is suitable for simple processing of the node characteristics.

With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, or the third possible implementation manner of the first aspect, in a fourth possible implementation manner, when the type of the neural network corresponding to the i +1 th node is a second type of neural network, it is assumed that the i +1 th node includes m i-th nodes, m is a positive integer, m >1, where the leftmost i-th node is referred to as the 1 st i-th node, the rightmost i-th node is referred to as the m i-th node, and the second type of characteristics of the i + 1-th node is calculated by sequentially inputting node characteristics of the m i-th nodes into the second type of neural network from left to right, outputting the first characteristics through the second type of neural network, and then sequentially inputting node characteristics of the m i-th nodes into the second type of neural network from right to left, outputting a second characteristic through a second type neural network, and finally splicing the first characteristic and the second characteristic to obtain a second type characteristic of the (i + 1) th layer node; in the execution process from left to right, after the node characteristics of the ith layer node are processed by the second type of neural network, the node characteristics of the (i + 1) th layer node are input into the second type of neural network for processing; in the process of executing from right to left, after the node characteristics of the ith layer node are processed by the second type of neural network, the node characteristics of the ith-1 ith layer node are input into the second type of neural network, i is a positive integer, 1< i < m, that is, after the previous ith layer node is processed by the second type of neural network, the next ith layer node is input into the second type of neural network for processing.

Due to the fact that the first characteristic and the second characteristic output by the neural network are different in the left-to-right processing process and the right-to-left processing process, the first characteristic and the second characteristic are spliced, and the second characteristic output by the neural network can be more comprehensively expressed.

With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, the third possible implementation manner of the first aspect, or the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, after the node features of the root node are input into the classifier, the classifier can output probability values of respective class labels possibly corresponding to the text object, and then determine the class label with the highest probability value as the class label of the text object.

By determining the category label with the maximum probability value as the category label of the text object, the category label with the maximum probability in the category labels determined by the classifier is used as the category label of the text object, so that the accuracy of the classification result is higher.

With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, the third possible implementation manner of the first aspect, the fourth possible implementation manner of the first aspect, or the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner, before the neural network and the classifier are used for classifying text objects, parameters of the neural network and the classifier need to be determined through training, a training process needs to use each training sample including sample content and sample labels, supervised learning is performed on the neural network and the classifier, when training is started, training parameters of the neural network and the classifier are initialized first, the initialization manner may be a random manner, or a set of initialization parameters may be determined through a supervised training method, and then sample content of each training sample is input to the neural network, the same flow as the actual use is adopted, the node characteristics of each node are calculated from the bottom layer by layer, the node characteristics corresponding to the root node are finally calculated, then inputting the node characteristics of the root node into a classifier, outputting probability values corresponding to the labels by the classifier, acquiring the probability values corresponding to the sample labels, adding the probability values of the corresponding sample labels calculated according to the content of each sample to obtain the sum of the probability values, calculating each adjusted training parameter by using a gradient ascent algorithm according to the sum of the calculated probability values, then judging whether an iteration ending condition is reached or not, if the iteration ending condition is not reached, and iterating the steps of calculating the sum of the probability values, and if the iteration ending condition is reached, determining the training parameter calculated by the last iteration as the trained training parameter, namely the parameter when the classifier and the neural network are finally put into use.

The neural network and the classifier are subjected to supervised training by using the training samples with the sample labels, so that the parameters of the finally trained neural network and classifier can be effectively and accurately classified according to the content of the text object.

With reference to the first aspect, the first possible implementation manner of the first aspect, the second possible implementation manner of the first aspect, the third possible implementation manner of the first aspect, the fourth possible implementation manner of the first aspect, the fifth possible implementation manner of the first aspect, or the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner, when training parameters of the neural network and the classifier are trained, an iteration ending condition is that the number of iterations reaches a first number, or a change of a sum of probability values calculated by consecutive second numbers falls within a predetermined range, where the first number, the second number, and the predetermined range are manually preset according to an actual training requirement.

The iteration ending condition is set manually, so that the training process can be automatically stopped after the iteration calculation reaches the manually preset first times, in addition, if the change of the sum of the probability values obtained by the calculation of the continuous second times is small, the required training effect can be judged to be achieved, the training process is ended in advance, and the calculation resources are saved.

In a second aspect, a text object classification apparatus is provided, where the text object classification apparatus includes at least one unit, and each unit is respectively used to implement a corresponding step in the text object classification method of the first aspect.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the text object classification method provided in the first aspect is implemented.

In a fourth aspect, a computer-readable medium is provided, which stores instructions for implementing the text object classification method provided in the first aspect.

Drawings

FIG. 1 is a schematic diagram of a tree structure provided by one embodiment of the present application;

FIG. 2 is a diagram illustrating a structure of a web document according to an embodiment of the present application;

FIG. 3 is a flow diagram of a neural network classifying web documents according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for classifying text objects according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for classifying textual objects according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a neural network and a classifier provided in one embodiment of the present application;

FIG. 7 is a schematic diagram of a neural network provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a neural network provided in another embodiment of the present application;

FIG. 9 is a schematic diagram of a neural network and classifier provided in accordance with another embodiment of the present application;

FIG. 10 is a diagram illustrating a method for classifying textual objects according to an embodiment of the present application;

FIG. 11 is a flowchart of a method for training a neural network and a classifier according to an embodiment of the present application;

fig. 12 is a block diagram illustrating a structure of a text object classification apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a server provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

A tree structure is a structure that organizes data elements in a branching relationship and is generally used to describe objects having a hierarchical relationship. Data elements in a tree structure are called nodes, wherein one node at the top of the tree structure is called a root node, the node at the bottom of each branch of the tree structure is called a leaf node, the other nodes except the leaf node are called intermediate nodes (including the root node), each intermediate node comprises at least one child node, the leaf node does not comprise a child node, the child node is a next-layer node (the next-layer node can be an intermediate node and can also be a leaf node) belonging to the intermediate node, the node content of the child node is a part of the node content of the intermediate node, the node content of each child node of the same intermediate node is not intersected, and the union of the node content of all the child nodes of the same intermediate node is equal to the node content of the intermediate node. The composition of the tree structure is described below with reference to fig. 1, which schematically shows a tree structure with 3 levels of nodes, as shown in fig. 1, where node 101 is a root node, nodes 101 to 103 are intermediate nodes, nodes 104 to 108 are leaf nodes, the child nodes of node 101 include node 102 and node 103, the child nodes of node 102 include node 104 and node 105, and the child nodes of node 103 include node 106, node 107 and node 108. Wherein, a branch from the root node to a leaf node is a path, such as: node 101-node 102-node 105 is a path, or node 101-node 103-node 108 is a path.

The text object has a tree structure, the tree structure is assumed to have n layers of nodes, the text content of the ith layer node in the tree structure is a part of the text content of the ith +1 layer node, no intersection exists between the text contents of all the ith layer nodes belonging to the same ith +1 layer node, and the union of the text contents of all the ith layer nodes belonging to the same ith +1 layer node is equal to the text content of the ith +1 layer node. Correspondingly, the i +1 th layer node is the intermediate node, the i-th layer node is a child node of the i +1 th layer node, n and i are positive integers, and i is more than or equal to 1 and is less than n. Common text objects include web documents, electronic cases, news reviews, and the like.

Taking a web document as an example, with reference to fig. 2, the structural composition of a web document is exemplarily shown. A HyperText Markup Language (HTML) file node 201 is a root node, the HTML file node 201 includes two child nodes, a header node 202 and a body node 203, the child node of the body node 203 is a paragraph node 204, the child node of the paragraph node 204 is a sentence node 205, the child node of the sentence node 205 is a word node 206, and the child node of the header node 202 is a word node 206. The word node 206 is a leaf node, and the HTML file node 201, the title node 202, the body node 203, the paragraph node 204, and the sentence node 205 are intermediate nodes.

Illustratively, the structure of an electronic case is: the root node is a clinical document node, the child nodes of the clinical document node are document segment nodes, the child nodes of the document segment node are dataset nodes, and the child nodes of the dataset node are data element nodes. The clinical document nodes, the document segment nodes and the data group nodes are intermediate nodes, and the data element nodes are leaf nodes.

Illustratively, the structure of a news comment is: the root node is a comment content node, the child nodes of the comment content node are paragraph nodes, the child nodes of the paragraph nodes are sentence nodes, and the child nodes of the sentence nodes are word nodes. The comment content node, the paragraph node and the sentence node are intermediate nodes, and the word node is a leaf node.

For convenience and convenience of information retrieval, text objects such as web documents and the like need to be divided into different categories according to contents, corresponding category tags are added, and names of the category tags generally correspond to specific application scenarios, such as: the category labels for news web pages may be "domestic", "international", "political", "economic", etc. Since text objects such as web documents have complex tree structures, early classification techniques mainly rely on manual classification, and obtaining the characteristics of text objects according to manually designed rules is called artificial feature engineering. The accuracy of the classification result obtained by manual classification is high, but high labor cost is required, and with the rapid increase of the number of text objects, high labor cost is required to meet the classification requirement of the text objects.

In order to save labor cost, automatic classification of web page documents through a neural network has been developed in the prior art. Referring to fig. 3 in conjunction, the process of automatically classifying web documents by a neural network generally includes: 301, inputting a webpage document; 302, extracting features by a neural network; 303, classifying by a classifier according to the characteristics; the classifier outputs a class label 304. A typical implementation of neural network feature extraction is as follows: the neural network generally regards a webpage document as a text consisting of a character string sequence, decomposes the text into words and phrases, expresses the words and phrases by word vectors, forms a word vector sequence by the word vectors of the words and phrases in the text, generates a semantic vector according to the word vector sequence, and takes the semantic vector as the input of the classifier, wherein the semantic vector is the characteristic extracted by the neural network. In the process of extracting features by the neural network, no matter word vectors or semantic vectors are the features learned by a machine. The word vectors are an expression mode that each word in the language is mapped into a real number vector through training, the word vectors form a word vector space, each word vector is a point in the word vector space, and the similarity of the word vectors in lexical or semantic terms can be judged according to the distance between the word vectors.

In the embodiments of the present application, the features extracted through the artificial feature engineering are referred to as first-class features, and the features learned through the machine are referred to as second-class features. The first type of characteristics are extracted through manually designed rules, and are generally accurate; the second type of features are features learned by a machine and can be used for processing a large amount of data, but because the neural network ignores the tree structure of the webpage document when extracting the features, the features extracted by the neural network are relatively simple and cannot completely describe the features of the webpage document, and the accuracy of the classification result is influenced.

Because the first class of features is more accurate in classifying the text object with the tree structure, the second class of features can be used for processing large-scale data, and in order to keep respective advantages of the first class of features and the second class of features, embodiments of the present application provide a text object classification method for the text object with the tree structure. The tree structure of the text object comprises n layers of nodes, the node at the bottommost layer is a first layer node, the node at the topmost layer is a root node, the text content of the ith layer node in the tree structure is a part of the text content of the ith +1 layer node, n and i are positive integers, i is more than or equal to 1 and less than n, optionally, the text content of each ith layer node belonging to the same ith +1 layer node is not intersected, and the union of the text contents of all the ith layer nodes belonging to the same ith +1 layer node is equal to the text content of the ith +1 layer node.

Fig. 4 is a flowchart of a method for classifying text objects according to an embodiment of the present application. The text object classification method is exemplified by an application in a server, and as shown in fig. 4, the text object classification method includes:

401, obtaining node characteristics of an ith layer node, where the node characteristics include: and splicing characteristics obtained after the first type of characteristics and the second type of characteristics are spliced, or the second type of characteristics.

The first class of features are features extracted by artificial feature engineering. Optionally, the first-class features are usually represented by a sparse vector, if an ith bit of the sparse vector is 1, the ith first-class feature appears, and if the ith bit of the sparse vector is 0, the ith first-class feature does not appear. The sparse vector includes two parallel sequential vectors and a value vector, such as: the vector (1,0,1,3) is represented as a sparse vector: (4, [0,2,3], [1,1,3]), 4 denotes the number of elements of the vector, [0,2,3] is a sequential vector, [1,1,3] is a value vector, the value indicating the position of vector 0 is 1, the value of the position of vector 2 is 1, the value of the position of vector 3 is 3, and the values of the other positions are all 0.

The second class of features are features learned by machine, and are typically represented by dense vectors. Such as: vector (1,0,1,3) is represented as a dense vector: [1,0,1,3].

Since the tree structure includes multiple layers of nodes, from the root node down, the (i + 1) th layer of nodes generally includes one or more (i + 1) th layer of nodes, which are also referred to as intermediate nodes, and (i + 1) th layer of nodes which are also referred to as child nodes, which are next to the intermediate nodes to which the child nodes belong.

The meaning of splicing is: inputting two vectors, wherein the dimension of the first vector is m, the dimension of the second vector is n, outputting a new vector according to the first vector and the second vector, the dimension of the new vector is m + n, the contents of the first m dimensions are copied to the first vector, and the contents of the last n dimensions are copied to the second vector. Such as: the first vector for representing the first class of features is [1,2, 3], that is, the dimension of the first vector is 3, the second vector for representing the second class of features is [4, 5], that is, the dimension of the second vector is 2, the spliced features obtained after splicing the first class of features and the second class of features are represented as [1,2,3,4,5], and the dimension of the vector corresponding to the spliced features is 5.

The node features of the nodes may be spliced features obtained by splicing the first type features and the second type features, or may be second type features learned by a machine. For example, in a webpage document of a tree structure, the node characteristics of word nodes can be the concatenation of part of speech and word vectors, the part of speech is the characteristics of artificial design, and the word vectors are the characteristics learned by a machine; the node features of the word nodes may also be only the second type of feature word vectors.

And 402, inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into the neural network corresponding to the (i + 1) th layer node to obtain the second type characteristics of the (i + 1) th layer node.

Each i +1 layer node corresponds to a respective neural network, the node characteristics of the i +1 layer node are determined by depending on the node characteristics of the i layer node, the process of obtaining the node characteristics is called encoding, and the encoding mode has three types: artificial coding, neural network coding, and both. Correspondingly, the artificial features obtained by the artificial coding, namely the first-class features, and the features extracted by the neural network obtained by the neural network coding, namely the second-class features, are spliced to obtain the splicing features by splicing the first-class features and the second-class features.

Because the node feature of the ith layer node may be the second type feature of the ith layer node, or may be a splicing feature obtained by splicing the first type feature and the second type feature of the ith layer node, the node feature of the ith layer node input into the neural network corresponding to the (i + 1) th layer node may be the second type feature, or may be a splicing feature, but only the second type feature of the (i + 1) th layer node is output through the neural network. The node characteristics of the i +1 th layer node may be a second type of characteristic, and if the node characteristics of the i +1 th layer node are splicing characteristics, the neural network corresponding to the i +1 th layer node needs to splice the first type of characteristic and the second type of characteristic.

When a text object of a tree structure is coded, coding is started from leaf nodes of the tree structure, after the node characteristics of the leaf nodes are obtained, the node characteristics of the leaf nodes are used as codes of nodes of a previous layer to be input, the nodes of the previous layer are coded continuously, after the node characteristics of the nodes of the previous layer are obtained, the node characteristics of the nodes of the previous layer are transmitted upwards continuously until the root nodes are coded to obtain the node characteristics of the root nodes. Such as: a webpage document with a 4-layer tree structure sequentially comprises webpage nodes, paragraph nodes, sentence nodes and word nodes from a root node, wherein the word nodes are coded to obtain the node characteristics of the word nodes, then the node characteristics of the word nodes are used as the codes of the sentence nodes to be input, the sentence nodes are coded to obtain the node characteristics of the sentence nodes, then the node characteristics of the sentence nodes are used as the codes of the paragraph nodes to be input, the paragraph nodes are coded to obtain the node characteristics of the paragraph nodes, then the node characteristics of the paragraph nodes are used as the codes of the webpage nodes to be input, the webpage nodes are coded to obtain the node characteristics of the webpage nodes, and the node characteristics of the webpage nodes are the node characteristics of the root node of the webpage document. This process of encoding each node in turn from bottom to top for a tree-structured text object until the root node is encoded is called concatenated encoding.

And 403, when the (i + 1) th layer node is the root node, inputting the node characteristics of the root node into a classifier, and outputting the class label of the text object through the classifier.

When the cascade coding is carried out to the root node, after the node characteristics of the root node are obtained through calculation, the node characteristics of the root node are input into the classifier, and the classifier can calculate the class label of the text object according to the node characteristics of the root node.

It should be noted that, in each node of the text object in the tree structure, the node feature of at least one node is a concatenation feature. In order to enable the node characteristics to describe the characteristics of the text object more comprehensively, at least one node needs to be determined in the tree structure of the text object, when the node characteristics of the node are obtained, the first type characteristics and the second type characteristics of the node need to be spliced by the neural network corresponding to the node, and the spliced characteristics obtained after splicing are used as the node characteristics of the node.

To sum up, according to the text object classification method provided in the embodiment of the present application, when the node features of the text object are extracted, the node features of at least one node in the tree structure of the text object are obtained by splicing the first class features and the second class features, so that only the second class features obtained through the neural network are extracted from the node features of the text object, and the first class features extracted through the artificial feature engineering are also fused, so that the node features output by the root node can more comprehensively express the features of the text object, the problem that the features extracted through the neural network in the related art are more comprehensive and cannot completely describe the features of the web page, and thus the accuracy of the classification result is affected is solved, and the effects of extracting more comprehensive features and improving the accuracy of the classification result are achieved.

In addition, in step 402, the node features of each i-th layer node of the same i + 1-th layer node are input into the neural network corresponding to the i + 1-th layer node to obtain the second-class features of the i + 1-th layer node, the node features of the lower-layer node are used as the input of the neural network corresponding to the upper-layer node layer by layer from bottom to top from the node features of the leaf nodes to obtain the second-class features of the upper-layer node, then the second-class features of the upper-layer node or the splicing features of the first-class features and the second-class features of the upper-layer node are input into the neural network corresponding to the upper-layer node, and the node features of the upper-layer node are continuously extracted until the node features of the root node are obtained, so that the text object with the tree structure can calculate the node features of each node layer by layer.

In addition, through step 402, because each i +1 th layer node corresponds to a respective neural network, different neural networks are designed according to different types of nodes, and different node types corresponding to different text objects are combined, so that the design mode of the neural networks is more flexible, and the first type of features and the second type of features can be designed and spliced on any node of the tree structure.

Fig. 5 is a flowchart of a method for classifying a text object according to another embodiment of the present application. The text object classification method is exemplified by an application in a server, and as shown in fig. 5, the text object classification method includes:

501, obtaining node characteristics of an ith layer node, where the node characteristics include: and splicing characteristics obtained after the first type of characteristics and the second type of characteristics are spliced, or the second type of characteristics.

Taking a web document as an example, the nodes of the web document include HTML file nodes, title nodes, body nodes, paragraph nodes, sentence nodes, and word nodes.

The first class of features are features extracted by artificial feature engineering, and are usually represented by sparse vectors.

The second class of features are features learned by machine, and are typically represented by dense vectors.

The meaning of splicing is: inputting two vectors, wherein the dimension of the first vector is m, the dimension of the second vector is n, outputting a new vector according to the first vector and the second vector, the dimension of the new vector is m + n, the contents of the first m dimensions are copied to the first vector, and the contents of the last n dimensions are copied to the second vector.

The node features of the nodes may be spliced features obtained by splicing the first type features and the second type features, or may be second type features output only by the neural network. As shown in fig. 6, the present embodiment exemplarily illustrates a splicing feature obtained by splicing a first type of feature and a second type of feature by using a node feature of each node. The text object 600 shown in fig. 6 contains 3 levels of nodes, a first level node 610, a second level node 620, and a root node 630. In addition to the first level node 610, the second level node 620 and the root node 630 may be referred to as intermediate nodes or i +1 level nodes. When the node characteristics of the second-level nodes 620 are calculated, the node characteristics of the first-level nodes 610 included in the second-level nodes 620 are obtained, and when the node characteristics of the root nodes 630 are calculated, the node characteristics of the second-level nodes 620 included in the root nodes are obtained.

Each node at layer i +1 corresponds to a respective neural network, and the node at the lowest layer may not have a corresponding neural network, such as: the bottommost node is a word.

Optionally, when the term node has the corresponding first-class feature, for example, the manually extracted first-class feature is a part of speech, the term node needs to be correspondingly connected with the neural network to splice the first-class feature and the second-class feature of the term node.

Optionally, when the word node does not have the corresponding first-class feature, that is, the word node only corresponds to the second-class feature, because the second-class feature of the word node is usually a word vector, and the word vector is an expression manner trained in advance to map each word in a certain language into a real number vector, and usually, for a corresponding relationship between a word and a word vector in a certain language, a mature training system is used, the word node may not need to design a corresponding neural network, and is directly mapped into a word vector according to the corresponding relationship, so that in this case, the word node does not have a corresponding neural network.

The neural networks corresponding to the nodes of the same type are the same, and the neural networks corresponding to the nodes of different types are different or the same.

Taking a web document as an example, child nodes of HTML file nodes in the web document generally only include a header node and a main body node, and because fewer child nodes are included, when a neural network corresponding to the HTML file nodes extracts node features of the HTML file according to the node features of the header node and the node features of the main body node, the designed neural network can simplify points, for example, only involves a single operation, such as a splicing operation; since a paragraph usually contains a plurality of sentences, there are many sentence nodes belonging to the same paragraph node, and many node features of the sentence nodes in the neural network corresponding to the paragraph node, and complex points for processing the multi-input neural network design, such as a recurrent neural network, are required.

Referring to fig. 6 in combination, each node of the text object 600 corresponds to a respective neural network in the neural network 601, the first layer node 610 corresponds to the first neural network 611, the second layer node 620 corresponds to the second neural network 621, and the root node 630 corresponds to the third neural network 631.

Illustratively, two different types of neural networks are provided in this embodiment, including a first type of neural network and a second type of neural network. When calculating the second type of feature of the i +1 th node, please refer to steps 502 to 503 when the neural network corresponding to the i +1 th node is the first type of neural network, and refer to steps 504 to 506 when the neural network corresponding to the i +1 th node is the second type of neural network.

502, when the neural network corresponding to the i +1 th layer node is the first type neural network, inputting the node characteristics of each i +1 th layer node belonging to the same i +1 th layer node into the first type neural network corresponding to the i +1 th layer node.

Since the neural networks corresponding to different types of nodes may differ, the specific function that the neural network performs depends on the type of neural network.

503, performing multi-stage splicing on the node features of each ith layer of nodes through the first-class neural network to obtain a second-class feature of the (i + 1) th layer of nodes.

The first type of neural network realizes the function of multi-stage splicing of node characteristics of nodes at the ith layer. The multi-stage splicing is to perform splicing operation on at least two node characteristics, such as: the number of the ith layer nodes is 2, the first-class neural network carries out splicing operation on the node characteristics of the first ith layer node and the node characteristics of the second ith layer node, and the obtained splicing characteristics are the second-class characteristics of the (i + 1) th layer node; the number of the ith layer nodes is 3, the first-class neural network firstly carries out splicing operation on the node characteristics of the first ith layer node and the node characteristics of the second ith layer node to obtain first node characteristics, then carries out splicing operation on the first node characteristics and the node characteristics of the third ith layer node to obtain second node characteristics, and the second node characteristics are used as second-class characteristics of the (i + 1) th layer nodes.

Taking a web document as an example, as shown in FIG. 7, a page node 710 of the web document contains two child nodes: the first-class neural network 740 corresponding to the title node 720, the main body node 730 and the page node 710 performs splicing operation on the node characteristics of the title node 720 and the main body node 730. Illustratively, the node characteristics of the title node 720 are 1.8,2.6, …, the node characteristics of the subject node 730 are 3.8,4.2,8.1, …, and the second type of characteristics of the page node 710 output by the first type of neural network 740 are 1.8,2.6, …,3.8,4.2,8.1, ….

And 504, when the neural network corresponding to the i +1 th layer node is a second type neural network, sequentially inputting the node characteristics of the ith layer node into the second type neural network from left to right to obtain the first characteristics.

In practical applications, the second type of neural network may be a threshold cycle Unit (GRU).

Taking the web document as an example, as shown in FIG. 8, the main node of the web document includes n (n is a positive integer )>1) Each paragraph node: p is a radical of₁、p₂、…、p_nThe first GRU 810 processes each paragraph node in turn from left to right, i.e. the first GRU 810 processes p in turn₁To p_nWhen p is₁After being processed by the first GRU 810, the first GRU 810 continues to process p₂When p is_nAfter being processed by the first GRU 810, the first GRU 810 outputs s₁，s₁The first characteristic obtained for the first GRU 810. Wherein p is₁、p₂、…、p_nIndependent of each other, are not linked to each other when p₁After being processed by the first GRU 810, the first GRU 810 outputs the 1 st feature, when p is₂After being processed by the first GRU 810, the first GRU 810 outputs the 2 nd feature until when p is reached_nAfter being processed by the first GRU 810, the first GRU 810 outputs the nth feature, and finally s is output by the first GRU 810₁The method is obtained by splicing the 1 st feature, the 2 nd feature, … and the nth feature in multiple stages.

And 505, sequentially inputting the node characteristics of the nodes of the ith layer from right to left into a second type of neural network to obtain second characteristics.

Referring to fig. 8, the second GRU820 processes each paragraph node in turn from right to left, i.e. the second GRU820 processes p in turn_nTo p₁When p is_nAfter being processed by the second GRU820, the second GRU820 continues to process p_n-1When p is₁After being processed by the second GRU820, the second GRU820 outputs s₂，s₂Which is the second characteristic obtained for the second GRU 820. Since the second GRU820 processes each paragraph node in turn from right to left, s is finally output by the second GRU820₂The method is obtained by splicing the nth feature, the (n-1) th feature, … and the 1 st feature in multiple stages.

And 506, splicing the first characteristic and the second characteristic to obtain a second characteristic of the node of the (i + 1) th layer.

Referring collectively to FIG. 8, s output by the first GRU 810₁And s output of the second GRU820₂And performing a splicing operation 830 to obtain a second type of characteristics b of the main body node.

Optionally, p₁、p₂、…、p_nPerformed serially during left-to-right execution, p₁、p₂、…、p_nBeing executed serially during execution from right to left, the first GRU 810 and the second GRU820 pair p₁、p₂、…、p_nThe processing of (a) may be performed in parallel (the first GRU 810 and the second GRU820 are processed simultaneously), or may be performed in series (the first GRU 810 processes first, and the second GRU820 processes after the first GRU 810 processes, or the second GRU820 processes first, and the first GRU 810 processes after the second GRU820 processes).

It should be noted that the arrows between the first GRUs 810 indicate the order of execution (from left to right), the first GRUs 810 are not associated with each other when processing the respective node features, the arrows between the second GRUs 820 indicate the order of execution (from right to left), and the second GRUs 820 are associated with each other when processing the respective node featuresWhen the characteristics of each node are processed, the characteristics are not related to each other; in addition, the first GRU 810 and s₁The arrow between indicates that the final output of the first GRU 810 is s₁Second GRU820 and s₂The arrow between indicates that the final output of the second GRU820 is s₂。

507, when the (i + 1) th layer node is the root node, inputting the node characteristics of the root node into the classifier to obtain the probability value of the text object belonging to each class label.

With reference to fig. 6, after the text object 600 is input to the neural network 601, when the node characteristics of each node are calculated by the neural network 601, the node characteristics of the first-layer node 610 are calculated in a bottom-up manner, the first-layer node 610 corresponds to the first neural network 611, after the first-type feature 612 and the second-type feature 613 of the first-layer node 610 are input to the first neural network 611, the first neural network 611 outputs the node characteristics of the first-layer node 610, and when the first neural network 611 is the first-type neural network, the first neural network 611 performs a splicing operation on the first-type feature 612 and the second-type feature 613; then calculating the node characteristics of the second layer node 620, wherein the second layer node 620 corresponds to a second neural network 621, according to the acquired node characteristics of each first layer node 610 belonging to the same second layer node 620, after the node characteristics of each first layer node 610 are input into the second neural network 621, the second neural network 621 outputs the second type characteristics of the second layer node 620, and the first type characteristics 622 of the second layer node 620 and the second type characteristics of the second layer node 620 are spliced to obtain the node characteristics of the second layer node 620, and the second neural network 621 can be a first type neural network or a second type neural network; finally, the node characteristics of the root node 630 are calculated, the root node 630 corresponds to a third neural network 631, according to the acquired node characteristics of each second-layer node 620 belonging to the root node 630, after the node characteristics of each second-layer node 620 are input into the third neural network 631, the third neural network 631 outputs the second-type characteristics of the root node 630, the first-type characteristics 632 of the root node 630 and the second-type characteristics of the root node 630 are spliced to obtain the node characteristics of the root node 630, and the third neural network 631 may be the first-type neural network or the second-type neural network.

After the node features of the root node 630 are calculated, the node features of the root node 630 are input into the classifier 640 for calculating probability values of the text objects belonging to the various category labels.

When the cascade coding is carried out to the root node, after the node characteristics of the root node are obtained through calculation, the node characteristics of the root node are input into the classifier, and the classifier can calculate the probability values of various category labels corresponding to the text object according to the node characteristics of the root node. For example, the classification result of the classifier on the sensitive web page is as follows: contains sensitive information (0.8) and does not contain sensitive information (0.2).

And 508, determining the category label with the maximum probability value as the category label corresponding to the text object.

Such as: when the classification result of the classifier on the sensitive webpage is as follows: if the sensitive web page contains sensitive information (0.8) and does not contain sensitive information (0.2), the category label of the sensitive web page is 'contains sensitive information'.

Optionally, when the node features of the respective nodes of the text object are calculated, there is a node feature of at least one node that is a concatenation feature.

Optionally, when the node features of the respective nodes of the text object are calculated, the node features where at least one path exists are the concatenation features.

Optionally, when the node features of the nodes of the text object are calculated, the node features of the nodes of at least one layer are the splicing features.

Optionally, when the node features of the respective nodes of the text object are calculated, the node features of all the nodes are the concatenation features.

The node features of each node shown in fig. 6 are all the splicing features, and in other possible implementations, the node features of only one layer of nodes may be the splicing features, or the node features of some layers of nodes in the tree structure are the splicing features, and the node features of the other nodes are the second type of features, or the node features of some nodes in the tree structure are the splicing features, or the node feature of a node of one path in the tree structure is the splicing feature, or the node features of the nodes of some paths in the tree structure are the splicing features. The present embodiment does not limit the node position where the node feature is the splicing feature.

For example, as shown in fig. 9, a text object 900 is input into a neural network 901, and assuming that a first-level node 910 is a word node, a second-type feature 912 of the first-level node 910 is a word vector, since the first-level node 910 does not have a corresponding first-type feature, the first-level node 910 does not have a corresponding neural network, and the second-type feature 912 is directly used as a node feature of the first-level node 910; then, calculating node characteristics of the second layer of nodes 920, wherein the second layer of nodes 920 corresponds to a second neural network 921, according to the obtained node characteristics of each first layer of nodes 910 belonging to the same second layer of nodes 920, after the node characteristics of each first layer of nodes 910 are input into the second neural network 921, the second neural network 921 outputs second characteristics of the second layer of nodes 920, and the first characteristics 922 of the second layer of nodes 920 and the second characteristics of the second layer of nodes 920 are spliced to obtain the node characteristics of the second layer of nodes 920; and finally, calculating the node characteristics of the root node 930, wherein the root node 930 corresponds to the third neural network 931, and according to the acquired node characteristics of each second-layer node 920 belonging to the root node 930, after the node characteristics of each second-layer node 920 are input into the third neural network 931, the third neural network 931 outputs the second-class characteristics of the root node 930. After the node features of the root node 930 are calculated, the node features of the root node 930 are input into the classifier 940 for calculating probability values of the text objects belonging to the various category labels.

Optionally, the text object classification method in this embodiment may also be represented as a schematic diagram shown in fig. 10. As shown in FIG. 10, step 1010 is first executed to input a web page document tree; then, executing step 1020, extracting the first type of features by manual feature engineering to obtain a webpage document tree 1030 with the first type of features; then step 1040 is executed, the second type of characteristics are machine-learned, and the first type of characteristics and the second type of characteristics are spliced to obtain codes 1050 of the webpage document tree; finally, step 1060 is executed, the classifier performs classification, and the category label with the maximum probability value is selected to obtain the optimal category label 1070.

Through step 501, the node characteristics of each ith layer node of the same (i + 1) th layer node are input into the neural network corresponding to the (i + 1) th layer node to obtain the second class characteristics of the (i + 1) th layer node, so that the text object with the tree structure can calculate the node characteristics of each node layer by layer.

In step 501, each (i + 1) th layer node corresponds to a respective neural network, so that the neural network is more flexible in design mode, and the first class features and the second class features can be spliced on any node of the tree structure.

Aiming at step 501, when the node characteristics of part of the nodes are splicing characteristics, the neural network corresponding to each node can be flexibly designed, so that the structural applicability of the neural network is stronger; when the node features of all the nodes are the splicing features, the neural network can splice the first class features and the second class features of all the nodes, so that the finally obtained node features of the root node can represent more comprehensive features of the text object to the greatest extent.

Aiming at the step 501, the neural networks corresponding to the nodes of the same type are designed to be the same, so that the nodes of the same type can multiplex the same neural network, and the training cost of the neural network is reduced; the neural networks corresponding to different types of nodes are the same, so that the training cost of the neural networks can be reduced, and the neural networks corresponding to different types of nodes are different, so that the neural networks are more flexibly designed, and different functions are realized for different types of nodes.

Through the

steps

502 and 503, the function realized by the first-class neural network is to perform multi-stage splicing on the node characteristics of the nodes of the ith layer, and the method is suitable for processing simple node characteristics.

Through the steps 504 to 506, due to the left-to-right and right-to-left processing procedures, the first characteristic and the second characteristic of the neural network output are different, and the first characteristic and the second characteristic are spliced, so that the second type of characteristic of the neural network output can be more comprehensively expressed.

Through step 508, the category label with the highest probability value is determined as the category label of the text object, so that the category label with the highest probability among the category labels determined by the classifier is used as the category label of the text object, and the accuracy of the classification result is higher.

Fig. 11 is a flowchart of a method for training a neural network and a classifier according to an embodiment of the present application. The method is exemplified by an application in a server, as shown in fig. 11, the method includes:

1101, training parameters of the classifier and the neural network are initialized.

Optionally, the initialization mode is random initialization, or a set of initialization parameters is trained by a supervised training method.

And 1102, acquiring a training sample, wherein the training sample comprises sample content and a sample label corresponding to the sample content, and the sample content has a tree structure.

The training sample includes sample content and a sample label corresponding to the sample content, such as: sample content is a news webpage, and sample labels are international and economic; the sample content is a sensitive web page, and the sample label is 'contains sensitive information'.

Because the training samples comprise sample contents and sample labels corresponding to the sample contents, the sample contents are classified through the neural network and the classifier, and then the neural network and the classifier are subjected to supervised learning training according to the sample labels.

The probability value of the sample label in the training sample defaults to 1.

1103, inputting the sample content into the neural network, and calculating to obtain the node characteristics corresponding to the root node of the sample content.

The neural network includes a neural network corresponding to each layer i +1 node. Such as: the sample content is a news webpage which comprises webpage nodes, title nodes, main body nodes, paragraph nodes, sentence nodes and word nodes. The webpage nodes correspond to a neural network, the title nodes correspond to a neural network, the main body nodes correspond to a neural network, the paragraph nodes correspond to a neural network, the sentence nodes correspond to a neural network, and the word nodes are leaf nodes, so that no corresponding neural network exists.

And calculating the node characteristics corresponding to the root nodes of the sample content through the neural network, wherein the node characteristics correspond to the neural network with the current parameters.

And 1104, inputting the node characteristics into the classifier, and calculating to obtain the probability value of the sample label belonging to the sample label.

Corresponding to the process of classifying the text object by the neural network and the classifier in the steps 401 to 403, after the node characteristics of the root node of the sample content are obtained through calculation, the node characteristics of the root node are input into the classifier, the classifier calculates the probability value of the sample content belonging to each category label, and then the probability value of the corresponding sample label is obtained.

In an actual training process, there is usually more than one training sample, and the neural network and the classifier need to calculate probability values of respective corresponding sample labels according to sample contents of the respective training samples.

1105, adding the probability values of the sample contents belonging to the respective corresponding sample labels, and calculating to obtain the sum of the probability values.

The way in which the sum of the probability values is calculated can be represented by formula (1):

where θ represents a training parameter, N represents the number of training samples, (X)_n,Y_n) Denotes the nth training sample, X_nRepresenting sample content, Y_nDenotes a sample tag, P (Y)_n|X_nAnd theta) indicates that the neural network and the classifier depend on the sample content X when the training parameter is theta_nCalculating the sample content belonging to the sample label Y_nL (θ) represents the sum of the probability values. And (2) calculating the sum of probability values of the sample labels corresponding to the sample contents by the neural network under the training parameters of the current neural network through a formula (1). Log is a mathematical operation on the probability values and does not affect the properties represented by the probability values.

1106, the adjusted training parameters are calculated by using a gradient ascent algorithm according to the sum of the probability values.

The gradient ascent algorithm regards parameters in the neural network as a multi-dimensional curved surface, and the neural network finds out a maximum value in the curved surface according to sample content by adjusting training parameters of the neural network, wherein the maximum value is a training target for the neural network. Optionally, if a negative sign is added in front of the log in the formula (1), the training parameters are adjusted by using a gradient descent algorithm, and the training target is a minimum value in the curved surface obtained by calculation.

Adjusting training parameters of the neural network and the classifier by using a gradient ascent algorithm, then performing iterative computation according to new training parameters, and updating the training parameters according to a formula (2) in each iteration:

the adjusted training parameter is denoted by θ', the learning rate is denoted by lr, and the learning rate is a constant, and the magnitude of lr influences the speed of the adjustment of the training parameter.

Equation (2) is executed once to indicate that an iteration is completed, and correspondingly, the above steps 1103 to 1106 are executed once in a loop, and after each iteration is completed through the gradient ascent algorithm, the sum of the probability values usually increases or remains unchanged.

1107, it is judged whether the iteration end condition is reached.

Optionally, the iteration end condition includes: the iteration times reach the first time, or the change of the sum of the probability values obtained by calculating the continuous second time belongs to a preset range.

If the sum of the probability values is continuously increased after each iteration and the increase amplitude is large, a first time is manually preset, and the iterative computation is stopped after the iterative time reaches the first time.

If the iteration times do not reach the first times, but the change of the sum of the probability values obtained by calculating the continuous second times is within a preset range, namely the change of the sum of the probability values is not obvious along with the adjustment of the training parameters, the iterative calculation can be stopped, and the current training parameters of the neural network and the classifier can meet the requirements.

And 1108, when the iteration end condition is not met, iteratively executing the four calculation steps.

And when the iteration ending condition is not met, continuously adjusting the training parameters according to the gradient ascending algorithm, and based on the new training parameters, calculating the sum of the probability values of the corresponding sample labels again according to the content of each sample.

1109, when the iteration end condition is reached, determining the training parameter calculated in the last iteration as the trained training parameter.

In summary, according to the training method for the neural network and the classifier provided by the embodiment of the application, the training sample with the sample label is used for carrying out supervised training on the neural network and the classifier, so that the parameters of the finally trained neural network and classifier can be effectively and accurately classified according to the content of the text object.

In addition, the iteration ending conditions are set manually, so that the training process can be automatically stopped after the iteration calculation reaches the manually preset first times, and in addition, if the change of the sum of the probability values obtained by calculating the continuous second times is small, the required training effect can be judged to be achieved, the training process is ended in advance, and the calculation resources are saved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 12 is a block diagram illustrating a structure of a text object classifying device according to an embodiment of the present application, where the text object classifying device may be implemented as part or all of a server by software, hardware, or a combination of the software and the hardware. The text object classification apparatus may include: an acquisition unit 1210, a processing unit 1220 and a determination unit 1230.

An obtaining unit 1210, configured to implement the above step 401, step 501, step 1102, and any other implicit or disclosed obtaining related functions.

A processing unit 1220, configured to implement the above-mentioned step 402, step 502, step 503, step 504, step 505, step 506, step 1101, step 1103, step 1104, step 1105, step 1106, step 1107, step 1108, and any other implicit or public processing-related functions.

A determining unit 1230, configured to implement the above-mentioned

steps

403, 507, 508, 1109, and any other implicit or disclosed determining related functions.

To sum up, the text object classification device provided by the embodiment of the application, when extracting the node feature of the text object, the node feature of at least one node in the tree structure of the text object is obtained by splicing the first class feature and the second class feature, so that the second class feature obtained only through the neural network is obtained in the node feature of the extracted text object, and meanwhile, the first class feature extracted through the artificial feature engineering is also fused, so that the feature of the text object can be more comprehensively expressed by the node feature output by the root node, the problem that the accuracy of the classification result is influenced by comparing the features extracted through the neural network in the related technology is solved, the features of the webpage cannot be completely described, and the effect of extracting more comprehensive features and improving the accuracy of the classification result is achieved.

In addition, when the node characteristics of partial nodes are splicing characteristics, the neural network corresponding to each node can be flexibly designed, so that the structural applicability of the neural network is stronger; when the node features of all the nodes are the splicing features, the neural network can splice the first class features and the second class features of all the nodes, so that the finally obtained node features of the root node can represent more comprehensive features of the text object to the greatest extent.

In addition, the neural networks corresponding to the nodes of the same type are designed to be the same, so that the nodes of the same type can multiplex the same neural network, and the training cost of the neural network is reduced; the neural networks corresponding to different types of nodes are the same, so that the training cost of the neural networks can be reduced, and the neural networks corresponding to different types of nodes are different, so that the neural networks are more flexibly designed, and different functions are realized for different types of nodes.

In addition, the first-class neural network realizes the function of performing multi-stage splicing on the node characteristics of each i-th layer node, and is suitable for simple processing of the node characteristics.

In addition, due to the processing process from left to right and from right to left, the first characteristic and the second characteristic output by the neural network are different, and the first characteristic and the second characteristic are spliced, so that the second characteristic output by the neural network can be more comprehensively expressed.

In addition, the category label with the highest probability value is determined as the category label of the text object, so that the category label with the highest probability in all the category labels determined by the classifier is used as the category label of the text object, and the accuracy of the classification result is higher.

Fig. 13 is a schematic structural diagram of a server provided in an embodiment of the present application. Specifically, the method comprises the following steps: server 1300 includes a Central Processing Unit (CPU) 1301, a system memory 1304 including a Random Access Memory (RAM) 1302 and a read-only memory (ROM) 1303, and a system bus 1305 connecting system memory 1304 and Central Processing Unit 1301. The server 1300 also includes a basic input/output system (I/O system) 1306, which facilitates transfer of information between devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.

The basic input/output system 1306 includes a display 1308 for displaying information and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and input device 1309 are connected to the central processing unit 1301 through an input/output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input/output controller 1310 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the server 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown) such as a hard disk or Compact Disc-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid state memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1304 and mass storage device 1307 described above may be collectively referred to as memory.

The server 1300 may also operate as a remote computer connected to a network via a network, such as the internet, according to various embodiments of the invention. That is, the server 1300 may be connected to the network 1312 through the network interface unit 1311, which is connected to the system bus 1305, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1311.

It should be noted that: in the text object classification apparatus and the server provided in the above embodiments, when classifying text objects, only the division of the above function modules is exemplified, and in practical applications, the above function distribution may be completed by different function modules according to needs, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the above described functions. In addition, the text object classification device and the server provided by the embodiments and the method embodiments of the text object classification method belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

It should be understood that, as used herein, the singular forms "a," "an," "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Claims

1. A text object classification method is characterized in that the text object has a tree structure with n layers of nodes, the text content of the ith layer of node in the tree structure is a part of the text content of the (i + 1) th layer of node, n and i are positive integers, i is more than or equal to 1 and less than n, and the method comprises the following steps:

acquiring node characteristics of an ith layer of nodes, wherein the node characteristics are as follows: splicing characteristics obtained after the first type of characteristics and the second type of characteristics are spliced, or the second type of characteristics;

inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into a neural network corresponding to the (i + 1) th layer node to obtain the second type characteristics of the (i + 1) th layer node;

when the (i + 1) th layer node is a root node, inputting the node characteristics of the root node into a classifier, and outputting a class label of the text object through the classifier, wherein the node at the topmost layer in the tree structure is the root node;

the first class of features are features extracted through artificial feature engineering, the second class of features are features learned through a machine, each (i + 1) th layer node corresponds to the neural network, and the node features with at least one node are the splicing features.

2. The method according to claim 1, wherein the node characteristic for which there is a node located in at least one path and/or at least one layer is the splicing characteristic.

3. The method of claim 1, wherein the neural networks corresponding to the same type of nodes are the same, and the neural networks corresponding to different types of nodes are different or the same.

4. The method according to any one of claims 1 to 3, wherein the neural network corresponding to the layer i +1 node is a first type neural network;

the inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into the neural network corresponding to the (i + 1) th layer node to obtain the second type characteristics of the (i + 1) th layer node includes:

inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into the first type neural network corresponding to the (i + 1) th layer node;

and performing multi-stage splicing on the node characteristics of the ith layer of nodes through the first type of neural network to obtain second type characteristics of the (i + 1) th layer of nodes.

5. The method according to any one of claims 1 to 3, wherein the neural network corresponding to the layer i +1 node is a second type of neural network; the (i + 1) th layer nodes comprise m (i + 1) th layer nodes, the leftmost side is the 1 st layer node, the rightmost side is the mth layer node, m is a positive integer, and m is greater than 1;

sequentially inputting the node characteristics of the m ith nodes from left to right into the second type of neural network to obtain first characteristics;

sequentially inputting the node characteristics of the m ith nodes from right to left into the second type of neural network to obtain second characteristics;

and splicing the first characteristic and the second characteristic to obtain a second characteristic of the i +1 layer node.

6. The method according to any one of claims 1 to 3, wherein when the i +1 th node is the root node, inputting the node feature of the root node into a classifier, and outputting a class label of the text object through the classifier, comprises:

when the (i + 1) th layer node is the root node, inputting the node characteristics of the root node into the classifier to obtain probability values of the text objects belonging to various class labels;

and determining the category label with the maximum probability value as the category label corresponding to the text object.

7. The method according to any one of claims 1 to 3, wherein the training process of the classifier and the neural network comprises the steps of:

step 1: initializing training parameters of the classifier and the neural network;

step 2: obtaining a training sample, wherein the training sample comprises sample content and a sample label corresponding to the sample content, and the sample content has the tree structure;

and step 3: inputting the sample content into the neural network, and calculating to obtain node characteristics corresponding to a root node of the sample content;

and 4, step 4: inputting the node features into the classifier, and calculating to obtain a probability value of the sample label belonging to the sample label;

and 5: adding probability values of the sample contents belonging to the corresponding sample labels, and calculating to obtain the sum of the probability values;

step 6: calculating the adjusted training parameters by using a gradient ascent algorithm according to the sum of the probability values;

when the iteration end condition is not met, the steps 3 to 6 are executed in an iteration mode;

and when the iteration ending condition is reached, determining the training parameters calculated by the last iteration as the trained training parameters.

8. The method of claim 7, wherein the iteration end condition comprises:

the number of iterations reaches the first number of iterations,

alternatively, the first and second electrodes may be,

the variation of the sum of the probability values calculated for the consecutive second times falls within a predetermined range.

9. A text object classification device is characterized in that the text object has a tree structure with n layers of nodes, the text content of the ith layer of node in the tree structure is a part of the text content of the (i + 1) th layer of node, n and i are positive integers, i is more than or equal to 1 and less than n, and the device comprises:

an obtaining unit, configured to obtain a node characteristic of an i-th layer node, where the node characteristic is: splicing characteristics obtained after the first type of characteristics and the second type of characteristics are spliced, or the second type of characteristics;

the processing unit is used for inputting the node characteristics of each ith layer node belonging to the same (i + 1) th layer node into a neural network corresponding to the (i + 1) th layer node to obtain the second type characteristics of the (i + 1) th layer node;

a determining unit, configured to, when the i +1 th node is a root node, input the node feature of the root node into a classifier, and output a category label of the text object through the classifier, where a top-most node in the tree structure is the root node;

10. The apparatus of claim 9, wherein the node characteristic that exists at a node located in at least one path and/or at least one layer is the splice characteristic.

11. The apparatus of claim 9, wherein the neural networks corresponding to the same type of the nodes are the same, and the neural networks corresponding to different types of the nodes are different or the same.

12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 8 when executing the program.