CN112597298A

CN112597298A - Deep learning text classification method fusing knowledge maps

Info

Publication number: CN112597298A
Application number: CN202011097951.7A
Authority: CN
Inventors: 刘星辰; 麻沁甜; 陈晓峰
Original assignee: Shanghai Bochi Information Technology Co ltd
Current assignee: Shanghai Bochi Information Technology Co ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-04-02

Abstract

The deep learning text classification method based on the fusion knowledge graph comprises the following steps: extracting entities in the text to be classified; acquiring entity related implicit information by using the established knowledge graph; converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text; performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text; inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation; and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting. The method and the device overcome the defects of the prior art, introduce the knowledge graph into the text classification of deep learning, and perform information supplementation on the original text by inquiring the hidden information from the knowledge graph and converting the hidden information into the formatted text, thereby improving the accuracy of the classification of the deep learning text.

Description

Deep learning text classification method fusing knowledge maps

Technical Field

The invention relates to the technical field of deep learning and text classification, in particular to a deep learning text classification method fusing a knowledge map.

Background

Text classification has wide application in the fields of internet, finance and the like. Current deep learning or machine learning text classification models are mostly based on information of the text itself, such as word segmentation of the text itself. However, the text usually contains a large number of entities such as person names, place names and organization names, and these entities usually contain hidden important information, which is not contained in the entity names themselves, and the missing of these hidden entity information will result in the accuracy of text classification.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a deep learning text classification method fusing a knowledge map, which overcomes the defects of the prior art, introduces the knowledge map into the deep learning text classification, and performs information supplement on an original text by inquiring hidden information from the knowledge map and converting the hidden information into a formatted text, thereby improving the accuracy of the deep learning text classification.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the deep learning text classification method based on the fusion knowledge graph comprises the following steps:

s1: extracting entities in the text to be classified;

s2: acquiring entity related implicit information by using the established knowledge graph;

s3: converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text;

s4: performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text;

s5: inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation;

s6: and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting.

Further, in the step S1, the entities in the text to be classified are extracted from the original text by a named entity identification method, where the entities include names of people, places, organizations and proper nouns.

Further, the step S2 specifically includes: the extracted implicit information of the text entities to be classified is obtained through the query of the existing knowledge graph, wherein the query comprises the steps of directly obtaining attribute values of the entities and indirectly obtaining related information of the entities through knowledge reasoning.

Further, the step S3 specifically includes: and querying the obtained entity implicit information through the knowledge graph, wherein the entity implicit information is obtained through { entity name: entity information } added to the end of the original text. The entity information is converted into natural language, and a plurality of entity information are added according to the sequence of the entity information in the original text to form a supplementary text containing entity hidden information.

Further, each word in the word segmentation sequence in step S4 is obtained by performing word segmentation processing on the supplemented text, and performing special symbol and stop word filtering preprocessing.

Further, the word embedding matrix of the word sequence in step S5 is obtained by mapping a preset or randomly initialized word embedding model, where each behavior word segmentation sequence of the word embedding matrix is mapped to a corresponding word embedding vector.

Further, the step S6 of embedding the words in the word segmentation sequence into the matrix and inputting the deep learning text classification algorithm to perform training or prediction specifically includes: and (4) inputting the word segmentation sequence matrix obtained after the processing of the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training or classification accuracy test on the deep learning text classifier by combining the labels of the samples.

The invention provides a deep learning text classification method fusing knowledge maps. The method has the following beneficial effects: by acquiring the entity of the original classified text, acquiring the implicit information of the entity through the knowledge graph and supplementing the implicit information into the original text, the classification precision of deep learning text classification can be effectively improved.

Drawings

In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 provides an example of a knowledge graph of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.

As shown in fig. 1, the method for classifying deep learning texts by fusing knowledge maps comprises the following steps:

s1: extracting entities in the text to be classified;

Specifically, in step S1, extracting entities in the text to be classified; specifically, entities in the original text are obtained by using a Named Entity Recognition (NER) method for extraction, and each extracted entity corresponds to one entity type. The entities include names of people, places, organizations, proper nouns, and the like.

In step S2, the established knowledge graph is used to obtain the implicit information related to the entity by combining the entity and the inference of the knowledge graph. Specifically, named entities such as the person name, the place name, and the organization name acquired in step S1 are queried by the knowledge graph that has been constructed to acquire implicit information of the entities. Wherein, the query comprises directly obtaining the attribute value of the entity and indirectly obtaining the related information of the entity through knowledge reasoning.

In step S3, the entity name and the hidden information are converted into formatted text, and added to the tail of the original text to form a supplemented text. Specifically, each entity in the original text and the corresponding implicit information obtained by querying the knowledge-graph are converted into the text and then added to the end of the original text according to the sequence of the entities in the original text.

In step S4, the supplemented text is subjected to word segmentation processing, and a word segmentation sequence of the text is obtained by preprocessing such as stopping word and special symbol filtering. Specifically, the text with the entity hidden information added is subjected to word segmentation, and stop words and punctuation marks are removed.

In step S5, the preset or randomly initialized word embedding model is queried to obtain an embedding matrix of the participle sequence, where each row of the matrix is an embedding vector of each participle. Specifically, each word in the word segmentation sequence is mapped to a vector with equal dimension through a word embedding model, and all vectors of the word segmentation sequence form a word embedding matrix, wherein the word embedding model can be pre-trained or randomly initialized, such as a pre-trained word2vec model, for example, pre-trained using word2vec or randomly initialized with uniformly distributed U (0, 1).

In step S6, the embedded matrix of the word segmentation sequence is input into the deep learning text classification algorithm for training or classification testing. Specifically, before text input deep learning is trained or predicted, the same process is performed on text: extracting entities in the original text, inquiring the implicit information of the entities through a knowledge graph, converting the entities and the implicit information into formatted text, and then sequentially adding the formatted text to the end of the text. And performing word segmentation and filtering processing on the supplemented text, and converting the text into a word embedding matrix. The deep learning text classification algorithm is not limited to convolutional neural networks, circular neural networks, transformers, etc., but is any deep learning text classification method that can embed a matrix with words as input.

The invention provides entity extraction through an original text, and the extracted entity is utilized to inquire implicit entity information from the existing knowledge graph and is merged into the original text. The deep learning text classification performance is improved in a mode of improving the information quantity of the original text.

Examples

As shown in fig. 2, taking "the scoring system with updated wool score in beijiao sheep" as an example of a text to be classified;

step S1: two entities, a "beijiao" (dummy entity, used as an example only) and a "wool score" (dummy entity, used as an example only), are extracted from the text, wherein the beijiao is a consumption credit product pushed by a sheep financial service group (dummy entity, used as an example only), so the entity type is a product name, and the wool score is a score of personal credit, and the entity type is a proper noun. Therefore, the entity information that can be extracted from the original text is: { bubei product name of sheep } and { wool fraction: proper noun }. These extracted entity information are input as a query for the next step.

Step S2: for the two entities of the buy and wool of sheep extracted in step S1, using the example of knowledge graph shown in fig. 2, the following information can be obtained for the two entities: the beibei bubei of the sheep is a credit product for consumption, belongs to sheep financial service group, and uses wool score as credit scoring system; the wool score is a credit scoring system, the scoring range of the system is 150-750, and the system is introduced in 2020.

Step S3: adding the information of both the beijiao value and the wool score of the sheep in the step S2 to the end of the original text to form the following supplementary text: "the sheep bubei renews the scoring system of wool score. { beijiao value of sheep: the method is also called as a Bei consumption credit product, belongs to a sheep financial service group, and uses a wool score as a scoring system }; { wool score: a credit scoring system has a scoring range of 150-750, which is introduced in 2020 } ". It can be seen that the amount of text information after supplementation is much higher than the original text.

Step S4: the text "beijiao after supplement obtained in step S3 is updated with the scoring system of wool points. { beijiao value of sheep: the method is also called as a Bei consumption credit product, belongs to a sheep financial service group, and uses a wool score as a scoring system }; { wool score: a credit scoring system is provided, wherein the scoring range is 150-750, if any } is pushed in 2020, the scoring list obtained after performing segmentation, removing stop words (such as's', etc.) and removing some punctuation marks (such as's'), is as follows: [ "buy bei in sheep", "update", "wool score", "system", "{", "buy in sheep,": "," alias "," bei "," one "," consumer credit "," product "," belonging "," sheep financial services group "," use "," wool score "," as "," score "," system "," "}", "{", "wool score", ": "," a "," credit "," score "," system "," score "," range "," < digit > "," < digit > "," year "," push "," "}", where the number in the segmentation is converted to a < digit > identifier.

Step S5: for the segmented sequence [ "beibei ovine", "updated", "wool score", "scoring", "system", "{", "bei ovine,") obtained in step S4: "," alias "," bei "," one "," consumer credit "," product "," belonging "," sheep financial services group "," use "," wool score "," as "," score "," system "," "}", "{", "wool score", ": "," one "," credit "," score "," system "," score "," range "," digit > "," < digit > "," < digit > "," year "," push "," etc. "], converting each word in a sequence of segmented words into a 200-dimensional vector, the sequence of segmented words containing 36 segmented words in total, and converting the sequence of original segmented words into a 36 x 200 word embedding matrix.

Step S6: and (4) inputting the 36 × 200 word embedding matrix obtained after the processing in the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training of a deep learning text classifier or a classification accuracy test by combining the labels of the samples.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The deep learning text classification method fused with the knowledge graph is characterized by comprising the following steps: the method comprises the following steps:

s1: extracting entities in the text to be classified;

2. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: in the step S1, the entities in the text to be classified are extracted from the original text by a named entity identification method, where the entities include names of people, places, names of organizations, and proper nouns.

3. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: the step S2 specifically includes: the extracted implicit information of the text entities to be classified is obtained through the query of the existing knowledge graph, wherein the query comprises the steps of directly obtaining attribute values of the entities and indirectly obtaining related information of the entities through knowledge reasoning.

4. The method for classifying deep learning texts of fusion knowledge-graphs according to claim 1, wherein the step S3 specifically comprises: and querying the obtained entity implicit information through the knowledge graph, wherein the entity implicit information is obtained through { entity name: entity information } added to the end of the original text. The entity information is converted into natural language, and a plurality of entity information are added according to the sequence of the entity information in the original text to form a supplementary text containing entity hidden information.

5. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: and in the step S4, each word in the word sequence is obtained by performing word segmentation processing on the supplemented text, and performing special symbol and stop word filtering preprocessing.

6. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: the word embedding matrix of the word sequence in step S5 is obtained by mapping a preset or randomly initialized word embedding model, where each behavior word sequence of the word embedding matrix is a word embedding vector corresponding to each word.

7. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: step S6, embedding the words in the word segmentation sequence into the matrix, inputting the deep learning text classification algorithm, and the training or predicting specifically includes: and (4) inputting the word segmentation sequence matrix obtained after the processing of the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training or classification accuracy test on the deep learning text classifier by combining the labels of the samples.