CN112597298A - Deep learning text classification method fusing knowledge maps - Google Patents

Deep learning text classification method fusing knowledge maps Download PDF

Info

Publication number
CN112597298A
CN112597298A CN202011097951.7A CN202011097951A CN112597298A CN 112597298 A CN112597298 A CN 112597298A CN 202011097951 A CN202011097951 A CN 202011097951A CN 112597298 A CN112597298 A CN 112597298A
Authority
CN
China
Prior art keywords
text
deep learning
word
entity
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011097951.7A
Other languages
Chinese (zh)
Inventor
刘星辰
麻沁甜
陈晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bochi Information Technology Co ltd
Original Assignee
Shanghai Bochi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bochi Information Technology Co ltd filed Critical Shanghai Bochi Information Technology Co ltd
Priority to CN202011097951.7A priority Critical patent/CN112597298A/en
Publication of CN112597298A publication Critical patent/CN112597298A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The deep learning text classification method based on the fusion knowledge graph comprises the following steps: extracting entities in the text to be classified; acquiring entity related implicit information by using the established knowledge graph; converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text; performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text; inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation; and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting. The method and the device overcome the defects of the prior art, introduce the knowledge graph into the text classification of deep learning, and perform information supplementation on the original text by inquiring the hidden information from the knowledge graph and converting the hidden information into the formatted text, thereby improving the accuracy of the classification of the deep learning text.

Description

Deep learning text classification method fusing knowledge maps
Technical Field
The invention relates to the technical field of deep learning and text classification, in particular to a deep learning text classification method fusing a knowledge map.
Background
Text classification has wide application in the fields of internet, finance and the like. Current deep learning or machine learning text classification models are mostly based on information of the text itself, such as word segmentation of the text itself. However, the text usually contains a large number of entities such as person names, place names and organization names, and these entities usually contain hidden important information, which is not contained in the entity names themselves, and the missing of these hidden entity information will result in the accuracy of text classification.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a deep learning text classification method fusing a knowledge map, which overcomes the defects of the prior art, introduces the knowledge map into the deep learning text classification, and performs information supplement on an original text by inquiring hidden information from the knowledge map and converting the hidden information into a formatted text, thereby improving the accuracy of the deep learning text classification.
In order to achieve the purpose, the invention is realized by the following technical scheme:
the deep learning text classification method based on the fusion knowledge graph comprises the following steps:
s1: extracting entities in the text to be classified;
s2: acquiring entity related implicit information by using the established knowledge graph;
s3: converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text;
s4: performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text;
s5: inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation;
s6: and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting.
Further, in the step S1, the entities in the text to be classified are extracted from the original text by a named entity identification method, where the entities include names of people, places, organizations and proper nouns.
Further, the step S2 specifically includes: the extracted implicit information of the text entities to be classified is obtained through the query of the existing knowledge graph, wherein the query comprises the steps of directly obtaining attribute values of the entities and indirectly obtaining related information of the entities through knowledge reasoning.
Further, the step S3 specifically includes: and querying the obtained entity implicit information through the knowledge graph, wherein the entity implicit information is obtained through { entity name: entity information } added to the end of the original text. The entity information is converted into natural language, and a plurality of entity information are added according to the sequence of the entity information in the original text to form a supplementary text containing entity hidden information.
Further, each word in the word segmentation sequence in step S4 is obtained by performing word segmentation processing on the supplemented text, and performing special symbol and stop word filtering preprocessing.
Further, the word embedding matrix of the word sequence in step S5 is obtained by mapping a preset or randomly initialized word embedding model, where each behavior word segmentation sequence of the word embedding matrix is mapped to a corresponding word embedding vector.
Further, the step S6 of embedding the words in the word segmentation sequence into the matrix and inputting the deep learning text classification algorithm to perform training or prediction specifically includes: and (4) inputting the word segmentation sequence matrix obtained after the processing of the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training or classification accuracy test on the deep learning text classifier by combining the labels of the samples.
The invention provides a deep learning text classification method fusing knowledge maps. The method has the following beneficial effects: by acquiring the entity of the original classified text, acquiring the implicit information of the entity through the knowledge graph and supplementing the implicit information into the original text, the classification precision of deep learning text classification can be effectively improved.
Drawings
In order to more clearly illustrate the present invention or the prior art solutions, the drawings that are needed in the description of the prior art will be briefly described below.
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 provides an example of a knowledge graph of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings.
As shown in fig. 1, the method for classifying deep learning texts by fusing knowledge maps comprises the following steps:
s1: extracting entities in the text to be classified;
s2: acquiring entity related implicit information by using the established knowledge graph;
s3: converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text;
s4: performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text;
s5: inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation;
s6: and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting.
Specifically, in step S1, extracting entities in the text to be classified; specifically, entities in the original text are obtained by using a Named Entity Recognition (NER) method for extraction, and each extracted entity corresponds to one entity type. The entities include names of people, places, organizations, proper nouns, and the like.
In step S2, the established knowledge graph is used to obtain the implicit information related to the entity by combining the entity and the inference of the knowledge graph. Specifically, named entities such as the person name, the place name, and the organization name acquired in step S1 are queried by the knowledge graph that has been constructed to acquire implicit information of the entities. Wherein, the query comprises directly obtaining the attribute value of the entity and indirectly obtaining the related information of the entity through knowledge reasoning.
In step S3, the entity name and the hidden information are converted into formatted text, and added to the tail of the original text to form a supplemented text. Specifically, each entity in the original text and the corresponding implicit information obtained by querying the knowledge-graph are converted into the text and then added to the end of the original text according to the sequence of the entities in the original text.
In step S4, the supplemented text is subjected to word segmentation processing, and a word segmentation sequence of the text is obtained by preprocessing such as stopping word and special symbol filtering. Specifically, the text with the entity hidden information added is subjected to word segmentation, and stop words and punctuation marks are removed.
In step S5, the preset or randomly initialized word embedding model is queried to obtain an embedding matrix of the participle sequence, where each row of the matrix is an embedding vector of each participle. Specifically, each word in the word segmentation sequence is mapped to a vector with equal dimension through a word embedding model, and all vectors of the word segmentation sequence form a word embedding matrix, wherein the word embedding model can be pre-trained or randomly initialized, such as a pre-trained word2vec model, for example, pre-trained using word2vec or randomly initialized with uniformly distributed U (0, 1).
In step S6, the embedded matrix of the word segmentation sequence is input into the deep learning text classification algorithm for training or classification testing. Specifically, before text input deep learning is trained or predicted, the same process is performed on text: extracting entities in the original text, inquiring the implicit information of the entities through a knowledge graph, converting the entities and the implicit information into formatted text, and then sequentially adding the formatted text to the end of the text. And performing word segmentation and filtering processing on the supplemented text, and converting the text into a word embedding matrix. The deep learning text classification algorithm is not limited to convolutional neural networks, circular neural networks, transformers, etc., but is any deep learning text classification method that can embed a matrix with words as input.
The invention provides entity extraction through an original text, and the extracted entity is utilized to inquire implicit entity information from the existing knowledge graph and is merged into the original text. The deep learning text classification performance is improved in a mode of improving the information quantity of the original text.
Examples
As shown in fig. 2, taking "the scoring system with updated wool score in beijiao sheep" as an example of a text to be classified;
step S1: two entities, a "beijiao" (dummy entity, used as an example only) and a "wool score" (dummy entity, used as an example only), are extracted from the text, wherein the beijiao is a consumption credit product pushed by a sheep financial service group (dummy entity, used as an example only), so the entity type is a product name, and the wool score is a score of personal credit, and the entity type is a proper noun. Therefore, the entity information that can be extracted from the original text is: { bubei product name of sheep } and { wool fraction: proper noun }. These extracted entity information are input as a query for the next step.
Step S2: for the two entities of the buy and wool of sheep extracted in step S1, using the example of knowledge graph shown in fig. 2, the following information can be obtained for the two entities: the beibei bubei of the sheep is a credit product for consumption, belongs to sheep financial service group, and uses wool score as credit scoring system; the wool score is a credit scoring system, the scoring range of the system is 150-750, and the system is introduced in 2020.
Step S3: adding the information of both the beijiao value and the wool score of the sheep in the step S2 to the end of the original text to form the following supplementary text: "the sheep bubei renews the scoring system of wool score. { beijiao value of sheep: the method is also called as a Bei consumption credit product, belongs to a sheep financial service group, and uses a wool score as a scoring system }; { wool score: a credit scoring system has a scoring range of 150-750, which is introduced in 2020 } ". It can be seen that the amount of text information after supplementation is much higher than the original text.
Step S4: the text "beijiao after supplement obtained in step S3 is updated with the scoring system of wool points. { beijiao value of sheep: the method is also called as a Bei consumption credit product, belongs to a sheep financial service group, and uses a wool score as a scoring system }; { wool score: a credit scoring system is provided, wherein the scoring range is 150-750, if any } is pushed in 2020, the scoring list obtained after performing segmentation, removing stop words (such as's', etc.) and removing some punctuation marks (such as's'), is as follows: [ "buy bei in sheep", "update", "wool score", "system", "{", "buy in sheep,": "," alias "," bei "," one "," consumer credit "," product "," belonging "," sheep financial services group "," use "," wool score "," as "," score "," system "," "}", "{", "wool score", ": "," a "," credit "," score "," system "," score "," range "," < digit > "," < digit > "," year "," push "," "}", where the number in the segmentation is converted to a < digit > identifier.
Step S5: for the segmented sequence [ "beibei ovine", "updated", "wool score", "scoring", "system", "{", "bei ovine,") obtained in step S4: "," alias "," bei "," one "," consumer credit "," product "," belonging "," sheep financial services group "," use "," wool score "," as "," score "," system "," "}", "{", "wool score", ": "," one "," credit "," score "," system "," score "," range "," digit > "," < digit > "," < digit > "," year "," push "," etc. "], converting each word in a sequence of segmented words into a 200-dimensional vector, the sequence of segmented words containing 36 segmented words in total, and converting the sequence of original segmented words into a 36 x 200 word embedding matrix.
Step S6: and (4) inputting the 36 × 200 word embedding matrix obtained after the processing in the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training of a deep learning text classifier or a classification accuracy test by combining the labels of the samples.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. The deep learning text classification method fused with the knowledge graph is characterized by comprising the following steps: the method comprises the following steps:
s1: extracting entities in the text to be classified;
s2: acquiring entity related implicit information by using the established knowledge graph;
s3: converting the entity name and the implicit information into a formatted text, and adding the formatted text to the tail part of the original text to form a supplemented text;
s4: performing word segmentation processing on the supplemented text, and preprocessing to obtain a word segmentation sequence of the text;
s5: inquiring a preset or randomly initialized word embedding model to obtain a word embedding matrix of a word segmentation sequence, wherein each action of the matrix is an embedding vector of each word segmentation;
s6: and embedding the words of the word segmentation sequence into a matrix, inputting the words into a deep learning text classification algorithm, and training or predicting.
2. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: in the step S1, the entities in the text to be classified are extracted from the original text by a named entity identification method, where the entities include names of people, places, names of organizations, and proper nouns.
3. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: the step S2 specifically includes: the extracted implicit information of the text entities to be classified is obtained through the query of the existing knowledge graph, wherein the query comprises the steps of directly obtaining attribute values of the entities and indirectly obtaining related information of the entities through knowledge reasoning.
4. The method for classifying deep learning texts of fusion knowledge-graphs according to claim 1, wherein the step S3 specifically comprises: and querying the obtained entity implicit information through the knowledge graph, wherein the entity implicit information is obtained through { entity name: entity information } added to the end of the original text. The entity information is converted into natural language, and a plurality of entity information are added according to the sequence of the entity information in the original text to form a supplementary text containing entity hidden information.
5. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: and in the step S4, each word in the word sequence is obtained by performing word segmentation processing on the supplemented text, and performing special symbol and stop word filtering preprocessing.
6. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: the word embedding matrix of the word sequence in step S5 is obtained by mapping a preset or randomly initialized word embedding model, where each behavior word sequence of the word embedding matrix is a word embedding vector corresponding to each word.
7. The method of classifying a deep learning text of a converged knowledge graph according to claim 1, wherein: step S6, embedding the words in the word segmentation sequence into the matrix, inputting the deep learning text classification algorithm, and the training or predicting specifically includes: and (4) inputting the word segmentation sequence matrix obtained after the processing of the step (S5) into a convolutional neural network, a cyclic neural network and a transform deep learning model, and performing training or classification accuracy test on the deep learning text classifier by combining the labels of the samples.
CN202011097951.7A 2020-10-14 2020-10-14 Deep learning text classification method fusing knowledge maps Pending CN112597298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011097951.7A CN112597298A (en) 2020-10-14 2020-10-14 Deep learning text classification method fusing knowledge maps

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011097951.7A CN112597298A (en) 2020-10-14 2020-10-14 Deep learning text classification method fusing knowledge maps

Publications (1)

Publication Number Publication Date
CN112597298A true CN112597298A (en) 2021-04-02

Family

ID=75180672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011097951.7A Pending CN112597298A (en) 2020-10-14 2020-10-14 Deep learning text classification method fusing knowledge maps

Country Status (1)

Country Link
CN (1) CN112597298A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860905A (en) * 2021-04-08 2021-05-28 深圳壹账通智能科技有限公司 Text information extraction method, device and equipment and readable storage medium
CN112926309A (en) * 2021-05-11 2021-06-08 北京智源人工智能研究院 Safety information distinguishing method and device and electronic equipment
CN114519399A (en) * 2022-02-22 2022-05-20 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595708A (en) * 2018-05-10 2018-09-28 北京航空航天大学 A kind of exception information file classification method of knowledge based collection of illustrative plates
CN111061843A (en) * 2019-12-26 2020-04-24 武汉大学 Knowledge graph guided false news detection method
CN111259666A (en) * 2020-01-15 2020-06-09 上海勃池信息技术有限公司 CNN text classification method combined with multi-head self-attention mechanism

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595708A (en) * 2018-05-10 2018-09-28 北京航空航天大学 A kind of exception information file classification method of knowledge based collection of illustrative plates
CN111061843A (en) * 2019-12-26 2020-04-24 武汉大学 Knowledge graph guided false news detection method
CN111259666A (en) * 2020-01-15 2020-06-09 上海勃池信息技术有限公司 CNN text classification method combined with multi-head self-attention mechanism

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860905A (en) * 2021-04-08 2021-05-28 深圳壹账通智能科技有限公司 Text information extraction method, device and equipment and readable storage medium
CN112926309A (en) * 2021-05-11 2021-06-08 北京智源人工智能研究院 Safety information distinguishing method and device and electronic equipment
CN114519399A (en) * 2022-02-22 2022-05-20 平安科技(深圳)有限公司 Text classification method, device, equipment and storage medium based on artificial intelligence
WO2023159762A1 (en) * 2022-02-22 2023-08-31 平安科技(深圳)有限公司 Text classification method and apparatus based on artificial intelligence, device, and storage medium

Similar Documents

Publication Publication Date Title
CN110781276B (en) Text extraction method, device, equipment and storage medium
CN108959242B (en) Target entity identification method and device based on part-of-speech characteristics of Chinese characters
CN111222305B (en) Information structuring method and device
CN112597298A (en) Deep learning text classification method fusing knowledge maps
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
US20240143644A1 (en) Event detection
CN111401065A (en) Entity identification method, device, equipment and storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN114416979A (en) Text query method, text query equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN115409018B (en) Corporate public opinion monitoring system and method based on big data
CN114647713A (en) Knowledge graph question-answering method, device and storage medium based on virtual confrontation
CN112613321A (en) Method and system for extracting entity attribute information in text
CN111475608A (en) Mashup service characteristic representation method based on functional semantic correlation calculation
CN113420548A (en) Entity extraction sampling method based on knowledge distillation and PU learning
CN111091004B (en) Training method and training device for sentence entity annotation model and electronic equipment
CN117743526A (en) Table question-answering method based on large language model and natural language processing
CN111178080A (en) Named entity identification method and system based on structured information
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN107844475A (en) A kind of segmenting method based on LSTM
CN114065749A (en) Text-oriented Guangdong language recognition model and training and recognition method of system
CN114912458A (en) Emotion analysis method and device and computer readable medium
CN116070642A (en) Text emotion analysis method and related device based on expression embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination