CN112906379B - Method for researching and developing natural language processing technology based on graph theory - Google Patents

Method for researching and developing natural language processing technology based on graph theory Download PDF

Info

Publication number
CN112906379B
CN112906379B CN202011435391.1A CN202011435391A CN112906379B CN 112906379 B CN112906379 B CN 112906379B CN 202011435391 A CN202011435391 A CN 202011435391A CN 112906379 B CN112906379 B CN 112906379B
Authority
CN
China
Prior art keywords
chinese character
data
sentence
chinese
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011435391.1A
Other languages
Chinese (zh)
Other versions
CN112906379A (en
Inventor
杜爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yingte Leizhen Intelligent Technology Co ltd
Original Assignee
Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yingte Leizhen Intelligent Technology Co ltd filed Critical Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority to CN202011435391.1A priority Critical patent/CN112906379B/en
Publication of CN112906379A publication Critical patent/CN112906379A/en
Application granted granted Critical
Publication of CN112906379B publication Critical patent/CN112906379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a research and development method of a natural language processing technology based on graph theory, which comprises the following steps: 1, storing natural dialogue according to the sequence of Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time; 2, processing the sentences of the dialogue according to the sentence breaking rule to form a word connecting chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form data of N-Gram; and 3, forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules and Chinese character data, forming data, namely a processing method of a processing unit, by a graph theory method, and performing intervention calibration on the formed rules by a semi-supervised learning method. By the method, the relation among the elements of the natural language can be extracted more quickly and intuitively, and the extracted natural language rules are more accurate, concise and easy to maintain.

Description

Method for researching and developing natural language processing technology based on graph theory
Technical Field
The invention relates to the technical field of artificial intelligence, graphic theory and natural language processing technology, in particular to a method for researching and developing natural language processing technology based on graphic theory.
Background
Natural language processing (NLP, natural Language Processing) is a subject of language problems for human interaction with computers, and aims to let computing mechanisms solve languages, so as to realize communication between human and computers by using natural language.
Generally, the development of natural language processing technology is developed in the direction of a classifier based on deep learning or statistical analysis, and according to different technical implementation difficulties, the system can be divided into three types of simple matching type, fuzzy matching type and paragraph decomposition type. Either type is an embodiment of the keyword matching process. The vocabulary of Chinese is composed of Chinese characters, the structure is loose, the real semantic intention is often matched with the simple key words with strong relevance to the context, and the accuracy of intention recognition is not sufficiently improved.
At present, the development of a natural language processing mode is carried out based on the classifier direction of deep learning, and the graph theory is combined with the natural language processing to solve part of problems in the natural language processing. The development of the knowledge graph gradually matures the application of the graph theory and the application of the graph database, and the graph theory is combined into the natural language processing, so that the knowledge graph is innovative.
Disclosure of Invention
The invention aims to provide a research and development method of a natural language processing technology based on a graph theory, which can enable the relation among various elements of a natural language to be extracted more quickly and intuitively, and enable the extracted natural language rule to be more accurate, concise and easy to maintain.
In order to solve the technical problems, the invention adopts a technical scheme that: the research method for the natural language processing technology based on the graph theory is provided, and comprises the following steps:
1) Storing natural dialogue according to the sequence of Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) Processing the sentences of the dialogue according to the sentence breaking rule to form a word connecting chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and marking the parts of speech of the Chinese vocabulary so as to form a dialogue rule;
3) Based on the rule, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rule is tightly bound with Chinese character data, the data is formed through a graph theory method, namely a processing method of a processing unit, and the formed rule is subjected to interference calibration through a semi-supervised learning method.
Further, the specific research process of the step 2) is as follows:
21 Generating conversation codes, splitting sentences of the conversation, and collecting punctuation marks to the previous sentence;
22 Splitting the Chinese characters in the daily dialogue, recording the Chinese characters in the graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking sentences as units;
23 Counting Chinese characters, extracting words close to the Chinese characters, marking parts of speech, and counting three-level word frequency;
24 Counting the vocabulary sequence relation;
25 Extracting noun keywords for the whole data;
26 The sentence pattern labeling method and the data are continuously modified through supervision education, a word law device is finally formed, and new data are subject extracted through the word law device.
Further, the step 24) is specifically to count and abstract lexical sequential relationships to form sentence pattern labels, and to label association relationships between contexts in the session range.
Further, the step 25) is specifically to extract noun keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the vocabulary coding similarity among the sentences.
Further, the concrete flow of the sentence breaking rule in the step 2) is as follows:
(1) defining a natural dialogue as a scene to encode;
(2) judging whether a next sentence exists, and splitting the Chinese characters according to the sentences;
(3) judging whether the Chinese characters exist or not, when the Chinese characters do not exist, establishing Chinese character nodes, and setting the frequency to be 1; when the Chinese characters exist, the Chinese character frequency is +1;
(4) judging whether the sentence is the first Chinese character, and establishing a sequence relation with the previous Chinese character when the sentence is not the first Chinese character; when the first Chinese character is the sentence, marking the first Chinese character as the sentence;
(5) judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the last Chinese character, the Chinese character is marked as the ending Chinese character of the sentence.
The beneficial effects of the invention are as follows: the method for researching and developing the natural language processing technology based on the graph theory can extract semantic intention in natural dialogue, make natural language processing such as word segmentation and syntactic analysis be embodied, improve the research and development capability of the natural language processing mode, improve the accuracy and break-through threshold value on the basis of the same corpus data quantity, and improve the accuracy of intention analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
FIG. 1 is a flow chart of a method of developing a natural language processing technique based on graph theory of the present invention;
FIG. 2 is an exemplary diagram of a method of developing a natural language processing technique based on graph theory in accordance with the present invention;
fig. 3 is a schematic diagram of an implementation of a method for developing a natural language processing technique based on graph theory according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are merely exemplary and the invention is not limited to these embodiments.
It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.
And, in the description of the present invention, the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Referring to fig. 1 to 3, an embodiment of the present invention includes: a research and development method of natural language processing technology based on graph theory comprises the following steps:
1) Storing natural dialogue according to the sequence of Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) Processing the sentences of the dialogue according to the sentence breaking rule to form a word connecting chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and marking the parts of speech of the Chinese vocabulary so as to form a dialogue rule;
3) Based on the rule, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rule is tightly bound with Chinese character data, the data is formed through a graph theory method, namely a processing method of a processing unit, and the formed rule is subjected to interference calibration through a semi-supervised learning method.
As shown in fig. 2, the specific research procedure of step 2) of the present invention is as follows:
21 Generating conversation codes, splitting sentences of the conversation, and collecting punctuation marks to the previous sentence;
22 Splitting the Chinese characters in the daily dialogue, recording the Chinese characters in the graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking sentences as units;
23 Counting Chinese characters, extracting words close to the Chinese characters, marking parts of speech, and counting three-level word frequency;
24 Counting the vocabulary sequence relation;
25 Extracting noun keywords for the whole data;
26 The sentence pattern labeling method and the data are continuously modified through supervision education, a word law device is finally formed, and new data are subject extracted through the word law device.
Further, the step 24) is specifically to count and abstract lexical sequential relationships to form sentence pattern labels, and to label association relationships between contexts in the session range.
Further, the step 25) is specifically to extract noun keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the vocabulary coding similarity among the sentences.
As shown in fig. 2, the concrete flow of the sentence breaking rule in the step 2) of the present invention is as follows:
(1) defining a natural dialogue as a scene to encode, directly carrying out the next operation with the next sentence, and directly ending the process without the next sentence;
(2) splitting Chinese characters according to sentences;
(3) judging whether the Chinese characters exist or not, when the Chinese characters do not exist, establishing Chinese character nodes, and setting the frequency to be 1; when the Chinese characters exist, the Chinese character frequency is +1;
(4) judging whether the sentence is the first Chinese character, and establishing a sequence relation with the previous Chinese character when the sentence is not the first Chinese character; when the first Chinese character is the sentence, marking the first Chinese character as the sentence;
(5) judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the last Chinese character, the Chinese character is marked as the ending Chinese character of the sentence.
(6) Repeating the process of splitting Chinese characters according to sentences.
As shown in fig. 3, the present invention forms a pyramid structure from chinese characters to words to phrases to sentences based on the above rule, the rule is tightly bound with chinese character data, and the data, i.e., the processing method of the processing unit, is formed by a graph theory method.
The invention discloses a method for extracting and maintaining grammar rules in the natural language processing process, which marks words through the structural characteristics of the words in the natural language and continuously performs data dimension rising in the natural dialogue so as to extract the grammar rules. The method can be used for extracting semantic intention in natural dialogue, and makes the natural language processing such as word segmentation, syntactic analysis and the like be visualized, so that the research and development capability of the natural language processing mode is improved. In the prior art, the syntactic analysis of natural language processing classifies semantic intent through word combination statistics, and the statistical result of a local sample is reflected, is a probability model based on word hit, and has a threshold value between the accuracy and the curve of corpus data amount. The invention is an algorithm model based on the relationship data among words, and can improve the accuracy rate to break through a threshold value on the basis of the same corpus data quantity, and improve the accuracy rate of intention analysis.
Other characteristic data, such as dialogue characteristic data of different characters, and the like, can be derived on the basis of the model.
Furthermore, it should be noted that, in this specification, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
It should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims (3)

1. A method for developing a natural language processing technology based on graph theory is characterized by comprising the following steps: the research and development method is as follows:
1) The dialogs are stored according to the Chinese character sequence through a graph database, and word association statistics of an N-Gram model are formed in real time;
2) Processing sentences of the dialogue according to sentence breaking rules to form a word connection chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and marking the parts of speech of the Chinese vocabulary so as to form a dialogue rule;
3) Forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules with Chinese character data, forming data, namely a processing method of a processing unit, by a graph theory method, and performing intervention calibration on the formed rules by a semi-supervised learning method;
the specific research process of the step 2) is as follows:
21 Generating dialogue codes, splitting sentences of the dialogue, and collecting punctuation marks to the previous sentence;
22 Splitting the dialogue sentences into Chinese characters, recording the Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking the sentences as units;
23 Counting adjacent Chinese characters, extracting words, labeling parts of speech and counting three-level word frequency of adjacent Chinese characters;
24 Counting the vocabulary sequence relation;
25 Extracting noun keywords for the whole data;
26 Continuously modifying sentence pattern labeling method and data through supervision education to finally form a grammar device, and extracting new data through the grammar device;
the concrete flow of the sentence breaking rule in the step 2) is as follows:
(1) defining a natural dialogue as a scene to encode;
(2) judging whether a next sentence exists, and splitting the Chinese characters according to the sentences;
(3) judging whether the Chinese characters exist or not, when the Chinese characters do not exist, establishing Chinese character nodes, and setting the frequency to be 1; when the Chinese characters exist, the Chinese character frequency is +1;
(4) judging whether the sentence is the first Chinese character, and establishing a sequence relation with the previous Chinese character when the sentence is not the first Chinese character; when the first Chinese character is the first Chinese character, the first Chinese character is marked as the initial Chinese character of the sentence;
(5) judging whether the Chinese character is the last Chinese character of the sentence, and if not, processing the next Chinese character; when the sentence is the last Chinese character, the Chinese character is marked as the ending Chinese character of the sentence.
2. The method for developing a natural language processing technology based on graph theory according to claim 1, wherein the method comprises the following steps: step 24) is to count and abstract the vocabulary sequence relation into sentence pattern label, and to label the association relation of the context in the conversation range.
3. The method for developing a natural language processing technology based on graph theory according to claim 1, wherein the method comprises the following steps: step 25) is specifically to extract noun keywords for all data, classify and label sentences with nouns as subjects, and analyze the vocabulary coding similarity among the sentences.
CN202011435391.1A 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory Active CN112906379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011435391.1A CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011435391.1A CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Publications (2)

Publication Number Publication Date
CN112906379A CN112906379A (en) 2021-06-04
CN112906379B true CN112906379B (en) 2023-12-22

Family

ID=76111536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011435391.1A Active CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Country Status (1)

Country Link
CN (1) CN112906379B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07325825A (en) * 1994-06-01 1995-12-12 Mitsubishi Electric Corp English grammar checking system device
CN108388553A (en) * 2017-12-28 2018-08-10 广州索答信息科技有限公司 Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation
US10387575B1 (en) * 2019-01-30 2019-08-20 Babylon Partners Limited Semantic graph traversal for recognition of inferred clauses within natural language inputs
CN111241299A (en) * 2020-01-09 2020-06-05 重庆理工大学 Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07325825A (en) * 1994-06-01 1995-12-12 Mitsubishi Electric Corp English grammar checking system device
CN108388553A (en) * 2017-12-28 2018-08-10 广州索答信息科技有限公司 Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation
US10387575B1 (en) * 2019-01-30 2019-08-20 Babylon Partners Limited Semantic graph traversal for recognition of inferred clauses within natural language inputs
CN111241299A (en) * 2020-01-09 2020-06-05 重庆理工大学 Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
智能对话系统架构及算法;黄毅;冯俊兰;胡珉;吴晓婷;杜晓宇;;北京邮电大学学报(第06期);14-23-23 *

Also Published As

Publication number Publication date
CN112906379A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
Brill Automatic grammar induction and parsing free text: A transformation-based approach
Denis et al. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort
CN111737496A (en) Power equipment fault knowledge map construction method
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
US20240143633A1 (en) Generative event extraction method based on ontology guidance
CN111061882A (en) Knowledge graph construction method
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111046660B (en) Method and device for identifying text professional terms
CN110717045A (en) Letter element automatic extraction method based on letter overview
CN115577086A (en) Bridge detection knowledge graph question-answering method based on hierarchical cross attention mechanism
CN115858750A (en) Power grid technical standard intelligent question-answering method and system based on natural language processing
CN115497477A (en) Voice interaction method, voice interaction device, electronic equipment and storage medium
CN109815497B (en) Character attribute extraction method based on syntactic dependency
Zimmermann et al. TV-gram language models for offline handwritten text recognition
Comas et al. Sibyl, a factoid question-answering system for spoken documents
CN112906379B (en) Method for researching and developing natural language processing technology based on graph theory
Jafar Tafreshi et al. A novel approach to conditional random field-based named entity recognition using Persian specific features
Lee Natural Language Processing: A Textbook with Python Implementation
Kuchta et al. Extracting concepts from the software requirements specification using natural language processing
CN112328811A (en) Word spectrum clustering intelligent generation method based on same type of phrases
CN112883742B (en) Semantic analysis method, semantic analysis device, intelligent equipment and storage medium
Batarfi et al. Building an Arabic semantic lexicon for Hajj
Aparna et al. A REVIEW ON DIFFERENT APPROACHES OF POS TAGGING IN NLP
Abu Bakar et al. Part-of-speech for old Malay manuscript corpus: A Review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant