CN112906379A - Natural language processing technology research and development method based on graph theory - Google Patents
Natural language processing technology research and development method based on graph theory Download PDFInfo
- Publication number
- CN112906379A CN112906379A CN202011435391.1A CN202011435391A CN112906379A CN 112906379 A CN112906379 A CN 112906379A CN 202011435391 A CN202011435391 A CN 202011435391A CN 112906379 A CN112906379 A CN 112906379A
- Authority
- CN
- China
- Prior art keywords
- chinese character
- sentence
- chinese
- data
- chinese characters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000003058 natural language processing Methods 0.000 title claims abstract description 28
- 238000005516 engineering process Methods 0.000 title claims abstract description 15
- 238000012827 research and development Methods 0.000 title claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000003672 processing method Methods 0.000 claims abstract description 5
- 238000002372 labelling Methods 0.000 claims description 8
- 238000011160 research Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
Abstract
The invention discloses a research and development method of a natural language processing technology based on a graph theory, which comprises the following steps: 1, storing natural conversation according to a Chinese character sequence by a graph database, and forming word association statistics of an N-Gram model in real time; 2, processing the sentences to form character connection chains according to sentence break rules, counting the occurrence frequency of adjacent characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; and 3, forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules with Chinese character data, forming a processing method of data, namely a processing unit, by a graph theory method, and carrying out intervention calibration on the formed rules by a semi-supervised learning method. By the mode, the method can enable the relation between the elements of the natural language to be extracted more quickly and intuitively, and enables the extracted natural language rules to be more accurate, concise and easy to maintain.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, graphic theory and natural language processing technology, in particular to a research and development method of natural language processing technology based on the graphic theory.
Background
Natural Language Processing (NLP) is a subject for studying the Language problem of human-computer interaction, and aims to solve the Language by a computer and realize the communication between a human and a computer by Natural Language.
Generally, the development of natural language processing technology is developed based on the direction of a classifier of deep learning or statistical analysis, and such systems can be divided into three types, namely a simple matching type, a fuzzy matching type and a paragraph understanding type according to different technical implementation difficulties. In any case, the method is embodied in the matching process of the keywords. The Chinese vocabulary is formed by combining Chinese characters, has a loose structure, the real semantic intention is often strongly associated with the context, and the accuracy of intention identification cannot be improved by matching with simple keywords.
At present, the research and development of natural language processing modes are carried out based on the direction of a deep learning classifier, a graph theory and natural language processing are combined to solve partial problems in the natural language processing, and no data indicates that an organization or an individual carries out research and development. The development of knowledge graph makes the application of graph theory and graph database mature gradually, and the combination of graph theory and natural language processing is an innovation.
Disclosure of Invention
The invention aims to provide a method for researching and developing a natural language processing technology based on a graph theory, which can enable the relation extraction between all elements of a natural language to be quicker and more intuitive, and enables the extracted natural language to be more accurate, simpler and easier to maintain.
In order to solve the technical problems, the invention adopts a technical scheme that: a research and development method of a natural language processing technology based on a graph theory is provided, and the research method comprises the following steps:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
Further, the specific research process of step 2) is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
Further, the step 24) is to perform statistics on the lexical sequence relationship, abstract the statistics to form a sentence pattern label, and label the context association relationship in the session range.
Further, the step 25) is specifically to extract nominal keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the similarity of vocabulary coding between sentences.
Further, the sentence-breaking rule in step 2) has the following specific flow:
defining natural conversation as a scene for coding;
judging whether a next sentence exists or not, and splitting the Chinese characters according to the sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
The invention has the beneficial effects that: the natural language processing technology research and development method based on the graph theory can extract semantic intentions in natural conversation, enable natural language processing such as word segmentation and syntactic analysis to be similar, improve the research and development capacity of natural language processing modes, improve the accuracy rate on the basis of the same corpus data volume, and improve the accuracy rate of intention analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for developing natural language processing based on graph theory according to the present invention;
FIG. 2 is a diagram illustrating an example of a method for developing a natural language processing technology based on graph theory according to the present invention.
FIG. 3 is a schematic diagram of an implementation of the method for developing a natural language processing technology based on graph theory according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
Also, in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Referring to fig. 1 to 3, an embodiment of the present invention includes: a research and development method of natural language processing technology based on graph theory comprises the following steps:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
As shown in FIG. 2, the specific study procedure of step 2) of the present invention is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
Further, the step 24) is to perform statistics on the lexical sequence relationship, abstract the statistics to form a sentence pattern label, and label the context association relationship in the session range.
Further, the step 25) is specifically to extract nominal keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the similarity of vocabulary coding between sentences.
As shown in fig. 2, the sentence-breaking rule in step 2) of the present invention has the following specific flow:
firstly, defining natural conversation as a scene for coding, directly carrying out next operation by a next sentence, and directly ending the process without the next sentence;
secondly, splitting the Chinese characters according to sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
Sixthly, the process of splitting the Chinese characters according to sentences is repeated.
As shown in fig. 3, the present invention forms a pyramid structure from chinese character to word to phrase to sentence based on the above rule, the rule and the chinese character data are tightly bound, and the data, i.e. the embodiment of the processing method of the processing unit, is formed by the graph theory method.
The invention discloses a method for extracting and maintaining grammar rules in the process of natural language processing, which marks characters through the structural characteristics of the characters in natural language and continuously performs data dimension enhancement in natural conversation so as to extract the grammar rules. The method can be used for extracting semantic intentions in natural conversation, enabling natural language processing such as word segmentation and syntactic analysis to be visualized, and improving the research and development capability of natural language processing modes. The conventional syntactic analysis of natural language processing classifies semantic intentions through word combination statistics, reflects the statistical result of local samples, is based on a word hit probability model, and has a threshold value on the curve of the accuracy and the corpus data volume. The invention is an algorithm model based on the relation data between words, and can improve the accuracy rate and the breakthrough threshold value on the basis of the same corpus data volume and improve the accuracy rate of intention analysis.
Other feature data, such as dialogue feature data of different characters, can be derived on the basis of the model.
Furthermore, it should be noted that in the present specification, "include" or any other variation thereof is intended to cover a non-exclusive inclusion, so that a process, a method, an article or an apparatus including a series of elements includes not only those elements but also other elements not explicitly listed, or further includes elements inherent to such process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should take the description as a whole, and the technical solutions in the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.
Claims (5)
1. A natural language processing technology research and development method based on graph theory is characterized in that: the study method is as follows:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
2. The method of claim 1, wherein the method comprises: the specific research process of the step 2) is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
3. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 24) counting the lexical sequence relation, abstracting the lexical sequence relation into sentence pattern labels, and labeling the context association relation in the session range.
4. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 25) specifically, aiming at all data, extracting nominal keywords, classifying and labeling sentences with nouns as subject words, and analyzing the vocabulary coding similarity among the sentences.
5. The method of claim 1, wherein the method comprises: the sentence-breaking rule in the step 2) has the following specific flow:
defining natural conversation as a scene for coding;
judging whether a next sentence exists or not, and splitting the Chinese characters according to the sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011435391.1A CN112906379B (en) | 2020-12-10 | 2020-12-10 | Method for researching and developing natural language processing technology based on graph theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011435391.1A CN112906379B (en) | 2020-12-10 | 2020-12-10 | Method for researching and developing natural language processing technology based on graph theory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112906379A true CN112906379A (en) | 2021-06-04 |
CN112906379B CN112906379B (en) | 2023-12-22 |
Family
ID=76111536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011435391.1A Active CN112906379B (en) | 2020-12-10 | 2020-12-10 | Method for researching and developing natural language processing technology based on graph theory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112906379B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07325825A (en) * | 1994-06-01 | 1995-12-12 | Mitsubishi Electric Corp | English grammar checking system device |
CN108388553A (en) * | 2017-12-28 | 2018-08-10 | 广州索答信息科技有限公司 | Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation |
US10387575B1 (en) * | 2019-01-30 | 2019-08-20 | Babylon Partners Limited | Semantic graph traversal for recognition of inferred clauses within natural language inputs |
CN111241299A (en) * | 2020-01-09 | 2020-06-05 | 重庆理工大学 | Knowledge graph automatic construction method for legal consultation and retrieval system thereof |
CN111723215A (en) * | 2020-06-19 | 2020-09-29 | 国家计算机网络与信息安全管理中心 | Device and method for establishing biotechnological information knowledge graph based on text mining |
-
2020
- 2020-12-10 CN CN202011435391.1A patent/CN112906379B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07325825A (en) * | 1994-06-01 | 1995-12-12 | Mitsubishi Electric Corp | English grammar checking system device |
CN108388553A (en) * | 2017-12-28 | 2018-08-10 | 广州索答信息科技有限公司 | Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation |
US10387575B1 (en) * | 2019-01-30 | 2019-08-20 | Babylon Partners Limited | Semantic graph traversal for recognition of inferred clauses within natural language inputs |
CN111241299A (en) * | 2020-01-09 | 2020-06-05 | 重庆理工大学 | Knowledge graph automatic construction method for legal consultation and retrieval system thereof |
CN111723215A (en) * | 2020-06-19 | 2020-09-29 | 国家计算机网络与信息安全管理中心 | Device and method for establishing biotechnological information knowledge graph based on text mining |
Non-Patent Citations (1)
Title |
---|
黄毅;冯俊兰;胡珉;吴晓婷;杜晓宇;: "智能对话系统架构及算法", 北京邮电大学学报, no. 06, pages 14 - 23 * |
Also Published As
Publication number | Publication date |
---|---|
CN112906379B (en) | 2023-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109388795B (en) | Named entity recognition method, language recognition method and system | |
US8131539B2 (en) | Search-based word segmentation method and device for language without word boundary tag | |
CN111737496A (en) | Power equipment fault knowledge map construction method | |
CN112101041B (en) | Entity relationship extraction method, device, equipment and medium based on semantic similarity | |
US20100318348A1 (en) | Applying a structured language model to information extraction | |
CN111061882A (en) | Knowledge graph construction method | |
CN109271524B (en) | Entity linking method in knowledge base question-answering system | |
CN112541337B (en) | Document template automatic generation method and system based on recurrent neural network language model | |
CN111274804A (en) | Case information extraction method based on named entity recognition | |
CN111046660B (en) | Method and device for identifying text professional terms | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
CN110717045A (en) | Letter element automatic extraction method based on letter overview | |
CN110019698A (en) | A kind of intelligent Service method and system of medicine question and answer | |
CN108763192B (en) | Entity relation extraction method and device for text processing | |
CN115292461A (en) | Man-machine interaction learning method and system based on voice recognition | |
CN115858750A (en) | Power grid technical standard intelligent question-answering method and system based on natural language processing | |
CN115713072A (en) | Relation category inference system and method based on prompt learning and context awareness | |
Zimmermann et al. | TV-gram language models for offline handwritten text recognition | |
CN109213988B (en) | Barrage theme extraction method, medium, equipment and system based on N-gram model | |
CN112906379B (en) | Method for researching and developing natural language processing technology based on graph theory | |
CN113919339A (en) | Artificial intelligence auxiliary writing method | |
Jafar Tafreshi et al. | A novel approach to conditional random field-based named entity recognition using Persian specific features | |
CN112328811A (en) | Word spectrum clustering intelligent generation method based on same type of phrases | |
CN109960720B (en) | Information extraction method for semi-structured text | |
CN113761919A (en) | Entity attribute extraction method of spoken short text and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |