CN112906379A - Natural language processing technology research and development method based on graph theory - Google Patents

Natural language processing technology research and development method based on graph theory Download PDF

Info

Publication number
CN112906379A
CN112906379A CN202011435391.1A CN202011435391A CN112906379A CN 112906379 A CN112906379 A CN 112906379A CN 202011435391 A CN202011435391 A CN 202011435391A CN 112906379 A CN112906379 A CN 112906379A
Authority
CN
China
Prior art keywords
chinese character
sentence
chinese
data
chinese characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011435391.1A
Other languages
Chinese (zh)
Other versions
CN112906379B (en
Inventor
杜爽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yingte Leizhen Intelligent Technology Co ltd
Original Assignee
Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yingte Leizhen Intelligent Technology Co ltd filed Critical Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority to CN202011435391.1A priority Critical patent/CN112906379B/en
Publication of CN112906379A publication Critical patent/CN112906379A/en
Application granted granted Critical
Publication of CN112906379B publication Critical patent/CN112906379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models

Abstract

The invention discloses a research and development method of a natural language processing technology based on a graph theory, which comprises the following steps: 1, storing natural conversation according to a Chinese character sequence by a graph database, and forming word association statistics of an N-Gram model in real time; 2, processing the sentences to form character connection chains according to sentence break rules, counting the occurrence frequency of adjacent characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; and 3, forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules with Chinese character data, forming a processing method of data, namely a processing unit, by a graph theory method, and carrying out intervention calibration on the formed rules by a semi-supervised learning method. By the mode, the method can enable the relation between the elements of the natural language to be extracted more quickly and intuitively, and enables the extracted natural language rules to be more accurate, concise and easy to maintain.

Description

Natural language processing technology research and development method based on graph theory
Technical Field
The invention relates to the technical field of artificial intelligence, graphic theory and natural language processing technology, in particular to a research and development method of natural language processing technology based on the graphic theory.
Background
Natural Language Processing (NLP) is a subject for studying the Language problem of human-computer interaction, and aims to solve the Language by a computer and realize the communication between a human and a computer by Natural Language.
Generally, the development of natural language processing technology is developed based on the direction of a classifier of deep learning or statistical analysis, and such systems can be divided into three types, namely a simple matching type, a fuzzy matching type and a paragraph understanding type according to different technical implementation difficulties. In any case, the method is embodied in the matching process of the keywords. The Chinese vocabulary is formed by combining Chinese characters, has a loose structure, the real semantic intention is often strongly associated with the context, and the accuracy of intention identification cannot be improved by matching with simple keywords.
At present, the research and development of natural language processing modes are carried out based on the direction of a deep learning classifier, a graph theory and natural language processing are combined to solve partial problems in the natural language processing, and no data indicates that an organization or an individual carries out research and development. The development of knowledge graph makes the application of graph theory and graph database mature gradually, and the combination of graph theory and natural language processing is an innovation.
Disclosure of Invention
The invention aims to provide a method for researching and developing a natural language processing technology based on a graph theory, which can enable the relation extraction between all elements of a natural language to be quicker and more intuitive, and enables the extracted natural language to be more accurate, simpler and easier to maintain.
In order to solve the technical problems, the invention adopts a technical scheme that: a research and development method of a natural language processing technology based on a graph theory is provided, and the research method comprises the following steps:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
Further, the specific research process of step 2) is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
Further, the step 24) is to perform statistics on the lexical sequence relationship, abstract the statistics to form a sentence pattern label, and label the context association relationship in the session range.
Further, the step 25) is specifically to extract nominal keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the similarity of vocabulary coding between sentences.
Further, the sentence-breaking rule in step 2) has the following specific flow:
defining natural conversation as a scene for coding;
judging whether a next sentence exists or not, and splitting the Chinese characters according to the sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
The invention has the beneficial effects that: the natural language processing technology research and development method based on the graph theory can extract semantic intentions in natural conversation, enable natural language processing such as word segmentation and syntactic analysis to be similar, improve the research and development capacity of natural language processing modes, improve the accuracy rate on the basis of the same corpus data volume, and improve the accuracy rate of intention analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for developing natural language processing based on graph theory according to the present invention;
FIG. 2 is a diagram illustrating an example of a method for developing a natural language processing technology based on graph theory according to the present invention.
FIG. 3 is a schematic diagram of an implementation of the method for developing a natural language processing technology based on graph theory according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
Also, in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Referring to fig. 1 to 3, an embodiment of the present invention includes: a research and development method of natural language processing technology based on graph theory comprises the following steps:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
As shown in FIG. 2, the specific study procedure of step 2) of the present invention is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
Further, the step 24) is to perform statistics on the lexical sequence relationship, abstract the statistics to form a sentence pattern label, and label the context association relationship in the session range.
Further, the step 25) is specifically to extract nominal keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the similarity of vocabulary coding between sentences.
As shown in fig. 2, the sentence-breaking rule in step 2) of the present invention has the following specific flow:
firstly, defining natural conversation as a scene for coding, directly carrying out next operation by a next sentence, and directly ending the process without the next sentence;
secondly, splitting the Chinese characters according to sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
Sixthly, the process of splitting the Chinese characters according to sentences is repeated.
As shown in fig. 3, the present invention forms a pyramid structure from chinese character to word to phrase to sentence based on the above rule, the rule and the chinese character data are tightly bound, and the data, i.e. the embodiment of the processing method of the processing unit, is formed by the graph theory method.
The invention discloses a method for extracting and maintaining grammar rules in the process of natural language processing, which marks characters through the structural characteristics of the characters in natural language and continuously performs data dimension enhancement in natural conversation so as to extract the grammar rules. The method can be used for extracting semantic intentions in natural conversation, enabling natural language processing such as word segmentation and syntactic analysis to be visualized, and improving the research and development capability of natural language processing modes. The conventional syntactic analysis of natural language processing classifies semantic intentions through word combination statistics, reflects the statistical result of local samples, is based on a word hit probability model, and has a threshold value on the curve of the accuracy and the corpus data volume. The invention is an algorithm model based on the relation data between words, and can improve the accuracy rate and the breakthrough threshold value on the basis of the same corpus data volume and improve the accuracy rate of intention analysis.
Other feature data, such as dialogue feature data of different characters, can be derived on the basis of the model.
Furthermore, it should be noted that in the present specification, "include" or any other variation thereof is intended to cover a non-exclusive inclusion, so that a process, a method, an article or an apparatus including a series of elements includes not only those elements but also other elements not explicitly listed, or further includes elements inherent to such process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should take the description as a whole, and the technical solutions in the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.

Claims (5)

1. A natural language processing technology research and development method based on graph theory is characterized in that: the study method is as follows:
1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;
2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;
3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.
2. The method of claim 1, wherein the method comprises: the specific research process of the step 2) is as follows:
21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;
22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;
23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;
24) counting the sequential relation of the vocabularies;
25) the method comprises the steps of (1) extracting nominal keywords facing to all data;
26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.
3. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 24) counting the lexical sequence relation, abstracting the lexical sequence relation into sentence pattern labels, and labeling the context association relation in the session range.
4. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 25) specifically, aiming at all data, extracting nominal keywords, classifying and labeling sentences with nouns as subject words, and analyzing the vocabulary coding similarity among the sentences.
5. The method of claim 1, wherein the method comprises: the sentence-breaking rule in the step 2) has the following specific flow:
defining natural conversation as a scene for coding;
judging whether a next sentence exists or not, and splitting the Chinese characters according to the sentences;
thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;
judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;
judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.
CN202011435391.1A 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory Active CN112906379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011435391.1A CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011435391.1A CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Publications (2)

Publication Number Publication Date
CN112906379A true CN112906379A (en) 2021-06-04
CN112906379B CN112906379B (en) 2023-12-22

Family

ID=76111536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011435391.1A Active CN112906379B (en) 2020-12-10 2020-12-10 Method for researching and developing natural language processing technology based on graph theory

Country Status (1)

Country Link
CN (1) CN112906379B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07325825A (en) * 1994-06-01 1995-12-12 Mitsubishi Electric Corp English grammar checking system device
CN108388553A (en) * 2017-12-28 2018-08-10 广州索答信息科技有限公司 Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation
US10387575B1 (en) * 2019-01-30 2019-08-20 Babylon Partners Limited Semantic graph traversal for recognition of inferred clauses within natural language inputs
CN111241299A (en) * 2020-01-09 2020-06-05 重庆理工大学 Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07325825A (en) * 1994-06-01 1995-12-12 Mitsubishi Electric Corp English grammar checking system device
CN108388553A (en) * 2017-12-28 2018-08-10 广州索答信息科技有限公司 Talk with method, electronic equipment and the conversational system towards kitchen of disambiguation
US10387575B1 (en) * 2019-01-30 2019-08-20 Babylon Partners Limited Semantic graph traversal for recognition of inferred clauses within natural language inputs
CN111241299A (en) * 2020-01-09 2020-06-05 重庆理工大学 Knowledge graph automatic construction method for legal consultation and retrieval system thereof
CN111723215A (en) * 2020-06-19 2020-09-29 国家计算机网络与信息安全管理中心 Device and method for establishing biotechnological information knowledge graph based on text mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄毅;冯俊兰;胡珉;吴晓婷;杜晓宇;: "智能对话系统架构及算法", 北京邮电大学学报, no. 06, pages 14 - 23 *

Also Published As

Publication number Publication date
CN112906379B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN109388795B (en) Named entity recognition method, language recognition method and system
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
CN111737496A (en) Power equipment fault knowledge map construction method
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
US20100318348A1 (en) Applying a structured language model to information extraction
CN111061882A (en) Knowledge graph construction method
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN111274804A (en) Case information extraction method based on named entity recognition
CN111046660B (en) Method and device for identifying text professional terms
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN110717045A (en) Letter element automatic extraction method based on letter overview
CN110019698A (en) A kind of intelligent Service method and system of medicine question and answer
CN108763192B (en) Entity relation extraction method and device for text processing
CN115292461A (en) Man-machine interaction learning method and system based on voice recognition
CN115858750A (en) Power grid technical standard intelligent question-answering method and system based on natural language processing
CN115713072A (en) Relation category inference system and method based on prompt learning and context awareness
Zimmermann et al. TV-gram language models for offline handwritten text recognition
CN109213988B (en) Barrage theme extraction method, medium, equipment and system based on N-gram model
CN112906379B (en) Method for researching and developing natural language processing technology based on graph theory
CN113919339A (en) Artificial intelligence auxiliary writing method
Jafar Tafreshi et al. A novel approach to conditional random field-based named entity recognition using Persian specific features
CN112328811A (en) Word spectrum clustering intelligent generation method based on same type of phrases
CN109960720B (en) Information extraction method for semi-structured text
CN113761919A (en) Entity attribute extraction method of spoken short text and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant