CN112906379A

CN112906379A - Natural language processing technology research and development method based on graph theory

Info

Publication number: CN112906379A
Application number: CN202011435391.1A
Authority: CN
Inventors: 杜爽
Original assignee: Suzhou Yingte Leizhen Intelligent Technology Co ltd
Current assignee: Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-06-04
Anticipated expiration: 2040-12-10
Also published as: CN112906379B

Abstract

The invention discloses a research and development method of a natural language processing technology based on a graph theory, which comprises the following steps: 1, storing natural conversation according to a Chinese character sequence by a graph database, and forming word association statistics of an N-Gram model in real time; 2, processing the sentences to form character connection chains according to sentence break rules, counting the occurrence frequency of adjacent characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; and 3, forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules with Chinese character data, forming a processing method of data, namely a processing unit, by a graph theory method, and carrying out intervention calibration on the formed rules by a semi-supervised learning method. By the mode, the method can enable the relation between the elements of the natural language to be extracted more quickly and intuitively, and enables the extracted natural language rules to be more accurate, concise and easy to maintain.

Description

Natural language processing technology research and development method based on graph theory

Technical Field

The invention relates to the technical field of artificial intelligence, graphic theory and natural language processing technology, in particular to a research and development method of natural language processing technology based on the graphic theory.

Background

Natural Language Processing (NLP) is a subject for studying the Language problem of human-computer interaction, and aims to solve the Language by a computer and realize the communication between a human and a computer by Natural Language.

Generally, the development of natural language processing technology is developed based on the direction of a classifier of deep learning or statistical analysis, and such systems can be divided into three types, namely a simple matching type, a fuzzy matching type and a paragraph understanding type according to different technical implementation difficulties. In any case, the method is embodied in the matching process of the keywords. The Chinese vocabulary is formed by combining Chinese characters, has a loose structure, the real semantic intention is often strongly associated with the context, and the accuracy of intention identification cannot be improved by matching with simple keywords.

At present, the research and development of natural language processing modes are carried out based on the direction of a deep learning classifier, a graph theory and natural language processing are combined to solve partial problems in the natural language processing, and no data indicates that an organization or an individual carries out research and development. The development of knowledge graph makes the application of graph theory and graph database mature gradually, and the combination of graph theory and natural language processing is an innovation.

Disclosure of Invention

The invention aims to provide a method for researching and developing a natural language processing technology based on a graph theory, which can enable the relation extraction between all elements of a natural language to be quicker and more intuitive, and enables the extracted natural language to be more accurate, simpler and easier to maintain.

In order to solve the technical problems, the invention adopts a technical scheme that: a research and development method of a natural language processing technology based on a graph theory is provided, and the research method comprises the following steps:

1) storing the natural conversation according to the sequence of the Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;

2) processing the sentences in conversation according to sentence breaking rules to form character connection chains, counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reaches a certain magnitude, and forming N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and labeling the part of speech of Chinese vocabulary so as to form a conversation rule;

3) based on the rules, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rules are tightly bound with Chinese character data, a processing method of data, namely a processing unit, is formed by a graph theory method, and the formed rules are subjected to intervention and calibration by a semi-supervised learning method.

Further, the specific research process of step 2) is as follows:

21) generating a conversation code, splitting sentences of the conversation, and collecting punctuation marks into a previous sentence;

22) splitting the Chinese characters of the daily conversation, recording the split Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking a sentence as a unit;

23) counting Chinese characters, extracting vocabularies close to the Chinese characters, labeling the parts of speech, and counting the frequency of three levels of speech;

24) counting the sequential relation of the vocabularies;

25) the method comprises the steps of (1) extracting nominal keywords facing to all data;

26) and continuously modifying sentence pattern marking methods and data through supervision education to finally form a grammar device, and extracting the theme of the new data through the grammar device.

Further, the step 24) is to perform statistics on the lexical sequence relationship, abstract the statistics to form a sentence pattern label, and label the context association relationship in the session range.

Further, the step 25) is specifically to extract nominal keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the similarity of vocabulary coding between sentences.

Further, the sentence-breaking rule in step 2) has the following specific flow:

defining natural conversation as a scene for coding;

judging whether a next sentence exists or not, and splitting the Chinese characters according to the sentences;

thirdly, judging whether the Chinese characters exist or not, and if the Chinese characters do not exist, establishing Chinese character nodes, wherein the frequency is set to be 1; when the Chinese characters exist, the frequency of the Chinese characters is + 1;

judging whether the first Chinese character is the first Chinese character of the sentence or not, and establishing a sequence relation with the previous Chinese character when the first Chinese character is not the first Chinese character of the sentence; when the first Chinese character of the sentence is the first Chinese character, marking the first Chinese character as the Chinese character of the sentence;

judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the final Chinese character, the Chinese character is marked as the final Chinese character of the sentence.

The invention has the beneficial effects that: the natural language processing technology research and development method based on the graph theory can extract semantic intentions in natural conversation, enable natural language processing such as word segmentation and syntactic analysis to be similar, improve the research and development capacity of natural language processing modes, improve the accuracy rate on the basis of the same corpus data volume, and improve the accuracy rate of intention analysis.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for developing natural language processing based on graph theory according to the present invention;

FIG. 2 is a diagram illustrating an example of a method for developing a natural language processing technology based on graph theory according to the present invention.

FIG. 3 is a schematic diagram of an implementation of the method for developing a natural language processing technology based on graph theory according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

Also, in the description of the present invention, the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Referring to fig. 1 to 3, an embodiment of the present invention includes: a research and development method of natural language processing technology based on graph theory comprises the following steps:

As shown in FIG. 2, the specific study procedure of step 2) of the present invention is as follows:

24) counting the sequential relation of the vocabularies;

As shown in fig. 2, the sentence-breaking rule in step 2) of the present invention has the following specific flow:

firstly, defining natural conversation as a scene for coding, directly carrying out next operation by a next sentence, and directly ending the process without the next sentence;

secondly, splitting the Chinese characters according to sentences;

Sixthly, the process of splitting the Chinese characters according to sentences is repeated.

As shown in fig. 3, the present invention forms a pyramid structure from chinese character to word to phrase to sentence based on the above rule, the rule and the chinese character data are tightly bound, and the data, i.e. the embodiment of the processing method of the processing unit, is formed by the graph theory method.

The invention discloses a method for extracting and maintaining grammar rules in the process of natural language processing, which marks characters through the structural characteristics of the characters in natural language and continuously performs data dimension enhancement in natural conversation so as to extract the grammar rules. The method can be used for extracting semantic intentions in natural conversation, enabling natural language processing such as word segmentation and syntactic analysis to be visualized, and improving the research and development capability of natural language processing modes. The conventional syntactic analysis of natural language processing classifies semantic intentions through word combination statistics, reflects the statistical result of local samples, is based on a word hit probability model, and has a threshold value on the curve of the accuracy and the corpus data volume. The invention is an algorithm model based on the relation data between words, and can improve the accuracy rate and the breakthrough threshold value on the basis of the same corpus data volume and improve the accuracy rate of intention analysis.

Other feature data, such as dialogue feature data of different characters, can be derived on the basis of the model.

Furthermore, it should be noted that in the present specification, "include" or any other variation thereof is intended to cover a non-exclusive inclusion, so that a process, a method, an article or an apparatus including a series of elements includes not only those elements but also other elements not explicitly listed, or further includes elements inherent to such process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It should be understood that although the present description refers to embodiments, not every embodiment contains only a single technical solution, and such description is for clarity only, and those skilled in the art should take the description as a whole, and the technical solutions in the embodiments may be appropriately combined to form other embodiments understood by those skilled in the art.

Claims

1. A natural language processing technology research and development method based on graph theory is characterized in that: the study method is as follows:

2. The method of claim 1, wherein the method comprises: the specific research process of the step 2) is as follows:

24) counting the sequential relation of the vocabularies;

3. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 24) counting the lexical sequence relation, abstracting the lexical sequence relation into sentence pattern labels, and labeling the context association relation in the session range.

4. The method for developing a natural language processing technology based on graph theory as claimed in claim 2, wherein: and 25) specifically, aiming at all data, extracting nominal keywords, classifying and labeling sentences with nouns as subject words, and analyzing the vocabulary coding similarity among the sentences.

5. The method of claim 1, wherein the method comprises: the sentence-breaking rule in the step 2) has the following specific flow:

defining natural conversation as a scene for coding;