CN112906379B

CN112906379B - Method for researching and developing natural language processing technology based on graph theory

Info

Publication number: CN112906379B
Application number: CN202011435391.1A
Authority: CN
Inventors: 杜爽
Original assignee: Suzhou Yingte Leizhen Intelligent Technology Co ltd
Current assignee: Suzhou Yingte Leizhen Intelligent Technology Co ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2023-12-22
Anticipated expiration: 2040-12-10
Also published as: CN112906379A

Abstract

The invention discloses a research and development method of a natural language processing technology based on graph theory, which comprises the following steps: 1, storing natural dialogue according to the sequence of Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time; 2, processing the sentences of the dialogue according to the sentence breaking rule to form a word connecting chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form data of N-Gram; and 3, forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules and Chinese character data, forming data, namely a processing method of a processing unit, by a graph theory method, and performing intervention calibration on the formed rules by a semi-supervised learning method. By the method, the relation among the elements of the natural language can be extracted more quickly and intuitively, and the extracted natural language rules are more accurate, concise and easy to maintain.

Description

Method for researching and developing natural language processing technology based on graph theory

Technical Field

The invention relates to the technical field of artificial intelligence, graphic theory and natural language processing technology, in particular to a method for researching and developing natural language processing technology based on graphic theory.

Background

Natural language processing (NLP, natural Language Processing) is a subject of language problems for human interaction with computers, and aims to let computing mechanisms solve languages, so as to realize communication between human and computers by using natural language.

Generally, the development of natural language processing technology is developed in the direction of a classifier based on deep learning or statistical analysis, and according to different technical implementation difficulties, the system can be divided into three types of simple matching type, fuzzy matching type and paragraph decomposition type. Either type is an embodiment of the keyword matching process. The vocabulary of Chinese is composed of Chinese characters, the structure is loose, the real semantic intention is often matched with the simple key words with strong relevance to the context, and the accuracy of intention recognition is not sufficiently improved.

At present, the development of a natural language processing mode is carried out based on the classifier direction of deep learning, and the graph theory is combined with the natural language processing to solve part of problems in the natural language processing. The development of the knowledge graph gradually matures the application of the graph theory and the application of the graph database, and the graph theory is combined into the natural language processing, so that the knowledge graph is innovative.

Disclosure of Invention

The invention aims to provide a research and development method of a natural language processing technology based on a graph theory, which can enable the relation among various elements of a natural language to be extracted more quickly and intuitively, and enable the extracted natural language rule to be more accurate, concise and easy to maintain.

In order to solve the technical problems, the invention adopts a technical scheme that: the research method for the natural language processing technology based on the graph theory is provided, and comprises the following steps:

1) Storing natural dialogue according to the sequence of Chinese characters through a graph database, and forming word association statistics of an N-Gram model in real time;

2) Processing the sentences of the dialogue according to the sentence breaking rule to form a word connecting chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and marking the parts of speech of the Chinese vocabulary so as to form a dialogue rule;

3) Based on the rule, a pyramid structure is formed from Chinese characters to words to phrases to sentences, the rule is tightly bound with Chinese character data, the data is formed through a graph theory method, namely a processing method of a processing unit, and the formed rule is subjected to interference calibration through a semi-supervised learning method.

Further, the specific research process of the step 2) is as follows:

21 Generating conversation codes, splitting sentences of the conversation, and collecting punctuation marks to the previous sentence;

22 Splitting the Chinese characters in the daily dialogue, recording the Chinese characters in the graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking sentences as units;

23 Counting Chinese characters, extracting words close to the Chinese characters, marking parts of speech, and counting three-level word frequency;

24 Counting the vocabulary sequence relation;

25 Extracting noun keywords for the whole data;

26 The sentence pattern labeling method and the data are continuously modified through supervision education, a word law device is finally formed, and new data are subject extracted through the word law device.

Further, the step 24) is specifically to count and abstract lexical sequential relationships to form sentence pattern labels, and to label association relationships between contexts in the session range.

Further, the step 25) is specifically to extract noun keywords for the whole data, classify and label sentences with nouns as subjects, and analyze the vocabulary coding similarity among the sentences.

Further, the concrete flow of the sentence breaking rule in the step 2) is as follows:

(1) defining a natural dialogue as a scene to encode;

(2) judging whether a next sentence exists, and splitting the Chinese characters according to the sentences;

(3) judging whether the Chinese characters exist or not, when the Chinese characters do not exist, establishing Chinese character nodes, and setting the frequency to be 1; when the Chinese characters exist, the Chinese character frequency is +1;

(4) judging whether the sentence is the first Chinese character, and establishing a sequence relation with the previous Chinese character when the sentence is not the first Chinese character; when the first Chinese character is the sentence, marking the first Chinese character as the sentence;

(5) judging whether the sentence is the last Chinese character or not, and if not, processing the next Chinese character; when the Chinese character is the last Chinese character, the Chinese character is marked as the ending Chinese character of the sentence.

The beneficial effects of the invention are as follows: the method for researching and developing the natural language processing technology based on the graph theory can extract semantic intention in natural dialogue, make natural language processing such as word segmentation and syntactic analysis be embodied, improve the research and development capability of the natural language processing mode, improve the accuracy and break-through threshold value on the basis of the same corpus data quantity, and improve the accuracy of intention analysis.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow chart of a method of developing a natural language processing technique based on graph theory of the present invention;

FIG. 2 is an exemplary diagram of a method of developing a natural language processing technique based on graph theory in accordance with the present invention;

fig. 3 is a schematic diagram of an implementation of a method for developing a natural language processing technique based on graph theory according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are merely exemplary and the invention is not limited to these embodiments.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

And, in the description of the present invention, the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are directions or positional relationships based on the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Referring to fig. 1 to 3, an embodiment of the present invention includes: a research and development method of natural language processing technology based on graph theory comprises the following steps:

As shown in fig. 2, the specific research procedure of step 2) of the present invention is as follows:

24 Counting the vocabulary sequence relation;

25 Extracting noun keywords for the whole data;

As shown in fig. 2, the concrete flow of the sentence breaking rule in the step 2) of the present invention is as follows:

(1) defining a natural dialogue as a scene to encode, directly carrying out the next operation with the next sentence, and directly ending the process without the next sentence;

(2) splitting Chinese characters according to sentences;

(6) Repeating the process of splitting Chinese characters according to sentences.

As shown in fig. 3, the present invention forms a pyramid structure from chinese characters to words to phrases to sentences based on the above rule, the rule is tightly bound with chinese character data, and the data, i.e., the processing method of the processing unit, is formed by a graph theory method.

The invention discloses a method for extracting and maintaining grammar rules in the natural language processing process, which marks words through the structural characteristics of the words in the natural language and continuously performs data dimension rising in the natural dialogue so as to extract the grammar rules. The method can be used for extracting semantic intention in natural dialogue, and makes the natural language processing such as word segmentation, syntactic analysis and the like be visualized, so that the research and development capability of the natural language processing mode is improved. In the prior art, the syntactic analysis of natural language processing classifies semantic intent through word combination statistics, and the statistical result of a local sample is reflected, is a probability model based on word hit, and has a threshold value between the accuracy and the curve of corpus data amount. The invention is an algorithm model based on the relationship data among words, and can improve the accuracy rate to break through a threshold value on the basis of the same corpus data quantity, and improve the accuracy rate of intention analysis.

Other characteristic data, such as dialogue characteristic data of different characters, and the like, can be derived on the basis of the model.

Furthermore, it should be noted that, in this specification, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

It should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to specific embodiments, and that the embodiments may be combined appropriately to form other embodiments that will be understood by those skilled in the art.

Claims

1. A method for developing a natural language processing technology based on graph theory is characterized by comprising the following steps: the research and development method is as follows:

1) The dialogs are stored according to the Chinese character sequence through a graph database, and word association statistics of an N-Gram model are formed in real time;

2) Processing sentences of the dialogue according to sentence breaking rules to form a word connection chain, and counting the occurrence frequency of adjacent Chinese characters on the same chain when the data reach a certain level to form N-Gram data; extracting association frequency data among words based on the data of the N-Gram, and marking the parts of speech of the Chinese vocabulary so as to form a dialogue rule;

3) Forming a pyramid structure from Chinese characters to words to phrases to sentences based on the rules, tightly binding the rules with Chinese character data, forming data, namely a processing method of a processing unit, by a graph theory method, and performing intervention calibration on the formed rules by a semi-supervised learning method;

the specific research process of the step 2) is as follows:

21 Generating dialogue codes, splitting sentences of the dialogue, and collecting punctuation marks to the previous sentence;

22 Splitting the dialogue sentences into Chinese characters, recording the Chinese characters into a graph database according to the sequence of the Chinese characters, and marking the sequence relation by taking the sentences as units;

23 Counting adjacent Chinese characters, extracting words, labeling parts of speech and counting three-level word frequency of adjacent Chinese characters;

24 Counting the vocabulary sequence relation;

25 Extracting noun keywords for the whole data;

26 Continuously modifying sentence pattern labeling method and data through supervision education to finally form a grammar device, and extracting new data through the grammar device;

the concrete flow of the sentence breaking rule in the step 2) is as follows:

(1) defining a natural dialogue as a scene to encode;

(4) judging whether the sentence is the first Chinese character, and establishing a sequence relation with the previous Chinese character when the sentence is not the first Chinese character; when the first Chinese character is the first Chinese character, the first Chinese character is marked as the initial Chinese character of the sentence;

(5) judging whether the Chinese character is the last Chinese character of the sentence, and if not, processing the next Chinese character; when the sentence is the last Chinese character, the Chinese character is marked as the ending Chinese character of the sentence.

2. The method for developing a natural language processing technology based on graph theory according to claim 1, wherein the method comprises the following steps: step 24) is to count and abstract the vocabulary sequence relation into sentence pattern label, and to label the association relation of the context in the conversation range.

3. The method for developing a natural language processing technology based on graph theory according to claim 1, wherein the method comprises the following steps: step 25) is specifically to extract noun keywords for all data, classify and label sentences with nouns as subjects, and analyze the vocabulary coding similarity among the sentences.