CN116561251A

CN116561251A - Natural language processing method

Info

Publication number: CN116561251A
Application number: CN202310449583.5A
Authority: CN
Inventors: 裴正奇; 王树徽; 张安然
Original assignee: Beijing Xinshui Technology Co ltd
Current assignee: Beijing Xinshui Technology Co ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-08-08

Abstract

The invention discloses a natural language processing method, which specifically comprises the following steps: obtaining keywords to be processed, carrying out permutation and combination processing on the keywords to be processed to obtain prompt probabilities, generating natural sentences from the prompt probabilities by adopting a language model according to confusion, and constructing a fact library; obtaining a semantic path corresponding to the natural sentence by adopting a pre-trained language analysis model; based on the semantic structure, generating a semantic path set between any two tokens in the natural sentence to construct a semantic path library, and storing the semantic path library into a semantic field database; calculating the similarity of two semantic fields; evaluating the fact deviation degree of replacing the logograms in the natural sentences; training and initializing a semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model; the method has the advantages of remarkable improvement of accuracy performance, stronger model interpretability and lower calculation complexity, and can reduce the calculation complexity of the deep learning language model and reduce the calculation cost.

Description

Natural language processing method

Technical Field

The invention relates to the technical field of computers, in particular to a natural language processing method.

Background

Knowledge-intensive reasoning uses factual statements, i.e. natural sentences describing facts, to retrieve from a knowledge base to perform reasoning and make decisions. The most basic knowledge-intensive reasoning is common sense reasoning, which involves building basic assumptions about everyday cases. Common sense reasoning capability is critical to human thinking and interaction with the world. Thus, giving a machine common sense reasoning capability in a practical form (e.g., question-answering, reading understanding) is the basis of general artificial intelligence.

Language generation models are a class of models that can generate text from input data. These models are typically based on neural networks, such as LSTM or transducers, etc. They may be used to accomplish tasks such as text summarization, conversation robotics, translation, and the like. Generating a model requires a large amount of training data, after which a corresponding output (e.g., a complete article or dialogue) can be generated from a given input (e.g., an input sentence or abstract). Typically, these models combine the encoder-decoder architecture and the attention mechanism to generate text.

Large-scale pre-trained language models (LLMs) have very strong natural language understanding capabilities. They are therefore used as a basis for common sense reasoning. LLM, however, requires an explicit mechanism to handle knowledge-intensive information. As a viable solution to interpret knowledge-centric data, knowledge Graph (KGs) has been successful in topologically encoding features between entities. KGs is indispensable in the context of providing LLMs in physical form that is associated with a substantial relationship to arrive at an answer. The mainstream common sense reasoning method is a method for coupling LLM with KG, and comprises KG-BERT, kagNet, QA-GNN and GreaseLM. They improve accuracy by combining the advantages of natural language understanding and structural knowledge guidance. However, there is still room for improvement in the performance, interpretability, and sustainability of common sense reasoning.

A reinforcement learning based human feedback optimized language model (RLHF) is used to directly optimize the language model with human feedback. RLHF enables language models to begin to align models trained on a generic text data corpus with models of complex human value, guiding the training of intelligent agents through human preferences. Specifically, it requires humans to evaluate the merits of a range of different strategies and then use these evaluation results as training data to train the deep neural network of the intelligent agent. In this way, the intelligent agent can learn more desirable strategies under the guidance of human preferences. In addition to reducing training time and improving performance of intelligent agents, RLHF may also play a role in many realistic scenarios, such as game design, autopilot, etc. By using human preferences to guide the training of intelligent agents, the needs can be better met and more intelligent and humanized technical applications can be created.

Syntactic analysis is an important tool for analyzing semantic components of sentences, which provides support for auxiliary features for natural language processing tasks. Syntactic analysis is largely divided into two categories: component syntax analysis, dependency syntax analysis.

Component syntactic analysis is used to identify the structure of a phrase in a sentence and the hierarchical syntactic relationship between the phrases. The method mainly comprises the following steps: the words in the sentence are firstly subjected to part-of-speech analysis, then each adjacent word is formed into a longer phrase, and the phrase is gradually recursion until the phrase is restored into a complete sentence. The final presentation of component syntactic analysis is typically a tree structure (component tree) that converts a piece of text into phrases, non-leaf nodes in the component tree representing phrase types, and leaf nodes representing words in a sentence.

The dependency syntax analysis can automatically analyze dependency syntax structure information in the text, and achieve accurate understanding of natural language. The technique can use the dependency relationship between words in sentences to represent the syntactic structure information (such as main predicate, dynamic guest, fixed medium structure relationship) of the words, and use tree structure to represent the structure of the whole sentence (such as main predicate guest, fixed complement) and the like.

Part of speech tagging is used to determine and tag the parts of speech of words in a given sentence. The common part-of-speech tagging methods include the following:

the basic idea of the part-of-speech tagging method based on rules is to build part-of-speech disambiguation rules according to word collocation relations and context. Early word class tagging rules were typically constructed manually.

The part-of-speech tagging method based on the statistical model is mainly used for solving the sequence tagging problem. Specifically, given a sequence, the method predicts the part of speech of a word based on the labeling of all words preceding the word in the sentence. The method is currently the most common among Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs).

Part-of-speech tagging methods based on deep learning, such as BiLSTM+CRF, and the like.

For knowledge reasoning technology, the prior art focuses on combining knowledge with semantic features, which are mainly obtained from knowledge maps and pre-trained neuro-language models. However, improvements in the areas of performance, interpretability, and sustainability remain desirable. For example, while a large-scale pre-training language model has very strong natural language processing capability, it still requires an explicit mechanism to process knowledge-intensive information, and existing methods employ "knowledge as an embedding" strategy to aggregate pre-training language model output and knowledge triples into a representation of fixed dimensions, which also has the disadvantages of imperfect and easy loss of information labeling in common sense reasoning tasks.

Disclosure of Invention

The invention aims to solve the problems and designs a natural language processing method.

The technical scheme of the invention for achieving the purpose is that, in the natural language processing method, the natural language processing method comprises the following steps:

obtaining keywords to be processed, carrying out permutation and combination processing on the keywords to be processed to obtain prompt probabilities, generating natural sentences from the prompt probabilities by adopting a language model according to confusion, and constructing a fact library;

obtaining a semantic path corresponding to the natural sentence by adopting a pre-trained language analysis model;

based on the semantic structure, generating a semantic path set of any two tokens in the natural sentence directly to construct a semantic path library, and storing the semantic path library into a semantic field database;

calculating the similarity of two semantic fields, wherein the semantic fields are semantic path sets;

evaluating the fact deviation degree of replacing the logograms in the natural sentence;

and training and initializing a semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model.

Further, in the above natural language processing method, the obtaining the keywords to be processed, performing permutation and combination processing on the keywords to be processed to obtain a prompt outline, generating a natural sentence from the prompt outline by adopting a language model according to the confusion degree, and constructing a fact library, including:

obtaining a plurality of groups of keywords to be processed, carrying out permutation and combination processing on the keywords to be processed to obtain a prompt outline, and expanding and writing the prompt outline into a complete natural sentence by adopting a generated language model;

and calculating the confusion degree of the natural sentences by using a pre-training language model, and eliminating the natural sentences with the confusion degree exceeding a preset threshold value.

Further, in the above natural language processing method, after the rejecting the natural sentence whose confusion exceeds a preset threshold, the method further includes:

acquiring a removed natural sentence, and manually sequencing the removed natural sentence based on a reinforcement learning method to obtain a sequencing result;

based on the sorting result, perfecting a generated language model according to an RLHF method;

acquiring a prompt outline, and inputting the prompt outline into a generated language model after completion to obtain a natural sentence;

and constructing a fact library based on the prompt outline and the corresponding natural sentences, and storing the fact library.

Further, in the above natural language processing method, the obtaining the semantic path corresponding to the natural sentence by using a pre-trained language parsing model includes:

obtaining a pre-trained language analysis model, wherein the language analysis model is a component syntactic analysis model, a dependency syntactic analysis model and a part-of-speech tagging model;

inputting the natural sentences into the pre-trained language analysis model to obtain analysis results, and sorting the analysis results to obtain semantic structures.

Further, in the above natural language processing method, the generating, based on the semantic structure, a semantic path set of any two tokens in the natural sentence directly to construct a semantic path library, and storing the semantic path library in a semantic field database includes:

let there be N number of morpheme sets w ₁ ，w ₂ ...w _N Marking a natural sentence with a length k, which is composed of morphemes according to a specific sequence, as

Reading the semantic path library and judging the natural sentences in the fact libraryWhether or not to contain a word->

If yes, reading the index key corresponding to the natural sentence to include

Inputting the index key into a language analysis model to obtain a semantic path set, and using

Stored as index keys in a semantic field database.

Further, in the above natural language processing method, the calculating the similarity of the two semantic fields includes:

obtaining index keyCorresponding semantic field and index key->The semantic field is a semantic path set;

the similarity of the two semantic fields is calculated by a similarity matching algorithm, which evaluates the similarity of the two constituent elements by a trainable metric function.

Further, in the above natural language processing method, the evaluating the degree of deviation of the fact that the logograms in the natural sentence are replaced includes:

calculating natural sentences in a fact repositoryAnd obtaining a final numerical value by utilizing an evaluation mechanism according to the similarity of the corresponding semantic fields, wherein the evaluation mechanism takes the maximum value or takes the average value.

Further, in the above natural language processing method, the training and initializing the semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model, includes:

invoking an initialization semantic coding model, wherein the initialization semantic coding model is an LSTM or a Transformer semantic coding model;

inputting a natural sentence, training the initialized semantic coding model, and outputting a relation result among all the characters in the natural sentence, wherein the relation result is a vector with a specific dimension;

the training of back propagation is carried out on the initialized semantic coding model after training by utilizing a deep learning frame, a certain word symbol in a natural sentence is replaced by another word, the deviation degree of the natural sentence is determined, and if the deviation degree is larger, the loss value of the corresponding loss function is higher;

determining a loss value through the deviation degree of the natural sentence, and performing iterative optimization on the randomly initialized semantic coding model by using a back propagation algorithm to obtain a semantic analysis model.

The method has the advantages that the keywords to be processed are obtained, are arranged and combined to obtain prompt probabilities, a language model is adopted to generate natural sentences according to the confusion degree, and a fact library is constructed; obtaining a semantic path corresponding to the natural sentence by adopting a pre-trained language analysis model; based on the semantic structure, generating a semantic path set of any two tokens in the natural sentence directly to construct a semantic path library, and storing the semantic path library into a semantic field database; calculating the similarity of two semantic fields, wherein the semantic fields are semantic path sets; evaluating the fact deviation degree of replacing the logograms in the natural sentence; training and initializing a semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model; the method has the advantages of remarkable improvement of accuracy performance, stronger model interpretability, lower calculation complexity, capability of reducing the calculation complexity of a large-scale deep learning language model and capability of reducing the calculation cost, and supports a user to adjust the content in the fact library to customize the model for a specific scene.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention.

FIG. 1 is a diagram illustrating a natural language processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a first embodiment of constructing a semantic path library depending on a pre-training model according to an embodiment of the present invention;

FIG. 3 is a diagram of a second embodiment of constructing a semantic path library depending on a pre-training model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an embodiment of a dynamic reasoning mechanism for building a depending pre-training model in an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The present invention will be described in detail with reference to the accompanying drawings, as shown in fig. 1, a natural language processing method, comprising the steps of:

step 101, obtaining keywords to be processed, carrying out permutation and combination treatment on the keywords to be processed to obtain prompt probabilities, generating natural sentences from the prompt probabilities by adopting a language model according to confusion, and constructing a fact library;

in the embodiment, a plurality of groups of keywords to be processed are obtained, the keywords to be processed are arranged and combined to obtain a prompting outline, and the prompting outline is expanded and written into a complete natural sentence by adopting a generated language model; and calculating the confusion degree of the natural sentences by using the pre-training language model, and eliminating the natural sentences with the confusion degree exceeding a preset threshold value.

In the embodiment, the removed natural sentences are obtained, and the removed natural sentences are manually ordered based on a reinforcement learning method to obtain an ordering result; based on the sequencing result, perfecting a generated language model according to an RLHF method; acquiring a prompt outline, and inputting the prompt outline into the generated language model after completion to obtain a natural sentence; and constructing a fact library based on the prompt outline and the corresponding natural sentences, and storing the fact library into the fact library.

In this embodiment, the method may generate a natural sentence conforming to common sense according to the keyword.

Step one: and (3) arranging and combining a plurality of keywords to generate a prompt outline, and expanding and sketching the prompt outline into a complete natural sentence by means of a generated language model (e.g. a GENIUS model).

Step two: the confusion degree (perplexity) of the natural sentence generated in the last step is calculated by using a pre-training language model (e.g. BERT model), and the natural sentence with the confusion degree exceeding a specific threshold (e.g. 80) is rejected.

Step three: if the confusion degree of the natural sentences generated by various arrangement and combination modes of a certain group of keywords is generally higher, a reinforcement learning method (RLHF) based on human feedback is introduced, the sentences with higher confusion degree are firstly manually ordered according to the common sense rationality, then the ordering result is used as rewards, and a generated language model is perfected by depending on the RLHF method, so that the natural sentences with higher common sense rationality are generated.

Step four: the prompting outline is used as an index key (key), and then a natural sentence with higher common sense rationality generated by the prompting outline is used as a storage content (value) corresponding to the index and stored in a fact library for subsequent training and searching.

102, obtaining a semantic path corresponding to a natural sentence by adopting a pre-trained language analysis model;

in the embodiment, a pre-trained language analysis model is obtained, wherein the language analysis model is a component syntactic analysis model, a dependency syntactic analysis model and a part-of-speech tagging model; inputting the natural sentences into a pre-trained language analysis model to obtain analysis results, and sorting the analysis results to obtain semantic structures.

In this embodiment, the functional profile: the natural sentence is converted into a multi-level semantic structure diagram.

Step one: using a pre-trained language parsing model, including but not limited to: a component syntactic analysis model (Constituency Parsing), a dependency syntactic analysis model (Dependency Parsing) and a Part-of-Speech model (Part-of-Speech), and performing semantic analysis on the current natural sentence.

Step two: the parsing results are collated, and for parsing models (e.g., component syntactic parsing model, dependency syntactic parsing model) in the form of triples, i.e., the model can obtain the relationship between every two different tokens (token), thereby constructing basic parsing elements (e.g., structure1 and Structure2 in fig. 2) like "token-relationship-token".

Step three: if it is not possible to ensure that there are paths between all tokens (e.g., a part-of-speech tagging model that can only link each token to a relationship), a ROOT node (ROOT) can be constructed and all relationships can be linked (e.g., structure3 in fig. 2).

Step four: according to the semantic structure obtained in the previous step, a semantic path set between any two tokens in the current natural sentence can be obtained, for example, three different semantic paths exist between the tokens who and the token in fig. 3, and three different language analysis models (component syntactic analysis model, dependency syntactic analysis model and part-of-speech labeling model) are respectively corresponding.

Step five: the method comprises the steps of taking a word symbol 1-word symbol 2-sentence ID as an index key, taking a language analysis model ID as a sub-index key, storing semantic paths between the word symbol 1 and the word symbol 2 in a list (list) form as storage contents corresponding to indexes and sub-indexes, and constructing a semantic path library.

Step 103, based on the semantic structure, generating a semantic path set of any two tokens in the natural sentence directly to construct a semantic path library, and storing the semantic path library into a semantic field database;

in this embodiment, the number N of morpheme sets w is set ₁ ，w ₂ ...w _N Marking a natural sentence with a length k, which is composed of morphemes according to a specific sequence, asReading a semantic path library and judging natural sentence +.>Whether or not to include->If yes, the index key corresponding to the natural sentence is read to comprise Inputting index key into language analysis model to obtain semantic path set, and adding +.>Stored as index keys in a semantic field database.

In this embodiment, the functional profile: judging a certain natural sentence S ^(j) Some character in (a)

And another may not be at S ^(j) The word->Whether there is knowledge-level equivalence in a particular fact repository.

Step one: the knowledge base of the scenario is first constructed, i.e. a specific fact base (marked K) is generated from given keywords (corresponding to common sense that should be acted on as true propositions) according to the fact base is constructed, which contains M natural sentences.

Step two: constructing a semantic path library based on a pre-training model, and constructing natural sentences and S in the fact library generated in the step one ^(j) And converting into a semantic field and forming a semantic path library.

Step three: all of the will in the will semantic path libraryThe index key as the "logout 1" shape, such as "logout 1-logout 2-sentence ID", and its corresponding stored content are read out. Specifically, if a certain natural sentence in the fact repositoryComprises->The index key associated with the sentence that needs to be read includes: />

Step four: for convenience of explanation, constructing a semantic field database, and indexing the keys in the third stepThe set of semantic paths (i.e. semantic field) obtained in different language parsing models is +.>Stored as index keys in a semantic field database.

104, calculating the similarity of two semantic fields, wherein the semantic fields are semantic path sets;

in this embodiment, the index key is acquiredCorresponding semantic field and index key->The corresponding semantic field, wherein the semantic field is a semantic path set;

the similarity of the two semantic fields is calculated by a similarity matching algorithm that evaluates the similarity of the two constituent elements by a trainable metric function.

In this embodiment, step five: index keyCorresponding semantic field and index key "Similarity of corresponding semantic fields +.>Simplified marking +.>Calculated from a specific similarity matching algorithm, including but not limited to a dynamic programming algorithm, such as that of fig. 4, the example is modified from the LCS (Longest Common Subsequence) algorithm for calculating the similarity of two sequence paths, which also considers the similarity of the constituent elements themselves in the paths to be matched, i.e., evaluates the similarity of the two constituent elements with a trainable metric function g.

Step 105, evaluating the fact deviation degree of replacing the logograms in the natural sentences;

in this embodiment, the natural sentence sum in the fact repository is calculatedAnd obtaining a final value by utilizing an evaluation mechanism, wherein the evaluation mechanism takes the maximum value or takes the average value, and the similarity of the corresponding semantic fields is obtained.

In this embodiment, step six: utilizing the fifth step to make all sentences S in the fact base ⁽ⁱ⁾ And (3) withIs calculated, the obtained, < + >> The final value, which is representative of the natural sentence S, is obtained by reusing the evaluation mechanism (including but not limited to, taking the maximum value and taking the average value) ^(j) The word->And ++>The higher the value, the more equivalent the knowledge in the fact repository K, indicating both.

And step 106, training and initializing a semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model.

In this embodiment, an initialization semantic coding model is called, wherein the initialization semantic coding model is an LSTM or a transducer semantic coding model; inputting a natural sentence, training an initialized semantic coding model, and outputting a relation result among all the characters in the natural sentence, wherein the relation result is a vector with a specific dimension; the training of back propagation is carried out on the initialized semantic coding model after training by utilizing a deep learning frame, a certain word symbol in a natural sentence is replaced by another word, the deviation degree of the natural sentence is determined, and if the deviation degree is larger, the loss value of the corresponding loss function is higher; determining a loss value through the deviation degree of the natural sentence, and performing iterative optimization on the randomly initialized semantic coding model by using a back propagation algorithm to obtain a semantic analysis model.

In this embodiment, step one: the semantic relation of each character in the natural sentence is not required to be obtained according to a pre-trained semantic analysis model. A number of semantic coding models (input as natural sentences, output as relations between the tokens in the sentences expressed as vectors of a specific dimension) such as LSTM or transfomer are randomly initialized, each of which corresponds to a semantic structure. The subsequent operation is identical to the previous method.

Step two: the method comprises the steps of performing back propagation training on a current model by using a deep learning framework (for example, pyTorch), wherein the training mechanism is to replace a word symbol in a fact sentence with another word, if the degree of deviation of the sentence from common sense is larger after replacement, the loss value of a corresponding loss function is higher, and performing iterative optimization on a randomly initialized semantic coding model by using a back propagation algorithm on the loss value.

According to the embodiment of the invention, the keywords to be processed are obtained, are arranged and combined to obtain the prompting supposes, the prompting supposes are generated into natural sentences by adopting a language model according to the confusion degree, and a fact library is constructed; obtaining a semantic path corresponding to the natural sentence by adopting a pre-trained language analysis model; based on the semantic structure, generating a semantic path set of any two tokens in the natural sentence directly to construct a semantic path library, and storing the semantic path library into a semantic field database; calculating the similarity of two semantic fields, wherein the semantic fields are semantic path sets; evaluating the fact deviation degree of replacing the logograms in the natural sentences; training and initializing a semantic coding model based on the fact deviation degree, and performing iterative optimization to obtain a semantic analysis model; the method has the advantages of remarkable improvement of accuracy performance, stronger model interpretability, lower calculation complexity, capability of reducing the calculation complexity of a large-scale deep learning language model and capability of reducing the calculation cost, and supports a user to adjust the content in the fact library to customize the model for a specific scene.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the above-described embodiments, and that the above-described embodiments and descriptions are only preferred embodiments of the present invention, and are not intended to limit the invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A natural language processing method, characterized in that the natural language processing method comprises the steps of:

2. The method for processing natural language according to claim 1, wherein the obtaining the keywords to be processed, performing permutation and combination processing on the keywords to be processed to obtain prompt probabilities, generating natural sentences from the prompt probabilities by using a language model according to confusion, and constructing a fact library, includes:

3. The method according to claim 2, wherein after said rejecting said natural sentence whose confusion exceeds a preset threshold, further comprising:

4. The method for processing natural language according to claim 1, wherein the obtaining the semantic path corresponding to the natural sentence by using the pre-trained language parsing model includes:

5. The method of claim 1, wherein generating, based on the semantic structure, a semantic path set of any two tokens in the natural sentence to construct a semantic path library, and storing the semantic path library in a semantic field database, comprises:

let there be N number of morpheme sets w ₁ ，w ₂ …w _N Marking a natural sentence with a length k, which is composed of morphemes according to a specific sequence, as

Reading the semantic path library and judging the natural sentences in the fact libraryWhether or not to include

If yes, reading the index key corresponding to the natural sentence to include Inputting the index key into a language analysis model to obtain a semantic path set, and adding +.>Stored as index keys in a semantic field database.

6. A natural language processing method according to claim 1, wherein said calculating the similarity of two semantic fields comprises:

7. A natural language processing method according to claim 1, wherein said evaluating a degree of fact bias for replacing tokens in said natural language sentence comprises:

8. The method of claim 1, wherein the training the initial semantic coding model based on the degree of fact deviation, performing iterative optimization to obtain a semantic parsing model, comprises: