CN116244277A

CN116244277A - NLP (non-linear point) identification and knowledge base construction method and system

Info

Publication number: CN116244277A
Application number: CN202310205842.XA
Authority: CN
Inventors: 杨超; 高文飞; 田野; 李群; 张辉; 赵雪松; 焦键; 张�荣; 张天浩; 贾玉谦
Original assignee: Beijing Wucoded Technology Co ltd
Current assignee: Beijing Wucoded Technology Co ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-06-09

Abstract

The invention discloses a method and a system for NLP (natural language processing) recognition and knowledge base construction, which are used for acquiring original text data, recognizing knowledge expressions contained in a text and extracting, matching the extracted knowledge expressions by using knowledge expressions in a knowledge base so as to determine whether the knowledge expressions in the text exist in the knowledge base or not, and updating and storing the knowledge base according to results; processing and preprocessing the knowledge, and checking the integrity and consistency of the knowledge; carrying out knowledge dimension reduction/conversion reprocessing on the preprocessed knowledge, and reducing the knowledge quantity through operations comprising text feature extraction, knowledge base mapping, semantic representation and graphic data representation; and carrying out knowledge mining from the knowledge by adopting a preset algorithm, expressing according to a specific mode, evaluating the accuracy and reliability of the mining result, and outputting the mining result to a user in a mode which can be understood by the user. The knowledge base is expanded without manual searching and labeling, and the working efficiency is high.

Description

NLP (non-linear point) identification and knowledge base construction method and system

Technical Field

The invention relates to the technical field of knowledge base construction, in particular to an NLP (non-linear light-emitting diode) recognition and knowledge base construction method and system.

Background

The knowledge base is also called an intelligent database or an artificial intelligent database, and is a structured, easy-to-operate, easy-to-use and comprehensive and organized knowledge cluster constructed for solving the problem of a certain (or some) field in knowledge engineering. The knowledge base is stored, organized, managed and used in the computer by adopting a certain (or a plurality of) knowledge representation modes and has mutual association relation. The knowledge base is generally composed of knowledge points, standard questions and corresponding expansion questions. In order to expand the corresponding expansion questions of the standard questions in the knowledge base, the traditional method is to check and confirm the expansion questions possibly corresponding to each standard question in a massive corpus, and to manually mark the confirmed expansion questions in combination with the actual scene, and then to supplement the marked expansion questions to the standard questions of the corresponding knowledge points in the knowledge base, in the process, a great amount of manual searching and marking work is needed, so that huge manpower and financial resources are consumed, and the efficiency is low.

Disclosure of Invention

Therefore, the invention provides a method and a system for constructing an NLP recognition and knowledge base, which are used for solving the problems that a large amount of manual searching and labeling work is required for expanding the traditional knowledge base, huge manpower and financial resources are consumed, and the efficiency is low.

In order to achieve the above object, the present invention provides the following technical solutions:

according to a first aspect of an embodiment of the present invention, a method for NLP recognition and knowledge base construction is provided, the method comprising:

acquiring original text data, identifying knowledge expressions contained in a text, extracting, matching the extracted knowledge expressions with knowledge expressions in a knowledge base to determine whether the knowledge expressions in the text exist in the knowledge base, and updating the knowledge base and storing the knowledge according to the result;

processing and preprocessing knowledge, checking the integrity and consistency of the knowledge, processing noise knowledge in the knowledge, and filling the missing knowledge by using a statistical method;

carrying out knowledge dimension reduction/conversion reprocessing on the preprocessed knowledge, and reducing the knowledge quantity through operations comprising text feature extraction, knowledge base mapping, semantic representation and graphic data representation;

and carrying out knowledge mining from the knowledge by adopting a preset algorithm, expressing according to a specific mode, evaluating the accuracy and reliability of the mining result, and outputting the mining result to a user in a mode which can be understood by the user.

Further, obtaining original text data, identifying knowledge expression contained in the text and extracting, and specifically comprising the steps of:

collecting original documents and texts, including news stories, web page content, academic papers and blog articles;

semantic analysis is performed on the documents and text to extract valuable information;

abstracting information into knowledge including vocabulary, syntactic relation and semantic structure by utilizing a semantic analysis technology, classifying the knowledge according to different topics, checking whether the extracted knowledge accords with expectations, and ensuring the accuracy of the knowledge;

the extracted knowledge is presented using visualization tools for better understanding.

Further, the knowledge dimension reduction/conversion reprocessing is performed on the preprocessed knowledge, and specifically includes:

the method comprises the steps of extracting keywords in a problem, then carrying out similarity calculation on the keywords, carrying out synonymous expansion on knowledge in a knowledge base, and constructing a plurality of response modes through random replacement of the synonyms when a response theme is unchanged.

Further, the similarity calculation is performed on the keywords, which specifically includes:

the cosine similarity is adopted to measure the similarity between texts; using word frequency vectors of two text contents, and then using the word frequency vectors as vectors to calculate the similarity between texts;

for keyword processing, TF-IDF is adopted to identify important keywords, feature vectors with discrimination are extracted, and the feature vectors are converted into standard word frequency vectors.

Further, knowledge mining is performed from knowledge by adopting a preset algorithm, and the method specifically comprises the following steps:

data preprocessing: extracting useful information from the raw data and formatting it into a format usable by a machine learning algorithm;

model construction: selecting an appropriate algorithm for mining valuable knowledge from the formatted data; the algorithm comprises decision tree, cluster analysis, neural network and K nearest neighbor algorithm;

evaluation of results: evaluating the effect of the model through the confusion matrix, and checking the accuracy of the model;

model optimization: by improving the algorithm and the model parameters, the accuracy and the efficiency of the model are improved.

adopting a deep learning algorithm to carry out knowledge mining, and carrying out adaptability improvement on the deep learning algorithm in a mode of changing a network structure, adding new neurons, adjusting parameters and adjusting learning rate;

by changing the network structure, the number of layers of the neural network is changed or the number of neurons of each layer is changed;

adding new neurons increases the complexity of the network to capture more features;

adjusting parameters refers to changing the weights of the neural network so that the network can better fit the data;

adjusting the learning rate refers to changing the step size of the updating weights of each iteration of the network to change the convergence speed of the network.

Further, the method for evaluating the accuracy and the reliability of the mining result specifically comprises the following steps:

firstly, defining evaluation indexes including accuracy, recall rate and F value so as to evaluate the result;

then a certain amount of useful data is collected, which will be used as training and testing data sets;

training and testing the data using corresponding NLP techniques to obtain a set of reliable model parameters;

and finally, evaluating the model by using the defined evaluation index to judge the accuracy and reliability of the model.

According to a second aspect of an embodiment of the present invention, there is provided an NLP recognition and knowledge base construction system, the system comprising:

the knowledge selection module is used for acquiring original text data, identifying knowledge expressions contained in the text and extracting, matching the extracted knowledge expressions with knowledge expressions in a knowledge base to determine whether the knowledge expressions in the text exist in the knowledge base or not, and updating the knowledge base and storing the knowledge according to the result;

the knowledge cleaning and preprocessing module is used for processing and preprocessing knowledge, checking the integrity and consistency of the knowledge, processing noise knowledge in the knowledge, and filling the missing knowledge by using a statistical method;

the knowledge dimension reduction/conversion module is used for carrying out knowledge dimension reduction/conversion reprocessing on the preprocessed knowledge, and reducing the knowledge quantity through operations comprising text feature extraction, knowledge base mapping, semantic representation and graphic data representation;

the knowledge mining module is used for mining knowledge from the knowledge by adopting a preset algorithm and expressing the knowledge in a specific mode;

the knowledge evaluation module is used for evaluating the accuracy and reliability of the mining result;

and the knowledge output module is used for outputting the knowledge to the user in a mode which can be understood by the user.

Further, the knowledge dimension reduction/conversion module is specifically configured to:

According to a third aspect of an embodiment of the present invention, there is provided an electronic device including:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of the above.

According to a fourth aspect of embodiments of the present invention, a computer storage medium is presented, the computer storage medium having one or more program instructions embodied therein for performing the method of any of the above by an NLP recognition and knowledge base construction system.

The invention has the following advantages:

according to the NLP recognition and knowledge base construction method and system, original text data are obtained, knowledge expressions contained in a text are recognized and extracted, the extracted knowledge expressions are matched by using knowledge expressions in the knowledge base, so that whether the knowledge expressions in the text exist in the knowledge base or not is determined, and the knowledge base is updated and stored according to results; processing and preprocessing knowledge, checking the integrity and consistency of the knowledge, processing noise knowledge in the knowledge, and filling the missing knowledge by using a statistical method; carrying out knowledge dimension reduction/conversion reprocessing on the preprocessed knowledge, and reducing the knowledge quantity through operations comprising text feature extraction, knowledge base mapping, semantic representation and graphic data representation; and carrying out knowledge mining from the knowledge by adopting a preset algorithm, expressing according to a specific mode, evaluating the accuracy and reliability of the mining result, and outputting the mining result to a user in a mode which can be understood by the user. The knowledge base is expanded without manual searching and labeling, and the working efficiency is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

Fig. 1 is a flowchart of a method for NLP recognition and knowledge base construction provided in embodiment 1 of the present invention;

fig. 2 is a schematic diagram of text data obtained by global retrieval in the method for constructing NLP recognition and knowledge base provided in embodiment 1 of the present invention;

fig. 3 is a global search result in the method for constructing NLP recognition and knowledge base provided in embodiment 1 of the present invention;

fig. 4 is a diagram of a knowledge acquisition architecture in the method for constructing NLP recognition and knowledge base according to embodiment 1 of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, this embodiment proposes a method for NLP recognition and knowledge base construction, where the method includes:

s100, acquiring original text data, identifying knowledge expressions contained in the text, extracting, matching the extracted knowledge expressions with knowledge expressions in a knowledge base to determine whether the knowledge expressions in the text exist in the knowledge base, and updating the knowledge base and storing the knowledge according to the result.

First, you need to collect relevant documents and text, which may be news stories, web page content, academic papers, blog articles, etc. Fig. 2 is a schematic diagram of a page for global search at a policy-related content website, and fig. 3 is a global search result.

2. Semantic analysis is performed on documents and text to extract valuable information.

3. Using semantic analysis techniques, information is abstracted into knowledge, which can be categorized according to different topics.

4. The extracted knowledge is presented with visualization tools for better understanding.

The extracted knowledge has the following requirements:

1. knowledge is presented in a visual representation for better understanding and analysis.

2. The knowledge is clear and unambiguous and is not ambiguous.

3. Knowledge is updated in time to ensure their availability and usability.

The method comprises the following specific steps:

1. language analysis: the provided text is analyzed to extract useful knowledge and information such as vocabulary, syntactic relationships, semantic structures, etc.

2. Knowledge extraction: based on the analyzed structure and semantics, knowledge structures such as entities, relationships, and relationship features are extracted

3. Knowledge validation: and checking whether the extracted knowledge accords with the expectation or not, and ensuring the accuracy of the knowledge.

For example, if a person relationship needs to be extracted from text, we can: firstly, carrying out language analysis to extract vocabulary and syntactic relations; then, extracting character entities and relations based on semantic analysis; and finally, confirming whether the extracted knowledge is correct or not according to the text context and the background knowledge.

The knowledge selection step specifically includes:

1. identifying knowledge expression: identifying knowledge expressions, such as entities and relationships, contained in the text;

2. extracting knowledge: extracting knowledge representation in the text for subsequent processing;

3. matching knowledge: matching using knowledge representations in the knowledge base to determine whether knowledge representations in the text already exist in the knowledge base;

4. updating knowledge: if the knowledge expression in the text does not appear in the knowledge base, adding the knowledge expression in the knowledge base or updating the knowledge expression in the knowledge base;

5. storing knowledge: the extracted knowledge representation is stored in a knowledge base for later use.

Knowledge base operations may be processed using a concept knowledge base, entity knowledge base, relational knowledge base, and the like.

S200, processing and preprocessing the knowledge, checking the integrity and consistency of the knowledge, processing noise knowledge in the knowledge, and filling the missing knowledge by using a statistical method.

The knowledge cleaning and preprocessing steps comprise: data collection, natural language processing, feature extraction, data cleansing and preprocessing, and final data visualization.

Whether knowledge is missing can be determined by comparing the content in the knowledge base with real world conditions.

Noise knowledge refers to invalid or irrelevant knowledge that may lead to errors in the system.

S300, carrying out knowledge dimension reduction/conversion reprocessing on the preprocessed knowledge, and reducing the knowledge quantity through operations comprising text feature extraction, knowledge base mapping, semantic representation and graphic data representation.

Specifically, in knowledge dimension reduction and knowledge mining, the construction of a knowledge base with unchanged response subject content and diversified response modes can be realized by extracting keywords in the problems, then carrying out similarity calculation on the keywords, expanding synonyms, and flexibly and randomly replacing the synonyms in response terms.

In this embodiment, cosine similarity is used to measure similarity between texts; using word frequency vectors of two text contents, and then using the word frequency vectors as vectors to calculate the similarity between texts;

for keyword processing, TF-IDF (terminal-InverseDocument Frequency) is used to identify important keywords, extract discriminative feature vectors, and convert the feature vectors into standard word frequency vectors.

Flexible random replacement of synonyms in answer parlance means that a knowledge base is built, wherein the synonyms comprise synonyms which can be used for answering a problem, and the synonyms can be randomly replaced, so that the answer theme content is unchanged, and the knowledge base of multiple answer modes can be provided.

Such as: assume that there is a problem: "how do you tell me to build a knowledge base using NLP? By "we can create a knowledge base that contains the following synonyms:

1. and (3) construction: construction, establishment, creation, construction

2. Knowledge base: database, information base, database, material base

Thus, we can use the synonyms above to randomly replace, construct a series of different responses, such as:

1. how do the database be built with NLP?

2. How does a library be created?

3. Is a method of building a library of materials?

4. How are databases built?

Reprocessing the preprocessed knowledge, reducing the amount of knowledge, mainly by mapping or other operations in the knowledge base:

1. text feature extraction: text feature extraction refers to extracting features from the original text to form text feature vectors, which is a dimension reduction step that converts the original text information into smaller feature vectors for subsequent analysis. For example, a bag of words model may be used to convert text into word frequency vectors, which are then converted into feature vectors using the TF-IDF algorithm.

2. Mapping a knowledge base: knowledge base mapping refers to mapping words, entities, relationships, and concepts in text to concepts in a knowledge base to reduce the amount of knowledge in the text. For example, word words in text can be mapped to concepts of WordNet using WordNet, and then the semantics in text can be described using relationships and concepts of WordNet.

3. Semantic representation: semantic representation refers to converting text into a more abstract semantic representation to reduce the amount of knowledge of the text. For example, word2Vec may be used to convert words in text into semantic vectors to reduce the amount of knowledge of the text.

4. Graphic data representation: the graphical data representation refers to converting text into a graphical data representation to reduce the amount of knowledge of the text. For example, a graphic neural network may be used to convert text to a graphical representation to reduce the amount of knowledge of the text.

S400, knowledge mining is carried out from knowledge by adopting a preset algorithm, the knowledge is expressed according to a specific mode, the accuracy and the reliability of the mining result are evaluated, and the mining result is output to a user in a mode which can be understood by the user.

Machine learning, deep learning and natural language processing techniques, such as algorithms of support vector machines, decision trees, neural networks, etc., are mainly used. Common expression modes include a bag-of-word model, a topic model, semantic analysis and the like.

The vector machine algorithm (Support Vector Machines, SVM) is a classification and regression algorithm based on a kernel function. The idea is to map data points on the feature space to a higher dimensional space where the best hyperplane is found, which can divide the data as far apart as possible.

Such as: assuming a set of two-dimensional data, two types of data are represented in red and blue, we can divide the two types of data by a straight line, which is a hyperplane in the SVM.

The decision tree algorithm is a tree-structure-based algorithm that simulates human thinking behavior in a sense, representing the decision process in a tree structure. Each internal node represents a feature, each branch represents a feature value, and each leaf node represents a result.

For example: determining a person's gender based on his preference may be implemented using a decision tree algorithm.

The root node of this tree represents a "hobby" and has two branches, one is "sports and outdoor activities" and the other is "watch tv and play games". Then, on the "sports and outdoor activities" branch, there are two leaf nodes, representing "male" and "female", respectively; on the "watch tv and play game" branch, there are also two leaf nodes, representing "male" and "female", respectively.

The neural network algorithm is an artificial intelligence technology which mimics the neural network structure of the human brain, uses mathematical models to identify patterns, and builds models to solve practical problems. The neural network algorithm can be used in the fields of voice recognition, image recognition, automatic driving, language translation and the like.

For example, in speech recognition, a neural network algorithm may analyze the frequency and energy of sound and convert it to text. In image recognition, algorithms can use pixel information of an image to identify different objects and make accurate decisions. In autopilot, neural network algorithms analyze obstacles on the road surface and assist the vehicle in making the correct decisions.

Knowledge mining is a machine learning technique that uses different algorithms to find valuable knowledge from a large volume of data. The knowledge mining process mainly comprises four steps of data preprocessing, model construction, result evaluation and model optimization.

Data preprocessing refers to extracting useful information from raw data and formatting it into a format that can be used by machine learning algorithms.

Model building refers to selecting the most appropriate algorithm for mining valuable knowledge from the formatted data. Common algorithms include decision trees, cluster analysis, neural networks, K-nearest neighbor algorithms, and the like.

The result evaluation means that the effect of the model is evaluated through the confusion matrix, and the accuracy of the model is checked.

Model optimization refers to improving the accuracy and efficiency of the model by improving the algorithm and model parameters. Among them, deep learning is one of the algorithms with the best technical effect, which is implemented by using a multi-layer neural network, and can simulate the thinking process of the human brain to extract valuable patterns and features in the data. The improvement of deep learning can be realized by changing the network structure, adding new neurons, adjusting parameters, adjusting learning rate and the like.

Changing the network structure is to change the number of layers of the neural network or to change the number of neurons per layer.

Adding new neurons increases the complexity of the network to capture more features.

Adjusting parameters refers to changing the weights of the neural network so that the network can better fit the data.

The specific steps of the result evaluation are as follows:

1. defining an evaluation index: first, the evaluation indexes such as accuracy, recall, F-value, etc. are defined to evaluate the result.

2. And (3) data collection: a certain amount of useful data is then collected, which will be the training and testing data set.

3. Training and testing: these data can then be trained and tested using corresponding NLP techniques to obtain a reliable set of model parameters.

4. Evaluation: and finally, evaluating the model by using the defined evaluation index to judge the accuracy and reliability of the model.

Examples:

if the accuracy of an NLP model is required to be evaluated, a certain number of corpus can be collected, the corpus is used for training and testing the model, and after parameters of the model are obtained, the accuracy is used as an evaluation index to evaluate the accuracy of the model.

Fig. 4 is a schematic diagram of the knowledge acquisition architecture of the present invention, and its main functional contents are:

1. data cleaning: valid information is extracted from the raw data and converted into a format that is convenient for analysis.

2. And (3) integrated extraction: knowledge is extracted from the raw data using techniques such as machine learning, natural language processing, data mining, and the like.

3. Conversion: the extracted knowledge is converted into a format usable by the application.

4. Data mining processing: knowledge is extracted from the raw data using techniques such as machine learning, natural language processing, data mining, and the like.

5. Mining result evaluation: and (5) evaluating the accuracy and reliability of the mining result.

6. Knowledge output: the extracted knowledge is presented to the user in an understandable manner.

Example 2

Corresponding to the above embodiment 1, this embodiment proposes an NLP recognition and knowledge base construction system, which includes:

extracting keywords in the problem, then carrying out similarity calculation on the keywords, expanding synonyms, flexibly and randomly replacing the synonyms in the response term, and constructing a knowledge base with unchanged response subject content and diversified response modes.

The functions performed by each component in the NLP recognition and knowledge base construction system provided in the embodiment of the present invention are described in detail in the above embodiment 1, so that redundant description is omitted here.

Example 3

An embodiment of the present invention proposes an electronic device, and fig. 5 is a schematic entity structure diagram of the electronic device provided by the present invention, where the electronic device may include: processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and communication bus 1050, wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 communicate with each other via communication bus 1050. One or more programs are stored in the memory 1020 and configured to be executed by the one or more processors 1010, the one or more programs configured to perform the NLP recognition and knowledge base construction method described in the above embodiments.

Example 4

In correspondence with the above-described embodiments, the present embodiment proposes a computer storage medium containing one or more program instructions for executing the method as in embodiment 1 by an NLP recognition and knowledge base construction system.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. An NLP recognition and knowledge base construction method, comprising:

2. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the steps of obtaining the original text data, recognizing the knowledge expression contained in the text and extracting the knowledge expression comprise:

3. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the preprocessing knowledge is subjected to knowledge dimension reduction/conversion reprocessing, and the method specifically comprises the following steps:

4. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the similarity calculation is performed on the keywords, specifically comprising:

5. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the knowledge mining is performed from the knowledge by adopting a preset algorithm, specifically comprising:

6. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the knowledge mining is performed from the knowledge by adopting a preset algorithm, specifically comprising:

7. The method for constructing the NLP recognition and knowledge base according to claim 1, wherein the evaluation of the accuracy and reliability of the mining result comprises the following steps:

8. An NLP recognition and knowledge base construction system, the system comprising:

9. An electronic device, the electronic device comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-7.

10. A computer storage medium having one or more program instructions embodied therein for performing the method of any of claims 1-7 by an NLP recognition and knowledge base construction system.