CN111930909B

CN111930909B - Geological intelligent question-answering oriented data automation sequence labeling identification method

Info

Publication number: CN111930909B
Application number: CN202010804098.1A
Authority: CN
Inventors: 贺金龙; 付立军; 黄徐胜; 唐珂珂; 朱月琴; 刘晓娟
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2023-09-12
Anticipated expiration: 2040-08-11
Also published as: CN111930909A

Abstract

The application relates to the technical field of information, and provides a data automation sequence labeling identification method for geological intelligent question-answering. The application aims to realize the accuracy of the user question-answer interaction effect in the intelligent question-answer process of gold mine data. The main scheme includes that the gold mine literature map data are arranged and cleaned to obtain batch literature data; performing machine automatic labeling of character data by using BIOES labels aiming at document data to obtain gold data labeling results, and performing input training by deep learning to obtain training results of gold document data; applying the training result of the literature data to the recognition of the user inquiry sentences to obtain the labeling result of the user inquiry sentences, and then carrying out attribute classification to obtain the classification of the user inquiry sentences; and combining and packaging the labeling result and the classification through a set to obtain a result of labeling gold data in the user inquiry statement and the semantic attribute of the inquiry statement, and mapping the result to a gold knowledge graph to obtain a user inquiry knowledge result.

Description

Geological intelligent question-answering oriented data automation sequence labeling identification method

Technical Field

The application relates to the technical field of knowledge graph application in the deep learning knowledge mining process, and provides a gold mine data automatic sequence labeling method for realizing an intelligent question-answering platform.

Background

Currently, intelligent question-answering services are an important application in the development stage of artificial intelligence, and have a larger cognitive ability compared with traditional rule matching and co-occurrence search matching. In the implementation process, the concept and relation association of knowledge is realized by introducing a knowledge graph, and then the field recognition and the intention recognition are carried out by using an automatic sequence labeling method of deep learning in the user question-answering process, so that an intelligent question-answering platform is realized.

At present, the realization of a question-answering system is mostly dependent on regular template matching and elastic search retrieval matching, and questions and answers in the general field are more, and meanwhile, the realization of intelligent question-answering service in the specific field is challenging due to lack of deep semantic knowledge analysis. When the traditional question-answering system processes Chinese text, sentences are generally converted into word representations through word segmentation technology, and then knowledge base matching of the sentences is carried out through semantic similarity calculation (editing distance and vector cosine similarity of TFIDF) so as to realize query answering of users. The word segmentation technology comprises three development stages of rule dictionary matching, statistical machine learning and deep learning. The rule dictionary-based matching comprises forward maximum matching and reverse maximum matching bidirectional maximum matching; the statistical machine learning comprises an n-gram language model, a maximum entropy model, a conditional random field and the like; with the massive data information generated in the step of advancing web2.0 to web3.0, a word segmentation method based on deep learning is continuously raised, and the word segmentation method comprises a convolutional neural network, a cyclic neural network, a long-short-time memory network, a mode of combining with a conditional random field and the like, wherein the label mode adopted in the identification process is BIO or BIOES label.

The existing labeling method has the defects that:

(1) In the gold mine knowledge mining discovery process, manual processing of a large amount of data information is time-consuming and labor-consuming, and the processing efficiency is low.

(2) Aiming at the application of word segmentation tools, the method is seriously dependent on the construction of a dictionary, and when the method is used in the gold information processing process, the application effect cannot be achieved, and the method has a good effect in the general field.

(3) For sequence labeling of massive gold data, structured information of specific domain knowledge categories is also needed on the basis of the prior art method.

Disclosure of Invention

The application aims to realize the accuracy of the user question-answer interaction effect in the intelligent question-answer process of gold data, constructs a deep learning identification method based on automatic sequence labeling, and constructs by combining gold field documents with a map.

In order to solve the technical problems, the application adopts the following technical scheme:

a data automation sequence labeling identification method for geological intelligent question and answer comprises the following steps:

step 1: the gold mine literature map data are arranged to obtain domain entity classification description tags (including entities) which are used as labeling tags for domain knowledge entity identification;

step 2: performing machine automatic cleaning on document data content, including filtering English letters, punctuation marks and nonsensical marks to obtain effective Chinese text content;

step 3: storing the cleaned text content in an independent txt file to obtain a storage root path of batch document data;

step 4: performing machine automatic labeling of character data by using BIOES labels for the literature data obtained in the step 3, and performing label combination by combining the sorted map entity classification description data to obtain a gold ore data labeling result beginning with B, I, O, E, S;

step 5: inputting and training character sequence data of the labeling result of the gold mine data in the step 4 by adopting a mode of combining a bidirectional LSTM model and a conditional random field CRF in deep learning, and adding the tidied gold mine map entity data by adjusting the structure and the integral parameters of the memory cells in the LSTM model to obtain a training result of gold mine literature data;

step 6: applying the training result of the literature data to the recognition of the platform user inquiry statement to obtain the labeling result of the user inquiry statement;

step 7: inputting the rest sentences obtained by subtracting the content of the gold ore data labeling result from the content of the user inquiry sentences into a convolutional neural network for attribute classification to obtain the classification of the user inquiry sentences;

step 8: combining and packaging the gold data identification result and the classification of the user inquiry statement through a Map set to obtain the label of the gold data in the user inquiry statement and the result of the semantic attribute of the inquiry statement, for example { Qinghai-Tibet plateau=what of geological entity GENT };

step 9: and (3) mapping the labeling of the gold data and the result of the semantic attribute of the query sentence in the step (8) to a gold knowledge graph to obtain a user query knowledge result, thereby realizing intelligent question and answer.

In the above technical scheme, the arrangement of the gold mine literature map data comprises:

aiming at gold document data, the gold document data is collected through artificial arrangement of a geological encyclopedia and dog searching corpus, and classification description tags are constructed through gold field knowledge, wherein the classification description tags comprise geological entities GENT, geological actions GEFF, geological chemistry GEHE and geological methods GMET.

In the above technical solution, the label combination in step 4 includes the steps of:

firstly, performing character division on the BIOES label to obtain a single word Fu Zimu B, I, O, E, S;

and (3) automatically labeling the single character letters and the txt file content in the step (3) to obtain a gold ore data labeling result beginning with B, I, O, E, S.

According to the technical scheme, automatic labeling is performed on the basis of gold data labeling, gold data is firstly used for training character vectors based on Word2vec, then training and learning are performed on gold data labeling results in a mode of combining a bidirectional neural network LSTM and a conditional random field CRF in deep learning, and training results of the gold data are obtained by adjusting model parameters.

In the technical scheme, the user inquiry statement identification is carried out, the sequence of the user inquiry statement information is automatically identified by inputting the user inquiry statement into the model and using the training result model, and the labeling result of the user inquiry statement is obtained;

in the above technical solution, the user inquiry sentence identification includes the following steps:

(1) Inputting a user inquiry sentence into a platform through an http interface, and firstly obtaining a word index (such as cyan: 15, tibetan: 23, high: 54, original: 113, etc.) of the user sentence;

(2) And (3) further calling and outputting the word index of the user statement through the combined model training result of the LSTM and the CRF in the step (5) to obtain words combined by characters, namely the labeling result of the user inquiry statement.

In the technical scheme, the user statement classification is realized automatically through machine training of labeling data by inputting other unidentified parts input into the sequence recognition model into the convolutional neural network to classify the attributes of the unidentified parts, so that the user inquiry statement classification is obtained.

By adopting the technical scheme, the application has the following beneficial effects:

1. the gold mine literature data needs professional knowledge skills for processing application, and automatic sequence labeling identification of a machine is adopted, so that on one hand, the complexity of manual processing is reduced; another negative aspect concentrates domain knowledge on the inside, and users expand quickly during use without concentrating on the inside of the floor.

2. The automatic sequence labeling identification method based on the map gold mine data provides a convenient interaction mode for users in the intelligent question-answering process, only the query sentences need to be input, and the convenience of gold mine field knowledge in the application process is greatly improved.

3. The automatic sequence labeling recognition process does not depend on word segmentation tools, only depends on automatic model training, so that manpower resources are greatly reduced, meanwhile, the model only needs to be trained once in the use process, and the model only needs to be called during the use without training.

4. The migration of the model technology only depends on the provided literature data, the model can be conveniently and rapidly customized and trained according to different data, and the model migration risk is reduced.

5. The adoption of the automatic sequence labeling and recognition method of the map gold mine data enables the intelligent question-answer to have more generalization capability compared with the matching based on the regular template and the matching based on the retrieval.

Drawings

FIG. 1 is a flow chart of an intelligent question-answering service;

FIG. 2 is a sequence annotation diagram of a tag combination described based on BIOES and gold data classification;

FIG. 3 is a flowchart of a labeling process based on a word segmentation tool;

FIG. 4 is a flowchart of automated sequence annotation recognition.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the particular embodiments described herein are illustrative only and are not intended to limit the application, i.e., the embodiments described are merely some, but not all, of the embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The automatic sequence labeling identification method based on the map gold data, which is adopted by the gold intelligent question-answering platform, achieves timely and accurate answer of the user's query by combining the domain feature knowledge. Firstly, collecting gold mine data documents, and obtaining Chinese text content by removing invalid symbols and nonsensical labels; then, knowledge description classification information is constructed by combining the classification structure information of the domain knowledge; then, automatic sequence labeling of character labels is carried out on the text content, and combination label labeling of characters is carried out by combining domain knowledge classification description labels; training and learning the gold text data by using a bidirectional neural network in the deep learning model, and adjusting parameters to achieve a threshold model meeting automatic sequence labeling and recognition; and then using the obtained model to perform sequence recognition of the user statement, performing intention class on the data from which the sequence recognition is removed, mapping the classification result and the sequence recognition result to a gold ore knowledge graph to perform user inquiry, and further realizing user feedback. The question and answer service is shown in fig. 1.

The steps are as follows:

and (5) data arrangement. The gold document data are collected in a sorting mode, and classification description tags such as geological entities GENT, geological actions GEFF, geological chemistry GEHE and geological methods GMET are constructed through gold field knowledge.

And (5) data cleaning. And (3) processing the sorted document data in batches through texts to obtain text contents, and cleaning the text content formats in a regular matching expression mode to obtain effective Chinese texts.

And storing data in batches. The content of the batch text is uniformly stored in a fixed root directory according to the article space number by using python and is stored in the form of utf-8 and txt files.

Automated labeling of combined labels. And (3) combining and labeling the gold data text content piece by piece and character by combining the sorted gold field knowledge classification description labels with the traditional BIOES labels to obtain a gold data character labeling result beginning with B, I, O, E, S. As shown in fig. 2.

Automated sequence recognition for deep learning. On the basis of data annotation, gold data is firstly used for training character vectors based on Word2vec, then the annotation data is trained and learned in a combination mode of a bidirectional neural network LSTM and a conditional random field CRF in deep learning, and a training result (preservation of a checkpoint file) of the gold data is obtained by adjusting model parameters. The manner of recognition in the word segmentation tool that scores the weights of the word features is not used here, as shown in fig. 3.

User query sentence sequence identification. And inputting the user inquiry statement into the model, and automatically identifying the sequence of the user statement information by using the training result model to obtain the labeling result of the user data. As shown in fig. 4.

User statement classification. And inputting other unidentified parts input into the sequence recognition model into a convolutional neural network to classify the attributes, wherein the attribute classification is automatically realized through machine training of labeling data, and the user statement classification is obtained.

And classifying and obtaining the user sequence labeling result and sentence attribute. And (3) realizing user statement information understanding by combining the results in the step (6) and the step (7) to obtain a combined result of the two.

And (5) map mapping and inquiring. And (3) mapping the combined result in the step (8) into a gold ore knowledge graph, and obtaining feedback information through knowledge graph organization query.

Examples

The application provides a data automation sequence labeling identification method for geological intelligent question and answer, which comprises the following steps:

step 5: inputting and training the character sequence data of the labeling result of the gold mine data in the step 4 by adopting a mode of combining a bidirectional LSTM model and a conditional random field CRF in deep learning, and adding the tidied gold mine map entity data by adjusting the structure and the integral parameters of the memory cells in the LSTM model to obtain the training result of the gold mine literature data (storing a checkpoint file).

step 8: combining the gold data labeling result with the classification of the user query statement through the user query statement to obtain a result of the labeling of the gold data in the user query statement and the semantic attribute of the query statement; the labeling result of the gold ore data refers to entity parts in the gold ore literature, such as geological entities (Qinghai-Tibet plateau and volcanic institutions), geological effects, geochemistry and geological methods; classification of user query sentences refers to attribute categories of user queries for the entity part, such as: brief introduction, category, size, relationship, regional scope;

In the scheme, the arrangement of the gold mine literature map data comprises the following steps:

In the above scheme, the label combination in step 4 includes the steps of:

In the scheme, automatic labeling is performed on the basis of gold data labeling, gold data is firstly used for training character vectors based on Word2vec, then training and learning are performed on gold data labeling results by combining a bidirectional neural network LSTM and a conditional random field CRF in deep learning, and training results of the gold data are obtained by adjusting model parameters.

In the above scheme, the user inquiry sentence identification includes the following steps:

inputting a user inquiry sentence into a platform through an http interface, and firstly obtaining a word index (such as cyan: 15, tibetan: 23, high: 54, original: 113, etc.) of the user sentence;

and (3) further calling and outputting the word index of the user statement through the combined model training result of the LSTM and the CRF in the step (5) to obtain words combined by characters, namely the labeling result of the user inquiry statement.

In the scheme, the user statement classification is realized automatically through machine training of labeling data by inputting other unidentified parts input into the sequence recognition model into the convolutional neural network to classify the attributes of the unidentified parts, so that the user inquiry statement classification is obtained.

Claims

1. A data automation sequence labeling identification method for geological intelligent question and answer is characterized in that: the method comprises the following steps:

step 1: the gold mine literature map data are arranged to obtain domain entity classification description tags which are used as labeling tags for domain knowledge entity identification;

step 7: subtracting the identification content of the model on the gold data in the user statement from the content of the user query statement, and inputting the obtained residual statement into a convolutional neural network for attribute classification to obtain the classification of the user query statement;

step 8: combining and packaging the gold data identification result and the classification of the user inquiry statement through a Map set to obtain the label of the gold data in the user inquiry statement and the result of the semantic attribute of the inquiry statement;

step 9: mapping the labeling of the gold data in the step 8 and the result of the semantic attribute of the query sentence to a gold knowledge graph to obtain a user query knowledge result, thereby realizing intelligent question and answer;

the user query sentence identification includes the steps of:

inputting a user inquiry sentence into a platform through an http interface, and firstly obtaining a word index of the user sentence;

the word index of the user sentence is further invoked and output through the combined model training result of the LSTM and the CRF in the step 5, and the word combined by the characters, namely the labeling result of the user inquiry sentence, is obtained;

the user statement classification, namely inputting other unidentified parts input into the sequence recognition model into a convolutional neural network to classify the attributes of the unidentified parts, wherein the classification is automatically realized through machine training of labeling data, and the user inquiry statement classification is obtained.

2. The method for identifying the automatic sequence labels of the geological intelligent question-answering oriented data according to claim 1, wherein the arrangement of the gold mine literature map data comprises the following steps:

3. The method for identifying the automatic sequence labels of the data for intelligent geological questions and answers according to claim 1, wherein the label combination in the step 4 comprises the following steps:

4. The geological intelligent question-answering oriented data automatic sequence labeling identification method according to claim 3 is characterized in that automatic labeling is carried out on the basis of gold data labeling, gold data is firstly used for training character vectors based on Word2vec, then training and learning are carried out on gold data labeling results by combining a bidirectional neural network LSTM (local area network) and a conditional random field CRF (random field) in deep learning, and training results of gold data are obtained by adjusting model parameters.