CN111930909A

CN111930909A - Geological intelligent question and answer oriented data automatic sequence labeling identification method

Info

Publication number: CN111930909A
Application number: CN202010804098.1A
Authority: CN
Inventors: 贺金龙; 付立军; 黄徐胜; 唐珂珂; 朱月琴; 刘晓娟
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-13
Anticipated expiration: 2040-08-11
Also published as: CN111930909B

Abstract

The invention relates to the technical field of information, and provides a geological intelligent question and answer oriented data automatic sequence annotation identification method. The invention aims to realize the accuracy of the question-answer interaction effect of a user in the intelligent question-answer process of gold mine data. The main scheme comprises the steps of sorting and cleaning the map data of the gold mine literature to obtain batch literature data; carrying out machine automation labeling on character data by using BIOES labels aiming at the literature data to obtain a gold mine data labeling result, and carrying out input training by adopting deep learning to obtain a training result of the gold mine literature data; applying the training result of the document data to user query sentence recognition to obtain a labeling result of the user query sentence, and then performing attribute classification to obtain the classification of the user query sentence; and combining and packaging the labeling result and the classification through a set to obtain the labeling of the gold data in the user query sentence and the result of the semantic attribute of the query sentence, and mapping the result to the gold knowledge map to obtain the user query knowledge result.

Description

Geological intelligent question and answer oriented data automatic sequence labeling identification method

Technical Field

The invention relates to the technical field of knowledge map application in a deep learning knowledge mining process, and provides a gold mine data automatic sequence labeling method for realizing an intelligent question-answering platform.

Background

Currently, the intelligent question-answering service is an important application in the development stage of artificial intelligence, and has greater cognitive ability compared with the traditional rule matching and co-occurrence retrieval matching. In the implementation process, the concept and relationship association of knowledge is realized by introducing a knowledge map, and then the field recognition and the intention recognition are carried out by using an automatic sequence labeling method of deep learning in the question and answer process of a user, so that an intelligent question and answer platform is realized.

At present, the implementation of a question-answering system mostly depends on regular template matching and Elasricsearch retrieval matching, and the number of questions and answers in the general field is large, and meanwhile, due to lack of deep semantic knowledge analysis, the implementation of intelligent question-answering service in the specific field is challenging. When the existing question-answering system processes a Chinese text, sentences are generally converted into word representations through a word segmentation technology, and then knowledge base matching of the sentences is performed through semantic similarity calculation (editing distance, vector cosine similarity of TFIDF) so as to realize inquiry reply of a user. The word segmentation technology comprises three development stages of rule dictionary matching, statistical machine learning and deep learning. Matching based on the rule dictionary comprises forward maximum matching and reverse maximum matching bidirectional maximum matching; the statistical machine learning-based method comprises an n-element language model, a maximum entropy model, a conditional random field and the like; with mass data information generated in the stage of advancing web2.0 to web3.0, word segmentation methods based on deep learning are continuously started, and include a convolutional neural network, a cyclic neural network, a long-time memory network, a mode of combining with a conditional random field and the like, and a label mode adopted in the identification process is a BIO or BIOES label.

The existing labeling method has the following defects:

(1) in the gold mine knowledge mining and discovering process, manual processing of a large amount of data information consumes time and labor, and the processing efficiency is not high.

(2) The application of the word segmentation tool is seriously dependent on the construction of a dictionary, and the application effect cannot be achieved in the gold mine information processing process, so that the effect of the word segmentation tool in the general field is better.

(3) For the sequence annotation of massive gold mine data, structured information by means of specific domain knowledge categories is required on the basis of the prior art method.

Disclosure of Invention

The invention aims to realize the accuracy of the question-answer interaction effect of a user in the intelligent question-answer process of gold mine data, construct a deep learning identification method based on automatic sequence labeling, and construct the deep learning identification method by combining a gold mine field document and a map.

In order to solve the technical problems, the invention adopts the following technical scheme:

a geological intelligent question and answer oriented data automatic sequence labeling identification method comprises the following steps:

step 1: sorting the map data of the gold mine literature to obtain a domain entity classification description label (including an entity) as a labeling label for identifying a domain knowledge entity;

step 2: automatically cleaning the document data content by a machine, wherein English letters, punctuations and meaningless symbols are filtered to obtain effective Chinese text content;

and step 3: storing the cleaned text contents in an independent txt file to obtain a storage root path of batch document data;

and 4, step 4: performing machine automation labeling on character data by using BIOES labels aiming at the document data obtained in the step 3, wherein label combination is performed by combining the sorted map entity classification description data to obtain a gold ore data labeling result beginning from B, I, O, E, S;

and 5: inputting and training the character sequence data of the gold mine data labeling result in the step 4 by adopting a mode of combining a bidirectional LSTM model and a conditional random field CRF in deep learning, and adding the sorted gold mine map entity data by adjusting the structure and the overall parameters of memory cells in the LSTM model to obtain a training result of the gold mine literature data;

step 6: applying the training result of the document data to platform user query sentence recognition to obtain a labeling result of the user query sentence;

and 7: inputting the residual sentences obtained by subtracting the contents of the gold mine data labeling results from the contents of the user query sentences into a convolutional neural network for attribute classification to obtain the classification of the user query sentences;

and 8: combining and packaging the gold mine data identification result and the classification of the user query sentence through a Map set to obtain a result of the semantic attributes of the gold mine data in the user query sentence, such as what the profile of the geological entity GENT is in the Qinghai-Tibet plateau, which is a profile, for example;

and step 9: and (4) mapping the results of the semantic attributes of the label and inquiry statement of the gold mine data in the step (8) to a gold mine knowledge map to obtain the result of the inquiry knowledge of the user, thereby realizing intelligent question answering.

In the technical scheme, the sorting of the map data of the gold mine literature comprises the following steps:

the method is characterized in that gold mine literature data are collected through manual arrangement of geological encyclopedia and dog searching corpora, and classification description labels are constructed through gold mine field knowledge, wherein the classification description labels comprise geological entities GENT, geological effects GEFF, geochemical GEHE and geological methods GMET.

In the above technical solution, the tag combination in step 4 comprises the steps of:

firstly, carrying out character division on the BIOES label to obtain a single character letter B, I, O, E, S;

and (4) automatically labeling the single-character letters and the txt file content in the step 3 to obtain a gold mine data labeling result beginning from B, I, O, E, S.

In the technical scheme, the automatic labeling is carried out on the basis of gold mine data labeling, firstly, the gold mine data are used for training character vectors based on Word2vec, then, a mode of combining a bidirectional neural network LSTM and a conditional random field CRF in deep learning is used for training and learning gold mine data labeling results, and model parameters are adjusted to obtain the training results of the gold mine data.

In the technical scheme, the user inquiry sentence is identified, the user inquiry sentence is input into the model, and the training result model is used for automatically identifying the sequence of the user sentence information to obtain the labeling result of the user inquiry sentence;

in the above technical solution, the user query sentence recognition includes the following steps:

(1) inputting a user inquiry statement into a platform through an http interface, and firstly obtaining a word index (such as Qing: 15, Tibetan: 23, high: 54, original: 113 and the like) of the user statement;

(2) and (5) further calling and outputting the user sentence word index through the combined model training result of the LSTM and the CRF in the step 5 to obtain words combined by characters, namely the labeling result of the user inquiry sentence.

In the technical scheme, user statement classification is performed, other unidentified parts input into the sequence identification model are input into the convolutional neural network to perform attribute classification, and the classification of the user statement to be inquired is automatically realized through machine training of labeled data.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

1. the gold mine literature data needs professional knowledge skills for processing and application, and the automatic sequence marking and identification of a machine are adopted, so that the complexity of manual processing is reduced; another opposite aspect makes the domain knowledge focused on the inside, the user expands quickly during use without needing to concentrate on the inside of the bottom layer.

2. The automatic sequence labeling identification method based on the map gold mine data provides a convenient interaction mode for a user in an intelligent question-answering process, only an inquiry sentence needs to be input, and convenience of knowledge in the gold mine field in an application process is greatly improved.

3. The automatic sequence labeling recognition process does not depend on a word segmentation tool, only depends on automatic model training, greatly reduces human resources, and meanwhile, the model only needs to be trained once in the using process, and does not need to be trained during use, and only needs to be called.

4. The migration of the model technology only depends on the provided literature data, and the model can be conveniently and quickly customized and trained according to different data, so that the risk of model migration is reduced.

5. By adopting the automatic sequence labeling and identifying method of the map gold mine data, the intelligent question answering has generalization capability compared with the intelligent question answering based on the regular template matching and the retrieval matching.

Drawings

FIG. 1 is a flow diagram of an intelligent question and answer service;

FIG. 2 is a sequence annotation diagram based on BIOES and gold data classification description tag combinations;

FIG. 3 illustrates a process flow of annotation based on a segmentation tool;

FIG. 4 is a flow diagram of automated sequence annotation recognition.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The automatic sequence labeling and identifying method based on the map gold mine data, which is adopted by the intelligent gold mine question-answering platform, realizes timely and accurate response of the user's inquiry by combining with the domain characteristic knowledge. Firstly, collecting gold mine data documents, and removing invalid symbols and meaningless labels to obtain Chinese text contents; then, establishing knowledge description classification information by combining the classification structure information of the domain knowledge; then, automatic sequence labeling of character labels is carried out on the text content, and combined label labeling of characters is carried out by combining with domain knowledge classification description labels; secondly, training and learning the gold mine text data by using a bidirectional neural network in a deep learning model, and adjusting parameters to reach a threshold model meeting automatic sequence labeling recognition; and then, using the obtained model to carry out sequence recognition of user sentences, carrying out intention classification on the data without the sequence recognition, mapping the classification result and the sequence recognition result to a gold mine knowledge map to carry out user inquiry and query, and further realizing user feedback. The question-answering service is shown in figure 1.

The steps are as follows:

(1) and (6) data arrangement. And (3) sorting and collecting the gold mine literature data, and constructing a classification description label such as geological entity GENT, geological action GEFF, geological chemistry GEHE and geological method GMET through gold mine field knowledge.

(2) And (6) data cleaning. And performing batch processing on the sorted document data to obtain text content, and cleaning the format of the text content in a regular matching expression mode to obtain an effective Chinese text.

(3) And storing the data in batches. And uniformly storing the batch text contents in a fixed root directory by using python according to the article space number, and storing the batch text contents in the form of utf-8 and txt files.

(4) Automated labeling of composite labels. The method comprises the steps of reading the contents of the gold mine data text one by one and one by one characters, and combining the sorted gold mine field knowledge classification description labels with the traditional BIOES labels for combined labeling to obtain a gold mine data character labeling result beginning from B, I, O, E, S. As shown in fig. 2.

(5) Automated sequence recognition for deep learning. On the basis of data labeling, firstly, training character vectors based on Word2vec by using gold mine data, then, training and learning the labeled data by using a mode of combining a bidirectional neural network LSTM and a conditional random field CRF in deep learning, and obtaining training results (checkpoint file storage) of the gold mine data by adjusting model parameters. The recognition manner in the word segmentation tool for scoring the word feature weights is not used here, as shown in fig. 3.

(6) And (4) user query sentence sequence identification. And inputting the user inquiry sentences into the model, and automatically identifying the sequence of the user sentence information by using the training result model to obtain the labeling result of the user data. As shown in fig. 4.

(7) And classifying the user sentences. Inputting other unidentified parts input into the sequence identification model into a convolutional neural network for attribute classification, wherein the attribute classification is automatically realized through machine training of labeled data to obtain user statement classification.

(8) And classifying and obtaining the user sequence labeling result and the statement attribute. And (4) combining the results in the step (6) and the step (7) to realize the understanding of the statement information of the user, and obtaining a combined result of the two.

(9) And (5) mapping and querying the map. And mapping the combined result in the step 8 to a gold mine knowledge graph, and obtaining feedback information through the knowledge graph mechanization query.

Examples

The invention provides a geological intelligent question and answer oriented data automatic sequence labeling and identifying method, which comprises the following steps:

and 5: and (4) inputting and training the character sequence data of the gold mine data labeling result in the step (4) by adopting a mode of combining a bidirectional LSTM model and a conditional random field CRF in deep learning, and adding the sorted gold mine map entity data by adjusting the structure and the overall parameters of memory cells in the LSTM model to obtain a training result (checkpoint file storage) of the gold mine literature data. (ii) a

and 8: combining the gold ore data labeling result with the classification of the user query statement through the user query statement to obtain the result of the gold ore data labeling and query statement semantic attribute in the user query statement; the gold mine data annotation result refers to entity parts in the gold mine literature, such as geological entities (Qinghai-Tibet plateau, volcanic mechanism), geological action, geochemistry and geological methods; the classification of the user query statement refers to the attribute category of the user query for the entity part, such as: brief introduction, kind, size, relationship, area scope;

In the scheme, the step of arranging the map data of the gold mine literature comprises the following steps:

In the above scheme, the tag combination in step 4 comprises the steps of:

In the scheme, the automatic labeling is carried out on the basis of gold mine data labeling, firstly, the gold mine data are used for training character vectors based on Word2vec, then, a bidirectional neural network LSTM and conditional random field CRF combination mode in deep learning is used for training and learning gold mine data labeling results, and the training results of the gold mine data are obtained by adjusting model parameters.

In the above scheme, the identification of the user query sentence includes the following steps:

inputting a user inquiry statement into a platform through an http interface, and firstly obtaining a word index (such as Qing: 15, Tibetan: 23, high: 54, original: 113 and the like) of the user statement;

and (5) further calling and outputting the user sentence word index through the combined model training result of the LSTM and the CRF in the step 5 to obtain words combined by characters, namely the labeling result of the user inquiry sentence.

In the scheme, user statement classification is performed, other unidentified parts input into the sequence identification model are input into the convolutional neural network for attribute classification, and the classification of the user statement to be inquired is automatically realized through machine training of labeled data.

Claims

1. A geological intelligent question and answer oriented data automatic sequence labeling identification method is characterized by comprising the following steps: the method comprises the following steps:

and 7: subtracting the identification content of the model for the gold and mineral data in the user statement from the content of the user inquiry statement, and inputting the obtained residual statement into a convolutional neural network for attribute classification to obtain the classification of the user inquiry statement;

and 8: combining and packaging the gold data identification result and the classification of the user query statement through a Map set to obtain the result of the gold data label and query statement semantic attribute in the user query statement;

2. The method for automatically identifying sequence tags of data for geological intelligent question answering according to claim 1, wherein the sorting of the data of the golden ore literature maps comprises the following steps:

3. The method for automatically identifying the sequence label of the data facing the geological intelligent question answering according to claim 1, wherein the label combination in the step 4 comprises the following steps:

4. The method for automatically identifying the sequence label of the data facing the geological intelligent question-answering according to claim 3, wherein the automatic label is carried out on the basis of gold data label, firstly, the gold data is used for training character vectors based on Word2vec, then, the combination mode of bidirectional neural network LSTM and conditional random field CRF in deep learning is used for training and learning the gold data label result, and the training result of the gold data is obtained by adjusting model parameters.

5. The method for identifying the data automation sequence annotation facing the geological intelligent question answering according to claim 1, wherein the identification of the user query sentence comprises the following steps:

inputting a user inquiry statement into a platform through an http interface, and firstly obtaining a word index of the user statement;

6. The method for automatically identifying the sequence label of the data oriented to the geological intelligence question-answering as claimed in claim 1, wherein the user sentence classification is performed by inputting other unidentified parts input into the sequence identification model into a convolutional neural network for attribute classification, and the classification of the user sentence to be queried is automatically realized through machine training of label data.