CN116737953A

CN116737953A - Entity extraction method and device

Info

Publication number: CN116737953A
Application number: CN202310685376.XA
Authority: CN
Inventors: 张静; 张宪波
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-12

Abstract

The invention discloses a method and a device for entity extraction, and relates to the technical field of intelligent operation and maintenance. One embodiment of the method comprises the following steps: acquiring a plurality of historical operation and maintenance text data; preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word removal processing and part-of-speech tagging, and the stop word removal processing is performed by adopting an operation and maintenance stop word stock; labeling entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results; and training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result. The method and the system fully utilize the characteristics of the operation and maintenance field, and enable the effectiveness of the extracted entity to be higher when the entity is extracted in the operation and maintenance event, so that the operation and maintenance event can be efficiently managed.

Description

Entity extraction method and device

Technical Field

The invention relates to the technical field of intelligent operation and maintenance, in particular to a method and a device for entity extraction.

Background

The intelligent operation and maintenance technology applies artificial intelligence in the operation and maintenance field to improve the efficiency and accuracy of intelligent operation and maintenance and cope with complex operation and maintenance scenes. In the intelligent operation and maintenance, the operation and maintenance event is structured by using an entity extraction method, so that the management of the operation and maintenance event is facilitated.

In the related art, a lot of methods for extracting entities from events are provided, but most of the methods for extracting entities in the general field are provided, and the methods for extracting entities from the operation and maintenance event are provided for a small number of methods in the operation and maintenance field, so that the information in the operation and maintenance field cannot be reused, and the effectiveness of the extracted entities is low.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for entity extraction, which can make full use of the characteristics of the operation and maintenance field, and make the effectiveness of the extracted entity higher when the entity is extracted in the operation and maintenance event, so as to implement efficient management of the operation and maintenance event.

To achieve the above object, according to an aspect of the embodiments of the present invention, there is provided a training method for an entity extraction model, including:

acquiring a plurality of historical operation and maintenance text data;

preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word removal processing and part-of-speech tagging, and the stop word removal processing is performed by adopting an operation and maintenance stop word stock;

labeling entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results;

And training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result.

Optionally, the operation and maintenance disabling word stock is obtained by the following method:

acquiring a plurality of first operation text data;

classifying the plurality of first operation and maintenance text data to obtain a category corresponding to each first operation and maintenance text data;

matching the first operation and maintenance text data in each category by using a log template corresponding to each category to obtain effective words in the first operation and maintenance text data;

and determining the operation and maintenance stop word stock according to the first operation and maintenance text data and the effective words in the first operation and maintenance text data.

Optionally, labeling the entity in the preprocessed historical operation and maintenance text data and the entity type corresponding to the entity to obtain a labeling result, including:

and extracting an entity and an entity type from the first historical operation and maintenance text data by adopting a preset mode for the preprocessed first historical operation and maintenance text data meeting the preset mode so as to obtain a labeling result of the first historical operation and maintenance text data.

Optionally, after extracting the entity and the entity type from the first historical operation and maintenance text data by adopting a preset mode, the method further includes:

Obtaining n-element fragments according to the extracted entity and entity type;

and denoising the extracted entity and entity type according to the occurrence frequency of each segment in the n-element segment.

and for the preprocessed second historical operation and maintenance text data which does not meet the preset mode, if the words which are the same as the extracted entities exist in the second historical operation and maintenance text data, determining the entities and the entity types of the second historical operation and maintenance text data according to the extracted entities and the entity types so as to obtain the labeling result of the second historical operation and maintenance text data.

Optionally, determining the entity and the entity type of the second historical operation and maintenance text data according to the extracted entity and the entity type includes:

taking the extracted entity as the entity of the second historical operation and maintenance text data;

and under the condition that the entity of the second historical operation and maintenance text data corresponds to a plurality of entity types, taking the entity type with the highest occurrence frequency in the plurality of entity types as the entity type corresponding to the entity of the second historical operation and maintenance text data.

Optionally, the preprocessed historical operation and maintenance text data comprises part-of-speech tagging results; training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result, wherein the entity extraction model comprises the following steps:

word embedding is carried out on the preprocessed historical operation and maintenance text data, so that word embedding vectors are obtained;

determining a part-of-speech vector according to the part-of-speech tagging result;

weighting the word embedding vector according to the part-of-speech vector to obtain a weighted result;

inputting the weighted result into an entity identification network layer to obtain an entity extraction result corresponding to the historical operation and maintenance text, wherein the entity extraction result comprises the probabilities of entities and entity types;

training according to the entity extraction result and the labeling result to obtain the entity extraction model.

In another aspect of the embodiment of the present invention, there is provided a method for entity extraction, including:

acquiring operation and maintenance text data;

preprocessing the operation and maintenance text data to obtain preprocessed operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, de-stop word processing and part-of-speech tagging, and the de-stop word processing is performed by adopting an operation and maintenance stop word stock;

Inputting the preprocessed operation and maintenance text data into an entity extraction model to obtain an entity and an entity type corresponding to the operation and maintenance text data,

the entity extraction model is obtained by the training method of the entity extraction model according to the embodiment of the invention.

In still another aspect of the embodiment of the present invention, there is provided a training apparatus for entity extraction model, including:

the first acquisition module acquires a plurality of historical operation and maintenance text data;

the first preprocessing module is used for preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word processing and part-of-speech tagging, and the stop word processing is performed by adopting an operation and maintenance stop word library;

the labeling module is used for labeling entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results;

and the training module trains and obtains an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result.

According to still another aspect of an embodiment of the present invention, there is provided an apparatus for entity extraction, including:

the second acquisition module acquires operation and maintenance text data;

The second preprocessing module is used for preprocessing the operation and maintenance text data to obtain preprocessed operation and maintenance text data, wherein the preprocessing comprises word segmentation, de-stop word processing and part-of-speech tagging, and the de-stop word processing is performed by adopting an operation and maintenance stop word library;

the determining module inputs the preprocessed operation and maintenance text data into the entity extraction model to obtain an entity and an entity type corresponding to the operation and maintenance text data;

the entity extraction model is obtained by adopting the training method of the entity extraction model in the embodiment of the invention.

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including:

one or more processors;

storage means for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are enabled to implement the training method of the entity extraction model or the entity extraction method provided by the invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program, which when executed by a processor, implements the training method of entity extraction model or the entity extraction method provided by the present invention.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of obtaining a plurality of historical operation and maintenance text data, preprocessing the historical operation and maintenance text data, marking entities and entity types, obtaining marking results, training the preprocessed historical operation and maintenance text data and the marking results to obtain an entity extraction model, and extracting the entities and the entity types of the operation and maintenance text data by adopting the entity extraction model. The preprocessing comprises word segmentation processing, disabling word processing and part-of-speech tagging, the disabling word processing is achieved by adopting an operation and maintenance disabling word stock, the characteristics of the operation and maintenance field are fully utilized, the accuracy and the effectiveness of entity extraction in the operation and maintenance field are improved, and the extracted entity effectiveness is higher by combining the part-of-speech tagging.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main flow of a training method of entity extraction model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the main flow of another training method of entity extraction model according to an embodiment of the invention;

FIG. 3 is a schematic diagram of the main flow of a method of entity extraction according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the main modules of a training apparatus for entity extraction models according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the main modules of an entity extraction apparatus according to an embodiment of the present invention;

FIG. 6 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;

fig. 7 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram of main flow of a training method of entity extraction model according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

Step S101: acquiring a plurality of historical operation and maintenance text data;

step S102: preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word processing and part-of-speech tagging, and the stop word processing is performed by adopting an operation and maintenance stop word library;

step S103: labeling entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results;

step S104: and training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result.

In the embodiment of the invention, the historical operation and maintenance text data is text data corresponding to operation and maintenance events in an operation and maintenance scene, and the operation and maintenance events are events affecting the operation stability of the application. The historical operation and maintenance text data can be alarm log data, static relation data, problems derived from service feedback and the like, wherein the static relation data can comprise dependency relation data, communication relation data and the like among various applications and various components, and the static relation data can be stored in a CMDB (Configuration Management Database ); the problems derived from the business feedback can be configured alarm data and event problem data of manual investigation feedback.

In the embodiment of the present invention, the plurality of historical operation and maintenance text data may be acquired within a preset time range. And preprocessing the historical operation and maintenance text data according to each historical operation and maintenance text data after a plurality of historical operation and maintenance text data are acquired, wherein the preprocessing comprises data cleaning of the historical operation and maintenance text data. The pretreatment comprises the following steps: word segmentation processing, stop word removal and part-of-speech tagging. The word segmentation processing comprises first word segmentation processing and second word segmentation processing, specifically, word segmentation tools are adopted to perform first word segmentation processing on the historical operation and maintenance text data, and the historical operation and maintenance text data can be Chinese and English mixed data, so that Chinese and English are required to be considered simultaneously during data cleaning, and data normalization processing is performed after the first word segmentation processing, such as converting uppercase into lowercase, removing special symbols such as division numbers, question marks and the like; then performing second word segmentation, namely performing token segmentation by using a python (a computer programming language) ntk (Natural Language Toolkit, natural language processing kit) package to obtain a series of words and phrases; and then removing the deactivated word processing, and marking the parts of speech of the history operation and maintenance text data after data cleaning to obtain the parts of speech of each word, wherein the parts of speech of each word are realized through Stanford Corenlp (a natural language analysis tool) in python.

In the embodiment of the invention, the deactivating word processing is performed by adopting an operation and maintenance deactivating word library, wherein the operation and maintenance deactivating word library is a deactivating word facing the operation and maintenance field and is an operation and maintenance text deactivating word extracted based on massive operation and maintenance log scenes, so that entity extraction is more specialized. Specifically, the operation and maintenance disabling word stock is obtained by the following method:

acquiring a plurality of first operation text data;

and determining an operation and maintenance stop word bank according to the first operation and maintenance text data and the effective words in the first operation and maintenance text data.

In the embodiment of the invention, when an operation disabling word stock is constructed, a plurality of first operation text data in a preset time range are acquired, then the plurality of first operation text data are classified, clustering is realized by adopting a clustering algorithm, a plurality of categories and a plurality of first operation text data corresponding to each category are obtained, a log template is extracted for the plurality of first operation text data corresponding to each category, a plurality of log templates can be extracted for each category, each log template can be extracted according to the plurality of first operation text data, namely, each log template corresponds to the plurality of first operation text data, then the log template is matched with each first operation text data corresponding to the log template, so that effective words in the first operation text data are obtained, and the effective words are words existing in the log template; and then determining an operation and maintenance stop word bank according to the first operation and maintenance text data and the effective words in the first operation and maintenance text data, specifically selecting part of the first operation and maintenance text data from a plurality of first operation and maintenance text data corresponding to the log template, marking parts of speech of the part of the first operation and maintenance text data except the effective words, adding nouns, verbs and adjectives into the log template, judging whether the added nouns, verbs and adjectives are the effective words in a manual mode, and marking the parts of the first operation and maintenance text data except the effective words as stop words if the added nouns, verbs and adjectives are the effective words, and expanding the operation and maintenance stop word bank to all the first operation and maintenance text data to obtain the operation and maintenance stop word bank. When the operation and maintenance disabling word stock is obtained, a plurality of first operation and maintenance text data can be obtained at intervals of preset time so as to update the operation and maintenance disabling word stock.

In the embodiment of the invention, labeling the entity and the entity type corresponding to the entity in the preprocessed historical operation and maintenance text data to obtain a labeling result comprises the following steps:

and extracting the entity and the entity type from the first historical operation and maintenance text data by adopting the preset mode for the preprocessed first historical operation and maintenance text data meeting the preset mode so as to obtain a labeling result of the first historical operation and maintenance text data.

In the embodiment of the invention, after the preprocessed historical operation and maintenance text data is obtained, the labeling result of each historical operation and maintenance text data is required to be determined, so that an entity extraction model is obtained according to the historical operation and maintenance text data and the labeling result, namely the entity and entity type in the historical operation and maintenance text data. And carrying out data labeling on the preprocessed historical operation and maintenance text data, namely, labeling the data of the named entity, wherein in order to reduce the labeling cost, an unsupervised mode can be adopted for labeling. For the preprocessed first historical operation and maintenance text data conforming to the preset mode, the entity and the entity type can be marked by the preset mode, wherein the preset mode can be a key value mode, the entity and the entity type are recorded in the first historical operation and maintenance text data, and the key value mode can be adopted for automatic extraction. For example, from the text data "task id:155946039962 application name: xx-xx-xx-gateway alert dimension: [ unified log ] xx-xx. Jsf. Gd. Error. ClientTimeou Exception first alert time: 2022-08-1616:41:47 processing opinion that ClientTimeou Exception exception is high frequency of occurrence but not urgent exception. The exception may be captured and handled separately after troubleshooting the traffic timeout period setting problem. The entity xx-xx-xx-gateway and the corresponding entity type application name can be automatically extracted from the key value pair according to the mode of the key value pair.

In the embodiment of the invention, when marking the preprocessed historical operation and maintenance text data, a plurality of preprocessed historical operation and maintenance text data can be clustered to obtain the category of each preprocessed historical operation and maintenance text data, then a plurality of log templates are extracted for each category, the log templates are updated according to the log templates extracted by the category and the historical operation and maintenance text data to obtain updated log templates, a plurality of updated log templates are obtained from each category, the updated log templates can be marked with entities and entity types in a manual marking mode or a preset mode, and then the entity and entity types of the part of preprocessed historical operation and maintenance text data corresponding to the updated log templates can be marked. And then expanding the labeling result to all the preprocessed historical operation and maintenance text data corresponding to the updated log template, and the like, so that the labeling result of the historical operation and maintenance text data of each category can be obtained. The extraction of the log template can be realized through an FT-Tree algorithm.

In the embodiment of the present invention, after extracting the entity and the entity type from the first historical operation and maintenance text data by adopting the preset mode, the method further includes:

In the embodiment of the invention, the entity and the entity type acquired by adopting the preset mode are easy to be interfered by noise, and the extracted entity and entity type need to be cleaned, namely denoising is carried out, so that the accuracy of entity extraction is improved. N-element fragments can be extracted from the extracted entities and entity types, wherein n is a natural number which can be 1, 2 or 3, then the occurrence frequency of each fragment is determined, and the fragment with the highest occurrence frequency is used as the entity type corresponding to the entity. Application name as "127.0.0.1: xx-xx-xx-gateway "and" 10.0.0.1 application names: the ip address in xx-xx-xx-gateway is noise interference, so that the extracted entity name and the candidate set of entity types are further processed, n-element fragments (n is 1-3) are extracted from the candidate set, and n-element fragments of K (manually set parameters) with highest current frequency are extracted, thereby achieving the purpose of denoising. The former data can be changed into 3-element fragments [ "127.0.0.1", "application name", "xx-xx-xx-gateway" ], and if the application name appears most frequently in the candidate set, the entity and the entity type can be reserved, so that the interference of the domain name is removed. When extracting the n-element fragment, the extracted entity and entity type can be matched with the ip set and the xx-xx-xx-gateway set stored in the CMDB, so that fragment extraction is performed according to a matching result.

In the embodiment of the present invention, labeling the entity and the entity type corresponding to the entity in the preprocessed historical operation and maintenance text data to obtain a labeling result may include: and for the preprocessed second historical operation and maintenance text data which does not meet the preset mode, if the words which are the same as the extracted entities exist in the second historical operation and maintenance text data, determining the entities and the entity types of the second historical operation and maintenance text data according to the extracted entities and the entity types so as to obtain the labeling result of the second historical operation and maintenance text data.

In the embodiment of the invention, a mode of extracting the entity and the entity type by adopting a preset mode cannot be suitable for all data, and the corresponding data of the mode is special and easy to generate the problem of fitting in model training, so that the entity and the entity type can be marked by adopting a marking expansion mode for the preprocessed second historical operation and maintenance text data which does not accord with the preset mode. Specifically, if the second historical operation and maintenance text data has the same word or word as the extracted entity, that is, the second historical operation and maintenance text data has the same word as the entity extracted from the first historical operation and maintenance text data, the entity and entity type extracted from the first historical operation and maintenance text data can be utilized to label and expand the second historical operation and maintenance text data, and if the extracted entity corresponds to one entity type, that is, the entity type of the extracted entity in different first historical operation and maintenance text data is also the same, the entity and entity type of the first historical operation and maintenance text data can be directly used as the entity and entity type of the second historical operation and maintenance text data, that is, the entity and entity type of the second historical operation and maintenance text data can be obtained. That is, if a certain piece of text data has been labeled as an entity of a certain entity type, the same words or terms appearing in all text data as the entity extracted from the piece of text data are labeled as the same entity and entity type as the piece of text data. For example, "xx-xx-xx-gateway" in all text data may be labeled as "application name" entity type.

In the embodiment of the present invention, determining the entity and the entity type of the second historical operation and maintenance text data according to the extracted entity and entity type includes: and taking the extracted entity as the entity of the second historical operation and maintenance text data, and taking the entity type with the highest occurrence frequency in the plurality of entity types as the entity type corresponding to the entity of the second historical operation and maintenance text data under the condition that the entity of the second historical operation and maintenance text data corresponds to the plurality of entity types, so as to obtain the labeling result of the second historical operation and maintenance text data. The time marked by each entity type in the plurality of entity types can be obtained, and the entity type with the latest time is used as the entity type corresponding to the entity of the second historical operation and maintenance text data. The second historical operation and maintenance text data entity corresponds to a plurality of entity types, namely, a plurality of first historical operation and maintenance text data exist, and the plurality of first historical operation and maintenance text data exist in the same entity but are different in entity type. The plurality of first historical operation and maintenance text data may be operation and maintenance text data acquired at different times. That is, if the same entity of a certain text data is marked as different entity types at different times, or the same entity of different text data is marked as different entity types, the occurrence frequency of each entity type is calculated, that is, the occurrence frequency of each entity type in each first historical operation and maintenance text data is obtained, then the entity type with the highest occurrence frequency is used as the entity type corresponding to the entity, or the entity type with the latest time, that is, the entity type closest to the current time is used as the entity type of the entity. For example, in the a-history operation and maintenance text data, an ip address is marked as an a-entity type, in the C-history operation and maintenance text data, the ip address is marked as a B-entity type, and in the C-history operation and maintenance text data is marked as an a-entity type, wherein A, B and the C-history operation and maintenance text data can be operation and maintenance text data acquired at different times, and then when the ip address exists in a certain history operation and maintenance text data, the ip address is marked as an a-entity type.

In the embodiment of the invention, as shown in fig. 2, the preprocessed historical operation and maintenance text data comprises part-of-speech tagging results; training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result, wherein the entity extraction model comprises the following steps:

step S201: word embedding is carried out on the preprocessed historical operation and maintenance text data, so that word embedding vectors are obtained;

step S202: determining part-of-speech vectors according to the part-of-speech tagging results;

step S203: weighting the word embedding vector according to the part-of-speech vector to obtain a weighted result;

step S204: inputting the weighted result into an entity identification network layer to obtain an entity extraction result corresponding to the historical operation and maintenance text, wherein the entity extraction result comprises the probability of an entity and an entity type;

step S205: training according to the entity extraction result and the labeling result to obtain an entity extraction model.

In the embodiment of the invention, a Bert (Bidirectional Encoder Representation from Transformers, bi-directional encoder characterization quantity from a transformer) pre-training model can be adopted to perform word embedding on the preprocessed historical operation and maintenance text data to obtain word embedding vectors. The part-of-speech tagging result is obtained by adopting Stanford Corenlp tagging, the part-of-speech tagging result comprises the weight of each part of speech, a part-of-speech vector is obtained according to the weight of each word, and the part-of-speech vector is used for weighting the word embedding vector to obtain a weighted result, namely the input of a model; and inputting the weighted result into an entity identification network layer to obtain an entity extraction result, wherein the entity extraction result comprises the probability of an entity and an entity type, and the entity identification network layer can comprise a BiLSTM (Bi-directional Long Short-Term Memory network) layer+CRF (conditional random field ) layer, an LSTM (Long Short-Term Memory network) layer+CRF layer and the like. Under different scenes in the operation and maintenance field, the requirements on the entity and the entity probability are different, different probability thresholds can be set for different scenes, and the model can be trained according to the probability and the probability threshold of the entity and the entity type in the entity extraction result.

In the embodiment of the invention, the part-of-speech tagging result is obtained by using Stanford Corenlp, wherein the weight of each part of speech can be self-defined or obtained through classification model training, and different parts of speech correspond to different weights.

Training according to the entity extraction result and the labeling result, comparing the entity extraction result with the labeling result, calculating loss based on a loss function such as a cross entropy loss function, carrying out back propagation, and optimizing model parameters through gradient descent to obtain the entity extraction model.

In the embodiment of the invention, in the process of training the entity extraction model, the entity extraction model can be verified by adopting the historical operation and maintenance text data so as to extract the entity and the entity type in the historical operation and maintenance text data, when the accuracy and recall rate of the entity extraction model reach the preset threshold, the trained entity extraction model is obtained, and the extracted entity and entity type are added into the log template corresponding to the historical operation and maintenance text data so as to improve the accuracy of labeling of the subsequent entity and entity type.

In another aspect of the embodiment of the present invention, as shown in fig. 3, there is provided a method for entity extraction, including:

Step S301: acquiring operation and maintenance text data;

step S302: preprocessing operation and maintenance text data to obtain preprocessed operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, disabling word processing and part-of-speech tagging, and the disabling word processing is performed by adopting an operation and maintenance disabling word stock;

step S303: inputting the preprocessed operation and maintenance text data into an entity extraction model to obtain an entity and an entity type corresponding to the operation and maintenance text data,

In the embodiment of the invention, when entity extraction is performed by utilizing an entity extraction model, operation and maintenance text data is obtained, preprocessing is performed on the operation and maintenance text data, word segmentation is performed on the operation and maintenance text data by using a word segmentation tool, then normalization processing is performed on the operation and maintenance text data, wherein the normalization processing comprises the steps of converting English uppercase into lowercase, removing symbols such as question marks and the like, colon can not be removed, and then using an nltk packet to perform token on the text; then, the operation and maintenance stopping word library is adopted to perform stopping word processing, so that the effectiveness of the extracted entity can be improved; and extracting parts of speech by using a Stanford Corenlp package in python to obtain part of speech tagging results, so as to obtain preprocessed operation and maintenance text data, and inputting the operation and maintenance text data into an entity extraction model to obtain an entity and an entity type corresponding to the operation and maintenance text data. Specifically, inputting the operation and maintenance text data into the entity extraction model comprises the following steps: and carrying out word embedding on the operation and maintenance text data by adopting a Bert pre-training model to obtain word embedding vectors, obtaining part-of-speech vectors by using part-of-speech labeling results, weighting the word embedding vectors by using the part-of-speech vectors, and inputting the weighting results into a BiLSTM layer and a CRF layer to obtain entities and entity types.

In the embodiment of the invention, before preprocessing operation and maintenance text data or history operation and maintenance text data, the operation and maintenance text data can be matched with a set of specific values (such as xx-xx-xx-gateway) of application names, an ip address set and the like stored in a CMDB, the specific values of the application names, the ip address and the like in the operation and maintenance text data are removed, and then the removed data are restored to the preprocessed operation and maintenance text data after preprocessing is completed, so that the influence of word segmentation processing and word removal processing in the preprocessing process on the accuracy and effectiveness of subsequent entity extraction is prevented, and the accuracy and effectiveness of entity extraction results are improved.

According to the entity extraction method, the plurality of historical operation and maintenance text data are obtained, the historical operation and maintenance text data are preprocessed, the entity and entity types are marked, marking results are obtained, the preprocessed historical operation and maintenance text data and marking results are used for training to obtain an entity extraction model, and then the entity and entity types of the operation and maintenance text data are extracted by the entity extraction model. The preprocessing comprises word segmentation processing, disabling word processing and part-of-speech tagging, the disabling word processing is performed by adopting an operation and maintenance disabling word stock, the characteristics of the operation and maintenance field are fully utilized, the accuracy and the effectiveness of entity extraction in the operation and maintenance field are improved, and the extracted entity effectiveness is higher by adopting and combining the part-of-speech tagging; the method is carried out in an unsupervised mode when constructing the deactivated word library and labeling the entity and entity types, so that the labeling efficiency is improved, the labeling cost is reduced, the historical operation and maintenance text data is labeled by label expansion, the labeling efficiency and accuracy are improved, and the efficient management of operation and maintenance events is realized.

In still another aspect of the embodiment of the present invention, as shown in fig. 4, there is provided a training apparatus 400 for entity extraction model, including:

a first obtaining module 401 for obtaining a plurality of historical operation text data;

the first preprocessing module 402 is used for preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word processing and part-of-speech tagging, and the stop word processing is performed by adopting an operation and maintenance stop word library;

the labeling module 403 labels the entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results;

and the training module 404 trains to obtain the entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result.

In the embodiment of the present invention, in the first preprocessing module 402, the operation and maintenance disabling word stock is obtained by the following method: acquiring a plurality of first operation text data; classifying the plurality of first operation and maintenance text data to obtain a category corresponding to each first operation and maintenance text data; matching the first operation and maintenance text data in each category by using a log template corresponding to each category to obtain effective words in the first operation and maintenance text data; and determining an operation and maintenance stop word bank according to the first operation and maintenance text data and the effective words in the first operation and maintenance text data.

In the embodiment of the present invention, the labeling module 403 is further configured to: and extracting the entity and the entity type from the first historical operation and maintenance text data by adopting the preset mode for the preprocessed first historical operation and maintenance text data meeting the preset mode so as to obtain a labeling result of the first historical operation and maintenance text data.

In the embodiment of the present invention, the labeling module 403 is further configured to: extracting an entity and an entity type from the preprocessed historical operation and maintenance text data by adopting a preset mode, and obtaining an n-element fragment according to the extracted entity and the extracted entity type; and denoising the extracted entity and entity type according to the occurrence frequency of each segment in the n-element segment.

In the embodiment of the present invention, the labeling module 403 is further configured to: and for the preprocessed second historical operation and maintenance text data which does not meet the preset mode, if the words which are the same as the extracted entities exist in the second historical operation and maintenance text data, determining the entities and the entity types of the second historical operation and maintenance text data according to the extracted entities and the entity types so as to obtain the labeling result of the second historical operation and maintenance text data.

In the embodiment of the present invention, the labeling module 403 is further configured to: taking the extracted entity as the entity of the second historical operation and maintenance text data; and under the condition that the entity of the second historical operation and maintenance text data corresponds to a plurality of entity types, taking the entity type with the highest occurrence frequency in the plurality of entity types as the entity type corresponding to the entity of the second historical operation and maintenance text data.

In the embodiment of the invention, the preprocessed historical operation and maintenance text data comprises part-of-speech tagging results; training module 404, further to: word embedding is carried out on the preprocessed historical operation and maintenance text data, so that word embedding vectors are obtained; determining part-of-speech vectors according to the part-of-speech tagging results; weighting the word embedding vector according to the part-of-speech vector to obtain a weighted result; inputting the weighted result into an entity identification network layer to obtain an entity extraction result corresponding to the historical operation and maintenance text, wherein the entity extraction result comprises the probability of an entity and an entity type; training according to the entity extraction result and the labeling result to obtain an entity extraction model.

According to yet another aspect of the embodiment of the present invention, as shown in fig. 5, there is provided an apparatus 500 for entity extraction, including:

a second obtaining module 501, which obtains operation and maintenance text data;

the second preprocessing module 502 is used for preprocessing the operation and maintenance text data to obtain preprocessed operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word processing and part-of-speech tagging, and the stop word processing is performed by adopting an operation and maintenance stop word library;

a determining module 503, which inputs the preprocessed operation and maintenance text data into the entity extraction model to obtain an entity and an entity type corresponding to the operation and maintenance text data;

According to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the training method of the entity extraction model or the entity extraction method provided by the invention.

According to still another aspect of the embodiments of the present invention, there is provided a computer readable medium having stored thereon a computer program, which when executed by a processor implements the training method of entity extraction model or the entity extraction method provided by the present invention.

Fig. 6 illustrates an exemplary system architecture 600 of a training method of entity extraction model or training apparatus of entity extraction model, a method of entity extraction or apparatus of entity extraction to which embodiments of the present invention may be applied.

As shown in fig. 6, the system architecture 600 may include terminal devices 601, 602, 603, a network 604, and a server 605. The network 604 is used as a medium to provide communication links between the terminal devices 601, 602, 603 and the server 605. The network 604 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 605 via the network 604 using the terminal devices 601, 602, 603 to receive or send messages, etc. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 601, 602, 603.

The terminal devices 601, 602, 603 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 605 may be a server providing various services, such as a background management server (by way of example only) providing support for shopping-type websites browsed by users using terminal devices 601, 602, 603. The background management server may analyze and process the received data such as the product information query request, and feedback the processing result (e.g., the target push information, the product information—only an example) to the terminal device.

It should be noted that, the training method of the entity extraction model or the entity extraction method provided by the embodiment of the present invention is generally executed by the server 605, and accordingly, the training device of the entity extraction model or the entity extraction device is generally disposed in the server 605.

It should be understood that the number of terminal devices, networks and servers in fig. 6 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 7, there is illustrated a schematic diagram of a computer system 700 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the system 700 are also stored. The CPU 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input section 706 including a keyboard, a mouse, and the like; an output portion 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 708 including a hard disk or the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read therefrom is mounted into the storage section 708 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 709, and/or installed from the removable medium 711. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 701.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a first acquisition module, a first preprocessing module, a labeling module, and a training module. The names of these modules do not in some cases limit the module itself, and for example, the first acquisition module may also be described as a "module that acquires a plurality of historical operation and maintenance text data".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: acquiring a plurality of historical operation and maintenance text data; preprocessing the historical operation and maintenance text data to obtain preprocessed historical operation and maintenance text data, wherein the preprocessing comprises word segmentation processing, stop word processing and part-of-speech tagging, and the stop word processing is performed by adopting an operation and maintenance stop word library; labeling entities and entity types in the preprocessed historical operation and maintenance text data to obtain labeling results; training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result

According to the technical scheme of the embodiment of the invention, the entity extraction method comprises the steps of obtaining a plurality of historical operation and maintenance text data, preprocessing the historical operation and maintenance text data, marking the entity and entity types to obtain a marking result, training the preprocessed historical operation and maintenance text data and the marking result to obtain an entity extraction model, and extracting the entity and entity types of the operation and maintenance text data by adopting the entity extraction model. The preprocessing comprises word segmentation processing, disabling word processing and part-of-speech tagging, the disabling word processing is performed by adopting an operation and maintenance disabling word stock, the characteristics of the operation and maintenance field are fully utilized, the accuracy and the effectiveness of entity extraction in the operation and maintenance field are improved, and the extracted entity effectiveness is higher by adopting and combining the part-of-speech tagging; the method is carried out in an unsupervised mode when constructing the deactivated word library and labeling the entity and entity types, so that the labeling efficiency is improved, the labeling cost is reduced, the historical operation and maintenance text data is labeled by label expansion, the labeling efficiency and accuracy are improved, and the efficient management of operation and maintenance events is realized.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for training an entity extraction model, comprising:

acquiring a plurality of historical operation and maintenance text data;

2. The training method according to claim 1, wherein the operation and maintenance disabling word stock is obtained by:

Acquiring a plurality of first operation text data;

3. The training method according to claim 1, wherein labeling the entity in the preprocessed historical operation and maintenance text data and the entity type corresponding to the entity to obtain a labeling result comprises:

4. The training method of claim 3, further comprising, after extracting the entity and the entity type from the first historical operation and maintenance text data using a predetermined pattern:

5. The training method of claim 3, wherein labeling the entity in the preprocessed historical operation and maintenance text data and the entity type corresponding to the entity to obtain a labeling result comprises:

6. The training method of claim 5, wherein determining the entity and entity type of the second historical operation and maintenance text data based on the extracted entity and entity type comprises:

7. The training method of claim 1, wherein the preprocessed historical operation and maintenance text data comprises part-of-speech tagging results; training to obtain an entity extraction model according to the preprocessed historical operation and maintenance text data and the labeling result, wherein the entity extraction model comprises the following steps:

8. A method of entity extraction, comprising:

acquiring operation and maintenance text data;

wherein the entity extraction model is obtained according to the training method of any one of claims 1-7.

9. A training device for entity extraction models, comprising:

10. An apparatus for entity extraction, comprising:

the second acquisition module acquires operation and maintenance text data;

wherein the entity extraction model is obtained by using the training method of any one of claims 1-7.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.

12. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.