CN108399157B

CN108399157B - Dynamic extraction method of entity and attribute relationship, server and readable storage medium

Info

Publication number: CN108399157B
Application number: CN201711389560.0A
Authority: CN
Inventors: 陈虹; 董振江; 王宇; 龚乐君; 李涛
Original assignee: ZTE Corp; Nanjing University of Posts and Telecommunications
Current assignee: ZTE Corp; Nanjing University of Posts and Telecommunications
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2023-08-18
Anticipated expiration: 2037-12-21
Also published as: CN108399157A

Abstract

The application discloses a dynamic extraction method of entity and attribute relationship, which comprises the following steps: acquiring text data; based on the dynamic entity attribute relation library and the training model, each feature of the entity and the attribute is dynamically extracted from the text data. In addition, the application also provides a server and a readable storage medium, and the application constructs a dynamic entity attribute relation library and a training model and can automatically extract various characteristics of the entity and the attribute from the text data.

Description

Dynamic extraction method of entity and attribute relationship, server and readable storage medium

Technical Field

The application relates to the technical field of Internet, in particular to a dynamic extraction method of entity and attribute relationship, a server and a readable storage medium.

Background

With the rapid development of the internet and the advent of the big data information age, in some specific fields, for example: technology and business in the telecommunication field face the opportunities and challenges of technology upgrading and business updating, generate a great deal of knowledge and special terms, and become the industry with high-density knowledge and credibility for the famous and true. The volume of information in the field of telecommunications has grown and a very large and unordered library of information has been formed in which unstructured or semi-structured text data carries rich valuable telecommunications information. Named entities are important language units for carrying information in a text, are indispensable links for acquiring valuable information, and different entities have different attributes, and the same entity has approximately the same attribute, but different attribute values.

Named entity recognition includes the recognition of the entity and the extraction of the attributes. Entity recognition in the general field is to categorize entities in text into a certain semantic type. The existing methods mainly comprise three methods, namely: dictionary-based, statistics-based and rule-based methods. Wherein, the liquid crystal display device comprises a liquid crystal display device,

dictionary-based methods find named entities in word banks mainly by string matching, but generally do not have a comprehensive entity bank and are time-consuming to compare.

Based on rule algorithm, lexical rule, grammar rule and semantic rule are mainly added in the entity recognition process, and various named entities are recognized by rule matching method. However, rule-based methods are limited to manually adding rules.

The statistical-based method is trained by using artificial labeling or original corpus. The statistical-based method requires that a language model is built first and then model parameters are estimated on training data, which is beneficial to transplanting to different languages and new fields. Statistical-based methods mainly utilize some statistical models such as hidden markov models, maximum entropy models, support vector machines, conditional random fields, etc. The task of attribute extraction is to construct an attribute table for each entity semantic class and extract attribute values. The method of attribute extraction is mainly composed of pattern matching and statistical-based methods, but the research in this aspect is far less than entity identification. Therefore, in the prior art, the technique of extracting the relationship between the entity and the attribute still has the defects and disadvantages.

Disclosure of Invention

The application mainly aims to provide a dynamic extraction method, a server and a readable storage medium for entity and attribute relations, and aims to solve the problem of unhealthy knowledge base and corpus in the specific technical field.

In order to achieve the above object, the present application provides a method for dynamically extracting relationships between entities and attributes, the method comprising the steps of:

acquiring text data;

and dynamically extracting various characteristics of the entity and the attribute from the text data based on the dynamic entity attribute relation library and the training model.

In addition, in order to achieve the above object, the present application also proposes a server including a processor and a memory;

the processor is used for executing a dynamic extraction program of the relation between the entity and the attribute stored in the memory so as to realize the method.

In addition, to achieve the above object, the present application also proposes a computer-readable storage medium storing one or more programs executable by one or more processors to implement the above method.

According to the dynamic extraction method, the server and the readable storage medium for the entity and attribute relationship, the text data are obtained, and based on the dynamic entity attribute relationship library and the training model, all the characteristics of the entity and the attribute are dynamically extracted from the text data, so that the dynamic entity attribute relationship library and the training model are constructed, and all the characteristics of the entity and the attribute can be automatically extracted from the text data.

Drawings

FIG. 1 is a flow chart of a method for dynamic extraction of relationships between entities and attributes according to a first embodiment of the present application;

FIG. 2 is a flow chart of a method for dynamic extraction of relationships between entities and attributes according to a first embodiment of the present application;

FIG. 3 is a schematic view of a sub-process of a dynamic extraction method of entity and attribute relationships according to a first embodiment of the present application;

FIG. 4 is an exemplary diagram of a method for dynamic extraction of relationships between entities and attributes according to a first embodiment of the present application;

FIG. 5 is a second flowchart of a method for dynamic extraction of relationships between entities and attributes according to the first embodiment of the present application;

FIG. 6 is a second flow chart of a dynamic extraction method for entity and attribute relationships according to the first embodiment of the present application;

fig. 7 is a schematic diagram of a server hardware architecture according to a second embodiment of the present application;

FIG. 8 is a block diagram illustrating a dynamic extraction process of the entity-attribute relationship in FIG. 7.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and have no specific meaning in themselves. Thus, "module," "component," or "unit" may be used in combination.

First embodiment

Fig. 1 is a schematic flow chart of a dynamic extraction method of relationships between entities and attributes according to a first embodiment of the present application. In fig. 1, the dynamic extraction method of the relationship between the entity and the attribute includes the following steps:

step 110, acquiring text data;

and 120, dynamically extracting various characteristics of the entity and the attribute from the text data based on the dynamic entity attribute relation library and the training model.

Specifically, when text data is obtained, based on a pre-established entity attribute relation library and an entity attribute relation training model, each feature of the entity and the attribute is dynamically extracted from the text data, and is structured into an entity and attribute pair, so that a dynamic extraction result is obtained.

After the entity attribute relation library and the training model are established, the relation between the entity and the attribute in the text data can be identified, various characteristics are dynamically extracted, and the entity attribute relation corpus of the training model is continuously and dynamically expanded. Therefore, the corpus with more perfect scale is obtained as the training corpus, and the performance of the method for automatically extracting the entities and the attributes in a large number of texts based on statistical machine learning is better, so that the automatic extraction of the entities and the attributes in a large number of texts is comprehensively realized.

Optionally, as shown in fig. 2, before step 110, the method further includes:

step 210, capturing a plurality of sample data;

step 220, constructing an entity attribute relation library according to the plurality of sample data;

and 230, expanding the entity attribute relation library according to a preset characteristic rule.

Specifically, a large amount of sample data is acquired, crawler technology is adopted, and related text data on the Internet and related fields (such as telecommunication fields) are captured by using typical keywords in the related fields. And researching the grabbed sample data, and automatically constructing an Entity attribute seed table by using an Entity attribute value model (EAV) as a seed library of Entity attribute relations.

And segmenting the text by utilizing a preset characteristic rule, reserving preset keywords or keywords after preprocessing the text through clauses, word segmentation and the like, and expanding the reserved keywords or keywords into an entity attribute relation library. Taking the field of telecommunications as an example, these keywords or terms may be "package," "on," "phone," "display," etc., which when detected are extended to an entity attribute relational library.

Optionally, as shown in fig. 3, step 230 specifically includes:

step 310, receiving a character string sentence;

step 320, judging whether the character string statement includes a preset keyword in an entity attribute relation library; if yes, go to step 330, if not, do not process;

step 330, dividing the character string sentence into one or more sub-character string sentences;

step 340, judging whether the matching degree of each sub-string sentence and the preset keyword in the entity attribute relation library reaches a preset threshold value; if yes, the entity in the sub character string statement exists in the original entity attribute relation library, the entity is not processed, and if not, the step 350 is entered;

and step 350, expanding the sub-string sentences to the entity attribute relational library.

Specifically, detecting a character string sentence input by a user, receiving the character string sentence, and optimizing and simplifying the character string sentence into one or more sub-character string sentences through a regular expression if the character string sentence comprises a preset keyword or a keyword. And matching the similarity between the sub-string sentences and the entities in the entity attribute relation library. The similarity matching process comprises the following steps: setting a similarity threshold (for example, 1), if the matching degree of the sub-string sentences and the entities in the entity attribute relation library is 1, the fact that the entities in the sub-string sentences exist in the original entity attribute relation library is explained, and expansion is not needed, otherwise, if the matching degree of the sub-string sentences and the entities in the entity attribute relation library does not reach 1, the fact that the entities in the sub-string sentences do not exist in the original entity attribute relation library is explained, and the original entity attribute relation library is needed to be expanded. Preferably, if there are a plurality of entities that do not reach the similarity threshold, the entities with high similarity are expanded to the entity attribute relationship library.

Illustratively, as shown in FIG. 4, a display diagram of the entity attribute relationship library is augmented. In fig. 4, when receiving an input query content of "i want to know about wireless broadband and related information of private line surfing", it is obtained that entity 1 is "wireless broadband", and the similarity result of entity 1 is 0.800000011920929Pts, and the information corresponding to entity 1 is: service introduction, access mode, terminal, wireless network card and fault analysis; the entity 2 is obtained as 'private line surfing', and the information corresponding to the entity 2 is service introduction. And if the similarity between the entity 1 and the entity attribute relation library is smaller than 1, expanding the entity 1 into the entity attribute relation library.

Optionally, as shown in fig. 5, after step 110, the method further includes the steps of:

step 510, labeling the entity and the attribute of the text data according to the entity attribute relation library;

step 520, researching the labeled corpus to select the characteristics of the entity and the attribute.

Specifically, the captured text data is marked by using XML language through an entity attribute relational library to form a text entity attribute corpus in a specific field. The marked corpus is researched, and the characteristics of the entity and the attribute are selected according to the characteristics of the entity and the attribute in the text, for example, the characteristics of the context, the characteristics of the part of speech, the characteristics of the word list and the like are selected, so that various characteristics in the text are extracted.

Furthermore, words, sentences and the like possibly formed by the entities can be selected for marking and expanding. For example, if the entity "package" already exists in the relational library, and the "a package", "B package" and the like exist in the other text data, the "a package" and the "B package" may be labeled as entities, and the newly labeled entities may be expanded into the entity attribute relational library.

Optionally, the dynamic extraction method of the entity and attribute relationship of the present application further includes: the building of the entity attribute relationship training model, as shown in fig. 6, specifically includes the following steps:

step 610, capturing a plurality of text corpora;

step 620, processing the text corpus into one or more corpus files in a preset format;

step 630, training the one or more corpus files to generate a model file;

and 640, labeling the model file through a characteristic function set in the model file and a preset algorithm.

Specifically, the text corpus is preprocessed to generate one or more word-level training corpus files and word-level common training corpus files in preset formats, for example, a training file, a test file and a standard answer file for evaluation in a specified format are generated.

The corpus file generated in the preprocessing stage is used for generating a training file, and in the embodiment, the training file can be generated through a software development kit (Software Development Kit, SDK) provided by a CRF. And obtaining a globally optimal labeling result of the test input data by means of a Viterbi labeling algorithm by means of a characteristic function set and parameters in the model file.

Optionally, the process of building the entity attribute relationship training model may further include:

and identifying the accuracy, recall rate and F measure of the marked model file.

Specifically, in this embodiment, the labeling result and the standard answer are compared to obtain the identified accuracy, recall and F measure.

In practical application, the above process is repeated every time text data is obtained, and then a master and apprentice attribute relation library and a training model are dynamically established, so that the model learns new knowledge under the condition of limited samples, and screened elements are added into a word list. With the increase of data samples, telecommunication entities are automatically identified through learning of a large amount of data, so that the size of a named entity library is enlarged. The telecom entity attribute corpus is dynamically constructed to obtain a corpus with a relatively perfect scale as a training corpus, so that the performance of a method for automatically extracting entities and attributes in a mass of texts based on statistical machine learning is better, and the automatic extraction of the entities and the attributes in the mass of texts is comprehensively carried out.

According to the dynamic extraction method for the entity and attribute relationship, the text data is obtained, and based on the dynamic entity attribute relationship library and the training model, all the characteristics of the entity and the attribute are dynamically extracted from the text data, so that the dynamic entity attribute relationship library and the training model are constructed, and all the characteristics of the entity and the attribute can be automatically extracted from the text data.

Second embodiment

As shown in fig. 7, a schematic diagram of a server hardware architecture is provided in a second embodiment of the present application. In fig. 7, the server includes: memory 710, processor 720, and dynamic extraction program 730 stored on the memory 710 and operable on the processor 720 for entity and attribute relationships. In this embodiment, the entity and attribute relationship dynamic extraction program 730 includes a series of computer program instructions stored in the memory 710, which when executed by the processor 720, implement the entity and attribute relationship dynamic extraction operations of the embodiments of the present application. In some embodiments, the dynamic extraction of entity and attribute relationships 730 may be divided into one or more modules based on the particular operations implemented by portions of the computer program instructions. As shown in fig. 8, the dynamic extraction procedure 730 of entity-attribute relationship includes: a data acquisition module 810, a dynamic extraction module 820, a relational library construction module 830, an expansion module 840, a labeling module 850, a feature selection module 860, and a model construction module 870. Wherein, the liquid crystal display device comprises a liquid crystal display device,

a data acquisition module 810 for acquiring text data;

the dynamic extraction module 820 is configured to dynamically extract each feature of the entity and the attribute from the text data based on the dynamic entity attribute relation library and the training model.

Specifically, when the data obtaining module 810 obtains the text data, the dynamic extraction module 820 dynamically extracts each feature of the entity and the attribute from the text data based on the pre-established entity attribute relation library and the entity attribute relation training model, and constructs the feature as an entity and attribute pair, so as to obtain a dynamic extraction result.

After the entity attribute relation library and the training model are established, the dynamic extraction module 820 can identify the relation between the entity and the attribute in the text data, dynamically extract each feature, and dynamically expand the entity attribute relation corpus of the training model. Therefore, the corpus with more perfect scale is obtained as the training corpus, and the performance of the method for automatically extracting the entities and the attributes in a large number of texts based on statistical machine learning is better, so that the automatic extraction of the entities and the attributes in a large number of texts is comprehensively realized.

The data acquisition module 810 is further configured to capture a plurality of sample data;

a relational library construction module 830, configured to construct an entity attribute relational library according to the plurality of sample data;

and the expansion module 840 is configured to expand the entity attribute relationship library according to a preset feature rule.

Specifically, when the data acquisition module 810 acquires a large amount of sample data, crawler technology is employed and related text data on the internet related to a related domain (e.g., a telecommunications domain) is captured using keywords typical of the domain. And researching the grabbed sample data, and automatically constructing an entity attribute seed table by using the EAV as a seed library of the entity attribute relationship.

Optionally, as shown in fig. 3, the expansion module 840 is specifically configured to:

receiving a character string sentence;

judging whether the character string statement comprises a preset keyword in an entity attribute relation library or not; if yes, the character string statement is divided into one or more sub-character string statements;

judging whether the matching degree of each sub-string sentence and a preset keyword in the entity attribute relation library reaches a preset threshold value or not; if yes, the entity in the sub-string statement exists in the original entity attribute relation library, the entity is not processed, and if not, the sub-string statement is expanded to the entity attribute relation library.

The labeling module 850 is configured to label the entity and the attribute for the text data according to the entity attribute relation library;

the feature selection module 860 is configured to study the labeled corpus to select features of the entity and the attribute.

Specifically, the labeling module 850 labels the captured text data by using XML language through the entity attribute relational library, so as to form a text entity attribute corpus in a specific field. The feature selection module 860 researches the labeled corpus and selects features of entities and attributes according to the characteristics of the entities and attributes in the text, for example, according to contextual features, part-of-speech features, vocabulary features, and the like, thereby extracting various features in the text.

The model building module 870 is configured to build an entity attribute relationship training model, and the model building module 870 includes: a preprocessing unit 871, a training unit 872, a labeling unit 873, and an evaluation unit 874. Wherein, the method comprises the steps of,

a preprocessing unit 871, configured to process the captured multiple text corpora into one or more corpus files in a preset format;

a training unit 872, configured to train the one or more corpus files to generate a model file;

and the labeling unit 873 is used for labeling the model file through the characteristic function set in the model file and a preset algorithm.

And the evaluation unit 874 is used for identifying the accuracy, recall rate and F measure of the marked model file.

The corpus file generated in the preprocessing stage is generated into a training file, and in this embodiment, the training file can be generated through an SDK provided by the CRF. And obtaining a globally optimal labeling result of the test input data by means of a Viterbi labeling algorithm by means of a characteristic function set and parameters in the model file.

In this embodiment, the labeling result is compared with the standard answer to obtain the accuracy rate, recall rate and F measure of recognition.

The server provided in this embodiment obtains text data through the data obtaining module 810, and dynamically extracts various features of entities and attributes from the text data based on the dynamic entity attribute relation library and the training model by the dynamic extracting module 820, thereby constructing the dynamic entity attribute relation library and the training model, and can automatically extract various features of entities and attributes from the text data.

Third embodiment

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium here stores one or more programs. Wherein the computer readable storage medium may comprise volatile memory, such as random access memory; the memory may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid state disk; the memory may also comprise a combination of the above types of memories. The one or more programs in the computer-readable storage medium may be executed by the one or more processors to implement the method for dynamically extracting relationships between entities and attributes according to the first embodiment.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of embodiments, it will be clear to a person skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware platform, but may of course also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method for dynamically extracting relationships between entities and attributes, the method comprising the steps of:

acquiring text data;

dynamically extracting various characteristics of the entity and the attribute from the text data based on a dynamic entity attribute relation library and a training model;

before acquiring the text data, the method further comprises:

capturing a plurality of sample data;

constructing an entity attribute relation library according to the plurality of sample data;

expanding the entity attribute relation library according to a preset characteristic rule;

the expanding the entity attribute relation library according to a preset characteristic rule comprises the following steps:

receiving a character string sentence;

judging whether the character string statement comprises a preset keyword in an entity attribute relation library or not;

if yes, the character string statement is divided into one or more sub-character string statements;

judging whether the matching degree of each sub-string sentence and a preset keyword in the entity attribute relation library reaches a preset threshold value or not;

if not, expanding the sub-string sentences to the entity attribute relational library.

2. The method for dynamic extraction of entity-attribute relationships according to claim 1, wherein after obtaining text data, the method further comprises:

labeling the entity and the attribute of the text data according to the entity attribute relation library;

and researching the annotated corpus to select the characteristics of the entity and the attribute.

3. The method for dynamic extraction of entity-attribute relationships according to claim 1, wherein prior to obtaining text data, the method further comprises:

and establishing an entity attribute relationship training model.

4. A method for dynamic extraction of entity-attribute relationships according to claim 3, wherein building an entity-attribute relationship training model comprises:

capturing a plurality of text corpus;

processing the text corpus into one or more corpus files in a preset format;

training the one or more corpus files to generate model files;

and labeling the model file through a characteristic function set in the model file and a preset algorithm.

5. The method for dynamic extraction of entity-attribute relationships according to claim 4, further comprising:

6. A server, wherein the server comprises a processor and a memory;

the processor is configured to execute a dynamic extraction program of entity and attribute relationships stored in the memory, so as to implement the method of any one of claims 1-5.

7. A computer readable storage medium storing one or more programs executable by one or more processors to implement the method of any of claims 1-5.