CN110569510A

CN110569510A - method for identifying named entity of user request data

Info

Publication number: CN110569510A
Application number: CN201910877939.9A
Authority: CN
Inventors: 杜忠和
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2019-12-13

Abstract

The invention discloses a method for identifying named entities of user request data, which comprises the steps of obtaining a participle set with parts of speech through part of speech tagging, screening out part of speech morphemes with parts of speech to be used as a first entity set, obtaining names of people, place names and mechanism names in the request data through an entity extraction method to be used as a second entity set, extracting a main and subordinate relation, a moving object relation and components which are semantically related to a sentence core word in the request data through a dependency analysis method to form a third entity set, and finally cleaning the first entity set, the second entity set and the third entity set on the basis of semantic analysis to obtain a final named entity identification result. The method can solve the problem that the recognition accuracy of the named entity of the request data is not high in the interaction process of the user and the intelligent equipment, so that the recognition rate of the user intention is improved.

Description

Method for identifying named entity of user request data

Technical Field

The invention relates to the technical field of computer natural language processing, in particular to a method for identifying a named entity of user request data.

background

with the rapid development of computer technology and information industry, more and more text information is growing explosively, and how to quickly and accurately extract useful information from massive text information is of great importance. Natural language processing is an important direction in the fields of computer science and artificial intelligence, and people can be helped to interact with intelligent equipment more conveniently and effectively by researching natural language processing and related technologies, so that real intentions of people are realized. Named entity recognition is an important basic tool for information extraction and machine learning, and plays an important role in the practical process of natural language processing technology, and the named entity recognition is to find out a text unit, namely an entity, bearing practical significance from given text information. Common entities include person name, place name, organization name, time, etc.

The early named entity recognition method is mostly based on rules, the realization cost of the system is high, and the transportability of the system is limited to a certain extent. At present, popular named entity identification methods mainly include a feature template-based method and a neural network-based method. The statistical machine learning method regards named entity recognition as a sequence labeling task, and learns a labeling model by using large-scale linguistic data so as to label each position of a sentence. Commonly used models include HMM models (hidden markov models), CRF models (Conditional Random Field models), and the like. The method needs a large amount of labeled corpora, has certain requirements on feature extraction, and the quality of the feature template influences the entity recognition effect. The method based on the neural network automatically extracts the characteristics by using the neural network, so that the training of the model becomes an end-to-end integral process, the method is independent of characteristic engineering, and is a data-driven method, but the network has many varieties, the dependence on parameter setting is large, and the model has poor interpretability. In addition, the recognition of the named entities in chinese has some difficulties, such as different extensions of entities in different fields, various entity types, large quantity, large influence by the word segmentation effect in chinese, and the like.

Disclosure of Invention

the invention aims to overcome the defects in the background technology and provide a method for identifying the named entity of user request data, which adopts a Stanford natural language processing tool and solves the problem of low identification accuracy of the named entity of the request data in the interaction process of a user and intelligent equipment through a method of combining part of speech tagging, entity extraction and dependency analysis, thereby improving the identification rate of user intention.

in order to achieve the technical effects, the invention adopts the following technical scheme:

a method for named entity identification of user requested data comprising the steps of:

A. Introducing a Stanford natural language processing tool, performing word segmentation and part-of-speech tagging on a request text to obtain a word segmentation set with part-of-speech, and screening out famous-part morphemes in the word segmentation set to form a first entity set, such as a common noun NN, a proper noun NR, a time noun NT and the like;

B. Acquiring a person name, a place name and an organization name in the request data by an entity extraction method to serve as a second entity set;

C. Extracting a cardinal relation and a dynamic guest relation in the request text by a dependency analysis method, and forming a third entity set by a cardinal language forming the cardinal relation, a direct object and an indirect object forming the dynamic guest relation and a sentence core word ROOT;

D. and on the basis of semantic analysis, cleaning the first entity set, the second entity set and the third entity set to obtain a final named entity recognition result.

further, when performing word segmentation on the request text in the step a, the method specifically includes: and defining the word position information of the characters as a word head, a word middle, a word tail and single words by using a CRF model, realizing word position labeling, and forming participles by the characters between the word head and the word tail and the single words.

Further, the part-of-speech tagging performed on the request text in the step a specifically includes: and defining a group of characteristic functions for all possible part-of-speech tagging sequences after word segmentation, assigning a weight to each characteristic function, and scoring each part-of-speech tagging sequence to obtain a tagging sequence with the highest score as a part-of-speech tagging result.

Further, when the part-of-speech tagging is performed on the request text in the step a, the user-defined part-of-speech tagging is also performed on words in a specific field, for example, the word "love you for ten thousand years" is tagged with the part-of-speech of N _ song, that is, a song noun, and the word "speed and passion" is tagged with the part-of-speech of N _ video, that is, a movie noun.

further, the step B is specifically to input the labeled data into a CRF model for training, and then extract the name of the person, the name of the place, and the name of the organization in the predicted data.

Further, in the step B, for the request data that cannot be extracted by the entity identification, redundant components, such as qualifiers, ordinal words and quantifiers, that are not beneficial to the entity identification in the sentence are analyzed through syntactic analysis, and simultaneously, simple clauses, verb phrases and noun phrases in the sentence are analyzed.

Further, in the step C, when the core word is a verb, it is generally a predicate of a sentence, and the predicate is followed by an object or a table, and the object largely belongs to an entity or a part of the entity; when the core word is a noun, the noun is generally an entity or a portion thereof that needs to be extracted.

Further, the washing in the step D is specifically a secondary elimination of the components not belonging to the named entity.

Further, the components not belonging to the named entity at least include pronouns, verbs, prepositions, wherein pronouns (e.g., "you," "i," "her"), verbs (e.g., "see"), prepositions (e.g., "after," etc.) that may exist before the final set of named entities may act as components such as the subject predicate in a sentence, but these are not within the scope of the named entity and need to be excluded from the set of named entities.

compared with the prior art, the invention has the following beneficial effects:

The named entity recognition method for the user request data, disclosed by the invention, has the advantages that through introducing a Stanford natural language processing tool and adopting a method combining part-of-speech tagging, entity extraction and dependency analysis, other named entities including three categories of names of people, place and organization are recognized for the user request data in a specific field on the basis of semantic analysis, the recall ratio of Chinese entity recognition is improved, the semantic features are enriched, meanwhile, the named entity recognition method is easier to understand and realize than a feature template method and a neural network method, and the named entity recognition efficiency is improved.

drawings

FIG. 1 is a flow diagram illustrating a method for named entity identification of user requested data in accordance with the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the embodiments of the invention described hereinafter.

Example (b):

The first embodiment is as follows:

As shown in fig. 1, a method for identifying a named entity of user request data specifically includes the following steps:

In this embodiment, it is assumed that a user sends a request "i want to see hong kong movie dialect X that X runs on and X runs off in 1994 weeks" to an intelligent device through voice, in order to better identify the user intention, a named entity in a request text needs to be identified in this embodiment, and the specific steps are as follows:

Step 1, introducing Stanfordcorenlp, obtaining a participle list with part-of-speech labels through a participle and part-of-speech labeling method [ I/PN, want/VV, see/VV, 1994/NT, Zhou X chi/NR, and/CC, Zhu X/NR, Ex/VV, DEC, hong Kong/NR, movie/NN, dialect X/NR ], screening out famous morphemes in the participle list to form a first named entity set [ 1994/NT, Zhou X chi/NR, Zhu X/NR, hong Kong/NR, movie/NN, dialect X/NR ].

And 2, obtaining the PERSON name PERSON, the place name LOCATION and the ORGANIZATION name ORGANIZATION in the request text by an entity extraction method to form a second named entity set [ ZhouX chi/PERSON, Zhu X/PERSON, hong Kong/LOCATION ].

And 3, extracting the major-minor relationship and the moving-guest relationship in the request text by a dependency analysis method.

Specifically, the dependency syntax explains the syntax structure by analyzing the dependency relationship before the components in the language unit, and the core verb in the sentence is claimed to be the central component which governs other components, but is not governed by any other components, and all governed components depend on the governors in a certain relationship. The dependency analysis results in this example are [ (ROOT, want), (nsubj, want, me), (ccomp, want, see), (dep, chateaux, 1994), (conj, zhux, zhoux chi), (cc, zhux, and), (nsubj, rehearsal, zhux), (acl, chateaux, rehearsal), (mark, rehearsal), (nmod, movie, hong kong), (appos, chateaux, movie), (dobj, see, chateaux) ].

Each element in the list is a triple, and the three elements of the triple represent a dependency relationship, a dependent word and a dominant word respectively. And extracting the subject forming the subject-predicate relationship, the direct object and the indirect object forming the animal-guest relationship and a sentence core word ROOT to form a third named entity set [ I/PN, want/VV, Zhu X/NR, and Mandarin X tour/NN ].

And 4, solving a union set of the first named entity set, the second named entity set and the third named entity set to obtain [ 1994/NT, movie/NN, greater language X tour/NR, Zhou X chi/PERSON, Zhu X/PERSON, hong Kong/LOCATION, I/PN, want/VV ]. And cleaning components which do not belong to the named entities, such as pronouns PN, verbs VV, prepositions P and the like, to obtain a final named entity set [ 1994/NT, Mandarin X swimming/NR, Wenxchi/PERSON, Zhu X/PERSON, hong Kong/LOCATION ]. Thereby completing named entity identification of the entire user request data.

It will be understood that the above embodiments are merely exemplary embodiments taken to illustrate the principles of the present invention, which is not limited thereto. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit and substance of the invention, and these modifications and improvements are also considered to be within the scope of the invention.

Claims

1. A method for named entity identification of user requested data, comprising the steps of:

A. Introducing a Stanford natural language processing tool, performing word segmentation and part-of-speech tagging on a request text to obtain a word segmentation set with part-of-speech, and screening out famous-part morphemes in the word segmentation set to form a first entity set;

2. The method for identifying the named entity of the user request data according to claim 1, wherein when performing the word segmentation on the request text in the step a, the method specifically comprises: and defining the word position information of the characters as a word head, a word middle, a word tail and single words by using a CRF model, realizing word position labeling, and forming participles by the characters between the word head and the word tail and the single words.

3. The method for identifying the named entity of the user request data according to claim 2, wherein the part-of-speech tagging performed on the request text in the step a specifically comprises: and defining a group of characteristic functions for all possible part-of-speech tagging sequences after word segmentation, assigning a weight to each characteristic function, and scoring each part-of-speech tagging sequence to obtain a tagging sequence with the highest score as a part-of-speech tagging result.

4. The method as claimed in claim 3, wherein the step A of part-of-speech tagging of the request text further comprises a step of custom part-of-speech tagging of words in a specific field.

5. the method according to claim 3, wherein step B is specifically implemented by inputting labeled data into a CRF model for training, and then extracting the name of a person, the name of a place and the name of an organization from the predicted data.

6. The method for named entity recognition of data requested by a user according to claim 5, wherein the requested data that cannot be extracted for entity recognition in step B is obtained by parsing redundant components of qualifiers, ordinal words and quantifiers in the sentence that are not beneficial to entity recognition, and parsing simple clauses, verb phrases and noun phrases in the sentence.

7. The method for named entity recognition of user request data as claimed in claim 5, wherein in step C, when the core word is verb, that is, the predicate of the sentence, and the predicate is followed by the object or the table, the object belongs to the entity or the part of the entity; when the core word is a noun, the noun is the entity or a portion thereof to be extracted.

8. The method according to claim 7, wherein the cleaning in step D is performed by culling components not belonging to the named entity.

9. The method of claim 8, wherein the components not belonging to a named entity comprise at least pronouns, verbs, and prepositions.