CN113343693A

CN113343693A - Named entity identification method, device, equipment and machine readable medium

Info

Publication number: CN113343693A
Application number: CN202010140584.8A
Authority: CN
Inventors: 仇伟; 黄祥; 陈漠沙; 黄非; 司罗
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-03
Filing date: 2020-03-03
Publication date: 2021-09-03

Abstract

The embodiment of the application provides a named entity identification method, a device, equipment and a machine readable medium, wherein the method comprises the following steps: segmenting a text to be recognized into a plurality of language units; calculating a distance between any two adjacent language units among the plurality of language units; combining the plurality of language units according to the distance between any two adjacent language units to obtain one or more combined language units; and determining an entity recognition result corresponding to the text to be recognized according to the entity information corresponding to the one or more merged language units respectively. The embodiment of the application can improve the efficiency of named entity identification.

Description

Named entity identification method, device, equipment and machine readable medium

Technical Field

The present application relates to the field of communications technologies, and in particular, to a named entity recognition method, a named entity recognition apparatus, a device, and a machine-readable medium.

Background

With the popularization of computers and the rapid development of the internet, a large amount of information appears in the form of electronic documents. In order to meet the serious challenge brought by information explosion, an automatic tool is required to extract real valuable information from massive data, and information extraction is carried out. A named entity refers to an entity in the text that has a particular meaning, such as a person's name, place name, organization name, proper noun, and the like. Named Entity Recognition (NER) plays an important role in information extraction, can accurately recognize and classify proper nouns representing Named entities in texts, and further provides important semantic support for a plurality of natural language processing tasks such as automatic question answering, opinion mining, semantic analysis and the like.

Named entities typically exist in the form of representations of consecutive characters. Inevitably, there is a nesting phenomenon of named entities, that is, inside one named entity, there are multiple entities with nested structure, such as "yan garden of the university of beijing" including: named entities such as "Beijing university", "Yanyuan", and "Yanyuan of Beijing university", etc. Different entities generated by the nested structure are often rich in different semantic information, and effective identification and classification of multiple entities nested in multiple layers are necessary to improve the integrity of original text semantics.

To identify entities with nested structures, current methods of naming entities employ a multi-level Long Short-term memory (LSTM) model, where a first layer of LSTM predicts the innermost entities, a second layer of LSTM predicts the next-to-inner entities, and so on, and the last layer of LSTM predicts the outermost entities.

In practical applications, multi-level LSTM usually has higher complexity, so that named entity recognition is less efficient.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a method for identifying a named entity, which can improve the efficiency of identifying the named entity.

Correspondingly, the embodiment of the application also provides a named entity recognition device, a device and a machine readable medium, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a named entity identification method, including:

segmenting a text to be recognized into a plurality of language units;

calculating a distance between any two adjacent language units among the plurality of language units;

combining the plurality of language units according to the distance between any two adjacent language units to obtain one or more combined language units;

and determining an entity recognition result corresponding to the text to be recognized according to the entity information corresponding to the one or more merged language units respectively.

On the other hand, the embodiment of the present application further discloses a named entity recognition apparatus, the apparatus includes:

the segmentation module is used for segmenting the text to be recognized into a plurality of language units;

a distance calculation module for calculating a distance between any two adjacent language units among the plurality of language units;

a merging module, configured to merge the multiple language units according to a distance between any two adjacent language units to obtain one or more merged language units; and

and the result determining module is used for determining an entity recognition result corresponding to the text to be recognized according to the entity information corresponding to the one or more combined language units.

In another aspect, an embodiment of the present application further discloses an apparatus, including:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the methods described above.

In yet another aspect, embodiments of the present application disclose one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the methods described above.

The embodiment of the application has the following advantages:

in the embodiment of the present application, a plurality of language units in a text to be recognized are merged according to a distance between any two adjacent language units in the text to be recognized, where merging information obtained in the merging process may include: merging entity information corresponding to the language units; the entity information may represent whether the merged language unit is an entity, and an entity tag corresponding to the merged language unit when the merged language unit is an entity. According to the embodiment of the application, the entity information corresponding to the combined language unit can be obtained in the process of combining the plurality of language units in the text to be recognized, so that the entity with the nested structure can be recognized.

In addition, the embodiment of the application merges a plurality of language units in the text to be recognized and determines entity information corresponding to the merged language units, which is independent of a hierarchical neural network structure, so that the efficiency of named entity recognition can be improved.

Drawings

FIG. 1 is a flow chart of the steps of an embodiment of a named entity recognition method of the present application;

FIG. 2 is a schematic representation of a tree structure of an embodiment of the present application;

FIG. 3 is a schematic diagram of a processing flow of a named entity recognition method according to an embodiment of the present application;

FIG. 4 is a block diagram of an embodiment of a named entity recognition apparatus according to the present application; and

fig. 5 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

While the concepts of the present application are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the description above is not intended to limit the application to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Reference in the specification to "one embodiment," "an embodiment," "a particular embodiment," or the like, means that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, where a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. In addition, it should be understood that items in the list included in the form "at least one of a, B, and C" may include the following possible items: (A) (ii) a (B) (ii) a (C) (ii) a (A and B); (A and C); (B and C); or (A, B and C). Likewise, a listing of items in the form of "at least one of a, B, or C" may mean (a); (B) (ii) a (C) (ii) a (A and B); (A and C); (B and C); or (A, B and C).

In some cases, the disclosed embodiments may be implemented as hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be executed by one or more processors. A machine-readable storage medium may be implemented as a storage device, mechanism, or other physical structure (e.g., a volatile or non-volatile memory, a media disk, or other media other physical structure device) for storing or transmitting information in a form readable by a machine.

In the drawings, some structural or methodical features may be shown in a particular arrangement and/or ordering. Preferably, however, such specific arrangement and/or ordering is not necessary. Rather, in some embodiments, such features may be arranged in different ways and/or orders than as shown in the figures. Moreover, the inclusion of structural or methodical features in particular figures is not meant to imply that such features are required in all embodiments and that, in some embodiments, such features may not be included or may be combined with other features.

Aiming at the technical problem of low efficiency of named entity identification, the embodiment of the application provides a named entity identification scheme, which specifically comprises the following steps: segmenting a text to be recognized into a plurality of language units; calculating a distance between any two adjacent language units among the plurality of language units; combining the plurality of language units according to the distance between any two adjacent language units to obtain one or more combined language units; and determining an entity recognition result corresponding to the text to be recognized according to the entity information corresponding to the one or more merged language units respectively.

In the embodiment of the present application, a plurality of language units in a text to be recognized are merged according to a distance between any two adjacent language units in the text to be recognized, where merging information obtained in the merging process may include: merging entity information corresponding to the language units; the entity information may represent whether the merged language unit is an entity, and an entity tag corresponding to the merged language unit when the merged language unit is an entity. According to the embodiment of the application, the entity information corresponding to the language units can be obtained in the merging process of the plurality of language units in the text to be recognized, so that the entity with the nested structure can be recognized.

The method provided by the embodiment of the application can be applied to application environments corresponding to the client and the server, wherein the client and the server are located in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Optionally, the client may run on a terminal, and the terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.

In the embodiment of the present Application, the client may correspond to any APP (Application). For example, the APP corresponding to the client may include: entity recognition APP and the like. The client can receive the text to be recognized input by the user and return an entity recognition result corresponding to the text to be recognized to the user.

According to an embodiment, the client may determine the entity recognition result corresponding to the text to be recognized by using the scheme of the embodiment of the present application.

According to another embodiment, the client side can send the text to be recognized to the server side, and the server side determines the entity recognition result corresponding to the text to be recognized by using the scheme of the embodiment of the application.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a named entity identification method according to the present application is shown, where the method specifically includes the following steps:

101, segmenting a text to be recognized into a plurality of language units;

102, calculating the distance between any two adjacent language units in the plurality of language units;

103, combining the plurality of language units according to the distance between any two adjacent language units to obtain one or more combined language units;

and step 104, determining an entity identification result corresponding to the text to be identified according to the entity information corresponding to the one or more combined language units respectively.

At least one step included in the method of the embodiment of the present application may be executed by a client or a server, and it is understood that the embodiment of the present application does not impose a limitation on a specific execution subject of the step included in the method.

In step 101, the text to be recognized may be used for entity naming recognition in the embodiment of the present application, and may represent the text to be subjected to entity naming recognition. The text to be recognized may originate from a user, or the text to be recognized may originate from a client, or the text to be recognized may originate from a computing device, and it is understood that the embodiment of the present application is not limited to a specific source corresponding to the text to be recognized.

In this embodiment of the present application, optionally, the language unit corresponding to the language unit may include: a word, or a word. For example, if the language corresponding to the text to be recognized is chinese, the language unit corresponding to the language unit may include: a single word. For another example, if the language corresponding to the text to be recognized is english, the language unit corresponding to the language unit may include: words, and the like. It is understood that the embodiment of the present application does not impose a limitation on the specific language unit corresponding to the language unit.

In the embodiment of the application, after the text to be recognized is segmented into a plurality of language units, the plurality of language units in the text to be recognized can represent text segments or a part of the text to be recognized.

For example, if the text a to be recognized is "yanyuan of the university of beijing", the text a to be recognized may be segmented into a plurality of language units as follows: "Beijing", "Dada", "school", "Yan" and "Yuan".

In step 102, the distance between any two adjacent language units in the text to be recognized may be a semantic distance between any two adjacent language units, which may represent the matching degree of any two adjacent language units in the semantic aspect.

In an optional embodiment of the present application, the calculating a distance between any two adjacent language units may specifically include: determining the characteristics corresponding to the plurality of language units respectively; and determining the semantic distance between any two adjacent language units in the text to be recognized according to the characteristics corresponding to the plurality of language units respectively.

In an alternative embodiment of the present application, a word embedding method may be adopted to determine the features corresponding to the plurality of language units respectively. The word embedding method may map language units to vectors of real numbers so that the vectors can contain semantic and grammatical information.

The word embedding method may include: word2Vec (Word to vector), or Glove (Glove) method, etc. It is to be understood that the embodiments of the present application are not limited to the specific word embedding methods.

In an optional embodiment of the application, the determining the features corresponding to the plurality of language units respectively specifically includes: determining first vectors corresponding to the plurality of language units respectively by adopting a word embedding method; and performing feature extraction on the first vector by using a Convolutional Neural Network (CNN) to obtain a second vector. The second vector may contain character-level features such as character-shape features of radicals, etc. of Chinese characters, and features such as prefixes and suffixes of English words.

In another optional embodiment of the present application, the determining features corresponding to the plurality of language units respectively specifically includes: determining first vectors corresponding to the plurality of language units respectively by adopting a word embedding method; and performing feature extraction on the first vector by using a bidirectional neural network to obtain a third vector.

The basic idea of a bi-directional neural network is that an input sequence passes through the neural network once in both the forward and reverse directions, and such a bi-directional structure provides the output layer with complete past and future context information for the nodes in the input sequence. The bidirectional neural network may include: bidirectional LSTM, or bidirectional RNN (Recurrent Neural Network), etc.

In yet another optional embodiment of the present application, the determining features corresponding to the plurality of language units respectively specifically includes: determining first vectors corresponding to the plurality of language units respectively by adopting a word embedding method; performing feature extraction on the first vector by using the CNN to obtain a second vector; and performing feature extraction on the second vector by using a bidirectional neural network to obtain a third vector.

The embodiment of the present application does not impose any limitation on the specific process of calculating the distance between any two adjacent language units.

According to one embodiment, the distance between any two adjacent language units may be calculated using a similarity metric method. The similarity measurement method may include: cosine similarity method, euclidean distance method, etc.

According to another embodiment, the distance between any two language units in the text to be recognized can be calculated by using FCN (full Convolutional network).

The FCN converts the full connection layer into a convolution layer to form a full convolution network, which can perform semantic segmentation on the language units and can improve the accuracy of the distance between any two adjacent language units on the basis of performing semantic segmentation on the language units.

In step 103, the plurality of language units are combined according to the distance between any two adjacent language units, so that the language units corresponding to the language units can be gradually combined into the text to be recognized.

The merging information obtained by the merging process may include: merging entity information corresponding to the language units; the entity information may represent whether the merged language unit is an entity, and an entity tag corresponding to the merged language unit when the merged language unit is an entity. According to the embodiment of the application, the entity information corresponding to the language units can be obtained in the merging process of the plurality of language units in the text to be recognized, so that the entity with the nested structure can be recognized.

Optionally, the entity tag may characterize an entity class. The entity tag may include: PER (person name), LOC (place name), ORG (organization name), MISC (proper noun), and the like.

Optionally, the merging the plurality of language units in the text to be recognized may include: and determining entity information corresponding to the combined language unit. The merged linguistic unit may be used to characterize the merged results of a certain merge. And determining entity information corresponding to the merging language unit, wherein the entity information is used for determining the entity information corresponding to the merging result at a certain time.

The embodiment of the application can determine the entity information corresponding to the combined language unit by using a named entity identification method. The named entity identification method can comprise the following steps: a rule and dictionary based approach, a statistical based approach, or a neural network based approach, etc.

In an optional embodiment of the present application, the determining entity information corresponding to the merged language unit specifically may include: determining a first feature corresponding to a first merging object corresponding to a merging language unit and a second feature corresponding to a second merging object corresponding to the merging language unit; processing the first characteristic and the second characteristic by using the FCN to obtain a processing result; and processing the processing result by utilizing the normalization function to obtain entity information corresponding to the combined language unit.

The merged language unit can be merged by the first merged object and the second merged object. For example, "north" and "beijing" in the aforementioned text a to be recognized may be merged into "beijing", where "north" denotes a first merged object, "beijing" denotes a second merged object, and "beijing" denotes a merged language unit. As another example, the aforementioned "large" and "school" in the text a to be recognized may be merged into "university", where "large" represents the first merged object, "school" represents the second merged object, and "university" represents the merged language unit.

It is to be understood that the first merged object or the second merged object may be a single language unit, or the first merged object or the second merged object may be a plurality of language units after merging. For example, two merged language universities in the aforementioned text a to be recognized may be further merged, such as "beijing" and "university" as "beijing university", where "beijing" represents the first merged object, "university" represents the second merged object, and "beijing university" represents the merged language unit.

The FCN is used for processing the first characteristic and the second characteristic, the first merging object and the second merging object can be subjected to semantic segmentation, and therefore accuracy of entity information corresponding to the merging language units can be improved. The normalization function may include: softmax () function, etc. The normalization function is used to determine the probability that the merged linguistic unit corresponds to certain entity information. It is understood that the embodiment of the present application does not impose a limitation on the specific normalization function.

In another optional embodiment of the present application, the determining entity information corresponding to the merged language unit specifically may include: and determining entity information corresponding to the combined language unit according to a mapping relation between the pre-stored text and the entity information. Specifically, the merged language unit may be matched with the text in the mapping relationship to obtain entity information corresponding to the merged language unit.

Optionally, the merging information may further include: and combining the plurality of language units in the text to be recognized. For example, the merging order may include: two adjacent language units are first merged into a merged language unit, then the merged language unit is merged with the language units, and so on.

The embodiment of the application can provide the following technical scheme for merging a plurality of language units in the text to be recognized:

technical proposal A1,

In technical solution a1, the merging the multiple language units in the text to be recognized may specifically include: and combining a plurality of language units in the text to be recognized according to the sequence of the distances between any two adjacent language units from small to large.

The distance between any two adjacent language units in the text to be recognized may be a semantic distance between any two adjacent language units, and may represent a matching degree of any two adjacent language units in terms of semantics. Generally, the smaller the distance is, the higher the matching degree of any two adjacent language units in the aspect of semantics is; conversely, the larger the distance is, the lower the matching degree of any two adjacent language units in terms of semantics is.

Technical solution a1 combines a plurality of language units in the text to be recognized according to the order of the distances from small to large, and can preferentially combine language units with high matching degree.

In an optional embodiment of the present application, the merging the multiple language units in the text to be recognized specifically includes: sequentially storing the distances between any two adjacent language units to a distance set according to the sequence from small to large; according to the sequence from front to back, selecting a target distance from the distance set, and merging two adjacent merging objects corresponding to the target distance; the types of the two adjacent merging objects specifically include: a language unit and a language unit, or a language unit and a merged language unit, or a merged language unit and a merged language unit.

According to an embodiment, the target distance may include: the merging of two adjacent merging objects corresponding to the target distance, which is arranged at a first distance, includes: and merging the two adjacent language units corresponding to the first distance to obtain a first merged language unit.

The embodiment of the application can firstly merge two adjacent language units with the minimum distance. Taking the text a to be recognized as an example, assuming that two adjacent language units corresponding to the minimum distance are "large" and "school", the "large" and "school" may be merged first.

According to another embodiment, the target distance further includes: the merging two adjacent merging objects that are ranked at a second distance other than the first distance and correspond to the target distance may further include: determining a first language unit from the language units adjacent to the first merged language unit according to the second distance; and merging the first merged language unit and the first language unit to obtain a second merged language unit. The embodiment of the application can merge the first merged language unit and the first language unit.

Optionally, the first merged language unit specifically includes: and a second language unit adjacent to the first language unit, wherein a distance between the second language unit and the first language unit is a second distance.

According to still another embodiment, the target distance may further include: the merging two adjacent merging objects that are ranked at a third distance other than the head and correspond to the target distance may further include: and merging two adjacent first merging language units according to the third distance.

In the case that two first merged language units are adjacent, the embodiment of the present application may merge two adjacent first merged language units. The distance between two adjacent first merged language units can be obtained as the distance between two adjacent language units corresponding to the two adjacent first merged language units. Taking the text a to be recognized as an example, two adjacent first merged language units include: the distance between "beijing" and "university" may be determined as the distance between two adjacent single words, that is, the distance between "beijing" and "university" is determined according to the distance between "beijing" and "big".

According to another embodiment, the target distance may further include: the merging two adjacent merging objects that are ranked at a fourth distance other than the head and correspond to the target distance may further include: and merging the adjacent first merged language unit and the second merged language unit according to the fourth distance.

It is to be appreciated that in other embodiments of the present application, where two second merged language units are adjacent, the adjacent two second merged language units can be merged. It can be understood that, in the embodiment of the present application, no limitation is imposed on the first merging object and the second merging object corresponding to the merging process.

It can be understood that, no matter whether the merged object corresponding to one distance is a language unit or a merged language unit, as long as the distance meets the sorting condition, two adjacent merged objects corresponding to the distance can be merged.

Technical proposal A2,

In technical solution a2, the merging the multiple language units in the text to be recognized may specifically include: dividing the plurality of language units into at least two categories; merging one or more language units belonging to the same category to obtain corresponding merging and merging results; and combining the combining and combining results corresponding to the at least two categories.

Technical solution a2 may divide the linguistic units in the text to be recognized into at least two categories, and the merging process corresponding to a single category may be independent. For example, the merging processes corresponding to different categories are independent and parallel, so that the merging efficiency can be improved.

In this embodiment of the application, optionally, the dividing the language units in the text to be recognized into at least two categories may specifically include: and dividing the language units in the text to be recognized into at least two categories according to the maximum distance between any two adjacent language units. Specifically, in the plurality of language units, obtaining the ith language unit and the (i +1) th language unit with the largest distance between two adjacent language units; the ith language unit is classified into one category, and the (i +1) th language unit is classified into another category. It is to be understood that language units located before the ith language unit may be classified into one category, and language units located after the (i +1) th language unit may be classified into another category.

Optionally, the at least two categories may include: a first category and a second category, wherein a maximum distance may be used to determine a boundary between the first category and the second category. For example, the text to be recognized includes N words, wherein the distance between the ith word and the (i +1) th word is the largest, so that the 1 st word and the ith word can be used as a first category, and the (i +1) th word and the nth word can be used as a second category; wherein i and N are both natural numbers greater than 0.

Taking the text a to be recognized as an example, assume that the first category includes: "North", "Beijing", "Large", "school", "of", the second category includes: "Swallow" and "park" can be combined for the first category of "big" and "school", and for the second category of "Swallow" and "park"; the corresponding merging of the first and second classes may be parallel.

Technical proposal A3,

In technical solution a3, the merging the multiple language units in the text to be recognized may specifically include: under the condition that the first merging object and the second merging object are adjacent, judging whether to merge the first merging object and the second merging object or not according to the distance between the first merging object and the second merging object; the first merging object is a language unit or a merging language unit, and the second merging object is a language unit or a merging language unit.

According to an embodiment, in a case where a first merging object and a second merging object are adjacent, if the first merging object corresponds to a unique second merging object and the second merging object corresponds to a unique first merging object, the first merging object and the second merging object may be merged.

According to another embodiment, when the first merging object and the second merging object are adjacent, if the first merging object corresponds to a plurality of second merging objects, a target second merging object is determined from the plurality of second merging objects according to a distance between the first merging object and the second merging object, and the first merging object and the target second merging object are merged. The distance between the target second merged object and the first merged object is the minimum value among the distances between the first merged object and the plurality of second merged objects.

According to another embodiment, when the first merging object and the second merging object are adjacent, if the second merging object corresponds to a plurality of first merging objects, the target first merging object is determined from the plurality of first merging objects according to the distance between the first merging object and the second merging object, and the second merging object and the target first merging object are merged. The distance between the target first merged object and the second merged object is the minimum of the distances between the second merged object and the plurality of first merged objects.

The process of merging a plurality of language units in the text to be recognized is described in detail through technical solutions a1 to A3, and it can be understood that a person skilled in the art can adopt any one or a combination of the technical solutions a1 to A3 according to actual application requirements, and the embodiment of the present application does not limit the specific process of merging a plurality of language units in the text to be recognized.

In step 104, an entity recognition result corresponding to the text to be recognized is determined according to the entity information corresponding to the one or more merged language units. After the entity recognition result is obtained, the entity recognition result may be output to the outside, or the entity recognition result may be stored.

Optionally, the entity identification result includes: and the root node of the tree structure corresponds to the text to be recognized, the father node of the tree structure corresponds to the combined language unit, and the leaf node of the tree structure corresponds to the language unit.

According to the embodiment of the application, the entity recognition result is represented through the tree structure, the hierarchy of the tree structure can represent the merging sequence, and the nodes of the tree structure can represent the merging language units appearing in the merging process.

Referring to fig. 2, a schematic diagram of a tree structure according to an embodiment of the present application is shown, where the tree structure may correspond to a text a to be recognized, "a yan garden of the university of beijing", a root node of the tree structure may correspond to a "yan garden of the university of beijing", and leaf nodes of the tree structure respectively correspond to the following multiple language units: "Beijing", "Dada", "school", "Yan" and "Yuan". In the tree structure, "Beijing" and "Beijing" are combined as "Beijing", "big" and "school" are combined as "university", "Beijing" and "university" are combined as "Beijing university", "swallow" and "garden" are combined as "swallow garden", "Beijing university" and "swallow garden" are combined as "Beijing university", and "Beijing university" and "swallow garden" are combined as "Beijing university swallow garden". In the tree structure, the merged language unit corresponding to the parent node is labeled with entity information, such as entity or non-entity.

In an optional embodiment of the present application, the method may further include: and determining entity information corresponding to the language units. When the language unit corresponding to the language unit is a word, it can also be determined whether the language unit is an entity.

To sum up, in the named entity recognition method according to the embodiment of the present application, a plurality of language units in a text to be recognized are merged according to a distance between any two adjacent language units in the text to be recognized, where merged information obtained in the merging process may include: merging entity information corresponding to the language units; the entity information may represent whether the merged language unit is an entity, and an entity tag corresponding to the merged language unit when the merged language unit is an entity. According to the embodiment of the application, the entity information corresponding to the language units can be obtained in the merging process of the plurality of language units in the text to be recognized, so that the entity with the nested structure can be recognized.

The embodiment of the application can identify the text to be identified through the named entity identification device. The named entity identifying device may specifically include: the system comprises a word embedding module, a bidirectional LSTM module, a first prediction module, a merging module and a second prediction module.

The word embedding module is used for determining first characteristics corresponding to the plurality of language units respectively by adopting a word embedding method. And the bidirectional LSTM module is used for processing the first characteristic by adopting bidirectional LSTM, and the obtained second characteristic can comprise context information. And the first prediction module is used for predicting the entity information corresponding to the language unit. And the merging module is used for merging the plurality of language units according to the distance between any two adjacent language units to obtain a merged language unit. And the second prediction module is used for predicting the entity information corresponding to the combined language unit.

In order to make the embodiment of the present application better understood by those skilled in the art, the named entity identification method of the embodiment of the present application is described by specific examples herein.

Referring to fig. 3, a schematic diagram of a processing flow of a named entity recognition method according to an embodiment of the present application is shown, where a text B "g 0160-included AP-1 complex" to be recognized may be subjected to named entity recognition, and the text B to be recognized may be segmented into a plurality of language units as follows: "g 0160", "-", "included", "AP", "-", "1", and "complex". In fig. 3, "< s >" indicates a start marker, and "</s >" indicates an end marker.

In fig. 3, the word embedding module is configured to determine first features corresponding to the plurality of language units by using a word embedding method. And the bidirectional LSTM module is used for processing the first characteristic by adopting bidirectional LSTM, and the obtained second characteristic can comprise context information.

In fig. 3, the merging module is configured to merge a plurality of language units according to a distance between two adjacent language units. The distances between two adjacent language units may be: d1, d2, d3, d4, d5 and d6, wherein d5 < d4 < d6 < d1 < d2 < d 3.

In fig. 3, the entity information may include: "empty", "G # Protein", wherein "G # Protein" characterizes an entity and "empty" characterizes a non-entity.

According to an embodiment, a plurality of language units in the text to be recognized may be merged according to a sequence from small to large of a distance between any two adjacent language units, where the corresponding merging process specifically includes:

firstly, merging the "-" and "1" corresponding to d5 into "-1", and the entity information corresponding to the "-" 1 "is" empty ";

then, combining the 'AP' and '-1' corresponding to the d4 into 'AP-1', wherein the entity information corresponding to the 'AP-1' is 'G # Protein';

then, the "AP-1" and the "complete" corresponding to the d6 are merged into the "AP-1 complete", and the entity information corresponding to the "AP-1 complete" is "empty";

then, "g 0160" and "-" corresponding to d1 are merged into "g 0160-", and entity information corresponding to "g 0160-" is "empty";

then, merging the "g 0160-" and the "included" corresponding to the d2 into the "g 0160-included", and merging the entity information corresponding to the "g 0160-included" into the "empty";

finally, the "G0160-included" and the "AP-1 complex" corresponding to d3 are merged into the "G0160-included AP-1 complex", and the entity information corresponding to the "G0160-included AP-1 complex" is "G # Protein".

According to another embodiment, the text to be recognized may be divided into a first category and a second category, the boundaries corresponding to the first category and the second category may be positions corresponding to the maximum distance d3, and the language units corresponding to the first category may include: "g 0160", "-", "included", the language unit corresponding to the second category may include: "AP", "-", "1", and "complex". Language units corresponding to the first category or the second category can be merged respectively; and then merging the merged language units corresponding to the first category and the second category.

The text to be recognized in the embodiment of the application can be derived from any application fields, such as the logistics field, the medical field, the geographic field, the e-commerce field and the like.

In one embodiment of the present application, the text to be recognized may originate from the field of logistics. For example, the text C to be recognized may be "hangzhou district writing western road No. 969" in hangzhou city, wherein the text C to be recognized may be segmented into a plurality of language units as follows: "hang", "state", "city", "remainder", "hang", "region", "text", "one", "west", "road", "9", "6", "9", "number", and the distance between two adjacent language units may be: d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12 and d13, wherein d3, d6 and d10 are greater than the other distances.

The processing procedure of the text C to be recognized may include:

firstly, the text C to be recognized is divided into 4 categories according to d3, d6 and d10, which are respectively: the first category of Hangzhou city, the second category of Yuhangzong, the third category of Wenyi West Lu and the fourth category of No. 969;

then, the first category, the second category, the third category and the fourth category are respectively merged to obtain corresponding merged language units: "Hangzhou city", "Yunzhou region", "Wen xi Lu" and "No. 969", wherein "Hangzhou city" corresponds to city name, "Yunzhou region" corresponds to region name, "Wen xi Lu" corresponds to road name, and "No. 969" corresponds to house number name;

then, the merged language units corresponding to the first category, the second category, the third category and the fourth category are merged to obtain the corresponding address name of 'Hangzhou region Yixilu 969 number in Hangzhou city'.

In another embodiment of the present application, the text to be recognized may originate from the medical field. For example, the text D to be recognized may be "left pleural effusion", wherein the text D to be recognized may be segmented into a plurality of language units as follows: "left", "side", "chest", "cavity", "volume" and "liquid", the distance between two adjacent language units can be respectively: d1, d2, d3, d4 and d5, wherein d3 is greater than the other distances.

The processing procedure of the text D to be recognized may include:

first, according to d3, the text to be recognized is divided into 2 categories, which are: a first category "left chest" and a second category "effusion";

then, the first category and the second category are respectively merged to obtain a corresponding merged language unit: "left thorax" and "effusion", wherein "left thorax" corresponds to the name of body part;

and then merging the merged language units corresponding to the first category and the second category to obtain a disease symptom name corresponding to the left pleural effusion.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

The embodiment of the application also provides a named entity recognition device.

Referring to fig. 4, a block diagram of an embodiment of a named entity recognition apparatus according to the present application is shown, where the apparatus specifically includes the following modules: a segmentation module 401, a distance determination module 402, a merging module 403 and a result determination module 404.

The segmentation module 401 is configured to segment a text to be recognized into a plurality of language units;

a distance calculating module 402, configured to calculate a distance between any two adjacent language units in the plurality of language units;

a merging module 403, configured to merge the multiple language units according to a distance between any two adjacent language units to obtain one or more merged language units; and

a result determining module 404, configured to determine an entity recognition result corresponding to the text to be recognized according to the entity information corresponding to the one or more merged language units, respectively.

Optionally, the merging module 403 may include:

and the first merging module is used for merging the plurality of language units in the text to be recognized according to the sequence of the distances between any two adjacent language units from small to large.

Optionally, the merging module 403 may include:

a category dividing module for dividing the plurality of language units into at least two categories;

the second merging module is used for merging one or more language units belonging to the same category to obtain corresponding merging and merging results; and

and the fusion module is used for merging the merging and separating results corresponding to the at least two categories. .

Optionally, the category classification module may include:

a demarcation point determining module, configured to obtain, in the plurality of language units, an ith language unit and an (i +1) th language unit, where a distance between two adjacent language units is the largest;

and the demarcation point type determining module is used for dividing the ith language unit into one type and dividing the (i +1) th language unit into another type.

Optionally, the first merging module may include:

the set storage module is used for sequentially storing the distances between any two adjacent language units to a distance set according to the sequence from small to large;

a set sequence merging module, configured to select a target distance from the distance sets according to a sequence from front to back, and merge two adjacent merging objects corresponding to the target distance; the types of the two adjacent merging objects may include: a language unit and a language unit, or a language unit and a merged language unit, or a merged language unit and a merged language unit.

Optionally, the target distance may include: the first distance may be a first distance, and the set order combining module may include:

and the first set sequence merging module is used for merging two adjacent language units corresponding to the first distance to obtain a first merged language unit.

Optionally, the target distance may further include: the set order merge module may further include:

a first language unit determining module, configured to determine a first language unit from language units adjacent to the first merged language unit according to the second distance;

and the second set sequence merging module is used for merging the first merged language unit and the first language unit to obtain a second merged language unit.

Optionally, the first merged language unit may include: and a second language unit adjacent to the first language unit, wherein a distance between the second language unit and the first language unit is a second distance.

and the third set sequence merging module is used for merging two adjacent first merging language units according to the third distance.

and the fourth set sequence merging module is used for merging the adjacent first merged language unit and the second merged language unit according to the fourth distance.

Optionally, the merging module 403 may include:

the third merging module is used for judging whether the first merging object and the second merging object are merged or not according to the distance between the first merging object and the second merging object under the condition that the first merging object and the second merging object are adjacent; the first merging object is a language unit or a merging language unit, and the second merging object is a language unit or a merging language unit.

Optionally, the distance calculating module may include:

determining the characteristics corresponding to the plurality of language units respectively;

and determining the semantic distance between any two adjacent language units in the text to be recognized according to the characteristics corresponding to the plurality of language units respectively.

Optionally, the entity recognition result may include: and the root node of the tree structure corresponds to the text to be recognized, the father node of the tree structure corresponds to the combined language unit, and the leaf node of the tree structure corresponds to the language unit.

Optionally, the language unit corresponding to the language unit may include: a word, or a word.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Embodiments of the application may be implemented as a system or device using any suitable hardware and/or software for the desired configuration. Fig. 5 schematically illustrates an exemplary device 1300 that may be used to implement various embodiments described above in this application.

For one embodiment, fig. 5 illustrates an exemplary apparatus 1300, which apparatus 1300 may comprise: one or more processors 1302, a system control module (chipset) 1304 coupled to at least one of the processors 1302, system memory 1306 coupled to the system control module 1304, non-volatile memory (NVM)/storage 1308 coupled to the system control module 1304, one or more input/output devices 1310 coupled to the system control module 1304, and a network interface 1312 coupled to the system control module 1306. The system memory 1306 may include: instruction 1362, the instruction 1362 executable by the one or more processors 1302.

Processor 1302 may include one or more single-core or multi-core processors, and processor 1302 may include any combination of general-purpose processors or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the device 1300 can be a server, a target device, a wireless device, etc. as described above in embodiments of the present application.

In some embodiments, device 1300 may include one or more machine-readable media (e.g., system memory 1306 or NVM/storage 1308) having instructions thereon and one or more processors 1302, which in combination with the one or more machine-readable media, are configured to execute the instructions to implement the modules included in the aforementioned devices to perform the actions described above in embodiments of the present application.

System control module 1304 for one embodiment may include any suitable interface controller to provide any suitable interface to at least one of processors 1302 and/or any suitable device or component in communication with system control module 1304.

System control module 1304 for one embodiment may include one or more memory controllers to provide an interface to system memory 1306. The memory controller may be a hardware module, a software module, and/or a firmware module.

System memory 1306 for one embodiment may be used to load and store data and/or instructions 1362. For one embodiment, system memory 1306 may include any suitable volatile memory, such as suitable DRAM (dynamic random access memory). In some embodiments, system memory 1306 may include: double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

System control module 1304 for one embodiment may include one or more input/output controllers to provide an interface to NVM/storage 1308 and input/output device(s) 1310.

NVM/storage 1308 for one embodiment may be used to store data and/or instructions 1382. NVM/storage 1308 may include any suitable non-volatile memory (e.g., flash memory, etc.) and/or may include any suitable non-volatile storage device(s), e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives, etc.

NVM/storage 1308 may include storage resources that are physically part of the device on which device 1300 is installed or may be accessible by the device and not necessarily part of the device. For example, the NVM/storage 1308 may be accessed over a network via the network interface 1312 and/or through the input/output devices 1310.

Input/output device(s) 1310 for one embodiment may provide an interface for device 1300 to communicate with any other suitable device, and input/output devices 1310 may include communication components, audio components, sensor components, and so forth.

Network interface 1312 of one embodiment may provide an interface for device 1300 to communicate with one or more components of a Wireless network, e.g., to access a Wireless network based on a communication standard, such as WiFi (Wireless Fidelity), 2G or 3G or 4G or 5G, or a combination thereof, and/or with any other suitable device, and device 1300 may communicate wirelessly according to any of one or more Wireless network standards and/or protocols.

For one embodiment, at least one of the processors 1302 may be packaged together with logic for one or more controllers (e.g., memory controllers) of the system control module 1304. For one embodiment, at least one of the processors 1302 may be packaged together with logic for one or more controllers of the system control module 1304 to form a System In Package (SiP). For one embodiment, at least one of the processors 1302 may be integrated on the same novelty as the logic of one or more controllers of the system control module 1304. For one embodiment, at least one of processors 1302 may be integrated on the same chip with logic for one or more controllers of system control module 1304 to form a system on a chip (SoC).

In various embodiments, apparatus 1300 may include, but is not limited to: a computing device such as a desktop computing device or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, device 1300 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1300 may include one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

Wherein, if the display includes a touch panel, the display screen may be implemented as a touch screen display to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The present application also provides a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the one or more modules may cause the device to execute instructions (instructions) of methods in this application.

Provided in one example is an apparatus comprising: one or more processors; and, instructions in one or more machine-readable media stored thereon, which when executed by the one or more processors, cause the apparatus to perform a method as in embodiments of the present application, which may include: the method shown in figure 1 or figure 2 or figure 3.

One or more machine-readable media are also provided in one example, having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a method as in embodiments of the application, which may include: the method shown in figure 1 or figure 2 or figure 3.

The specific manner in which each module performs operations of the apparatus in the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail here, and reference may be made to part of the description of the method embodiments for relevant points.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable named entity recognition device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable named entity recognition device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable named entity recognition device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable named entity recognition device to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The named entity identification method, the named entity identification device, the device and the machine-readable medium provided by the application are introduced in detail, specific examples are applied in the text to explain the principle and the implementation of the application, and the description of the above embodiments is only used to help understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A named entity recognition method, comprising:

segmenting a text to be recognized into a plurality of language units;

2. The method of claim 1, wherein said merging the plurality of language units comprises:

and combining the plurality of language units in the text to be recognized according to the sequence of the distances between any two adjacent language units from small to large.

3. The method of claim 1, wherein said merging the plurality of language units comprises:

dividing the plurality of language units into at least two categories;

merging one or more language units belonging to the same category to obtain corresponding merging and merging results;

and combining the combining and combining results corresponding to the at least two categories. .

4. The method of claim 3, wherein said classifying said plurality of language units into at least two categories comprises:

obtaining an ith language unit and an (i +1) th language unit with the largest distance between two adjacent language units in the plurality of language units;

the ith language unit is classified into one category, and the (i +1) th language unit is classified into another category.

5. The method according to claim 2, wherein the merging the plurality of language units in the text to be recognized comprises:

sequentially storing the distances between any two adjacent language units to a distance set according to the sequence from small to large;

according to the sequence from front to back, selecting a target distance from the distance set, and merging two adjacent merging objects corresponding to the target distance; the types of the two adjacent merging objects include: a language unit and a language unit, or a language unit and a merged language unit, or a merged language unit and a merged language unit.

6. The method of claim 5, wherein the target distance comprises: the merging two adjacent merging objects corresponding to the target distance according to the first distance arranged at the head comprises:

and merging the two adjacent language units corresponding to the first distance to obtain a first merged language unit.

7. The method of claim 6, wherein the target distance further comprises: the merging two adjacent merging objects corresponding to the target distance, which are arranged at a second distance other than the first distance, further includes:

determining a first language unit from the language units adjacent to the first merged language unit according to the second distance;

and merging the first merged language unit and the first language unit to obtain a second merged language unit.

8. The method of claim 7, wherein the first merged language unit comprises: the second language unit is adjacent to the first language unit, and the distance between the second language unit and the first language unit is a second distance.

9. The method of claim 6, wherein the target distance further comprises: a third distance arranged at a non-leading position, wherein the merging of the two adjacent merging objects corresponding to the target distance further comprises:

and merging two adjacent first merging language units according to the third distance.

10. The method of claim 7, wherein the target distance further comprises: a fourth distance arranged at a non-leading position, wherein the merging of the two adjacent merging objects corresponding to the target distance further comprises:

and merging the adjacent first merged language unit and the second merged language unit according to the fourth distance.

11. The method according to claim 1, wherein the merging the plurality of language units in the text to be recognized comprises:

under the condition that the first merging object and the second merging object are adjacent, judging whether to merge the first merging object and the second merging object or not according to the distance between the first merging object and the second merging object; the first merging object is a language unit or a merging language unit, and the second merging object is a language unit or a merging language unit.

12. The method according to any one of claims 1 to 11, wherein said calculating a distance between any two adjacent language units comprises:

determining characteristics corresponding to the plurality of language units respectively;

13. An apparatus for named entity recognition, the apparatus comprising:

14. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-12.

15. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-12.