CN110543638B - Named entity identification method and device - Google Patents

Named entity identification method and device Download PDF

Info

Publication number
CN110543638B
CN110543638B CN201910854243.4A CN201910854243A CN110543638B CN 110543638 B CN110543638 B CN 110543638B CN 201910854243 A CN201910854243 A CN 201910854243A CN 110543638 B CN110543638 B CN 110543638B
Authority
CN
China
Prior art keywords
language
text
executing
original text
language word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910854243.4A
Other languages
Chinese (zh)
Other versions
CN110543638A (en
Inventor
徐祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Chengying Data Technology Co ltd
Original Assignee
Hangzhou Chengying Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Chengying Data Technology Co ltd filed Critical Hangzhou Chengying Data Technology Co ltd
Priority to CN201910854243.4A priority Critical patent/CN110543638B/en
Publication of CN110543638A publication Critical patent/CN110543638A/en
Application granted granted Critical
Publication of CN110543638B publication Critical patent/CN110543638B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a method and a device for identifying a named entity, wherein the method comprises the following steps: receiving an original text, separating the original text to obtain a text unit; determining a text unit representation vector according to the text unit; acquiring splitting characteristics corresponding to the text unit, and determining a feature expression vector of the original text according to the splitting characteristics of the text unit; and determining the named entity in the original text according to the feature representation vector and the text unit representation vector of the original text. The split features corresponding to the text units are used as the minimum elements for processing, so that the text units can be reserved to the maximum extent as the internal features of pictographic characters or pictographic characters, the internal features among the text units are reserved, and the accuracy of named entity recognition is improved.

Description

Named entity identification method and device
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for named entity identification, a computing device, and a computer-readable storage medium.
Background
Named entity recognition is a fundamental task in natural language processing and has a very wide application range. A named entity generally refers to an entity in text that has a particular meaning or strong reference, and typically includes a person's name, place name, organization name, time of day, proper noun, and the like. Named entity recognition is the extraction of the entity from unstructured input text. In the prior art, the named entity identification method cannot link the internal characteristics between words in the original document, so that the accuracy of named entity identification in the original document is low.
Disclosure of Invention
In view of the above, embodiments of the present application provide a method and an apparatus for named entity identification, a computing device, and a computer-readable storage medium, so as to solve technical defects in the prior art.
The embodiment of the application discloses a method for identifying a named entity, which comprises the following steps: receiving an original text, separating the original text to obtain a text unit;
determining a text unit representation vector according to the text unit;
acquiring splitting characteristics corresponding to the text unit, and determining a feature expression vector of the original text according to the splitting characteristics of the text unit;
and determining the named entity in the original text according to the feature representation vector of the original text and the text unit representation vector.
The embodiment of the application also discloses a device for identifying the named entity, which comprises: the separation module is configured to receive an original text, separate the original text and obtain a text unit;
a first determination module configured to determine a text unit representation vector from the text unit;
the processing module is configured to acquire the splitting characteristics corresponding to the text unit and determine the characteristic representation vector of the original text according to the splitting characteristics of the text unit;
a second determination module configured to determine a named entity in the original text from the feature representation vector and the text unit representation vector of the original text.
The embodiment of the application discloses a computing device, which comprises a memory, a processor and computer instructions stored on the memory and capable of running on the processor, wherein the processor executes the instructions to realize the steps of the named entity identification method.
Embodiments of the present application disclose a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method for named entity identification as described above.
According to the method and the device for identifying the named entity, the original text is received, and the original text is separated to obtain a text unit; determining a text unit representation vector according to the text unit; acquiring splitting characteristics corresponding to the text unit, and determining a characteristic representation vector of the original text according to the splitting characteristics of the text unit; and determining the named entity in the original text according to the feature representation vector and the text unit representation vector of the original text. The split features corresponding to the text units are used as the minimum elements for processing, so that the text units can be reserved to the maximum extent as the internal features of pictographic characters or pictographic characters, the internal features among the text units are reserved, and the accuracy of named entity recognition is improved.
Drawings
FIG. 1 is a schematic block diagram of a computing device according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for named entity identification according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of determining a feature expression vector of an original text in the named entity recognition method of the present application;
FIG. 4 is a schematic diagram of determining named entities in the original text in the named entity recognition method of the present application;
FIG. 5 is a flowchart illustrating a method for named entity identification according to an embodiment of the present application;
FIG. 6 is a schematic flow chart diagram illustrating a method for named entity identification in accordance with an embodiment of the present application;
fig. 7 is a schematic structural diagram of a device for named entity recognition according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.
In the present application, a method and an apparatus for extracting abstract text, a computing device and a computer readable storage medium are provided, which are described in detail in the following embodiments one by one.
Fig. 1 is a block diagram illustrating a configuration of a computing device 100 according to an embodiment of the present specification. The components of the computing device 100 include, but are not limited to, memory 110 and processor 120. The processor 120 is coupled to the memory 110 via a bus 130 and a database 150 is used to store data.
Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 140 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 100 and other components not shown in FIG. 1 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 1 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 100 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), a mobile phone (e.g., smartphone), a wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.
Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart diagram illustrating a method of named entity identification, including steps 202-208, according to an embodiment of the present application.
Step 202: and receiving an original text, and separating the original text to obtain a text unit.
The received original text may be at least one of a chinese text, an english text, a korean text, and a japanese text, and of course, the original text may also be a text in other languages.
Optionally, the original text is separated to obtain at least one of the first language word and the second language word.
The original text is separated to obtain the first language single character, which is exemplified by the first language as Chinese.
For example, the received original text is "i want to listen to the rain of six months in three", and the first language words obtained by separating the original text are "i", "want", "listen", "one", "three", "six", "month", "date" and "rain".
There are various ways to separate the original text acquisition text unit, and the original text acquisition text unit is separated by a regular expression in the application.
Step 204: and determining a text unit representation vector according to the text unit.
Text unit embedding determines a text unit representing vector, and a single word is represented numerically through text unit embedding, namely the text unit representing vector is used for mapping the text unit into a high-dimensional vector to represent the text unit.
The text units "i", "want", "listen", "one", "three", "of", "six", "month", "of" and "rain" define text unit representation vectors of w, respectively 1 、w 2 、w 3 、w 4 、w 5 、w 6 、w 7 、w 8 、w 9 And w 10
Step 206: and acquiring splitting characteristics corresponding to the text unit, and determining a characteristic representation vector of the original text according to the splitting characteristics of the text unit.
The splitting characteristic can be a characteristic obtained by splitting the radicals of the single characters in the first language, and the splitting characteristic can also be a characteristic obtained by splitting the pinyin of the single characters in the first language, so that the internal relation of the pictophonetic characters of the single characters in the first language can be shown, and the accuracy of the named entity confirmation in the following steps is improved.
Referring to fig. 3, step 206 includes steps 302 through 310.
Step 302: and judging whether the ith first language single character can be split, wherein i is more than or equal to 1 and less than or equal to n, n is the total number of the first language single characters contained in the original text, if so, executing the step 304, and if not, executing the step 306.
For example, the received original document, "i", "six", "month" and "rain" separated from the "i want to listen to the rain in the six months" of the three months "of the original document are single-body characters, and the single-body characters are Chinese characters formed by using strokes as direct units, and the single-body characters cannot be separated.
Step 304: splitting the first language single character to obtain the radical of the first language single character, determining the split feature expression vector of the first language single character according to the radical of the first language single character, and executing step 308.
For example, the first language word "want" separated from the received original document "i want to listen to three six months of rain", and the radicals and radicals obtained by splitting the first language word "want" are "facies" and "hearts", respectively.
The basic radical of the first language word is used as the minimum element, so that the first language word can be reserved as the inherent characteristics of pictographic characters to the maximum extent, and the inherent characteristics between the first language word and characters are reserved.
The step 304 specifically includes a step 3041 and a step 3042.
Step 3041: and determining the embedded representation corresponding to the radical of the first language single character according to the radical of the first language single character.
Step 3042: and inputting the embedded representation corresponding to the radical of the first language single character into the convolution layer to obtain a split feature representation vector corresponding to the radical of the first language single character.
The embedding corresponding to the radical ' corresponding to ' the word ' according to the first language is denoted as h 21 The embedded expression of the first language word corresponding to the radical 'heart' is h 22 The embedded expression corresponding to the radical of the first language word is input into the convolution layer to obtain the splitting characteristic expression vector h corresponding to the radical of the first language word 2 Is [ h ] 21 ,h 22 ]。
By analogy, the split feature expression vector h of the words "listen", "one", "three", "of" and "of the first language is obtained 3 、h 4 、h 5 、h 6 And h 9 Are respectively [ h ] 31 ,h 32 ]、[h 41 ,h 42 ]、[h 51 ,h 52 ]、[h 61 ,h 62 ]And [ h 91 ,h 92 ]。
Modeling is carried out from the radicals of the first language single words, the embedded expression corresponding to the radicals of the first language single words is input into a convolutional layer of a convolutional neural network, in the following steps, a text unit expression vector corresponding to each text unit and a feature expression vector of an original text are input into a long-term and short-term memory model, the internal features among the first language single words are extracted, the relation among the first language single words is expressed to the maximum extent, and in the following step of determining a named entity, the recognition accuracy of the named entity is improved.
Step 306: and taking the expression vector corresponding to the first language single character as the split feature expression vector corresponding to the first language single character, and executing step 308.
Such as the first language separated from the original document received above, "i want to listen to a three-month-six-month-rainThe expression vector h of the single word "I") 1 The expression vector h of the first language word "I" is used 1 And directly taking the obtained result as a split feature representation vector corresponding to the single character of the first language.
By analogy, the splitting feature expression vectors of six, month and rain are respectively h 7 、h 8 And h 10
Step 308: and increasing i by 1, judging whether i is larger than n, if not, executing step 302, and if so, executing step 310.
Step 310: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each first language single character.
Determining the feature expression vector H of the original text as { H } according to the split feature expression vector corresponding to each first language single character 1 ,h 2 ,h 3 ,h 4 ,h 5 ,h 6 ,h 7 ,h 8 ,h 9 ,h 10 }。
Step 208: and determining the named entity in the original text according to the feature representation vector and the text unit representation vector of the original text.
Referring to fig. 4, step 208 includes steps 402 through 406.
Step 402: and inputting the text unit representation vector corresponding to each text unit and the feature representation vector of the original text into a long-short term memory model, and outputting a fusion vector corresponding to each text unit by the long-short term memory model.
Representing the text unit corresponding to the text unit 'I' by a vector w 1 And the feature expression vector H of the original text is { H } 1 ,h 2 ,h 3 ,h 4 ,h 5 ,h 6 ,h 7 ,h 8 ,h 9 ,h 10 Inputting the long and short term memory model, and outputting a fusion vector f corresponding to the text unit' I 1 By analogy, the fusion vectors of the text units of ' want ', ' listen ', ' one ', ' three ', ' six ', ' month ', ' and ' rain ' are obtainedIs f 2 、f 3 、f 4 、f 5 、f 6 、f 7 、f 8 、f 9 And f 10
Step 404: and inputting the fusion vector corresponding to the text unit into a conditional random field model, and outputting a label corresponding to each text unit by the conditional random field model.
The conditional random field model is a linear chain random field. The named entities refer to names of people, places and organizations, and the named entities refer to names of people and names of songs (including names of films and televisions, names of websites and television stations) aiming at the field of media; named entity recognition is the recognition of the type to which the separate text unit belongs.
The text units "i", "want", "listen", "one", "three", "of", "six", "month", "of" and "rain" are respectively corresponded to the fusion vector f 1 、f 2 、f 3 、f 4 、f 5 、f 6 、f 7 、f 8 、f 9 And f 10 Inputting a trained random field of linear chain elements, the original text "i want to listen to a three-month rain" is labeled as: i \ O thinks \ O listen \ O stretch \ B-PER three \ I-PER \ O six \ B-NAME month \ I-NAME rain \ I-NAME.
Wherein "O" represents other; "B" represents "begin", i.e., the beginning of the entity; "I" denotes internal, "PER" and "NAME" denote the category of the entity as a person NAME and a realm NAME, respectively.
Inputting the fusion vector corresponding to the text unit into a conditional random field model, and automatically marking the category of the text unit, so as to obtain an original text that 'three rains in six months in which one wants to listen to three rains' is a person name entity and 'six months rains' is a song name entity.
Step 406: and determining the named entities in the original text according to the labels corresponding to the text units.
Table 1 specifically shows labels corresponding to the text elements "I", "want", "listen", "one", "three", "of", "six", "month", "of" and "rain" output by the conditional random field model.
TABLE 1
Text unit I am Want to Listening device Sheet of paper Three (three) Is/are as follows Six ingredients Moon cake Is/are as follows Rain water
Label (R) 0 0 0 1 1 0 1 1 1 1
And determining a name entity ' three in length ' and a song name entity ' six months in the original text according to the text unit with the label of 1.
In the method for identifying the named entity, a text unit representation vector is determined according to the text unit; the method comprises the steps of obtaining splitting characteristics corresponding to a text unit, determining characteristic representation vectors of an original text according to the splitting characteristics of the text unit, modeling from radicals of first language single words, extracting internal characteristics among the first language single words, representing the relation among the first language single words to the maximum extent, determining named entities in the original text according to the characteristic representation vectors of the original text and the text unit representation vectors, and improving the accuracy of named entity recognition.
Fig. 5 shows a schematic flow chart of a method of named entity identification of another embodiment of the present application, comprising steps 502 to 516.
Step 502: and receiving an original text, and separating the original text to obtain m second language words.
The second language may be english.
Step 504: and determining a text unit representation vector according to the text unit.
Step 506: and judging whether the jth second language word can be split or not, wherein j is more than or equal to 1 and less than or equal to m, if so, executing a step 508, and if not, executing a step 510.
Step 508: splitting the second language word to obtain the character of the second language word, determining the split feature expression vector of the second language word according to the character of the second language word, and executing step 512.
Step 510: and taking the representation vector corresponding to the second language word as the split feature representation vector corresponding to the second language word, and executing step 512.
Step 512: and increasing j by 1, judging whether j is larger than m, if not, executing the step 506, and if so, executing the step 514.
Step 514: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each second language word.
Step 516: and determining the named entity in the original text according to the feature representation vector of the original text and the text unit representation vector.
According to the named entity recognition method, modeling is carried out from the characters of the second language words, the internal features of the characters of the second language words are extracted, the relationship of the characters of the second language words is shown to the maximum extent, the named entities in the original text are determined according to the feature representation vectors and the text unit representation vectors of the original text, and the accuracy of named entity recognition is improved.
Fig. 6 shows a schematic flow chart of a method of named entity identification of another embodiment of the present application, comprising steps 602 to 624.
Step 602: receiving an original text, separating the original text to obtain h text units, wherein the text units are single words in a first language or words in a second language.
The first language is Chinese, the second language is English for illustration, and the original text is assumed to be "I want to listen to My love".
Step 604: and determining a text unit representation vector according to the text unit.
Table 2 shows the text units obtained by separating the original text and the text unit representing vectors corresponding to the text units.
TABLE 2
Text unit I am Want to Listening device My love
Text unit representation vector w 1 w 2 w 3 w 4 w 5
The text unit expression vectors determined by the text units "I", "want", "listen", "My" and "love" are w respectively 1 、w 2 、w 3 、w 4 And w 5
Step 606: judging that the kth text unit is the first language single word or the second language word, wherein k is more than or equal to 1 and less than or equal to h, if the kth text unit is the first language single word, executing step 608, and if the kth text unit is the second language word, executing step 614.
The text units are judged as the first language words, i, and the second language words are My and love.
Step 608: and judging whether the kth first language single character can be split, if so, executing step 610, and if not, executing step 612.
Step 610: splitting the first language single character to obtain the radical of the first language single character, determining the split feature expression vector of the first language single character according to the radical of the first language single character, and executing step 620.
The Chinese character 'thought' obtained by splitting the first language is divided into the components and the radicalsThe word "corresponding to the radical" is expressed as "h" according to the first language 21 The embedded expression of the first language word corresponding to the radical 'heart' is h 22 The embedded expression corresponding to the radical of the first language word is input into the convolution layer to obtain the split feature expression vector h corresponding to the radical of the first language word 2 Is [ h ] 21 ,h 22 ]。
The radicals and radicals obtained by splitting the single characters of the first language are 'mouth' and 'jin' respectively, and the radicals are expressed as h according to the embedding corresponding to the 'mouth' of the single characters of the first language 31 The embedding of the radical of the first language word is denoted as h 32 The embedded expression corresponding to the radical of the first language word is input into the convolution layer to obtain the split feature expression vector h corresponding to the radical of the first language word 3 Is [ h ] 31 ,h 32 ]。
Step 612: and taking the expression vector corresponding to the first language single character as the split feature expression vector corresponding to the first language single character, and executing step 620.
The expression vector h of the first language word "me 1 The expression vector h of the first language word "I" is used 1 And directly taking the obtained result as a split feature representation vector corresponding to the single character of the first language.
Step 614: it is determined whether the kth second language word can be split, if yes, go to step 616, and if no, go to step 618.
Step 616: splitting the second language word to obtain the characters of the second language word, determining the split feature expression vector of the second language word according to the characters of the second language word, and executing step 620.
The characters M and y of the second language word obtained after the division of the second language word My are respectively represented as h 41 And h 42 The split feature representation vector h corresponding to the second language word "My" is a vector of "My" 4 Is [ h ] 41 ,h 42 ]。
Second languageThe characters "l", "o", "v" and "e" of the second language word obtained after the word "love" is split are respectively h, the corresponding embedded expressions of the characters "l", "o", "v" and "e" of the second language word are respectively 51 、h 52 、h 53 And h 54 The splitting feature corresponding to the word "love" in the second language represents a vector h 5 Is [ h ] 51 ,h 52 ,h 53 ,h 54 ]。
Step 618: and taking the representation vector corresponding to the second language word as a split feature representation vector corresponding to the second language word, and executing step 620.
For example, if the second language word "a" is encountered, the second language word cannot be split, step 618 is performed, and the second language word in the original text in the above example can be split, so step 618 is not performed.
Step 620: and increasing k by 1, judging whether k is larger than h, if not, executing the step 606, and if so, executing the step 622.
Step 622: and determining the feature expression vector of the original text according to the split feature expression vector corresponding to each first language word and the split feature expression vector corresponding to each second language word, and executing step 624.
The split feature expression vector h corresponding to the first language single character I, I and O respectively 1 、h 2 And h 3 And split feature representation vectors h corresponding to the second language words "My" and "love 4 And h 5 Determining the feature expression vector H of the original text 'My love you want to hear' as H 1 ,h 2 ,h 3 ,h 4 ,h 5 }。
Step 624: and determining the named entity in the original text according to the feature representation vector and the text unit representation vector of the original text.
Representing the text unit corresponding to the text unit 'I' by a vector w 1 And the feature expression vector H of the original text is { H } 1 ,h 2 ,h 3 ,h 4 ,h 5 Long and short term notes of inputMemory model, the fusion vector corresponding to the output text unit 'I' is f 1 By analogy, the fusion vectors of the text units of thought, listen, my and love are respectively f 2 、f 3 、f 4 And f 5
The text units "i", "want", "listen", "My" and "love" are respectively corresponded to a fusion vector f 1 、f 2 、f 3 、f 4 And f 5 Inputting a trained random field of linear chain elements.
The original text "I want to listen to My love" is labeled as: i \ O want \ O listen \ O My \ B-NAME love \ I-NAME. Wherein "O" represents other; "B" represents "begin", i.e., the beginning of the entity; "I" indicates internal and "NAME" indicates that the category of the entity is a realm NAME.
And inputting the fusion vector corresponding to the text unit into a conditional random field model, and automatically marking the category of the text unit, so as to obtain the original text that My love in My love wanted is a song name entity.
According to the method for recognizing the named entity, the original text comprises the first language words and the second language words, namely the original text is a mixed text of two languages, for example, the original text is a Chinese-English mixed text.
Fig. 7 illustrates an apparatus for named entity identification provided in an embodiment of the present application, where the apparatus includes:
a separation module 701 configured to receive an original text, separate the original text to obtain a text unit;
a first determining module 702 configured to determine a text unit representation vector from the text unit;
the processing module 703 is configured to obtain a splitting feature corresponding to the text unit, and determine a feature expression vector of the original text according to the splitting feature of the text unit;
a second determining module 704 configured to determine a named entity in the original text from the feature representation vector and the text unit representation vector of the original text.
The device for identifying the named entities takes the split features corresponding to the text units as the minimum elements for processing, so that the text units can be reserved as the internal features of pictographic characters or pictographic characters to the maximum extent, the internal features among the text units are reserved, and the accuracy of identifying the named entities is improved.
Optionally, the separation module 701 is further configured to separate the original text to obtain first language words, where the original text includes n first language words;
the processing module 702 is further configured to perform the following steps, S301: judging whether the ith first language single character can be split, wherein i is more than or equal to 1 and less than or equal to n, if so, executing S302, and if not, executing S303;
s302: splitting the first language single character to obtain the radicals of the first language single character, determining a splitting feature expression vector of the first language single character according to the radicals of the first language single character, and executing S304;
s303: taking the expression vector corresponding to the first language single character as a split feature expression vector corresponding to the first language single character, and executing S304;
s304; increasing i by 1, judging whether i is larger than n, if not, executing S301, and if so, executing S305;
s305: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each first language single character.
Optionally, the separation module 701 is further configured to separate the original text to obtain a second language word.
The processing module 703 is further configured to perform the following steps, S401: judging whether the jth second language word can be split, j is more than or equal to 1 and less than or equal to m, the original text comprises m second language words, if yes, executing S402, and if no, executing S403;
s402: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S404;
s403: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S404;
s404: increasing j by 1, judging whether j is larger than m, if not, executing S401, and if so, executing S405;
s405: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each second language word.
Optionally, the separation module 701 is further configured to separate the original text to obtain h text units, where the text units are words in a first language or words in a second language;
the processing module 703 is further configured to perform the following steps, S501: judging that the kth text unit is a first language single word or a second language word, wherein k is more than or equal to 1 and less than or equal to h, if the kth text unit is the first language single word, executing S502, and if the kth text unit is the second language word, executing S505;
s502: judging whether the kth first language single character can be split, if so, executing S503, and if not, executing S504;
s503: splitting the first language single character to obtain the radical of the first language single character, determining a splitting feature expression vector of the first language single character according to the radical of the first language single character, and executing step 507;
s504: taking the expression vector corresponding to the first language single character as a split characteristic expression vector corresponding to the first language single character;
s505: judging whether the kth second language word can be split or not, if so, executing S506, and if not, executing S507;
s506: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S508;
s507: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S508;
s508: increasing k by 1, judging whether k is larger than h, if not, executing S501, and if so, executing S509;
s509: and determining the feature expression vector of the original text according to the split feature expression vector corresponding to each first language word and the split feature expression vector corresponding to each second language word.
Optionally, the processing module 703 is further configured to determine, according to the radical of the first language word, an embedded representation corresponding to the radical of the first language word;
and inputting the embedded representation corresponding to the radical of the first language single character into the convolution layer to obtain a split feature representation vector corresponding to the radical of the first language single character.
Optionally, the processing module 703 is further configured to determine, from the characters of the second language word, an embedded representation corresponding to the characters of the second language word;
and inputting the embedded representation corresponding to the characters of the second language word into the convolution layer to obtain a split feature representation vector corresponding to the characters of the second language word.
Optionally, the second determining module 704 is further configured to input the text unit representation vector corresponding to each text unit and the feature representation vector of the original text into a long-short term memory model, and the long-short term memory model outputs a fused vector corresponding to each text unit;
inputting the fusion vector corresponding to the text unit into a conditional random field model, and outputting a label corresponding to each text unit by the conditional random field model;
and determining the named entities in the original text according to the labels corresponding to the text units.
Optionally, the separation module 701 is further configured to separate the original text acquisition text unit by a regular expression.
An embodiment of the present application also provides a computing device, which includes a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor executes the instructions to implement the steps of the method for named entity identification as described above.
An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the method for named entity identification as described above.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the named entity identifying method belong to the same concept, and for details that are not described in detail in the technical solution of the storage medium, reference may be made to the description of the technical solution of the named entity identifying method.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that for simplicity and convenience of description, the above-described method embodiments are described as a series of combinations of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (15)

1. A method of named entity recognition, comprising:
receiving an original text, separating the original text to obtain a text unit;
determining a text unit representation vector according to the text unit;
acquiring splitting characteristics corresponding to the text unit, and determining a characteristic representation vector of the original text according to the splitting characteristics of the text unit;
determining named entities in the original text according to the feature representation vector and the text unit representation vector of the original text;
wherein the determining the named entity in the original text according to the feature representation vector and the text unit representation vector of the original text comprises:
inputting the text unit representation vector corresponding to each text unit and the feature representation vector of the original text into a long-short term memory model, and outputting a fusion vector corresponding to each text unit by the long-short term memory model;
inputting the fusion vector corresponding to the text unit into a conditional random field model, and outputting a label corresponding to each text unit by the conditional random field model;
and determining the named entities in the original text according to the labels corresponding to the text units.
2. The method of claim 1, wherein the original text comprises n first language words;
separating the original text to obtain a text unit, comprising:
separating the original text to obtain a first language word;
acquiring the splitting characteristic corresponding to the text unit, and determining the characteristic expression vector of the original text according to the splitting characteristic of the text unit, wherein the method comprises the following steps:
s301: judging whether the ith first language single character can be split, wherein i is more than or equal to 1 and less than or equal to n, if so, executing S302, otherwise, executing S303;
s302: splitting the first language single character to obtain the radical of the first language single character, determining a split feature expression vector of the first language single character according to the radical of the first language single character, and executing S304;
s303: taking the expression vector corresponding to the first language single character as a split feature expression vector corresponding to the first language single character, and executing S304;
s304; increasing i by 1, judging whether i is larger than n, if not, executing S301, and if so, executing S305;
s305: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each first language single character.
3. The method of claim 1, wherein the original text comprises m second language words;
separating the original text to obtain a text unit, comprising:
separating the original text to obtain a second language word;
acquiring the splitting characteristic corresponding to the text unit, and determining the characteristic expression vector of the original text according to the splitting characteristic of the text unit, wherein the method comprises the following steps:
s401: judging whether j is more than or equal to 1 and less than or equal to m or not, if so, executing S402, and if not, executing S403;
s402: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S404;
s403: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S404;
s404: increasing j by 1, judging whether j is larger than m, if not, executing S401, and if so, executing S405;
s405: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each second language word.
4. The method of claim 1, wherein separating the original text-derived text unit comprises:
separating the original text to obtain h text units, wherein the text units are first language words or second language words;
acquiring splitting characteristics corresponding to the text unit, and determining a feature expression vector of the original text according to the splitting characteristics of the text unit, wherein the method comprises the following steps:
s501: judging that the kth text unit is a first language word or a second language word, wherein k is more than or equal to 1 and less than or equal to h, if the kth text unit is the first language word, executing S502, and if the kth text unit is the second language word, executing S505;
s502: judging whether the kth first language single character can be split, if so, executing S503, and if not, executing S504;
s503: splitting the first language single character to obtain the radical of the first language single character, determining a splitting feature expression vector of the first language single character according to the radical of the first language single character, and executing step 507;
s504: taking the expression vector corresponding to the first language single character as a split characteristic expression vector corresponding to the first language single character;
s505: judging whether the kth second language word can be split or not, if so, executing S506, and if not, executing S507;
s506: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S508;
s507: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S508;
s508: increasing k by 1, judging whether k is larger than h, if not, executing S501, and if so, executing S509;
s509: and determining the feature expression vector of the original text according to the split feature expression vector corresponding to each first language word and the split feature expression vector corresponding to each second language word.
5. The method according to claim 2 or 4, wherein determining the split feature representation vector of the first language word according to the radical of the first language word comprises:
determining the embedded representation corresponding to the radicals of the first language single character according to the radicals of the first language single character;
and inputting the embedded representation corresponding to the radical of the first language single character into a convolution layer to obtain a split feature representation vector corresponding to the radical of the first language single character.
6. The method of claim 3 or 4, wherein determining the split feature representation vector for the second language word from the characters of the second language word comprises:
determining an embedded representation corresponding to the characters of the second language words according to the characters of the second language words;
and inputting the embedded representation corresponding to the characters of the second language word into the convolution layer to obtain a split feature representation vector corresponding to the characters of the second language word.
7. The method of claim 1, wherein separating the original text-derived text unit comprises:
and separating the original text through a regular expression to obtain a text unit.
8. An apparatus for named entity recognition, comprising:
the separation module is configured to receive an original text, separate the original text and obtain a text unit;
a first determination module configured to determine a text unit representation vector from the text unit;
the processing module is configured to acquire the splitting characteristics corresponding to the text unit and determine the characteristic representation vector of the original text according to the splitting characteristics of the text unit;
a second determination module configured to determine a named entity in the original text according to the feature representation vector and the text unit representation vector of the original text;
the second determination module is further configured to input the text unit representation vector corresponding to each text unit and the feature representation vector of the original text into a long-short term memory model, and the long-short term memory model outputs a fusion vector corresponding to each text unit; inputting the fusion vector corresponding to the text unit into a conditional random field model, and outputting a label corresponding to each text unit by the conditional random field model; and determining the named entities in the original text according to the labels corresponding to the text units.
9. The apparatus of claim 8, wherein the original text comprises n first language words, and the separation module is further configured to separate the original text to obtain the first language words;
the processing module is further configured to perform the following steps, S301: judging whether the ith first language single character can be split, wherein i is more than or equal to 1 and less than or equal to n, if so, executing S302, and if not, executing S303;
s302: splitting the first language single character to obtain the radicals of the first language single character, determining a splitting feature expression vector of the first language single character according to the radicals of the first language single character, and executing S304;
s303: taking the expression vector corresponding to the first language single character as a split feature expression vector corresponding to the first language single character, and executing S304;
s304; increasing i by self 1, judging whether i is larger than n, if not, executing S301, and if so, executing S305;
s305: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each first language single character.
10. The apparatus according to claim 8, wherein the separation module is further configured to separate the original text to obtain second language words, and the original text includes m second language words;
the processing module is further configured to perform the following steps, S401: judging whether j is more than or equal to 1 and less than or equal to m or not, if so, executing S402, and if not, executing S403;
s402: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S404;
s403: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S404;
s404: increasing j by 1, judging whether j is larger than m, if not, executing S401, and if so, executing S405;
s405: and determining the feature representation vector of the original text according to the split feature representation vector corresponding to each second language word.
11. The apparatus according to claim 8, wherein the separation module is further configured to separate the original text to obtain h text units, where the text units are words in a first language or words in a second language;
the processing module is further configured to perform the following steps, S501: judging that the kth text unit is a first language single word or a second language word, wherein k is more than or equal to 1 and less than or equal to h, if the kth text unit is the first language single word, executing S502, and if the kth text unit is the second language word, executing S505;
s502: judging whether the kth first language single character can be split, if so, executing S503, otherwise, executing S504;
s503: splitting the first language single character to obtain the radical of the first language single character, determining a split feature expression vector of the first language single character according to the radical of the first language single character, and executing step 507;
s504: taking the expression vector corresponding to the first language single character as a split characteristic expression vector corresponding to the first language single character;
s505: judging whether the kth second language word can be split or not, if so, executing S506, and if not, executing S507;
s506: splitting the second language word to obtain the character of the second language word, determining a split feature expression vector of the second language word according to the character of the second language word, and executing S508;
s507: taking the expression vector corresponding to the second language word as a split feature expression vector corresponding to the second language word, and executing S508;
s508: increasing k by 1, judging whether k is larger than h, if not, executing S501, and if so, executing S509;
s509: and determining the feature expression vector of the original text according to the split feature expression vector corresponding to each first language word and the split feature expression vector corresponding to each second language word.
12. The apparatus according to claim 9 or 11, wherein the processing module is further configured to determine, from the radical of the first language word, an embedded representation corresponding to the radical of the first language word;
and inputting the embedded representation corresponding to the radical of the first language single character into a convolution layer to obtain a split feature representation vector corresponding to the radical of the first language single character.
13. The apparatus of claim 10 or 11, wherein the processing module is further configured to determine, from the characters of the second language word, embedded representations corresponding to the characters of the second language word;
and inputting the embedded representation corresponding to the characters of the second language word into the convolution layer to obtain a split feature representation vector corresponding to the characters of the second language word.
14. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1-7 when executing the instructions.
15. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
CN201910854243.4A 2019-09-10 2019-09-10 Named entity identification method and device Active CN110543638B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910854243.4A CN110543638B (en) 2019-09-10 2019-09-10 Named entity identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910854243.4A CN110543638B (en) 2019-09-10 2019-09-10 Named entity identification method and device

Publications (2)

Publication Number Publication Date
CN110543638A CN110543638A (en) 2019-12-06
CN110543638B true CN110543638B (en) 2022-12-27

Family

ID=68713595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910854243.4A Active CN110543638B (en) 2019-09-10 2019-09-10 Named entity identification method and device

Country Status (1)

Country Link
CN (1) CN110543638B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013086998A1 (en) * 2011-12-13 2013-06-20 北大方正集团有限公司 Method and device for named-entity recognition
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN106874256A (en) * 2015-12-11 2017-06-20 北京国双科技有限公司 Name the method and device of entity in identification field
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN109726397A (en) * 2018-12-27 2019-05-07 网易(杭州)网络有限公司 Mask method, device, storage medium and the electronic equipment of Chinese name entity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013086998A1 (en) * 2011-12-13 2013-06-20 北大方正集团有限公司 Method and device for named-entity recognition
CN103309926A (en) * 2013-03-12 2013-09-18 中国科学院声学研究所 Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN106874256A (en) * 2015-12-11 2017-06-20 北京国双科技有限公司 Name the method and device of entity in identification field
CN107797992A (en) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 Name entity recognition method and device
CN109726397A (en) * 2018-12-27 2019-05-07 网易(杭州)网络有限公司 Mask method, device, storage medium and the electronic equipment of Chinese name entity

Also Published As

Publication number Publication date
CN110543638A (en) 2019-12-06

Similar Documents

Publication Publication Date Title
US20240078386A1 (en) Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
WO2018032937A1 (en) Method and apparatus for classifying text information
CN104598445B (en) Automatically request-answering system and method
CN103838866B (en) A kind of text conversion method and device
CN109192225B (en) Method and device for recognizing and marking speech emotion
CN110633577A (en) Text desensitization method and device
CN110781663A (en) Training method and device of text analysis model and text analysis method and device
CN111310440A (en) Text error correction method, device and system
CN112417158A (en) Training method, classification method, device and equipment of text data classification model
CN112906381B (en) Dialog attribution identification method and device, readable medium and electronic equipment
CN110266900A (en) Recognition methods, device and the customer service system that client is intended to
CN114090776A (en) Document analysis method, system and device
CN111881297A (en) Method and device for correcting voice recognition text
CN112507706A (en) Training method and device of knowledge pre-training model and electronic equipment
CN110232920B (en) Voice processing method and device
CN115954001A (en) Speech recognition method and model training method
CN113255331B (en) Text error correction method, device and storage medium
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN110543638B (en) Named entity identification method and device
CN110969005A (en) Method and device for determining similarity between entity corpora
CN113268989A (en) Polyphone processing method and device
CN116956068A (en) Intention recognition method and device based on rule engine, electronic equipment and medium
CN115934904A (en) Text processing method and device
CN112800186B (en) Reading understanding model training method and device and reading understanding method and device
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant