CN115952798A - Named entity recognition method, device, server and storage medium - Google Patents

Named entity recognition method, device, server and storage medium Download PDF

Info

Publication number
CN115952798A
CN115952798A CN202211316156.1A CN202211316156A CN115952798A CN 115952798 A CN115952798 A CN 115952798A CN 202211316156 A CN202211316156 A CN 202211316156A CN 115952798 A CN115952798 A CN 115952798A
Authority
CN
China
Prior art keywords
named
named entity
named entities
entities
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211316156.1A
Other languages
Chinese (zh)
Inventor
张徐润
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Financial Technology Co Ltd
Original Assignee
Bank of China Financial Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Financial Technology Co Ltd filed Critical Bank of China Financial Technology Co Ltd
Priority to CN202211316156.1A priority Critical patent/CN115952798A/en
Publication of CN115952798A publication Critical patent/CN115952798A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a named entity identification method, a named entity identification device, a server and a storage medium, which can be applied to the financial field or other fields. The method comprises the steps of obtaining a text to be recognized, wherein the text consists of a plurality of keywords, obtaining first probability value sets corresponding to the keywords respectively, inputting the first probability value sets corresponding to the keywords respectively to a pre-constructed conditional random field model, dividing the text into a plurality of named entities through the conditional random field model, and obtaining the named entity categories to which the named entities belong respectively through the conditional random field model, wherein the named entities comprise one or more keywords belonging to the same named entity category, so that the named entities contained in the text are recognized more accurately, and the condition that whether a customer has a backwashing risk or not is judged mistakenly by a screening engine is effectively reduced.

Description

Named entity recognition method, device, server and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a named entity identification method, apparatus, server, and storage medium.
Background
In the process of handling business, for example, when a financial transaction is performed, the financial institution needs to confirm whether the customer has the risk of money laundering, so that the customer information provided by the customer is sent to a screening engine in the money laundering system for identification screening, and whether the customer has the risk of money laundering is confirmed based on the screening result. However, when the anti-money laundering system identifies that the named entity exists in the client information provided by the client, the semantic relation between the words is not considered, so that the screening engine may not accurately identify the named entity, the screening result is not accurate, and the client may be misjudged whether the anti-money laundering risk exists.
In summary, how to accurately identify a named entity is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the present application provides a named entity identification method, apparatus, server and storage medium.
In order to achieve the above purpose, the present application provides the following technical solutions:
a first aspect of the present application provides a named entity identification method, including:
acquiring a text to be recognized, wherein the text consists of a plurality of keywords;
acquiring a first set of probability values corresponding to a plurality of keywords respectively, wherein the first set of probability values corresponding to the keywords comprises a first probability value of the keyword belonging to an open label under each named entity category, a first probability value of the keyword belonging to a process label under each named entity category, a first probability value of i keywords located before the keyword belonging to an open label under each named entity category, a first probability value of i keywords located before the keyword belonging to a process label under each named entity category, a first probability value of j keywords located after the keyword belonging to an open label under each named entity category, and a first probability value of j keywords located after the keyword belonging to a process label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2, … M in sequence, and the maximum value of M is the difference value between the total word number of the text and the position; the probability value that the keyword belongs to the beginning label is a first probability value of the first keyword which is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of the non-first keyword which is a named entity;
inputting a first set of probability values corresponding to a plurality of keywords respectively into a pre-constructed conditional random field model, dividing the text into a plurality of named entities through the conditional random field model, and obtaining named entity categories to which the named entities respectively belong through the conditional random field model, wherein the named entities comprise one or more keywords belonging to the same named entity category, and the conditional random field model is obtained by inputting a first set of probability values corresponding to a plurality of keywords respectively contained in a sample text and the sample text into the conditional random field model, and training named entities in the labeled sample text and named entity categories to which the named entities in the sample belong as targets.
With reference to the first aspect, in a first possible implementation manner, the method further includes:
acquiring second probability value sets corresponding to the named entities respectively, wherein the second probability value sets corresponding to the named entities comprise second probability values of the named entities belonging to nested named entities under each named entity category, second probability values of q named entities located in front of the named entities belonging to the nested named entities under each named entity category, and second probability values of k named entities located behind the named entities belonging to the nested named entities under each named entity category; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at which the named entity is located, the value of k is 1,2, … Y in sequence, and the maximum value of Y is the difference value between the total number of the named entities and the position of the named entity;
inputting a plurality of named entities, a plurality of second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and obtaining nested named entities and named entity categories to which the nested named entities belong through the feature model; the nested named entity comprises a plurality of the named entities; the feature model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set respectively corresponding to the named entities and the sample text into the feature model, and training with the labeled nested named entities contained in the sample text and the named entity classes to which the nested named entities belong as targets.
With reference to the first aspect, in a second possible implementation manner, the inputting the plurality of named entities, the second probability value sets corresponding to the plurality of named entities, and the text into a pre-constructed feature model, and the obtaining, by the feature model, the nested named entities and the named entity categories to which the nested named entities belong includes:
segmenting the keywords except the named entities in the text to obtain keywords;
inputting a plurality of named entities, the keywords, a plurality of second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and obtaining nested named entities and named entity categories to which the nested named entities belong through the feature model, wherein the nested named entities comprise at least one named entity and the keywords.
With reference to the first aspect, in a third possible implementation manner, the method further includes:
by a first formula: p = a/b, calculating the accuracy P of the division of named entities by the conditional random field model, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model;
by the second formula: r = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text;
and if F = P R2/(P + R) is larger than or equal to a preset threshold value, stopping training the conditional random field model.
A second aspect of the present application provides a named entity recognition apparatus, including:
the device comprises a first acquisition unit, a second acquisition unit and a recognition unit, wherein the first acquisition unit is used for acquiring a text to be recognized, and the text consists of a plurality of keywords;
a second obtaining unit, configured to obtain a first set of probability values corresponding to a plurality of the keywords, wherein the first set of probability values corresponding to the keywords includes a first probability value that the keyword belongs to an open-end label under each named entity category, and a first probability value that the keyword belongs to a process label under each named entity category, and a first probability value that i keywords located before the keyword belong to an open-end label under each named entity category, and a first probability value that j keywords located after the keyword belong to a process label under each named entity category, and a first probability value that j keywords located after the keyword belong to a open-end label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2, … M in sequence, and the maximum value of M is the difference value between the total word number of the text and the position; the probability value that the keyword belongs to the start label is a first probability value of a first keyword which is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of a non-first keyword which is a named entity;
a third obtaining unit, configured to input a first set of probability values corresponding to the plurality of keywords respectively to a pre-constructed conditional random field model, divide the text into a plurality of named entities through the conditional random field model, and obtain, through the conditional random field model, named entity categories to which the plurality of named entities respectively belong, where the named entities include one or more keywords belonging to the same named entity category, and the conditional random field model is trained by inputting, to the conditional random field model, the first set of probability values corresponding to the plurality of keywords included in a sample text and the sample text, and using the named entities in the labeled sample text and the named entity categories to which the named entities in the sample belong as targets.
With reference to the second aspect, in a first possible implementation manner, the method further includes:
the first obtaining module is used for obtaining second probability value sets corresponding to the named entities respectively, wherein the second probability value sets corresponding to the named entities comprise second probability values of the named entities belonging to nested named entities under each named entity category, the second probability values of q named entities located in front of the named entities belonging to the nested named entities under each named entity category, and the second probability values of k named entities located behind the named entities belonging to the nested named entities under each named entity category; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at a plurality of named entities, the value of k is 1,2, … Y in sequence, and the maximum value of Y is the difference value between the total number of the named entities and the position of the named entity;
the second acquisition module is used for inputting the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and acquiring nested named entities and named entity categories to which the nested named entities belong through the feature model; the nested named entity comprises a plurality of the named entities; the feature model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set corresponding to each named entity and the sample text into the feature model, and training with the labeled nested named entities contained in the sample text and the named entity classes to which the nested named entities belong as targets.
With reference to the second aspect, in a second possible implementation manner, the second obtaining module includes:
the word segmentation sub-module is used for segmenting the keywords in the text except the named entities to obtain keywords;
the acquisition submodule is used for inputting the plurality of named entities, the keywords, the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and acquiring nested named entities and named entity categories to which the nested named entities belong through the feature model, wherein the nested named entities comprise at least one named entity and the keywords.
With reference to the second aspect, in a third possible implementation manner, the method further includes:
a first calculation submodule for calculating, by a first formula: p = a/b, calculating to obtain the accuracy P of the conditional random field model for dividing the named entities, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model;
a second calculation submodule for calculating, by a second formula: r = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text;
a processing submodule, configured to stop training the conditional random field model if F = P × R × 2/(P + R) is greater than or equal to a preset threshold.
A third aspect of the present application provides a server comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the named entity recognition method as described in any one of the above.
A fourth aspect of the present application provides a computer-readable storage medium, wherein instructions, when executed by a processor of a server, enable the server to perform any one of the named entity recognition methods described above.
According to the technical scheme, the method for identifying the named entity obtains the text to be identified, wherein the text consists of a plurality of keywords; acquiring a first probability value set corresponding to a plurality of keywords respectively, wherein the first probability value set corresponding to the keywords comprises a first probability value of the keyword belonging to an open-end label under each named entity category, a first probability value of the keyword belonging to a process label under each named entity category, a first probability value of i keywords positioned before the keyword belonging to the open-end label under each named entity category, a first probability value of i keywords positioned before the keyword belonging to the process label under each named entity category, a first probability value of j keywords positioned after the keyword belonging to the open-end label under each named entity category, and a first probability value of j keywords positioned after the keyword belonging to the process label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2 and … M in sequence, and the maximum value of M is the difference value between the total word number and the position of the text; the probability value that the keyword belongs to the start label is a first probability value of a first keyword of the named entity, the probability value that the keyword belongs to the process label is a first probability value of a non-first keyword of the named entity, and after a first probability value set corresponding to each keyword is obtained, the named entity can be determined to be under the same named entity category according to the first probability value set and consists of the keywords belonging to the start label and the keywords belonging to the process label; therefore, after the first probability value sets corresponding to the keywords are obtained, the first probability value sets corresponding to the keywords are input into the pre-constructed conditional random field model, the text is divided into the named entities through the conditional random field model, the named entity categories to which the named entities belong are obtained through the conditional random field model, the named entities comprise one or more keywords belonging to the same named entity category, and because the first probability value set corresponding to each keyword and the relation between the first i keywords and the last j keywords of the keyword are determined, the named entities divided through the conditional random field model are more accurate, and the condition that whether the screening engine misjudges whether the reverse money laundering risk exists in the client or not is effectively reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a block diagram of a hardware architecture provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a named entity recognition method in accordance with an exemplary embodiment;
FIG. 3 is a block diagram illustrating a named entity recognition apparatus in accordance with an exemplary embodiment;
FIG. 4 is a block diagram of an appliance apparatus provided in accordance with an example embodiment.
Detailed Description
The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
The embodiment of the application provides a named entity identification method, a named entity identification device, a server and a storage medium, and prior to introducing the technical scheme provided by the embodiment of the application, the related technology and hardware architecture related to the embodiment of the application are explained first.
First, a related art to which the embodiments of the present disclosure relate will be explained.
In the related art, customer information (the customer information may be represented in a text form) provided by a customer is input into a named entity recognition model, each named entity in the text and a named entity category to which each named entity belongs are recognized through the named entity recognition model, each named entity and the named entity category to which each named entity belongs are input into a screening engine, and whether the customer has a risk of anti-money laundering is determined through the screening engine.
Illustratively, the customer information provided by the customer is: "the Beijing express company Limited president's board-grandson-nine transfers 50w to Zhang-Sanchi transfers". The named entities identified include: beijing, courier, limited company, director, grandchild, zhang san, where the correct named entity should be divided into: beijing, express, limited, president, sun Jiu, zhang san, xiongan Xin district.
The named entity recognition model in the related art has a recognition mechanism for a named entity as follows: by determining whether a phrase can be formed between adjacent keywords, the named entity categories of a plurality of keywords before and after are not considered, so that the named entity recognition model is inaccurate for named entity recognition.
Next, a hardware architecture according to an embodiment of the present disclosure will be described.
As shown in fig. 1, a block diagram of a hardware architecture according to an embodiment of the present application includes, but is not limited to: an electronic device 11 and a server 12.
For example, the electronic device 11 may be any electronic product capable of interacting with a user through one or more manners such as a keyboard, a touch pad, a touch screen, a remote controller, voice interaction, or a handwriting device, for example, a mobile phone, a tablet computer, a palm computer, a personal computer, a wearable device, a smart television, and the like.
The server 12 may be, for example, one server, a server cluster composed of a plurality of servers, or a cloud computing service center.
It should be understood that fig. 1 is only an example, and does not limit the number of electronic devices 11 and the number of servers 12.
In an alternative implementation manner, the named entity identification method provided by the embodiment of the present application may be applied to the server 12, which relates to two cases.
The first condition is as follows: the screening engine and the conditional random field model are disposed in the electronic device 11.
In the first case, the electronic device 11 acquires a text to be recognized, where the text is composed of a plurality of keywords; then acquiring a first probability value set corresponding to the plurality of keywords respectively; the electronic equipment 11 inputs first probability value sets corresponding to the keywords to a pre-constructed conditional random field model, divides a text into a plurality of named entities through the conditional random field model, obtains named entity categories to which the named entities belong respectively through the conditional random field model, the named entities comprise one or more keywords belonging to the same named entity category, and the electronic equipment 11 inputs the named entity categories to which the named entities belong respectively to a screening engine to judge the backwashing risk of a client.
Illustratively, keywords refer to characters that make up text, e.g., customer information provided by a customer is: "the president Sun Jiu of Beijing express Limited company visits the work in the New area of Xiongan". After a first probability value set corresponding to each keyword is obtained, a named entity consisting of the keywords belonging to the start label and the keywords belonging to the process label under the same named entity category can be determined according to the first probability value set; therefore, after the first probability value sets corresponding to the keywords are obtained, the first probability value sets corresponding to the keywords are input into the pre-constructed conditional random field model, and the named entity obtained by the text through the conditional random field model comprises: beijing, express, inc., president, sun Jiu, xiongan Xin district. Therefore, the purpose of obtaining the named entity is achieved under the condition that the named entity category of each keyword is considered and the probability value of the keyword belonging to the process label or the beginning label is considered.
Case two: the screening engine is disposed in the server 12 and the conditional random field model is disposed in the electronic device 11.
In case two, the electronic device 11 and the server 12 are required to interact in the course of performing the named entity recognition method.
The electronic equipment 11 acquires a text to be recognized, wherein the text consists of a plurality of keywords; then acquiring a first probability value set corresponding to the plurality of keywords respectively; the electronic equipment 11 inputs first probability value sets corresponding to the keywords respectively to a pre-constructed conditional random field model, the text is divided into a plurality of named entities through the conditional random field model, named entity categories to which the named entities belong respectively are obtained through the conditional random field model, the named entities comprise one or more keywords belonging to the same named entity category, the electronic equipment 11 sends the obtained named entity categories to the server 12, and the server 12 inputs the named entity categories to which the named entities and the named entities belong respectively to a screening engine to judge the backwashing risk of the client.
In an alternative implementation, the screening engine and the conditional random field model may both be located in the server 12.
It will be understood by those skilled in the art that the foregoing electronic devices and servers are merely exemplary and that other existing or future electronic devices or servers may be suitable for use with the present disclosure and are intended to be included within the scope of the present disclosure and are hereby incorporated by reference.
The named entity identification method provided by the embodiment of the present application is described below with reference to the above hardware architecture.
Referring to fig. 2, fig. 2 is a flowchart of a named entity identifying method provided according to an exemplary embodiment, which may be applied to the server 12 or the electronic device 11 described above, and the method includes the following steps S201 to S203 in implementation.
Step S201: acquiring a text to be recognized, wherein the text consists of a plurality of keywords.
Illustratively, the keyword refers to a character constituting a text, for example, the text is "the president Zhang Sanzai mazechu new area inspection work" of beijing express limited company, the text includes 22 keywords, which are in turn: north, beijing, kuo, suo, limit, gong, sco, dong, shi, chang, zhang, san, in, xiong, an, xin, district, shi, chao, gou, work, and Do.
Illustratively, the characters may be english, chinese, japanese, korean, special symbols, and the like.
For example, if the character is english, the keyword may be an english word.
Step S202: and acquiring a first probability value set corresponding to each keyword.
Step S203: inputting a first probability value set corresponding to the keywords respectively into a pre-constructed conditional random field model, dividing the text into a plurality of named entities through the conditional random field model, and obtaining the named entity categories to which the named entities respectively belong through the conditional random field model.
Wherein the first set of probability values corresponding to the keywords comprises first probability values that the keywords belong to open labels under each named entity category, and first probability values that the keywords belong to process labels under each named entity category, and first probability values that i keywords located before the keywords belong to open labels under each named entity category, and first probability values that i keywords located before the keywords belong to process labels under each named entity category, and first probability values that j keywords located after the keywords belong to open labels under each named entity category, and first probability values that j keywords located after the keywords belong to process labels under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2 and … M in sequence, and the maximum value of M is the difference value between the total word number and the position of the text; the probability value that a keyword belongs to the start label is a first probability value of the first keyword that the keyword is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of the non-first keyword that the keyword is a named entity.
Where a named entity is an entity identified by a name.
The named entities comprise one or more keywords belonging to the same named entity category, and the conditional random field model is obtained by inputting a first probability value set corresponding to each of the keywords contained in the sample text and the sample text into the conditional random field model and training the named entities in the labeled sample text and the named entity category to which the named entities in the sample belong as targets.
It can be understood that, for each keyword, the greater the number of keywords before and after the keyword contained in the first probability value set corresponding to the keyword, that is, the greater the maximum value of the i value, the greater the maximum value of the j value, the more comprehensive the contextual information of the keyword contained in the first probability value set corresponding to the keyword will be, but the greater the number of keywords before and after the keyword contained in the first probability value set corresponding to the keyword will seriously decrease the operating efficiency of the conditional random field model, and may also generate an overfitting phenomenon. Meanwhile, the named entity is an entity with a name as an identifier, and the number of the words of the keyword corresponding to the named entity is not too large, so that two words before and after the keyword is observed can be selected, namely the maximum value of the i value = the maximum value of the j value =2.
Step S202 and step S203 are illustrated below, and exemplary named entity categories include, but are not limited to: name of person, organization, ship, place name, commodity, and others.
Exemplarily, before the first probability value set corresponding to each keyword is obtained, a first common lexicon corresponding to each named entity category needs to be constructed, and the first common lexicon corresponding to each named entity category is used as a basis for obtaining the first probability value set corresponding to each keyword.
For example, the first common thesaurus corresponding to the named entity category includes each named entity belonging to the named entity category, and the named entity category is "name", and the first common thesaurus corresponding to the name includes, but is not limited to: surname, first name.
Exemplary, last names include, but are not limited to: zhang, wang, zheng …. Wherein, the surname is the beginning of the name and corresponds to the beginning label; the name is the non-beginning of the person's name, corresponding to the process label.
For example, if the keyword is an english word, the first common lexicon corresponding to each named entity category may further include: english words of various parts of speech and English words of various tenses.
The first common lexicon corresponding to the named entity category can be characterized by a table, a function, and a structure, and the description below takes the table characterization of the first common lexicon as an example. The following description will take the named entity category "person name" as an example.
Illustratively, the first common thesaurus for the named entity category "person name" is shown in Table 1.
TABLE 1 first common thesaurus for the named entity class "person name
Type (B) Vocabulary and words A preset first probability value
Common surname King, zhang and Li. 0.8
Common names Jian, nation, celebration, dao, ting. 0.8
Rare surname Xi, tian, zu. 0.2
Rare names Lang … 0.2
For example, for each named entity category, the keyword may be compared with the vocabulary in the first common lexicon corresponding to the named entity category to obtain a first probability value that the keyword belongs to a process label or an opening label under the named entity category; taking the named entity category as "person name" and the keyword as "establish" as an example for explanation, the keyword "establish" is the same as the "establish" in the common names in the first common lexicon shown in table 1, and the common names are process tags, so that the first probability value of the keyword "establish" under the named entity category "person name" belonging to the process tags is 0.8.
The following examples are all given by taking the example in step S201 as an example, and the first probability value set corresponding to the keyword "page" is exemplified, wherein i is 1,j is 1, and the other keywords are not listed one by one, and so on, the keyword "page" belongs to the first probability value of the open label under each named entity category, and the keyword "page" belongs to the first probability value of the process label under each named entity category, and the keyword "page" located in the first 1 keyword belongs to the first probability value of the open label under each named entity category, and the keyword "page" located in the first 1 keyword belongs to the first probability value of the process label under each named entity category, and the keyword "page" located in the last 1 keyword belongs to the first probability value of the open label under each named entity category, and the keyword "page" located in the first 1 keyword belongs to the first probability value of the open label under each named entity category.
In summary, the keywords included in the first set of probability values corresponding to the keyword "sheet" are: { "0word": "sheet", "1 word": "long", "1 word": "three", where "0word" represents the current keyword, "-1word" represents the first 1 keyword of the current keyword, and "+1word" represents the last 1 keyword of the current keyword.
In an alternative embodiment, the process by which the conditional random field model partitions the text into named entities includes steps A11 through A13.
Step A11: for each keyword, determining the maximum value A of the first probability value of the keyword belonging to the open-end label and the first probability value belonging to the process label under each named identification category.
Step A12: for each keyword, determining the named entity category corresponding to the maximum value a corresponding to the keyword as the named entity category to which the keyword belongs, and determining the label (the label is a process label or a start label) corresponding to the maximum value a corresponding to the keyword as the label corresponding to the keyword.
Step a12 is described below by way of example, where the keyword "zhang" has a first probability value of 0.8 for the open-end tag under the named entity category "person name", has a first probability value of 0.3 for the open-end tag under the named entity category "organization", has a first probability value of 0.1 for the open-end tag under the named entity category "marine craft", has a first probability value of 0.6 for the open-end tag under the named entity category "place name", has a first probability value of 0.3 for the open-end tag under the named entity category "commodity", and has a first probability value of 0.3 for the open-end tag under the named entity category "other". Illustratively, the keyword "sheet" has a first probability value of 0.5 under the named entity category "person name" for the process tag, 0.4 under the named entity category "organization" for the process tag, 0.1 under the named entity category "marine craft" for the process tag, 0.3 under the named entity category "place name" for the process tag, 0.3 under the named entity category "goods" for the process tag, and 0.4 under the named entity category "others" for the process tag.
Since the highest first probability value of the keyword "sheet" is the first probability value of 0.8 under the named entity category "person name" that belongs to the past start label, the keyword "sheet" belongs to the start label in the named entity category "person name".
Illustratively, the first 1 keyword of the keyword "zhang" is "long," wherein the keyword "long" has a first probability value of 0.2 under the named entity category "person name" for the open-end label, 0.5 under the named entity category "organization" for the open-end label, 0.2 under the named entity category "marine craft" for the open-end label, 0.6 under the named entity category "place name" for the open-end label, 0.2 under the named entity category "commodity" for the open-end label, and 0.3 under the named entity category "other" for the open-end label. Illustratively, the keyword "Long" has a first probability value of 0.6 under the named entity category "people name" for the process tag, 0.5 under the named entity category "organization" for the process tag, 0.5 under the named entity category "ship aircraft" for the process tag, 0.5 under the named entity category "place name" for the process tag, 0.3 under the named entity category "goods" for the process tag, and 0.7 under the named entity category "others" for the process tag.
Since the highest first probability value for the keyword "long" is the first probability value 0.7 under the named entity category "other" belonging to the process label, the keyword "long" belongs to the process label in the named entity category "other".
Illustratively, the keyword "page" is followed by 1 keyword "three", wherein the keyword "three" has a first probability value of 0.3 under the named entity category "person name" for the open-end tag, 0.6 under the named entity category "organization" for the open-end tag, 0.4 under the named entity category "marine craft", 0.5 under the named entity category "place name" for the open-end tag, 0.5 under the named entity category "commodity", and 0.4 under the named entity category "other" for the open-end tag. Illustratively, the keyword "three" has a first probability value of 0.7 under the named entity category "person name" for the process tag, 0.4 under the named entity category "organization" for the process tag, 0.4 under the named entity category "ship aircraft" for the process tag, 0.4 under the named entity category "place name" for the process tag, 0.5 under the named entity category "goods" for the process tag, and 0.3 under the named entity category "others" for the process tag.
Since the highest first probability value for the keyword "three" is the first probability value 0.7 for the process label under the named entity category "person name", the keyword "three" belongs to the process label in the named entity category "person name".
Step A13: the text is divided according to the rules of "start label + process label" or "start label" and the rules of the same named entity category to get multiple named entities.
The rule of "start label + process label" means that if a named entity includes multiple keywords, the first keyword of the named entity belongs to the start label, and other keywords belong to the process label; the rule of "start label" means that the named entity consists of a key, and the key belongs to the start label; the rule of the same named entity category means that the named entity categories to which a plurality of keywords constituting the same named entity belong are the same.
Step a13 is described by way of example below, since the keyword "zhang" belongs to the beginning label in the named entity category "name", and the keyword "three" belongs to the process label in the named entity category "name", because the keyword "zhang" belongs to the same named entity category "name" as the keyword "three", and the keyword "zhang" belongs to the beginning label, and the keyword "three" belongs to the process label, it can be determined that "zhang" belongs to the same named entity, the named entity category that belongs to is "name", and other named entities are analogized, which is not illustrated one by one here, and the named entities finally divided by the conditional random field model include: beijing, express delivery, company Limited, director of president, zhang III, xiongan New district, wherein the named entity categories to which the named entities belong are respectively: place name, other, organization, other, person name, place name.
In conclusion, by acquiring the text to be recognized, the text is composed of a plurality of keywords; acquiring a first probability value set corresponding to a plurality of keywords respectively, wherein the first probability value set corresponding to the keywords comprises a first probability value of the keyword belonging to an open-end label under each named entity category, a first probability value of the keyword belonging to a process label under each named entity category, a first probability value of i keywords positioned before the keyword belonging to the open-end label under each named entity category, a first probability value of i keywords positioned before the keyword belonging to the process label under each named entity category, a first probability value of j keywords positioned after the keyword belonging to the open-end label under each named entity category, and a first probability value of j keywords positioned after the keyword belonging to the process label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2 and … M in sequence, and the maximum value of M is the difference value between the total word number and the position of the text; the probability value that the keyword belongs to the start label is a first probability value of a first keyword of the named entity, the probability value that the keyword belongs to the process label is a first probability value of a non-first keyword of the named entity, and after a first probability value set corresponding to each keyword is obtained, the named entity can be determined to be under the same named entity category according to the first probability value set and consists of the keywords belonging to the start label and the keywords belonging to the process label; therefore, after the first probability value sets corresponding to the keywords are obtained, the first probability value sets corresponding to the keywords are input into the pre-constructed conditional random field model, the text is divided into the named entities through the conditional random field model, the named entity categories to which the named entities belong are obtained through the conditional random field model, the named entities comprise one or more keywords belonging to the same named entity category, and because the first probability value set corresponding to each keyword and the relation between the first i keywords and the last j keywords of the keyword are determined, the named entities divided through the conditional random field model are more accurate, and the condition that whether the screening engine carries out misjudgment on the condition that whether the customer has the backwashing risk is effectively reduced.
In an optional implementation manner, a nested named entity may exist in client information provided by a client, where the nested named entity is composed of a plurality of named entities, and when the nested named entity is identified, a phenomenon that the plurality of named entities originally belonging to the same nested named entity are respectively identified as a plurality of different named entities may occur, so that a screening engine screens named entity categories to which the plurality of named entities (which should belong to the same nested named entity) respectively, which results in an inaccurate screening result, and thus a situation that whether a client has a risk of money laundering is misjudged occurs, so that the nested named entity needs to be accurately identified. Based on this, the embodiments of the present application provide, but are not limited to, the following method, which includes the following steps B11 and B12.
Step B11: and acquiring second probability value sets respectively corresponding to the named entities.
Step B12: inputting the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and obtaining nested named entities and named entity categories to which the nested named entities belong through the feature model.
The characteristic model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set corresponding to the named entities respectively and the sample text into the characteristic model, and training by taking the nested named entities contained in the labeled sample text and the named entity classes to which the nested named entities belong as targets.
The second probability value set corresponding to the named entities comprises second probability values of the named entities belonging to the nested named entities under each named entity category, second probability values of q named entities located in front of the named entities belonging to the nested named entities under each named entity category, and second probability values of k named entities located behind the named entities belonging to the nested named entities under each named entity category; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at the plurality of named entities, the value of k is 1,2, … Y in sequence, and the maximum value of Y is the difference value between the total number of the plurality of named entities and the position of the named entity.
Wherein the nested named entity comprises a plurality of named entities.
Exemplary named entity categories include, but are not limited to: name of person, organization, ship, place name, commodity, and others.
Illustratively, the text to be recognized is "director of south china oil and gas limited Zhang Sanzai conference in new male security area", and the named entities obtained through steps S201 to S203 are, in order of being located in the text: china, petroleum, natural gas, limited company, director of president, zhang san, xiong an new district. The second probability value sets corresponding to a plurality of named entities, namely 'petroleum', are illustrated, and the second probability value sets corresponding to other named entities are not listed one by one, and so on.
Exemplarily, before obtaining the second probability value set of each named entity belonging to the nested named entity under each named entity category, a second common lexicon corresponding to each named entity category needs to be constructed, and the second common lexicon corresponding to each named entity category is used as a basis for obtaining the second probability value set of each named entity belonging to the nested named entity under each named entity category.
After the plurality of named entities, the second probability value sets respectively corresponding to the plurality of named entities and the text are input into the pre-constructed feature model, the feature model determines the highest second probability value under each named entity category in the second probability value sets corresponding to the named entities to serve as the second probability value of the named entities, and then the nested named entities are divided.
For the following description of steps B11 and B12, for example, the second common thesaurus corresponding to the named entity category includes the position relationship between the named entities included in the nested named entities under the named entity category. The following description will take the named entity class as "organization" as an example.
The second common lexicon corresponding to the named entity category can be characterized by a table, a function and a structural body, and the second common lexicon is characterized by a table. Illustratively, the second common thesaurus for the named entity category "organization" is shown in Table 2.
TABLE 2 second common thesaurus for the named entity class "organization
Types of Naming entities Preset probability value
Left high probability named entities China and Daqing. 0.8
Left low probability named entity Zhang three and Li four. 0.2
High probability named entity on right Natural gas and coal. 0.9
Right low probability named entities Reporter and captain. 0.1
It will be appreciated that nested named entities include a plurality of named entities, with "high-probability named entity on the left" referring to a named entity with a high probability of occurring to the left (i.e., in front), a "low-probability named entity on the left" referring to a named entity with a low probability of occurring to the left (i.e., in front), and a "high-probability named entity on the right" referring to a named entity with a high probability of occurring to the right (i.e., behind). "Low probability named entity on the right" refers to a named entity that has a low probability of appearing on the right (i.e., behind).
In an alternative implementation, for each named entity category, the process of calculating a second probability value that the named entity a belongs to the nested named entities under the named entity category comprises: acquiring one or more named entities B in front of (i.e. on the left side of) the named entity and one or more named entities C behind (i.e. on the right side of) the named entity; determining a type 1 in the named entity category matched with the named entity B under the named entity category; obtaining a preset probability value 1 corresponding to the type 1; determining a type 2 in the named entity class matched with the named entity C under the named entity class; obtaining a preset probability value 2 corresponding to the type 2; a second probability value that the named entity belongs to the nested named entity under the named entity category is determined based on the probability value 1 and the probability value 2.
For example, still taking the multiple named entities obtained in steps S201 to S203 as an example for explanation, the multiple named entities sequentially include, according to the sequence in the text: china, petroleum, natural gas, limited company, director of president, zhang san, xiong an new district. Taking the named entity "petroleum" as an example, the first 1 named entities of "petroleum" are "china", the last 1 named entities are "natural gas", because "china" is the vocabulary in the "left high-probability named entity", and the probability value is 0.8, and "natural gas" is the vocabulary in the "right high-probability named entity", and the probability value is 0.9, it can be determined that the second probability value of "petroleum" belonging to the nested named entities under the named entity category "agency" is higher, for example, (0.8 + 0.9)/2 =0.85.
Illustratively, a second probability value set corresponding to the named entity "petroleum" is illustrated, wherein q is 1,k is 1, and other named entities are not illustrated one by one, and so on, and the following illustrates a second probability value of the named entity "petroleum" belonging to the nested named entity under each named entity category, and a second probability value of the named entity "petroleum" located at the first 1 of the named entity, and a second probability value of the named entity "petroleum" located at the last 1 of the named entity, belonging to the nested named entity under each named entity category.
In summary, the named entities included in the second probability value set corresponding to the named entity "petroleum" are: { "0word": "oil", "-1word": "China", "+1word": "natural gas", wherein "0word" represents the current named entity, "-1word" represents the first 1 named entity of the current named entity, and "+1word" represents the last 1 named entity of the current named entity.
The following describes a process of obtaining a nested named entity and a named entity category to which the nested named entity belongs through the feature model, and the process includes: for each named entity, determining the maximum value B of the second probability values that the named entity belongs to the nested named entities under the named recognition categories. Aiming at each named entity, determining the named entity category corresponding to the maximum value B corresponding to the named entity as the named entity category to which the named entity belongs; determining preset rules which accord with target named entity categories for dividing the named entities from the preset rules respectively corresponding to the named entity categories; dividing a plurality of named entities through a preset rule of a target named entity category to obtain nested named entities; and determining the named entity category to which the nested named entity belongs as the target named entity category.
For example, the preset rules corresponding to different named entity categories may be different. For the same text, the preset rules for dividing different named entities may be different, for example, the first 3 named entities in the text are divided by the preset rules corresponding to the named entity category "person name", and the last 4 named entities in the text are divided by the preset rules corresponding to the named entity category "mechanism". Taking the named entity category as the "organization" as an example, the preset rule is "place name-T organizations" or "place name-person name-T organizations" or "organization-T organizations"; the place name-T mechanisms refer to the named entity categories of the named entities contained in the nested named entities, and the named entity categories are as follows in sequence: place name, mechanism, … mechanism; the 'place name-person name-T mechanisms' means that the named entity categories of all named entities contained in the nested named entities are as follows in sequence: place name, person name, mechanism, … mechanism; and the others are analogized in turn.
The following describes, by way of example, a process of obtaining a nested named entity and a named entity category to which the nested named entity belongs through the feature model.
Illustratively, the second probability value for the named entity "petroleum" under the named entity category "person name" for the nested named entity is 0.1, the second probability value for the named entity category "organization" for the nested named entity is 0.9, the second probability value for the named entity category "marine craft" for the nested named entity is 0.1, the second probability value for the named entity category "place name" for the nested named entity is 0.3, the second probability value for the named entity category "commodity" for the nested named entity is 0.5, and the second probability value for the named entity category "other" for the nested named entity is 0.3.
Since the highest second probability value for the named entity "petroleum" is the second probability value of 0.85 for the nested named entity under the named entity category "agency," the named entity "petroleum" belongs to the named entity category "agency.
Illustratively, the first 1 named entity of the named entity "petroleum" is "china", wherein the named entity "china" has a second probability value of 0.3 under the named entity category "person name" for the nested named entity, 0.8 under the named entity category "organization", 0.3 under the named entity category "ship aircraft", 0.9 under the named entity category "place name", 0.5 under the named entity category "commodity", and 0.4 under the named entity category "other" for the nested named entity.
Since the highest second probability value of the named entity "china" is the second probability value of 0.9 belonging to the nested named entity under the named entity category "place name", the named entity "china" belongs to the named entity category "place name".
Illustratively, the last 1 named entity of the named entity "petroleum" is "natural gas", wherein the named entity "natural gas" has a second probability value of 0.1 under the named entity category "person name" for the nested named entity, 0.7 under the named entity category "agency" for the nested named entity, 0.1 under the named entity category "ship aircraft", 0.2 under the named entity category "place name", 0.3 under the named entity category "commodity", and 0.6 under the named entity category "other" for the nested named entity.
Since the highest second probability value for the named entity "natural gas" is the second probability value of 0.7 for the nested named entity under the named entity category "agency", the named entity "natural gas" belongs to the named entity category "agency".
Similarly, "limited" belongs to the named entity class "organization".
Exemplarily, since the named entity category corresponding to the named entity "china" is "place name", the named entity category corresponding to the named entity "petroleum" is "organization", the named entity category corresponding to the named entity "natural gas" is "organization", and "limited company" belongs to the named entity category "organization", it can be understood that the preset rule for dividing the 4 named entities is the preset rule for the named entity category "organization"; then the characteristic model obtains a nested named entity according to a preset rule of named entity category 'organization', namely 'place name-organization', namely 'China oil and gas Co., ltd', wherein the named entity category of 'China oil and gas Co., ltd' is: and (4) a mechanism.
In summary, the second probability value sets respectively corresponding to the named entities are obtained, so that the named entity category relation between each named entity and the preceding and following named entities can be determined, and then the nested named entities contained in the named entities can be identified by the feature model after the named entities, the second probability value sets respectively corresponding to the named entities and the text are input into the pre-constructed feature model for nesting, so that the purpose of identifying the nested named entities is achieved.
In an optional implementation manner, the conditional random field model may not divide the named entities completely, and if all the named entities contained in the text are not divided, a screening engine may misjudge whether there is a risk of money laundering in the customer, so that before the feature model is used to divide the nested named entities, a word segmentation tool may be used to segment keywords in the text except for a plurality of named entities to obtain keywords to assist the feature model in dividing the nested named entities. Based on this, the embodiments of the present application provide, but are not limited to, the following method, which includes the following steps C11 and C12.
Step C11: and segmenting the keywords in the text except the plurality of named entities to obtain the keywords.
Step C12: inputting a plurality of named entities, the keyword, a second probability value set corresponding to each named entity and the text into a pre-constructed feature model, and obtaining a nested named entity and a named entity category to which the nested named entity belongs through the feature model, wherein the nested named entity comprises at least one named entity and the keyword.
Step C11 and step C12 are illustrated below, and the text is, for example: zhang III, a director of China oil and gas Co., ltd, will have a conference on how to cultivate the enthusiasm of employees in Xiongan Xin differentiating company. The named entities obtained in steps S201 to S203 are: china, oil, natural gas, limited companies, president, zhang san, employee, meeting. Segmenting the keywords except the named entities in the text by a segmentation tool to obtain the keywords as follows: will, in, xianchan, division, summons, about, how, culture, aggressiveness. The nested named entities obtained through the feature model are as follows: china oil and gas Limited and Xiongan Xin differentiating company, wherein the named entity categories of each nested named entity are respectively as follows: mechanism, mechanism. In summary, the named entities and nested named entities obtained by the conditional random field model and the feature model include: china oil and gas Limited, director, zhang san, xiongan New Distinguish, employees, meetings. The named entity categories are respectively as follows: organization, other, name, organization, other.
In conclusion, before the feature model is used for dividing the named entities into the nested named entities, the word segmentation tool is used for segmenting the keywords except the named entities in the text to obtain the named entities which cannot be divided by the conditional random field model, so that the purpose of more accurately dividing the nested named entities can be achieved.
In an optional embodiment, after the named entities in the text are divided by the conditional random field model, whether the division of the conditional random field model on the named entities reaches the standard needs to be checked, if the division accuracy reaches the standard, the conditional random field model does not need to be trained, and if the division accuracy does not reach the standard, the conditional random field model continues to be trained. Based on this, the present embodiment provides, but is not limited to, a method including the following steps D11 to D13.
Step D11: by a first formula: and P = a/b, calculating the accuracy P of the conditional random field model for dividing the named entities, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model.
Step D12: by a second formula: and R = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text.
Step D13: and if F = P R2/(P + R) is larger than or equal to a preset threshold value, stopping training the conditional random field model.
For the following description of steps D11 to D13, for example, the preset threshold may be 0.75, and taking the examples in step C11 and step C12 as examples, the text is: zhang III, a director of China oil and gas Co., ltd, will have a conference on how to cultivate the enthusiasm of employees in Xiongan Xin differentiating company. . The named entities contained in the text are: china, oil, gas, limited, director of the world, zhang san, xianganxin district, division, employee, meeting, then c is 9, the named entities partitioned by the conditional random field model are: china, oil, gas, limited, president, zhang san, then b is 6, and the conditional random field model divides out the correct named entities, then a is 6. Then by the first formula: p = a/b calculation accuracy P =6/6, which gives P as 1, and by the second formula: r = a/c calculation recall ratio R =6/9, R is 2/3, F = 1*2/3*2/(1+2/3) =0.8 can be obtained, and since 0.8 > 0.75 of the preset threshold value, the division accuracy reaches the standard, and the training of the conditional random field model can be stopped.
In conclusion, the staff can calculate the accuracy and the recall ratio through the first formula and the second formula, so that whether the division of the conditional random field model on the named entities in the text reaches the preset threshold value can be determined, and whether the conditional random field model needs to be trained further is determined.
The method is described in detail in the embodiments disclosed in the present application, and the method of the present application can be implemented by various types of apparatuses, so that an apparatus is also disclosed in the present application, and the following detailed description is given of specific embodiments.
Referring to fig. 3, fig. 3 is a block diagram illustrating a named entity recognition apparatus according to an example embodiment, the apparatus comprising: a first acquisition unit 31, a second acquisition unit 32, and a third acquisition unit 33, wherein:
the first acquiring unit 31 is configured to acquire a text to be recognized, where the text is composed of a plurality of keywords.
A second obtaining unit 32, configured to obtain a first set of probability values corresponding to a plurality of keywords, respectively, where the first set of probability values corresponding to the keywords includes a first probability value that the keyword belongs to an open label under each named entity category, and a first probability value that the keyword belongs to a process label under each named entity category, and a first probability value that i keywords located before the keyword belong to an open label under each named entity category, and a first probability value that i keywords located before the keyword belong to a process label under each named entity category, and a first probability value that j keywords located after the keyword belong to an open label under each named entity category, and a first probability value that j keywords located after the keyword belong to a process label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2, … M in sequence, and the maximum value of M is the difference value between the total word number of the text and the position; the probability value that the keyword belongs to the beginning label is a first probability value of the first keyword that the keyword is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of the non-first keyword that the keyword is a named entity.
A third obtaining unit 33, configured to input the first probability value sets corresponding to the multiple keywords respectively to a pre-constructed conditional random field model, divide the text into multiple named entities through the conditional random field model, and obtain, through the conditional random field model, named entity categories to which the multiple named entities respectively belong, where the named entities include one or multiple keywords belonging to the same named entity category, and the conditional random field model is obtained by inputting the first probability value sets corresponding to the multiple keywords respectively included in a sample text and the sample text into the conditional random field model, and training the labeled named entities in the sample text and the named entity categories to which the named entities in the sample belong as targets.
In an optional implementation manner, the named entity identifying apparatus further includes:
the first obtaining module is used for obtaining second probability value sets corresponding to the named entities respectively, wherein the second probability value sets corresponding to the named entities comprise second probability values of the named entities belonging to nested named entities under each named entity category, the second probability values of q named entities located in front of the named entities belonging to the nested named entities under each named entity category, and the second probability values of k named entities located behind the named entities belonging to the nested named entities under each named entity category; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at which the named entity is located, the value of k is 1,2 and … Y in sequence, and the maximum value of Y is the difference value between the total number of the named entities and the position of the named entity.
The second acquisition module is used for inputting the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and acquiring nested named entities and named entity categories to which the nested named entities belong through the feature model; the nested named entity comprises a plurality of the named entities; the feature model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set corresponding to each named entity and the sample text into the feature model, and training with the labeled nested named entities contained in the sample text and the named entity classes to which the nested named entities belong as targets.
In an optional implementation manner, in the named entity identifying apparatus, the second obtaining module includes:
and the word segmentation sub-module is used for segmenting the keywords in the text except the plurality of named entities to obtain the keywords.
The acquisition submodule is used for inputting the plurality of named entities, the keywords, the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and acquiring nested named entities and named entity categories to which the nested named entities belong through the feature model, wherein the nested named entities comprise at least one named entity and the keywords.
In an optional implementation manner, the named entity identifying apparatus further includes:
a first calculation submodule for calculating, by a first formula: and P = a/b, calculating the accuracy P of the division of the named entities by the conditional random field model, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model.
A second calculation submodule for calculating, by a second formula: and R = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text.
A processing submodule, configured to stop training the conditional random field model if F = P × R × 2/(P + R) is greater than or equal to a preset threshold.
Fig. 4 is a block diagram of an apparatus, which may be a server 12, provided according to an example embodiment, including but not limited to: a processor 41, a memory 42, a network interface 43, an I/O controller 44, and a communication bus 45.
It should be noted that the structure of the device shown in fig. 4 is not intended to limit the device, and the device may include more or less components than those shown in fig. 4, or some components may be combined, or a different arrangement of components may be used, as will be appreciated by those skilled in the art.
The following describes each component of the server in detail with reference to fig. 4:
the processor 41 is a control center of the apparatus, connects various parts of the entire apparatus using various interfaces and lines, performs various functions of the apparatus and processes data by running or executing software programs and/or modules stored in the memory 42, and calling data stored in the memory 42, thereby monitoring the entire apparatus. Processor 41 may include one or more processing units; illustratively, the processor 41 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 41.
Processor 41 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.
The Memory 42 may include Memory, such as a Random-Access Memory (RAM) 421 and a Read-Only Memory (ROM) 422, and may also include a mass storage device 423, such as at least 1 disk storage. Of course, the device may also include hardware required for other services.
The memory 42 is configured to store instructions executable by the processor 41. The processor 41 described above has the functionality shown in the named entity recognition method.
A wired or wireless network interface 43 is configured to connect the server to a network.
The processor 41, the memory 42, the interface 43, and the I/O controller 44 may be connected to each other by a communication bus 45, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc.
In an exemplary embodiment, the device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the named entity identification method as provided by any of the embodiments of the present disclosure.
In an exemplary embodiment, there is also provided a computer readable storage medium having instructions which, when executed by a processor of the server 12, enable the server 12 to perform the named entity recognition method as described in any of the above.
In an exemplary embodiment, there is also provided a computer program product directly loadable into the internal memory of a computer, said memory being the memory 42 comprised by said server 12 and containing software code, said computer program being loadable and executable via a computer and being able to carry out the named entity recognition method as defined in any of the above.
It should be noted that the named entity identification method, device, server and storage medium provided by the invention can be used in the financial field or other fields. The other fields are arbitrary fields other than the financial field, for example, the identification field. The above description is only an example, and does not limit the application fields of the named entity identifying method, apparatus, server and storage medium provided by the present invention.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Note that the features described in the embodiments in the present specification may be replaced with or combined with each other. For the device or system type embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment. It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
It will be appreciated by those skilled in the art that the above-described servers are merely exemplary and that other existing or future servers, which may be present, are also encompassed within the scope of the present disclosure and are hereby incorporated by reference.

Claims (10)

1. A named entity recognition method, comprising:
acquiring a text to be recognized, wherein the text consists of a plurality of keywords;
acquiring a first set of probability values corresponding to a plurality of keywords respectively, wherein the first set of probability values corresponding to the keywords comprises a first probability value of the keyword belonging to an open label under each named entity category, a first probability value of the keyword belonging to a process label under each named entity category, a first probability value of i keywords located before the keyword belonging to an open label under each named entity category, a first probability value of i keywords located before the keyword belonging to a process label under each named entity category, a first probability value of j keywords located after the keyword belonging to an open label under each named entity category, and a first probability value of j keywords located after the keyword belonging to a process label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2, … M in sequence, and the maximum value of M is the difference value between the total word number of the text and the position; the probability value that the keyword belongs to the beginning label is a first probability value of the first keyword which is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of the non-first keyword which is a named entity;
inputting a first set of probability values corresponding to a plurality of keywords respectively into a pre-constructed conditional random field model, dividing the text into a plurality of named entities through the conditional random field model, and obtaining named entity categories to which the named entities respectively belong through the conditional random field model, wherein the named entities comprise one or more keywords belonging to the same named entity category, and the conditional random field model is obtained by inputting a first set of probability values corresponding to a plurality of keywords respectively contained in a sample text and the sample text into the conditional random field model, and training named entities in the labeled sample text and named entity categories to which the named entities in the sample belong as targets.
2. The named entity recognition method of claim 1, further comprising:
acquiring second probability value sets corresponding to the named entities respectively, wherein the second probability value sets corresponding to the named entities comprise second probability values of the named entities belonging to nested named entities under each named entity category, second probability values of q named entities located in front of the named entities belonging to the nested named entities under each named entity category, and second probability values of k named entities located behind the named entities belonging to the nested named entities under each named entity category; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at a plurality of named entities, the value of k is 1,2, … Y in sequence, and the maximum value of Y is the difference value between the total number of the named entities and the position of the named entity;
inputting a plurality of named entities, a plurality of second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and obtaining nested named entities and named entity categories to which the nested named entities belong through the feature model; the nested named entity comprises a plurality of the named entities; the feature model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set corresponding to each named entity and the sample text into the feature model, and training with the labeled nested named entities contained in the sample text and the named entity classes to which the nested named entities belong as targets.
3. The named entity recognition method of claim 2, wherein the step of inputting the plurality of named entities, the second set of probability values corresponding to the plurality of named entities, and the text into a pre-constructed feature model, and obtaining nested named entities and named entity categories to which the nested named entities belong via the feature model comprises:
segmenting the keywords except the named entities in the text to obtain keywords;
inputting a plurality of named entities, the keyword, a second probability value set corresponding to each named entity and the text into a pre-constructed feature model, and obtaining a nested named entity and a named entity category to which the nested named entity belongs through the feature model, wherein the nested named entity comprises at least one named entity and the keyword.
4. The named entity recognition method of any one of claims 1-3, further comprising:
by a first formula: p = a/b, calculating the accuracy P of the division of named entities by the conditional random field model, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model;
by a second formula: r = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text;
and if F = P R2/(P + R) is larger than or equal to a preset threshold value, stopping training the conditional random field model.
5. A named entity recognition apparatus, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a recognition unit, wherein the first acquisition unit is used for acquiring a text to be recognized, and the text consists of a plurality of keywords;
a second obtaining unit, configured to obtain a first set of probability values corresponding to a plurality of the keywords, wherein the first set of probability values corresponding to the keywords includes a first probability value that the keyword belongs to an open-end label under each named entity category, and a first probability value that the keyword belongs to a process label under each named entity category, and a first probability value that i keywords located before the keyword belong to an open-end label under each named entity category, and a first probability value that j keywords located after the keyword belong to a process label under each named entity category, and a first probability value that j keywords located after the keyword belong to a open-end label under each named entity category; the value of i is 1,2 in sequence, the maximum value of … N-1,N is the position of the keyword in the text, the value of j is 1,2, … M in sequence, and the maximum value of M is the difference value between the total word number of the text and the position; the probability value that the keyword belongs to the start label is a first probability value of a first keyword which is a named entity, and the probability value that the keyword belongs to the process label is a first probability value of a non-first keyword which is a named entity;
a third obtaining unit, configured to input a first set of probability values corresponding to the plurality of keywords respectively to a pre-constructed conditional random field model, divide the text into a plurality of named entities through the conditional random field model, and obtain, through the conditional random field model, named entity categories to which the plurality of named entities respectively belong, where the named entities include one or more keywords belonging to the same named entity category, and the conditional random field model is trained by inputting, to the conditional random field model, the first set of probability values corresponding to the plurality of keywords included in a sample text and the sample text, and using the named entities in the labeled sample text and the named entity categories to which the named entities in the sample belong as targets.
6. The named entity recognition device of claim 5, further comprising:
the first obtaining module is used for obtaining a plurality of second probability value sets respectively corresponding to the named entities, wherein the second probability value sets corresponding to the named entities comprise second probability values of the named entities belonging to nested named entities under each named entity category, the second probability values of the named entities belonging to the nested named entities under each named entity category positioned in the front q named entities and the second probability values of the named entities belonging to the nested named entities under each named entity category positioned in the back k named entities; the value of q is 1,2 in sequence, the maximum value of … X-1,X is the position of the named entity at a plurality of named entities, the value of k is 1,2, … Y in sequence, and the maximum value of Y is the difference value between the total number of the named entities and the position of the named entity;
the second acquisition module is used for inputting the second probability value sets respectively corresponding to the named entities and the text into a pre-constructed feature model, and acquiring nested named entities and named entity categories to which the nested named entities belong through the feature model; the nested named entity comprises a plurality of the named entities; the feature model is obtained by inputting a plurality of named entities contained in a sample text, a second probability value set corresponding to each named entity and the sample text into the feature model, and training with the labeled nested named entities contained in the sample text and the named entity classes to which the nested named entities belong as targets.
7. The named entity recognition device of claim 6, wherein the second obtaining module comprises:
the word segmentation sub-module is used for segmenting the keywords in the text except the named entities to obtain keywords;
the acquisition submodule is used for inputting the plurality of named entities, the keyword, the second probability value set corresponding to the named entities and the text into a pre-constructed feature model, and acquiring a nested named entity and a named entity category to which the nested named entity belongs through the feature model, wherein the nested named entity comprises at least one named entity and the keyword.
8. The named entity recognition device of any one of claims 5-7, further comprising:
a first calculation submodule for calculating, by a first formula: p = a/b, calculating the accuracy P of the division of named entities by the conditional random field model, wherein a is the number of the named entities correctly divided by the conditional random field model, and b is the total number of the named entities divided by the conditional random field model;
a second calculation submodule for calculating, by a second formula: r = a/c, calculating the recall rate R of the named entities divided by the conditional random field model, wherein c is the total number of the named entities contained in the text;
a processing submodule, configured to stop training the conditional random field model if F = P × R × 2/(P + R) is greater than or equal to a preset threshold.
9. A server, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the named entity recognition method according to any of claims 1 to 4.
10. A computer readable storage medium, instructions in which, when executed by a processor of a server, enable the server to perform the named entity recognition method of any of claims 1 to 4.
CN202211316156.1A 2022-10-26 2022-10-26 Named entity recognition method, device, server and storage medium Pending CN115952798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211316156.1A CN115952798A (en) 2022-10-26 2022-10-26 Named entity recognition method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211316156.1A CN115952798A (en) 2022-10-26 2022-10-26 Named entity recognition method, device, server and storage medium

Publications (1)

Publication Number Publication Date
CN115952798A true CN115952798A (en) 2023-04-11

Family

ID=87286456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211316156.1A Pending CN115952798A (en) 2022-10-26 2022-10-26 Named entity recognition method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN115952798A (en)

Similar Documents

Publication Publication Date Title
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
US20180107945A1 (en) Emoji recommendation method and device thereof
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
US20220012488A1 (en) Receipt identification method, apparatus, electronic device and computer-readable storage medium
CN111737499B (en) Data searching method based on natural language processing and related equipment
CA3048356A1 (en) Unstructured data parsing for structured information
CN113312461A (en) Intelligent question-answering method, device, equipment and medium based on natural language processing
CN103605694A (en) Device and method for detecting similar texts
CN110276009B (en) Association word recommendation method and device, electronic equipment and storage medium
CN103605691A (en) Device and method used for processing issued contents in social network
CN109947903B (en) Idiom query method and device
CN112199588A (en) Public opinion text screening method and device
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN112686026B (en) Keyword extraction method, device, equipment and medium based on information entropy
EP3425531A1 (en) System, method, electronic device, and storage medium for identifying risk event based on social information
CN111984797A (en) Customer identity recognition device and method
CN109284384B (en) Text analysis method and device, electronic equipment and readable storage medium
CN115952798A (en) Named entity recognition method, device, server and storage medium
CN110909538B (en) Question and answer content identification method and device, terminal equipment and medium
Mandal et al. Improving it support by enhancing incident management process with multi-modal analysis
CN112328752B (en) Course recommendation method and device based on search content, computer equipment and medium
CN110765778B (en) Label entity processing method, device, computer equipment and storage medium
CN110083817B (en) Naming disambiguation method, device and computer readable storage medium
CN113064984A (en) Intention recognition method and device, electronic equipment and readable storage medium
CN113392184A (en) Method and device for determining similar texts, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination