CN113642331B - Financial named entity identification method and system, storage medium and terminal - Google Patents
Financial named entity identification method and system, storage medium and terminal Download PDFInfo
- Publication number
- CN113642331B CN113642331B CN202110913735.3A CN202110913735A CN113642331B CN 113642331 B CN113642331 B CN 113642331B CN 202110913735 A CN202110913735 A CN 202110913735A CN 113642331 B CN113642331 B CN 113642331B
- Authority
- CN
- China
- Prior art keywords
- entity
- word
- financial
- words
- expanded
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Development Economics (AREA)
- Strategic Management (AREA)
- Accounting & Taxation (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Databases & Information Systems (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a financial named entity identification method and system, a storage medium and a terminal, comprising the following steps: expanding the entity words in the financial named entity database to generate an expanded entity word database; constructing an entity word candidate model of the financial named entity; screening candidate entity words from the text to be recognized based on the entity word candidate model; verifying the candidate entity words based on the expanded entity word database; and carrying out disambiguation processing on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified. The financial named entity identification method and system, the storage medium and the terminal effectively improve the coverage rate of the financial named entity and realize the quick and efficient identification of the financial named entity.
Description
Technical Field
The invention relates to the technical field of named entity identification, in particular to a financial named entity identification method and system, a storage medium and a terminal.
Background
Named entities are names of people, organizations, places, and all other entities identified by names. Named entity recognition refers to recognition of entities with specific meanings in text, and is a basic key task in natural language processing.
Financial named entity identification is the identification of named entities of particular significance within the financial domain. Financial named entities include, among other things, stocks, funds, bonds, companies, and organizations. The financial named entity recognition plays an important role in financial information classification and keyword extraction in information, and is also a foundation for event extraction and relationship extraction in financial related text analysis.
In the prior art, named entity identification mainly adopts the following four methods:
(1) a dictionary-based approach;
(2) rule-based method
(3) Method based on probabilistic model
(4) A method based on deep learning.
However, for financial named entities with larger magnitude, the existing identification method has the problems of lower coverage rate and lower speed.
Disclosure of Invention
In view of the above disadvantages of the prior art, an object of the present invention is to provide a method and a system for identifying a financial named entity, a storage medium, and a terminal, which effectively improve the coverage of the financial named entity and achieve fast and efficient identification of the financial named entity.
To achieve the above and other related objects, the present invention provides a financial named entity recognition method, comprising the steps of: expanding the entity words in the financial named entity database to generate an expanded entity word database; constructing an entity word candidate model of the financial named entity; screening out candidate entity words from the text to be recognized based on the entity word candidate model; verifying the candidate entity words based on the expanded entity word database; and carrying out disambiguation processing on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified.
In an embodiment of the present invention, expanding the entity words in the financial naming entity database to generate the expanded entity word database includes the following steps:
sequentially acquiring entity words to be expanded according to the entity word type priority;
for the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; if not, adding the entity words to be expanded, of which the suffixes of the companies are removed, into the expanded entity word database;
for the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database;
for the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
In an embodiment of the present invention, the entity part-of-speech type priority is, in order from high to low, listed companies, non-listed companies that issue financial products, and non-listed companies that do not issue financial products; non-public companies that do not release financial products are prioritized by registered capital.
In an embodiment of the present invention, the constructing of the entity word candidate model of the financial named entity includes the following steps:
setting a first two characters and a last two characters of an entity word, and determining the maximum length of the entity word comprising the first two characters and the last two characters;
mapping the entity words into 128-bit data based on an MD5 algorithm;
equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence;
for each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
In an embodiment of the present invention, the step of screening candidate entity words from the text to be recognized based on the entity word candidate model includes the following steps:
traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words;
for each suspected entity word, mapping to 128-bit data based on an MD5 algorithm; equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence; for each 32-bit data, the first 27-bit data is mapped to be a subscript of an integer array, and the last 5-bit data is mapped to be a last 5 mapping bits corresponding to an integer element corresponding to the subscript;
and searching bits corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging the suspected entity word as a candidate entity word only when the four bits are all 1.
In an embodiment of the present invention, verifying the candidate entity word based on the extended entity word database includes the following steps:
enabling entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information;
searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; and if the search is successful, the candidate entity word passes the verification.
In an embodiment of the present invention, disambiguating the verified candidate entity word to obtain the recognition result of the financial named entity in the text to be recognized includes the following steps:
judging whether the candidate entity words are labeled ambiguous entity words or not;
if the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as corpora s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waProbability of ambiguity and ambiguity absence, a ═ 1,2 … n;
when P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, judging the candidate entity words as the recognition results.
In an embodiment of the invention, the expanded entity word database is constructed based on Elastic Search.
In an embodiment of the present invention, the entity word candidate model is stored based on Hbase and Redis.
The invention provides a financial named entity recognition system, which comprises an expansion module, a construction module, a screening module, a verification module and a disambiguation module, wherein the expansion module is used for expanding a financial named entity;
the expansion module is used for expanding the entity words in the financial named entity database to generate an expanded entity word database;
the construction module is used for constructing an entity word candidate model of the financial named entity;
the screening module is used for screening out candidate entity words from the text to be recognized based on the entity word candidate model;
the verification module is used for verifying the candidate entity words based on the expanded entity word database;
and the disambiguation module is used for carrying out disambiguation on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified.
In an embodiment of the present invention, the expanding module expands the entity words in the financial named entity database to generate an expanded entity word database includes the following steps:
sequentially acquiring entity words to be expanded according to the entity word type priority;
for the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; if not, adding the entity words to be expanded without the suffixes of the companies into the entity word expansion database;
for the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database;
for the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
In an embodiment of the present invention, the entity part-of-speech type priority is, in order from high to low, listed companies, non-listed companies that issue financial products, and non-listed companies that do not issue financial products; non-public companies that do not release financial products are prioritized by registered capital.
In an embodiment of the invention, the constructing module constructs the entity word candidate model of the financial named entity, including the following steps:
setting a first two characters and a last two characters of an entity word, and determining the maximum length of the entity word comprising the first two characters and the last two characters;
mapping the entity words into 128-bit data based on an MD5 algorithm;
equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence;
for each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
In an embodiment of the present invention, the screening module, based on the entity word candidate model, screening candidate entity words from a text to be recognized includes the following steps:
traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words;
for each suspected entity word, mapping to 128-bit data based on an MD5 algorithm; equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence; for each 32-bit data, mapping the first 27-bit data to the subscript of the integer array, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript;
and searching bits corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging the suspected entity word as a candidate entity word only when the four bits are all 1.
In an embodiment of the present invention, the verifying module for verifying the candidate entity word based on the extended entity word database includes the following steps:
enabling entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information;
searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; and if the search is successful, the candidate entity word passes the verification.
In an embodiment of the present invention, the step of the recognition module disambiguating the verified candidate entity word to obtain the recognition result of the financial named entity in the text to be recognized includes the following steps:
judging whether the candidate entity words are labeled ambiguous entity words or not;
if the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as a corpus s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waProbability of ambiguity and ambiguity absence, a ═ 1,2 … n;
when P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, the candidate entity words are judged to be the recognition results.
In an embodiment of the present invention, the expansion module constructs the expansion entity word database based on Elastic Search.
In an embodiment of the present invention, the building module stores the entity word candidate model based on Hbase and Redis.
The present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described financial named entity recognition method.
The invention provides a financial named entity recognition terminal, comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the computer program stored in the memory to cause the financial named entity identification terminal to perform the above-mentioned financial named entity identification method.
As described above, the financial named entity identification method and system, the storage medium and the terminal of the present invention have the following advantages:
(1) by expanding the financial naming entity words, higher coverage rate can be obtained;
(2) the candidate word model supports rapid candidate word matching of mass entity words, so that financial naming entity words can be rapidly screened out, and the recognition speed is increased;
(3) the problem of ambiguous entity words is solved through a disambiguation algorithm.
Drawings
FIG. 1 is a flow chart illustrating a method for identifying financial named entities according to an embodiment of the invention;
FIG. 2 is a flow chart illustrating the development of entity words in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart illustrating the construction of a solid word candidate model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating an embodiment of a named entity financial identification system according to the present invention;
FIG. 5 is a block diagram of a financial named entity identification terminal according to an embodiment of the invention.
Description of the element reference numerals
41 expansion module
42 building block
43 screening module
44 authentication module
45 disambiguation module
51 processing unit
52 memory
521 random access memory
522 cache memory
523 storage system
524 program/utility
5241 program module
53 bus
54I/O interface
55 network adapter
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
It should be noted that the drawings provided in the present embodiment are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The financial named entity recognition method and system, the storage medium and the terminal effectively improve the coverage rate of the financial named entity and realize the quick and efficient recognition of the financial named entity through the ways of entity word expansion, candidate word screening, ambiguous word elimination and the like, are suitable for the recognition of the financial named entity in mass data, and have high practicability.
As shown in fig. 1, in an embodiment, the method for identifying a financial named entity of the present invention includes the following steps:
and step S1, expanding the entity words in the financial named entity database to generate an expanded entity word database.
Specifically, the financial named entity database includes unique identification information of financial named entities, entity names full names, and entity names of entities which are defined as short names. Preferably, the unique identification information adopts a unique code. When the financial named entity database is used for identifying the financial named entities, the problem of low coverage rate exists. Therefore, the financial named entity database needs to be expanded to realize the further expansion of the entity words.
In the invention, entity word development is mainly used for processing entity words of company types in sequence according to the optimization level of the entity word types. When entity word expansion is carried out, company suffix expansion and place name prefix expansion are carried out on each entity word in sequence, and place name prefixes and company suffixes are expanded together. For example, "zhhai lattice force electrical appliances limited" can be expanded to obtain the following expanded entity words: zhuhai Geli electric appliances, Inc., and Geli electric appliances. After the expansion is finished, the expansion entity word data block comprises an entity name full name, an entity name short name and an expansion entity name.
In an embodiment of the present invention, expanding the entity words in the financial naming entity database to generate the expanded entity word database includes the following steps:
11) and sequentially acquiring entity words to be expanded according to the entity word type priority.
Wherein, the priority of the listed company is higher than that of the non-listed company, the priority of the financial products issued in the non-listed company is higher than that of the non-issued company, and the financial products which are not issued can be prioritized according to the registered capital.
12) For the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; and if not, adding the entity words to be expanded without the suffixes of the companies into the entity word expansion database.
13) For the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database.
14) For the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
Preferably, the expanded entity word database is constructed based on Elastic Search. The Elastic Search is a Search server based on Lucene, provides a full-text Search engine with distributed multi-user capability and is based on RESTful web interface. The Elastic Search is developed by using Java language and issued as an open source code under Apache licensing terms, is used in cloud computing, can achieve real-time Search, and is stable, reliable, rapid, convenient to install and use. Official clients are available in Java,. NET (C #), PHP, Python, Apache Groovy, Ruby and many other languages.
And step S2, constructing an entity word candidate model of the financial named entity.
Specifically, in order to quickly locate the entity words in the text to be recognized, the entity word candidate model is set so as to facilitate subsequent candidate entity word recognition. Preferably, the entity word candidate models are stored to Hbase and Redis and loaded for use in memory at the time of use. Among them, HBase is a distributed, column-oriented, open source data. HBase provides Bigtable-like capabilities on top of Hadoop, and unlike a typical relational database, it is a database suitable for unstructured data storage. Redis (remote Dictionary Server), a remote Dictionary service, is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API of multiple languages.
In one embodiment of the present invention, as shown in FIG. 3, the construction of the financial named entity includes the following steps:
21) setting the first two characters and the last two characters of the entity word, and determining the maximum length of the entity word containing the first two characters and the last two characters.
Specifically, the entity words in the expanded entity word database are subjected to statistical analysis to obtain first two characters and last two characters, and the maximum length of the entity words including the first two characters and the last two characters is determined, so that subsequent entity word positioning is facilitated.
It should be noted that different numbers of beginning and end words have different effects on hardware overhead and positioning effect. Generally, the more the number of the head and the tail words, the larger the hardware overhead and the better the positioning effect, and in order to balance the hardware overhead and the positioning effect, the invention defines an overhead-confusion function. Counting the number of first and last words len _ s according to the number of different first and last wordsw、len_ew’,len_swWherein, len _ swIndicating the number of first words, len _ e, of the word number ww’Indicating the number of tail words for word number w'. Counting the number of entity words corresponding to the first and last characters, thereby calculating the average value ave _ s of the number of entity words corresponding to the first and last characters under different word numbersw、ave_ew‘,ave_swMean value, ave _ e, representing the number of words of the entity corresponding to the first word with number ww‘The mean value of the number of entity words corresponding to the tail word with the number of words w' is represented. The overhead-confusion function is Efv ═ log2(len_sw*len_ew′)+log10(ave_sw*ave_ew′)2The first half is overhead and the second half is confusion. The combination of w and w' with a small value for the cost-confusion function is taken as the preferred parameter. According to different practical situationsAnd selecting the optimal parameters by condition calculation. When the number of the first word is w ═ 2 and the number of the last word is w ═ 2, the calculation is carried out, and the number of the last word is the preferred parameter.
22) And mapping the entity words into 128-bit data based on an MD5 algorithm.
Specifically, when an entity word is mapped to a fixed-length data by a hash algorithm, collision may occur due to the fact that the entity word is mapped from an infinite space to a finite space. I.e. different contents correspond to the same data in a limited space. Generally, the longer the length of the finite space, the smaller the collision probability. Combining storage space and collision probability considerations, the present invention employs an MD5(Message-Digest Algorithm) hashing Algorithm. The MD5 hash algorithm is a widely used cryptographic hash function that generates a 128-bit (16-byte) hash value (hash value) to ensure the integrity of information transmission. Therefore, the entity words in the extended entity word database are mapped into one 128-BIT data based on the MD5 algorithm, i.e. MD5 VALUE 128BIT BYTE [16 ].
23) The 128-bit data is equally divided into 4 pieces of 32-bit data in order.
Specifically, for fast search, 128 bits of data need to be mapped to a certain bit, which requires 2^128 bits of storage space. In order to compress the storage space and reduce the problem of collision probability rise caused by the compressed space, the invention divides the 128BIT data into 4 parts, and each part is stored by an integer array with the length of 2^27, thereby obtaining MD5 VALUE 128BIT [0-3]32BIT, MD5 VALUE 128BIT [4-7]32BIT, MD5 VALUE 128BIT [8-11]32BIT and MD5 VALUE 128BIT [12-15]32 BIT. Preferably, the four integer arrays are denoted hash [ i ], (i ═ 0,1,2, 3).
24) For each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
Specifically, for each32-BIT data, the first 27-BIT data (MD5 VALUE 27BIT) can pass through a 2-BIT27An integer array of elements and initial values all 0. Wherein the first 27 bits of data may correspond to the subscripts of the integer array. The size of the last 5 BITs of data (MD5 VALUE 5BIT) is an integer between 0 and 31, and can be represented by one integer element (32 BIT). Wherein, the bit of the integer element corresponding to the size of the last 5 bits data is set as the last 5 mapping bits and the last 5 mapping bits are set as 1. For example, when the last 5 bits of data is 00001, the size is 1, and the 1 st bit of the integer element is set to 1; when the last 5 bits of data are 00110 and the size is 6, the 6 th bit of the integer element is set to 1.
And S3, screening candidate entity words from the text to be recognized based on the entity word candidate model.
Specifically, for the text to be recognized, the entity word candidate model is adopted to screen out candidate entity words.
In an embodiment of the present invention, the step of screening candidate entity words from the text to be recognized based on the entity word candidate model includes the following steps:
31) traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words.
Specifically, the first two characters, the last two characters and the maximum length of the entity word set by the entity word candidate model are utilized, the two characters are taken as a window to traverse the text to be recognized, and the suspected entity word with the length not exceeding the maximum length of the entity word and matched with the first two characters and the last two characters is obtained.
32) For each suspected entity word, mapping the suspected entity word into 128 bits of data based on an MD5 algorithm; equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence; for each 32-bit data, the first 27-bit data is mapped to the subscript of the integer array, and the last 5-bit data is mapped to the last 5 mapping bits corresponding to the integer element corresponding to the subscript.
Specifically, similar to the processing manner of the entity word candidate model, the suspected entity words are mapped to four integer arrays of the entity word candidate model.
33) And searching bits corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging the suspected entity word as a candidate entity word only when the four bits are all 1.
Specifically, for the last 5 mapping bits of the four integer elements of the suspected entity word, sequentially searching the bit corresponding to the last 5 mapping bits of the four integer elements in the integer array of the entity word candidate model, and if the bit corresponding to the four last 5 mapping bits is 1, indicating that the suspected entity word is a candidate entity word; and if the bit corresponding to one or more final 5 mapping bits is not 1, judging that the suspected entity word is not a candidate entity word.
And step S4, verifying the candidate entity words based on the expanded entity word database.
Specifically, for the obtained candidate entity words, verification is performed in the expanded entity word database, and whether the candidate entity words are recorded in the expanded entity word database is verified. If yes, the verification is passed; if not, the verification fails.
In an embodiment of the present invention, verifying the candidate entity word based on the extended entity word database includes the following steps:
41) and enabling the entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information.
When the extended entity word database is constructed based on the Elastic Search, the Elastic Search constructs an entity word index containing the unique identification information so as to facilitate subsequent entity word query. The entity words (such as entity full names, entity short names and entity alternative names) corresponding to the same financial named entity in the expanded entity word database have the same unique identification information.
42) Searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; if the search is successful, the candidate entity words pass the verification; and if the candidate entity words are searched and identified, verifying and identifying the candidate entity words.
And step S5, carrying out disambiguation processing on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified.
In an embodiment of the present invention, disambiguating the verified candidate entity word to obtain the recognition result of the financial named entity in the text to be recognized includes the following steps:
51) and judging whether the candidate entity word is the marked ambiguous entity word or not according to the unique identification information and the financial named entity full name.
Specifically, the related corpus of ambiguous entity words is collected in advance, and the corpus is a text containing the ambiguous entity words. And determining whether the ambiguous entity words refer to corresponding entities in a manual labeling mode, and constructing a labeling set of the ambiguous entity words by using the unique identification information and the financial named entities which are all called as indexes. For example, the corpus "great will apple developers come, why are apples more and more appreciate software? The apple is a financial entity apple company, and the apple is a non-financial entity by strongly promoting the transformation and upgrading of the apple industry in Qianyang county of Baoji City.
52) If the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as corpora s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waThe probability of ambiguity and absence, a ═ 1,2 … n.
In particular, a Conditional Random Field (CRF) algorithm is used for word segmentation. Let a sentence corpus siThe word segmentation result is w1,w2…wnIf the ambiguous word correspondence in the sentence is not an entity, the sentence is considered ambiguous c0Class, otherwise, unambiguous c1And (4) class. w is a1Appear in K sentences in the corpus, where K0Is a non-correspondent entity, i.e. is c0Of a class; k is a radical of1Is a correspondent entity, i.e. is c1And (4) the class. Probability of failure P (c)0|w1)=k0/(k0+k1) And P (c)1|w1)=k1/(k0+k1)。
53) When P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, the candidate entity words are judged to be the recognition results.
As shown in FIG. 4, in one embodiment, the financial named entity identification system of the present invention includes an expansion module 41, a construction module 42, a screening module 43, a verification module 44, and a disambiguation module 45.
The expanding module 41 is configured to expand the entity words in the financial naming entity database to generate an expanded entity word database.
Specifically, the financial named entity database contains unique identification information of financial named entities, entity names, and entity names which are defined as short names. Preferably, the unique identification information adopts a unique code. When the financial named entity database is used for identifying the financial named entities, the problem of low coverage rate exists. Therefore, the financial named entity database needs to be expanded to realize the further expansion of the entity words.
In the invention, entity word development is mainly used for processing entity words of company types in sequence according to the optimization level of the entity word types. When entity word expansion is carried out, company suffix expansion and place name prefix expansion are carried out on each entity word in sequence, and place name prefixes and company suffixes are expanded together. For example, "zhhai lattice force electrical appliances limited" can be expanded to obtain the following expanded entity words: zhuhai Geli electric appliances, Inc., and Geli electric appliances. After the expansion is finished, the expansion entity word data block comprises an entity name full name, an entity name short name and an expansion entity name.
In an embodiment of the present invention, expanding the entity words in the financial naming entity database to generate the expanded entity word database includes the following steps:
11) and sequentially acquiring entity words to be expanded according to the entity word type priority.
Wherein, the priority of the listed company is higher than that of the non-listed company, the priority of the financial products issued in the non-listed company is higher than that of the non-issued company, and the financial products which are not issued can be prioritized according to the registered capital.
12) For the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; and if not, adding the entity words to be expanded without the suffixes of the companies into the entity word expansion database.
13) For the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database.
14) For the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
Preferably, the expanded entity word database is constructed based on Elastic Search. The Elastic Search is a Search server based on Lucene, provides a full-text Search engine with distributed multi-user capability and is based on RESTful web interface. The Elastic Search is developed by using Java language and issued as an open source code under Apache licensing terms, is used in cloud computing, can achieve real-time Search, and is stable, reliable, rapid, convenient to install and use. Official clients are available in Java,. NET (C #), PHP, Python, Apache Groovy, Ruby and many other languages.
The constructing module 42 is connected to the expanding module 41, and is configured to construct an entity word candidate model of the financial named entity.
Specifically, in order to quickly locate the entity words in the text to be recognized, the entity word candidate model is set so as to facilitate subsequent candidate entity word recognition. Preferably, the entity word candidate models are stored to Hbase and Redis and loaded for use in memory at the time of use. Among them, HBase is a distributed, column-oriented, open source data. HBase provides Bigtable-like capabilities on top of Hadoop, which is a database suitable for unstructured data storage, unlike a typical relational database. Redis (remote Dictionary Server), a remote Dictionary service, is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API of multiple languages.
In one embodiment of the present invention, as shown in FIG. 3, the construction of the financial named entity includes the following steps:
21) setting the first two characters and the last two characters of the entity word, and determining the maximum length of the entity word containing the first two characters and the last two characters.
Specifically, the entity words in the expanded entity word database are subjected to statistical analysis to obtain first two characters and last two characters, and the maximum length of the entity words including the first two characters and the last two characters is determined, so that subsequent entity word positioning is facilitated.
It should be noted that different numbers of beginning and end words have different effects on hardware overhead and positioning effect. Generally, the more the number of the head and the tail words, the larger the hardware overhead and the better the positioning effect, and in order to balance the hardware overhead and the positioning effect, the invention defines an overhead-confusion function. Counting the number of first and last words len _ s according to the number of different first and last wordsw、len_ew’,len_swWherein len _ swIndicating the number of first words, len _ e, of the word number ww’Representing the number of tail words with word number w'. Statistics of first and last wordsCorresponding to the number of entity words, thereby calculating the average value ave _ s of the number of entity words corresponding to the first and last words under the condition of different word numbersw、ave_ew‘,ave_swMean value, ave _ e, representing the number of words of the entity corresponding to the first word with number ww‘The mean value of the number of entity words corresponding to the tail word with the number of words w' is represented. The overhead-confusion function is Efv ═ log2(len_sw*len_ew′)+log10(ave_sw*ave_ew′)2The first half is overhead and the second half is confusion. The combination of w and w' with a small value for the cost-confusion function is taken as the preferred parameter. The preferred parameters need to be selected according to different practical situation calculation. The calculation is carried out when the initial word number w-2 and the tail word number w' -2 are preferred parameters.
22) And mapping the entity words into 128-bit data based on an MD5 algorithm.
Specifically, when an entity word is mapped to a fixed-length data by a hash algorithm, collision may occur due to the fact that the entity word is mapped from an infinite space to a finite space. I.e. different contents correspond to the same data in a limited space. Generally, the longer the length of the finite space, the smaller the collision probability. Combining storage space and collision probability considerations, the present invention employs the MD5(Message-Digest Algorithm) hashing Algorithm. The MD5 hashing algorithm is a widely used cryptographic hashing function that generates a 128-bit (16-byte) hash value (hash value) to ensure the integrity and consistency of information transmission. Therefore, the entity words in the extended entity word database are mapped into one 128-BIT data based on the MD5 algorithm, i.e. MD5 VALUE 128BIT BYTE [16 ].
23) The 128-bit data is equally divided into 4 pieces of 32-bit data in order.
Specifically, for fast lookup, 128 bits of data are required to be mapped to a bit, which requires 2^128 bits of storage space. In order to compress the storage space and reduce the problem of collision probability rise caused by the compressed space, the invention divides the 128BIT data into 4 parts, and each part is stored by an integer array with the length of 2^27, thereby obtaining MD5 VALUE 128BIT [0-3]32BIT, MD5 VALUE 128BIT [4-7]32BIT, MD5 VALUE 128BIT [8-11]32BIT and MD5 VALUE 128BIT [12-15]32 BIT. Preferably, the four integer arrays are denoted as hash [ i ], (i ═ 0,1,2, 3).
24) For each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
Specifically, for each 32-BIT data, the first 27-BIT data (MD5 VALUE 27BIT) may pass through a 227An integer array of elements and initial values all 0. Wherein the first 27 bits of data may correspond to the subscripts of the integer array. The size of the last 5 BITs of data (MD5 VALUE 5BIT) is an integer between 0 and 31, and can be represented by one integer element (32 BIT). Wherein, the bit of the integer element corresponding to the size of the last 5 bits data is set as the last 5 mapping bits and the last 5 mapping bits are set as 1. For example, when the last 5 bits of data is 00001, the size is 1, and the 1 st bit of the integer element is set to 1; when the last 5 bits of data are 00110 and the size is 6, the 6 th bit of the integer element is set to 1. The screening module 43 is connected to the building module 42, and is configured to screen candidate entity words from the text to be recognized based on the entity word candidate model.
Specifically, for the text to be recognized, the entity word candidate model is adopted to screen out candidate entity words.
In an embodiment of the present invention, the step of screening candidate entity words from the text to be recognized based on the entity word candidate model includes the following steps:
31) traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words.
Specifically, the first two characters, the last two characters and the maximum length of the entity word set by the entity word candidate model are utilized, the two characters are taken as a window to traverse the text to be recognized, and the suspected entity word with the length not exceeding the maximum length of the entity word and matched with the first two characters and the last two characters is obtained.
32) For each suspected entity word, mapping to 128-bit data based on an MD5 algorithm; equally dividing the 128-bit data into 4 32-bit data in sequence; for each 32-bit data, the first 27-bit data is mapped to the subscript of the integer array, and the last 5-bit data is mapped to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript.
Specifically, similar to the processing manner of the entity word candidate model, the suspected entity words are mapped to four integer arrays of the entity word candidate module.
33) And searching bit positions corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging that the suspected entity word is a candidate entity word only when the four bit positions are all 1.
Specifically, for the last 5 mapping bits of the four integer elements of the suspected entity word, sequentially searching the bit corresponding to the last 5 mapping bits of the four integer elements in the integer array of the entity word candidate model, and if the bit corresponding to the last 5 mapping bits is all 1, indicating that the suspected entity word is a candidate entity word; and if the bit corresponding to one or more last 5 mapping bits is not 1, judging that the suspected entity word is not a candidate entity word.
The verification module 44 is connected to the expansion module 41 and the screening module 43, and is configured to verify the candidate entity words based on the expansion entity word database.
Specifically, for the obtained candidate entity words, verification is performed in the expanded entity word database, and whether the candidate entity words are recorded in the expanded entity word database is verified. If yes, the verification is passed; if not, the verification fails.
In an embodiment of the present invention, verifying the candidate entity word based on the extended entity word database includes the following steps:
41) and enabling the entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information.
When the extended entity word database is constructed based on the Elastic Search, the Elastic Search constructs an entity word index containing the unique identification information so as to facilitate subsequent entity word query. The entity words (such as entity full names, entity short names and entity alternative names) corresponding to the same financial named entity in the expanded entity word database have the same unique identification information.
42) Searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; if the searching is successful, the candidate entity word passes the verification; and if the candidate entity words are searched and identified, verifying and identifying the candidate entity words.
The recognition module 45 is connected to the verification module 44, and is configured to perform disambiguation processing on the verified candidate entity word, and obtain a recognition result of the financial named entity in the text to be recognized.
In an embodiment of the present invention, disambiguating the verified candidate entity word to obtain the recognition result of the financial named entity in the text to be recognized includes the following steps:
51) and judging whether the candidate entity word is the marked ambiguous entity word or not according to the unique identification information and the financial named entity full name.
Specifically, the related corpus of ambiguous entity words is collected in advance, and the corpus is a text containing the ambiguous entity words. And determining whether the ambiguous entity words refer to corresponding entities or not in a manual labeling mode, and constructing a labeling set of the ambiguous entity words by using the unique identification information and the financial named entities which are all called as indexes. For example, the corpus "great league for apple developers, why do apples pay more and more attention to software? The apple is a financial entity apple company, and the apple is a non-financial entity by strongly promoting the transformation and upgrading of the apple industry in Qianyang county of Baoji City.
52) If the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as corpora s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waThe probability of ambiguity and absence, a ═ 1,2 … n.
Specifically, a Conditional Random Field (CRF) algorithm is used for word segmentation. Set a certain sentence corpus siThe word segmentation result is w1,w2…wnIf the ambiguous word correspondence in the sentence is not an entity, the sentence is considered ambiguous c0Class, otherwise, unambiguous c1And (4) class. w is a1Appear in K sentences in the corpus, where K0Is a non-correspondent entity, i.e. is c0Of a class; k is a radical of1Is a correspondent entity, i.e. is c1And (4) the class. Probability of failure P (c)0|w1)=k0/(k0+k1) And P (c)1|w1)=k1/(k0+k1)。
53) When P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, the candidate entity words are judged to be the recognition results.
It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the x module may be a processing element that is set up separately, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the function of the x module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. As another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
The storage medium of the present invention has stored thereon a computer program which, when executed by a processor, implements the above-described financial named-entity recognition method. The storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
Any combination of one or more storage media may be employed. The storage medium may be a computer-readable signal medium or a computer-readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the computer program instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
In one embodiment, the financial named-entity recognition terminal of the present invention comprises: a processor and a memory.
The memory is for storing a computer program.
The memory includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the financial named entity identification terminal to execute the financial named entity identification method.
Preferably, the Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
As shown in FIG. 5, the financial named entity identification terminal of the present invention is embodied in the form of a general purpose computing device. The components of the financial named entity identification terminal may include, but are not limited to: one or more processors or processing units 51, a memory 52, and a bus 53 that couples the various system components (including the memory 52 and the processing unit 51).
Financial named entity identification terminals typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the financial naming entity identification terminal and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 52 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)521 and/or cache memory 522. The financial named entity identification terminal may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 523 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5 and commonly referred to as a "hard disk drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 53 by one or more data media interfaces. Memory 52 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 524 having a set (at least one) of program modules 5241 may be stored, for example, in the memory 52, such program modules 5241 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment. The program modules 5241 generally perform the functions and/or methods of the described embodiments of the invention.
The financial named entity identification terminal may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the financial named entity identification terminal, and/or with any devices (e.g., network card, modem, etc.) that enable the financial named entity identification terminal to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 54. Also, the terminal 4 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 55. As shown in fig. 5, the network adapter 55 communicates with the other modules of the financial naming entity identification terminal via the bus 53. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in connection with the financial named entity identification terminal, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
In summary, the financial named entity recognition method and system, the storage medium and the terminal of the invention can obtain higher coverage rate through the expansion of the financial named entity words; the candidate word model supports rapid candidate word matching of mass entity words, so that financial naming entity words can be rapidly screened out, and the recognition speed is increased; and the problem of ambiguous entity words is solved through a disambiguation algorithm. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (18)
1. A financial named entity recognition method, characterized by: the method comprises the following steps:
expanding the entity words in the financial named entity database to generate an expanded entity word database;
constructing an entity word candidate model of the financial named entity;
screening candidate entity words from the text to be recognized based on the entity word candidate model;
verifying the candidate entity words based on the expanded entity word database;
carrying out disambiguation processing on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified;
the method for constructing the entity word candidate model of the financial named entity comprises the following steps:
setting a first two characters and a last two characters of an entity word, and determining the maximum length of the entity word comprising the first two characters and the last two characters;
mapping the entity words into 128-bit data based on an MD5 algorithm;
equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence;
for each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
2. The financial named entity identification method of claim 1, wherein: the method for expanding the entity words in the financial named entity database to generate the expanded entity word database comprises the following steps:
sequentially acquiring entity words to be expanded according to the entity word type priority;
for the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; if not, adding the entity words to be expanded without the suffixes of the companies into the entity word expansion database;
for the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database;
for the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
3. The financial named entity identification method of claim 2, wherein: the entity part of speech type priority is listed on a listed company, a non-listed company which issues financial products and a non-listed company which does not issue financial products from high to low in sequence; non-public companies that do not release financial products are prioritized by registered capital.
4. The financial named entity identification method of claim 1, wherein: based on the entity word candidate model, the step of screening out candidate entity words from the text to be recognized comprises the following steps:
traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words;
for each suspected entity word, mapping to 128-bit data based on an MD5 algorithm; equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence; for each 32-bit data, mapping the first 27-bit data to the subscript of the integer array, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript;
and searching bits corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging the suspected entity word as a candidate entity word only when the four bits are all 1.
5. The financial named entity identification method of claim 1, wherein: verifying the candidate entity words based on the expanded entity word database comprises the following steps:
enabling entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information;
searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; and if the search is successful, the candidate entity word passes the verification.
6. The financial named entity identification method of claim 1, wherein: the disambiguation processing is carried out on the verified candidate entity words, and the identification result of the financial named entity in the text to be identified is obtained, and the method comprises the following steps:
judging whether the candidate entity words are labeled ambiguous entity words or not;
if the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as a corpus s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waProbability of ambiguity and ambiguity absence, a ═ 1,2 … n;
when P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, the candidate entity words are judged to be the recognition results.
7. The financial named entity identification method of claim 1, wherein: and constructing the expanded entity word database based on Elastic Search.
8. The financial named entity identification method of claim 1, wherein: storing the entity word candidate model based on Hbase and Redis.
9. A financial named entity recognition system, characterized by: the system comprises an expansion module, a construction module, a screening module, a verification module and a disambiguation module;
the expansion module is used for expanding the entity words in the financial named entity database to generate an expanded entity word database;
the construction module is used for constructing an entity word candidate model of the financial named entity;
the screening module is used for screening out candidate entity words from the text to be recognized based on the entity word candidate model;
the verification module is used for verifying the candidate entity words based on the expanded entity word database;
the disambiguation module is used for carrying out disambiguation on the verified candidate entity words to obtain the identification result of the financial named entity in the text to be identified;
the construction module for constructing the entity word candidate model of the financial named entity comprises the following steps:
setting a first two characters and a last two characters of an entity word, and determining the maximum length of the entity word comprising the first two characters and the last two characters;
mapping the entity words into 128-bit data based on an MD5 algorithm;
equally dividing the 128-bit data into 4 pieces of 32-bit data in sequence;
for each 32-bit data, the first 27 bits of data are mapped to a 227Subscript of the integer array with each element and initial value of 0, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript and mapping the last 5 mapping bits to position 1; the last 5 mapping bits are bits from back to front corresponding to the integer elements of values 0-31 obtained by converting the following 5-bit data.
10. The financial named entity identification system of claim 9, wherein: the expanding module expands the entity words in the financial named entity database to generate an expanded entity word database, and the expanding module comprises the following steps:
sequentially acquiring entity words to be expanded according to the entity word type priority;
for the entity words to be expanded containing the company suffixes, judging whether the entity words to be expanded without the company suffixes are contained in the expanded entity word database; if not, adding the entity words to be expanded without the suffixes of the companies into the entity word expansion database;
for the entity words to be expanded containing the place name prefixes, judging whether the entity words to be expanded without the place name prefixes are contained in the entity word expansion database; if not, adding the entity words to be expanded, from which the place name prefixes are removed, into the entity word expansion database;
for the entity words to be expanded containing place name prefixes and company suffixes, judging whether the entity words to be expanded without the place name prefixes and the company suffixes are contained in the entity word expansion database; and if not, adding the entity words to be expanded, from which the place name prefixes and the company suffixes are removed, into the entity word expansion database.
11. The financial named entity identification system of claim 10, wherein: the entity part of speech type priority is listed on a listed company, a non-listed company which issues financial products and a non-listed company which does not issue financial products from high to low in sequence; non-public companies that do not release financial products are prioritized by registered capital.
12. The financial named entity identification system of claim 9, wherein: the screening module screens candidate entity words in the text to be recognized based on the entity word candidate model and comprises the following steps:
traversing the text to be recognized by taking the two characters as a window, and screening suspected entity words based on the first two characters, the last two characters and the maximum length of the entity words;
for each suspected entity word, mapping to 128-bit data based on an MD5 algorithm; equally dividing the 128-bit data into 4 32-bit data in sequence; for each 32-bit data, mapping the first 27-bit data to the subscript of the integer array, and mapping the last 5-bit data to the last 5 mapping bits corresponding to the integer elements corresponding to the subscript;
and searching bits corresponding to the last 5 mapping bits of the four integer elements of the suspected entity word in the entity word candidate model, and judging the suspected entity word as a candidate entity word only when the four bits are all 1.
13. The financial named entity recognition system of claim 9, wherein: the verification module verifying the candidate entity words based on the expanded entity word database comprises the following steps:
enabling entity words corresponding to the same financial named entity in the expanded entity word database to have the same unique identification information;
searching the unique identification information and the financial named entity full name corresponding to the candidate entity words in the expanded entity word database; and if the search is successful, the candidate entity word passes the verification.
14. The financial named entity identification system of claim 9, wherein: the disambiguation module disambiguates the verified candidate entity words, and the obtaining of the recognition result of the financial named entity in the text to be recognized comprises the following steps:
judging whether the candidate entity words are labeled ambiguous entity words or not;
if the ambiguous entity words are marked, obtaining sentences where the candidate entity words are located and preceding and following sentences as corpora s; performing word segmentation on the corpus to obtain a word w1*,w2*…wn*(ii) a Respectively calculating the probability P (c) of ambiguity of the candidate entity words0S) and probability of no ambiguity P (c)1S); wherein, P (c)0|s*)=P(c0|w1*)P(c0|w2*)…P(c0|wn*);P(c1|s*)=P(c1|w1*)P(c1|w2*)…P(c1|wn*);P(c0|wa*) And P (c)1|wa*) Are respectively a word waProbability of ambiguity and ambiguity absence, a ═ 1,2 … n;
when P (c)0|s*)>P(c1S), judging the candidate entity word as an ambiguous word; otherwise, judgingAnd the candidate entity words are the recognition results.
15. The financial named entity identification system of claim 9, wherein: the expansion module constructs the expansion entity word database based on the elastic search.
16. The financial named entity identification system of claim 9, wherein: the construction module stores the entity word candidate model based on Hbase and Redis.
17. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the financial named entity recognition method of one of claims 1 to 8.
18. A financial named entity identification terminal, comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the memory-stored computer program to cause the financial named entity identification terminal to perform the financial named entity identification method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110913735.3A CN113642331B (en) | 2021-08-10 | 2021-08-10 | Financial named entity identification method and system, storage medium and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110913735.3A CN113642331B (en) | 2021-08-10 | 2021-08-10 | Financial named entity identification method and system, storage medium and terminal |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113642331A CN113642331A (en) | 2021-11-12 |
CN113642331B true CN113642331B (en) | 2022-05-03 |
Family
ID=78420528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110913735.3A Active CN113642331B (en) | 2021-08-10 | 2021-08-10 | Financial named entity identification method and system, storage medium and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113642331B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202382A (en) * | 2016-07-08 | 2016-12-07 | 南京缘长信息科技有限公司 | Link instance method and system |
CN111353310A (en) * | 2020-02-28 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Named entity identification method and device based on artificial intelligence and electronic equipment |
CN112215008A (en) * | 2020-10-23 | 2021-01-12 | 中国平安人寿保险股份有限公司 | Entity recognition method and device based on semantic understanding, computer equipment and medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9684648B2 (en) * | 2012-05-31 | 2017-06-20 | International Business Machines Corporation | Disambiguating words within a text segment |
CN110852106B (en) * | 2019-11-06 | 2024-05-03 | 腾讯科技(深圳)有限公司 | Named entity processing method and device based on artificial intelligence and electronic equipment |
-
2021
- 2021-08-10 CN CN202110913735.3A patent/CN113642331B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202382A (en) * | 2016-07-08 | 2016-12-07 | 南京缘长信息科技有限公司 | Link instance method and system |
CN111353310A (en) * | 2020-02-28 | 2020-06-30 | 腾讯科技(深圳)有限公司 | Named entity identification method and device based on artificial intelligence and electronic equipment |
CN112215008A (en) * | 2020-10-23 | 2021-01-12 | 中国平安人寿保险股份有限公司 | Entity recognition method and device based on semantic understanding, computer equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113642331A (en) | 2021-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11507601B2 (en) | Matching a first collection of strings with a second collection of strings | |
WO2022121171A1 (en) | Similar text matching method and apparatus, and electronic device and computer storage medium | |
US11308286B2 (en) | Method and device for retelling text, server, and storage medium | |
US20200183986A1 (en) | Method and system for document similarity analysis | |
US20180181646A1 (en) | System and method for determining identity relationships among enterprise data entities | |
US10885281B2 (en) | Natural language document summarization using hyperbolic embeddings | |
US20210319054A1 (en) | Encoding entity representations for cross-document coreference | |
KR102560521B1 (en) | Method and apparatus for generating knowledge graph | |
CN111259262A (en) | Information retrieval method, device, equipment and medium | |
US8224642B2 (en) | Automated identification of documents as not belonging to any language | |
CN115392235A (en) | Character matching method and device, electronic equipment and readable storage medium | |
US20130103388A1 (en) | Document analyzing apparatus | |
US10884996B1 (en) | Systems and methods for optimizing automatic schema-based metadata generation | |
CN113901783A (en) | Domain-oriented document duplicate checking method and system | |
US11379669B2 (en) | Identifying ambiguity in semantic resources | |
CN116383412B (en) | Functional point amplification method and system based on knowledge graph | |
CN113642331B (en) | Financial named entity identification method and system, storage medium and terminal | |
CN109815475B (en) | Text matching method and device, computing equipment and system | |
WO2018179729A1 (en) | Index generating program, data search program, index generating device, data search device, index generating method, and data search method | |
WO2022227196A1 (en) | Data analysis method and apparatus, computer device, and storage medium | |
US11341190B2 (en) | Name matching using enhanced name keys | |
CN114780673A (en) | Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching | |
CN114528824A (en) | Text error correction method and device, electronic equipment and storage medium | |
US11163953B2 (en) | Natural language processing and candidate response evaluation | |
JP2001101184A (en) | Method and device for generating structurized document and storage medium with structurized document generation program stored therein |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |