CN111783460A - Enterprise abbreviation extraction method and device, computer equipment and storage medium - Google Patents

Enterprise abbreviation extraction method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111783460A
CN111783460A CN202010542872.6A CN202010542872A CN111783460A CN 111783460 A CN111783460 A CN 111783460A CN 202010542872 A CN202010542872 A CN 202010542872A CN 111783460 A CN111783460 A CN 111783460A
Authority
CN
China
Prior art keywords
entity
name
enterprise
matching
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010542872.6A
Other languages
Chinese (zh)
Inventor
孙华蔚
沈春泽
李加庆
周张泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suning Financial Technology Nanjing Co Ltd
Original Assignee
Suning Financial Technology Nanjing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suning Financial Technology Nanjing Co Ltd filed Critical Suning Financial Technology Nanjing Co Ltd
Priority to CN202010542872.6A priority Critical patent/CN111783460A/en
Publication of CN111783460A publication Critical patent/CN111783460A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The invention discloses an enterprise abbreviation extracting method, an enterprise abbreviation extracting device, computer equipment and a storage medium, belonging to the technical field of text information processing, wherein the method comprises the following steps: acquiring an enterprise name; matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity; acquiring an entity sequence matched with the enterprise name, and identifying a name entity of the enterprise name according to the entity type of each entity in the entity sequence; and checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short. The invention can effectively improve the efficiency and the accuracy of enterprise abbreviation extraction.

Description

Enterprise abbreviation extraction method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of text information processing, in particular to an enterprise abbreviation extraction method, an enterprise abbreviation extraction device, computer equipment and a storage medium.
Background
With the rapid development of internet technology, a large amount of company public opinion information is generated in the network, for example, a large amount of financial news data includes company names, and internet users generate company name information with abbreviated names as main styles in web page texts, so that if more comprehensive information related to enterprises is to be timely and accurately acquired from the internet, the abbreviation of the enterprise needs to be identified. The effective public opinion processing system can process company name data in different forms, thereby providing support for analysis, research and decision of various businesses.
The existing enterprise abbreviation extraction process mainly adopts an algorithm based on statistics, the method needs to manually mark a large amount of corpora for training, and the corpora has large feature scale, higher cost and lower accuracy.
Disclosure of Invention
In order to solve the problems mentioned in the background art, the invention provides an enterprise abbreviation extraction method, an enterprise abbreviation extraction device, a computer device and a storage medium, which can effectively improve the efficiency and accuracy of enterprise abbreviation extraction. The embodiment of the invention provides the following specific technical scheme:
in a first aspect, a method for extracting enterprise abbreviation is provided, the method comprising:
acquiring an enterprise name;
matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of the regular expressions, wherein the entity dictionaries do not contain enterprise name entities;
acquiring an entity sequence matched with the enterprise name, and identifying a name entity of the enterprise name according to the entity category of each entity in the entity sequence;
and checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short.
Further, the matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions and the matching priority order of each regular expression includes:
traversing each regular expression in sequence according to the matching priority order of each regular expression;
in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
Further, the checking the name entity includes:
acquiring the total word number of the name entity;
judging whether the total word number of the name entity is larger than a first preset word number and smaller than a second preset word number or not;
if so, the verification is successful, otherwise, the verification fails.
Further, the method further comprises:
if the total word number of the name entity is judged to be not smaller than the second preset word number, matching the name entity according to the entity dictionaries and the regular expressions and the matching priority sequence of the regular expressions;
judging whether the entity sequence of the name entity is successfully matched;
if so, identifying the abbreviation of the enterprise name from the entity sequence of the name entity;
and if not, screening the short names of the enterprise names from the name entities, and supplementing the remaining words in the name entities into corresponding entity dictionaries based on a Bootstrapping algorithm.
Further, the method further comprises:
and if the total word number of the name entity is judged to be not more than the first preset word number, splicing the name entity with a previous entity or a next entity of the name entity in the entity sequence, and determining the splicing result as the name of the enterprise for short.
Further, the method further comprises:
and correspondingly storing the enterprise name and the short name of the enterprise name into a database.
Further, the method further comprises the step of pre-constructing a plurality of entity dictionaries:
constructing an enterprise name sample library;
extracting each enterprise name in the enterprise name sample library through an N-Gram algorithm to obtain a region name, an industry name and an enterprise type;
respectively taking the area name, the industry name and the enterprise type as entities, and correspondingly constructing an area dictionary, an industry dictionary and an enterprise type dictionary;
and performing word segmentation processing on each enterprise name in the enterprise name sample library through a word segmentation algorithm, and supplementing the region dictionary, the industry dictionary and the enterprise type dictionary according to word segmentation results.
In a second aspect, an enterprise abbreviation extraction apparatus is provided, the apparatus including:
the acquisition module is used for acquiring enterprise names;
the first matching module is used for matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions and according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity;
the identification module is used for acquiring an entity sequence matched with the enterprise name and identifying a name entity of the enterprise name according to the entity category of each entity in the entity sequence;
the checking module is used for checking the name entity;
and the determining module is used for determining the name entity as the name of the enterprise for short when the verification module successfully verifies.
Further, the first matching module is specifically configured to:
traversing each regular expression in sequence according to the matching priority order of each regular expression;
in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
Further, the verification module is specifically configured to:
acquiring the total word number of the name entity;
judging whether the total word number of the name entity is larger than a first preset word number and smaller than a second preset word number or not;
if so, the verification is successful, otherwise, the verification fails.
Further, the apparatus further includes a second matching module, and the second matching module is specifically configured to:
if the verification module judges that the total word number of the name entity is not less than the second preset word number, matching the name entity according to the entity dictionary and the regular expressions and the matching priority sequence of the regular expressions;
the determining module is specifically further configured to:
judging whether the second matching module successfully matches the entity sequence of the enterprise name;
if so, identifying the abbreviation of the enterprise name from the entity sequence of the name entity;
and if not, screening the short names of the enterprise names from the name entities, and supplementing the remaining words in the name entities into corresponding entity dictionaries based on a Bootstrapping algorithm.
Further, the determining module is specifically configured to:
and if the checking module judges that the total word number of the name entity is not greater than the first preset word number, splicing the name entity with a previous entity or a next entity of the name entity in the entity sequence, and determining the splicing result as the name of the enterprise for short.
Further, the apparatus further includes a saving module, and the saving module is specifically configured to:
and correspondingly storing the enterprise name and the short name of the enterprise name into a database.
Further, the apparatus further comprises a construction module, which is specifically configured to:
constructing an enterprise name sample library;
extracting each enterprise name in the enterprise name sample library through an N-Gram algorithm to obtain a region name, an industry name and an enterprise type;
respectively taking the area name, the industry name and the enterprise type as entities, and correspondingly constructing an area dictionary, an industry dictionary and an enterprise type dictionary;
and performing word segmentation processing on each enterprise name in the enterprise name sample library through a word segmentation algorithm, and supplementing the region dictionary, the industry dictionary and the enterprise type dictionary according to word segmentation results.
In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the enterprise abbreviation extraction method according to any one of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, implements the enterprise abbreviation extraction method according to any one of the first aspect.
The technical scheme provided by the invention at least has the following beneficial effects:
the embodiment of the invention provides an enterprise abbreviation extracting method, an enterprise abbreviation extracting device, computer equipment and a storage medium.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of an enterprise abbreviation extraction method provided in an embodiment of the present invention;
FIG. 2 is a flowchart of constructing an entity dictionary according to an embodiment of the present invention;
FIG. 3 is a block diagram of an enterprise abstraction device according to an embodiment of the present invention;
fig. 4 is a block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is to be understood that, unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
Furthermore, in the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
As described in the foregoing background art, in the existing enterprise abbreviation extraction process, an algorithm based on statistics is mainly used, and this method needs to manually label a large amount of corpora for training, so that the corpora has a large feature scale, high cost, and low accuracy. Therefore, the embodiment of the invention provides an enterprise abbreviation extracting method, which is used for extracting the abbreviation of an enterprise name corresponding to the enterprise name by matching the enterprise name through a dictionary and a regular expression according to the matching priority sequence of the expression, so that not only can the key name information of a standard company in the name form be accurately extracted, but also the key name information of a non-standard company in the name form be accurately extracted, and a data basis is provided for technologies such as enterprise entity identification, public opinion risk analysis based on the company name and the like.
Example one
The embodiment of the invention provides an enterprise abbreviation extracting method, which is exemplified by being applied to an enterprise abbreviation extracting device, and the device can be configured in any computer equipment so that the computer equipment can execute the enterprise abbreviation extracting method. Referring to fig. 1, the method may include the steps of:
101, acquiring the name of the enterprise.
The business name can be a name in a four-segment normalized form, namely < area > < name > < industry > < type >, the standard form is represented as different fields from left to right, each tip bracket represents the entity type of the field, and the name field is short for the business name. Furthermore, the business name may also be a company name in the form of a non-normalized name, for example, only a part of an entity category in a standard form, for example, < name > < industry > < type >, or a plurality of fields included belonging to the same entity category, for example, < region > < name > < industry > < type >, etc.
Specifically, the enterprise name may be obtained from an enterprise name database opened by the industry and commerce department, which is not specifically limited in this embodiment.
And 102, matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity.
Each entity dictionary can be pre-constructed based on the enterprise name sample corpus, and comprises a region dictionary, an industry dictionary and an enterprise type dictionary. Specifically, the region dictionary may be divided into a provincial region dictionary, a city region dictionary, and a prefecture region dictionary according to the administrative division level, and the industry dictionary is divided into a two-word industry dictionary, a three-word industry dictionary, and a four-word industry dictionary according to the number of words of dictionary elements.
The method comprises the steps of carrying out data exploration on an enterprise name sample in an enterprise name sample library in advance, determining a non-standard form enterprise name except a four-section type standard form enterprise name (< region > < name > < industry > < type >), and designing a plurality of regular expression field forms for matching different forms of enterprise names.
In a specific application, besides being used for matching regular expressions in a four-section specification form, more than 80 regular expression field forms for matching are also designed, wherein the more than 80 regular expression field forms comprise: the method comprises the steps of changing the position of an entity category included in a four-section type standard form (for example, < name > < area > < industry > < type >), adding a plurality of fields (for example, < area > < name > < industry > < type >) and only including part of entity categories (for example, < name > < industry > < type >) and the like to the same entity category, and by designing various regular expression field forms, the coverage rate of enterprise name matching can be improved conveniently, and the accuracy rate of subsequent enterprise name extraction is guaranteed.
Specifically, the regular expressions can be traversed sequentially according to the matching priority order of the regular expressions; in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
The matching priority order of the regular expressions can be set as:
starting matching from the regular expression with the largest field number, preferentially matching the regular expressions with the changed entity type sequence, if the matching is unsuccessful, preferentially matching the regular expressions with only lacking region fields in the process of descending the number of the entity types, and when the industry entity type matching is carried out, the matching priority of the regular expression with the four-word line element field is higher than that of the regular expression with the two-word line element fields, and analogizing according to the sequence until the entity sequence is successfully matched with the enterprise name.
For example, for the enterprise name "XX asset assessment (shanghai) limited", according to the matching priority order, the regular expression "< name > < industry > < area > < type >" should be preferentially matched to obtain the enterprise "XX", and for a correct result, if the pattern is preferentially matched to the "< name > < area > < type >", the enterprise "XX asset assessment" is obtained, which results in that the extraction result is not sufficiently simplified.
In the embodiment, the enterprise names are matched according to the preset entity dictionaries and the regular expressions and the matching priority sequence of each regular expression, so that the efficiency and the accuracy of the matching algorithm can be improved.
And 103, acquiring an entity sequence matched with the enterprise name, and identifying the name entity of the enterprise name according to the entity category of each entity in the entity sequence.
The sequencing position of each entity included in the entity sequence corresponds to the position of each entity in the business name, and the entities included in the entity sequence do not overlap with each other in the business name.
And 104, checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short.
Specifically, the process may include:
and acquiring the total word number of the name entity, and judging whether the total word number of the name entity is greater than a first preset word number and less than a second preset word number, if so, the verification is successful, otherwise, the verification fails.
The first preset word number and the second preset word number may be set according to actual needs, for example, the value of the first preset word number may be set to 1, and the value of the second preset word number may be set to 6.
In this embodiment, by verifying the identified name entity, the name entity is determined as the abbreviation of the enterprise name in the case of successful verification, and the extraction accuracy of the abbreviation of the enterprise can be improved.
In one example, referring to fig. 2, the entity dictionaries in step 101 may be constructed as follows, including the following steps:
and 201, constructing an enterprise name sample library.
Specifically, data cleaning is carried out on an original enterprise name corpus, non-enterprise name data, numbers and other non-standard symbols are removed, an enterprise name sample library is obtained, and sample data are extracted from the enterprise name sample library for data inspection.
The original enterprise corpus includes company names in normalized name form and company names in non-normalized name form, and cleaning work is required before processing, including but not limited to:
1) removing data of non-company names, such as company names which are not recorded fully or are composed of all numbers;
2) establishing a stop word dictionary, namely collecting fields which do not contain the keywords for short for the company clearly, such as ' Changdu region ', ' Korean family ', special symbols including ' and the like, deleting corresponding fields from the company name, and carrying out primary processing on data;
after data cleaning is carried out on the original enterprise name corpus, an enterprise name sample library is obtained, and partial enterprise name sample data is extracted for data exploration.
202, extracting the area name, the industry name and the enterprise type from the enterprise name sample library through an N-Gram algorithm.
In specific application, the construction of a region dictionary can be combined with external region name data, and the region dictionary is divided into three region name dictionaries of provincial level, city level and prefecture level according to administrative regions so as to carry out more accurate matching; the industry dictionary is divided into a two-character industry word, a three-character industry word and a four-character industry word dictionary according to the word number of the element, and the duplication removing processing including the two-character industry word element is carried out in the four-character industry word element, so that the subsequent algorithm efficiency is improved.
Specifically, a region name, an industry name and an enterprise type are extracted from an enterprise name sample in an enterprise name sample library through a bi-gram model, a tri-gram model and a 4-gram model, and a region dictionary, an industry dictionary and an enterprise type dictionary are correspondingly established.
Wherein the N-Gram is based on an assumption: the nth word occurrence is related to the first n-1 words and not to any other words, and the probability of the entire sentence occurrence is equal to the product of the probabilities of the respective words. Assuming that the sentence T is composed of word sequences w1, w2, w3 and … wn, the probability of each word can be obtained by statistical calculation in the corpus:
P(wi)=N(wi)/(N(w1)+N(w2)+N(w3)+…+N(wn));
and after the probability of each word is sorted in a descending order, entities corresponding to the three dictionaries are screened and extracted.
And 203, correspondingly constructing a region dictionary, an industry dictionary and an enterprise type dictionary by respectively taking the region name, the industry name and the enterprise type as dictionary elements.
And 204, performing word segmentation processing on the enterprise name sample library through a preset word segmentation algorithm, and supplementing a region dictionary, an industry dictionary and an enterprise type dictionary according to word segmentation results.
The HMM model-based Chinese word segmentation technology carries out word segmentation on the enterprise name sample library.
In the embodiment, three entity dictionaries of regions, industries and enterprise types are established through the N-Gram model based on the enterprise name sample library, and the dictionary is supplemented by combining the Chinese text word segmentation technology, so that entities in each entity dictionary are more comprehensive, and the accuracy of subsequent enterprise abbreviation is improved.
In one example, the method may further comprise:
if the total word number of the name entity is judged to be not less than the second preset word number, matching the name entity according to the entity dictionaries and the regular expressions and the matching priority sequence of the regular expressions;
judging whether the entity sequence of the name entity is successfully matched;
if so, identifying the abbreviation of the enterprise name from the entity sequence of the name entity;
if not, screening out the short names of the enterprise names from the name entities, and supplementing the residual words in the name entities into the corresponding entity dictionary based on a Bootstrapping algorithm.
In the case that the verification in step 104 fails, for example, the total number of the identified name entities is too many, which may be that a field containing a non-abbreviated name in the name entity includes an industry field, for example, the industry field is included, but the industry field is not included in an industry dictionary, because the number of fields containing an industry entity exceeds the matching maximum value N (for example, N is set to 3), at this time, the name entity may be secondarily matched by using a regular expression in combination with the dictionary, and if an entity sequence of the name entity is matched, a final abbreviation of the name of the enterprise is identified from the entity sequence of the name entity. If the entity sequence of the name entity cannot be matched, further screening the abbreviation of the enterprise name from the name entity in a manual screening mode, and supplementing the remaining words in the name entity except the abbreviation of the enterprise into a corresponding entity dictionary based on a Bootstrapping algorithm so as to carry out enterprise name matching by combining the updated entity dictionary with a regular expression in the following process. Therefore, the accuracy of enterprise short extraction can be further improved.
For example, the extracted short company is not simple enough due to the incomplete pre-constructed entity dictionary, and if the industry dictionary does not contain "nano", the short company of the "future nano technology limited company" is extracted as "future nano", which results in an error result that the extracted short company is not simple enough. At the moment, a manual screening mode is needed to screen out the future in the future nano as an enterprise abbreviation, and the nano is supplemented to a corresponding industry dictionary based on a Bootstrapping algorithm.
In one example, the method may further comprise:
and if the total word number of the name entity is judged to be not more than the first preset word number, splicing the name entity with the previous entity or the next entity of the name entity in the entity sequence, and determining the splicing result as the name of the enterprise for short.
In the case that the verification in step 104 fails, for example, the total number of the identified name entities is too few, which may be caused by mismatching fields containing industry entities in the short term of the enterprise due to ambiguity problems, and the short term of the enterprise is not complete, for example, because the "culture" belongs to a two-word industry element dictionary, the "culture" in the "beijing love culture limited company" is determined as the industry entity when matching, so that the obtained result of the short term of the enterprise is "love", and the extracted short term of the enterprise has an erroneous result, and thus the "love" and the culture of the next entity are spliced into "love culture", so that the correct extracted result of the short term of the enterprise is obtained.
Example two
The embodiment of the present invention provides an enterprise abbreviation extracting apparatus, which can be configured in any computer device, so that the computer device can execute the enterprise abbreviation extracting method provided in the above embodiment. The computer devices may be configured as various terminals, such as servers, which may be implemented as a single service or a cluster of servers.
Referring to fig. 3, the apparatus may include:
an obtaining module 31, configured to obtain a name of an enterprise;
the first matching module 32 is configured to match the enterprise names according to a matching priority order of each regular expression according to a plurality of preset entity dictionaries and a plurality of regular expressions, where each entity dictionary does not include an enterprise name entity;
the identifying module 33 is configured to obtain an entity sequence matched with the enterprise name, and identify a name entity of the enterprise name according to an entity type of each entity in the entity sequence;
a checking module 34, configured to check the name entity;
and the determining module 35 is configured to determine the name entity as the name of the enterprise for short when the verification module successfully verifies the name entity.
In one example, the first matching module 32 is specifically configured to:
traversing all regular expressions in sequence according to the matching priority order of all regular expressions;
in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
Further, the checking module 34 is specifically configured to:
acquiring the total word number of the name entity;
judging whether the total word number of the name entity is larger than a first preset word number and smaller than a second preset word number or not;
if so, the verification is successful, otherwise, the verification fails.
In one example, the apparatus further comprises a second matching module 36, and the second matching module 36 is specifically configured to:
if the checking module 34 determines that the total word number of the name entity is not less than the second preset word number, matching the name entity according to the entity dictionaries and the regular expressions and the matching priority order of the regular expressions;
the determining module 35 is further specifically configured to:
judging whether the second matching module 36 successfully matches the entity sequence of the enterprise name;
if so, identifying the abbreviation of the enterprise name from the entity sequence of the name entity;
if not, screening out the short names of the enterprise names from the name entities, and supplementing the residual words in the name entities into the corresponding entity dictionary based on a Bootstrapping algorithm.
In one example, the determining module 35 is specifically configured to:
if the checking module 34 determines that the total number of words of the name entity is not greater than the first preset number of words, the name entity is spliced with a previous entity or a next entity of the name entity in the entity sequence, and the splicing result is determined to be the name of the enterprise for short.
In an example, the apparatus further includes a saving module 37, where the saving module 37 is specifically configured to:
and correspondingly storing the enterprise name and the short name of the enterprise name into a database.
In one example, the apparatus further comprises a building module 30, the building module 30 being specifically configured to:
constructing an enterprise name sample library;
extracting each enterprise name in the enterprise name sample library through an N-Gram algorithm to obtain a region name, an industry name and an enterprise type;
respectively taking the area name, the industry name and the enterprise type as entities, and correspondingly constructing an area dictionary, an industry dictionary and an enterprise type dictionary;
and performing word segmentation processing on each enterprise name in the enterprise name sample library through a word segmentation algorithm, and supplementing a region dictionary, an industry dictionary and an enterprise type dictionary according to word segmentation results.
It should be noted that: in the enterprise abbreviation extracting apparatus provided in this embodiment, only the division of each functional module is exemplified, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, for specific implementation processes and beneficial effects of the enterprise abbreviation extracting device in this embodiment, reference is made to the enterprise abbreviation extracting method in the embodiment, and details are not described here.
Fig. 4 is an internal structural diagram of a computer device according to an embodiment of the present invention. The computer device may be a server, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for enterprise abbreviation extraction.
Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing devices to which aspects of the present invention may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is also provided a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring an enterprise name;
matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity;
acquiring an entity sequence matched with the enterprise name, and identifying a name entity of the enterprise name according to the entity type of each entity in the entity sequence;
and checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short.
In one embodiment, there is also provided a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:
acquiring an enterprise name;
matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity;
acquiring an entity sequence matched with the enterprise name, and identifying a name entity of the enterprise name according to the entity type of each entity in the entity sequence;
and checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An enterprise abbreviation extraction method, the method comprising:
acquiring an enterprise name;
matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions according to the matching priority order of the regular expressions, wherein the entity dictionaries do not contain enterprise name entities;
acquiring an entity sequence matched with the enterprise name, and identifying a name entity of the enterprise name according to the entity category of each entity in the entity sequence;
and checking the name entity, and if the checking is successful, determining the name entity as the name of the enterprise for short.
2. The method according to claim 1, wherein the matching the business names according to a preset plurality of entity dictionaries and a plurality of regular expressions and according to the matching priority order of each regular expression comprises:
traversing each regular expression in sequence according to the matching priority order of each regular expression;
in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
3. The method of claim 1, wherein the verifying the name entity comprises:
acquiring the total word number of the name entity;
judging whether the total word number of the name entity is larger than a first preset word number and smaller than a second preset word number or not;
if so, the verification is successful, otherwise, the verification fails.
4. The method of claim 3, further comprising:
if the total word number of the name entity is judged to be not smaller than the second preset word number, matching the name entity according to the entity dictionaries and the regular expressions and the matching priority sequence of the regular expressions;
judging whether the entity sequence of the name entity is successfully matched;
if so, identifying the abbreviation of the enterprise name from the entity sequence of the name entity;
and if not, screening the short names of the enterprise names from the name entities, and supplementing the remaining words in the name entities into corresponding entity dictionaries based on a Bootstrapping algorithm.
5. The method of claim 3, further comprising:
and if the total word number of the name entity is judged to be not more than the first preset word number, splicing the name entity with a previous entity or a next entity of the name entity in the entity sequence, and determining the splicing result as the name of the enterprise for short.
6. The method according to any one of claims 1 to 5, further comprising the step of pre-constructing a plurality of said entity dictionaries:
constructing an enterprise name sample library;
extracting each enterprise name in the enterprise name sample library through an N-Gram algorithm to obtain a region name, an industry name and an enterprise type;
respectively taking the area name, the industry name and the enterprise type as entities, and correspondingly constructing an area dictionary, an industry dictionary and an enterprise type dictionary;
and performing word segmentation processing on each enterprise name in the enterprise name sample library through a word segmentation algorithm, and supplementing the region dictionary, the industry dictionary and the enterprise type dictionary according to word segmentation results.
7. An enterprise abbreviation extraction device, the device comprising:
the acquisition module is used for acquiring enterprise names;
the first matching module is used for matching the enterprise names according to a plurality of preset entity dictionaries and a plurality of regular expressions and according to the matching priority order of each regular expression, wherein each entity dictionary does not contain an enterprise name entity;
the identification module is used for acquiring an entity sequence matched with the enterprise name and identifying a name entity of the enterprise name according to the entity category of each entity in the entity sequence;
the checking module is used for checking the name entity;
and the determining module is used for determining the name entity as the name of the enterprise for short when the verification module successfully verifies.
8. The apparatus of claim 7, wherein the matching module is specifically configured to:
traversing each regular expression in sequence according to the matching priority order of each regular expression;
in the traversal process, if the currently traversed regular expression is combined with a plurality of entity dictionaries to successfully match the enterprise names to obtain an entity sequence, the traversal is stopped, and if not, the traversal is continued until the matching is successful.
9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the enterprise abbreviation extraction method of any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the enterprise abbreviation extraction method according to any one of claims 1 to 6.
CN202010542872.6A 2020-06-15 2020-06-15 Enterprise abbreviation extraction method and device, computer equipment and storage medium Pending CN111783460A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010542872.6A CN111783460A (en) 2020-06-15 2020-06-15 Enterprise abbreviation extraction method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010542872.6A CN111783460A (en) 2020-06-15 2020-06-15 Enterprise abbreviation extraction method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111783460A true CN111783460A (en) 2020-10-16

Family

ID=72756499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010542872.6A Pending CN111783460A (en) 2020-06-15 2020-06-15 Enterprise abbreviation extraction method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111783460A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010694A (en) * 2021-04-19 2021-06-22 华北电力大学 Regular expression-based relay protection defect text proper noun dictionary construction method
CN113033208A (en) * 2021-04-21 2021-06-25 浙江非线数联科技股份有限公司 Government affair text data part-of-speech tagging-based enterprise owner matching method
CN113642867A (en) * 2021-07-30 2021-11-12 南京星云数字技术有限公司 Method and system for assessing risk
CN113987145A (en) * 2021-10-22 2022-01-28 智联(无锡)信息技术有限公司 Method, system, equipment and storage medium for accurately reasoning user attribute entity

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010694A (en) * 2021-04-19 2021-06-22 华北电力大学 Regular expression-based relay protection defect text proper noun dictionary construction method
CN113033208A (en) * 2021-04-21 2021-06-25 浙江非线数联科技股份有限公司 Government affair text data part-of-speech tagging-based enterprise owner matching method
CN113642867A (en) * 2021-07-30 2021-11-12 南京星云数字技术有限公司 Method and system for assessing risk
CN113987145A (en) * 2021-10-22 2022-01-28 智联(无锡)信息技术有限公司 Method, system, equipment and storage medium for accurately reasoning user attribute entity
CN113987145B (en) * 2021-10-22 2024-02-02 智联网聘信息技术有限公司 Method, system, equipment and storage medium for accurately reasoning user attribute entity

Similar Documents

Publication Publication Date Title
CN111783460A (en) Enterprise abbreviation extraction method and device, computer equipment and storage medium
CN108256074B (en) Verification processing method and device, electronic equipment and storage medium
US10095780B2 (en) Automatically mining patterns for rule based data standardization systems
CN110457302B (en) Intelligent structured data cleaning method
CN111832294B (en) Method and device for selecting marking data, computer equipment and storage medium
CN107688803B (en) Method and device for verifying recognition result in character recognition
CN110851559B (en) Automatic data element identification method and identification system
CN112163424A (en) Data labeling method, device, equipment and medium
CN112416778A (en) Test case recommendation method and device and electronic equipment
CN110389941B (en) Database checking method, device, equipment and storage medium
CN111723870A (en) Data set acquisition method, device, equipment and medium based on artificial intelligence
CN112559526A (en) Data table export method and device, computer equipment and storage medium
CN110704719B (en) Enterprise search text word segmentation method and device
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
US20170154029A1 (en) System, method, and apparatus to normalize grammar of textual data
CN110781673A (en) Document acceptance method and device, computer equipment and storage medium
CN113642327A (en) Method and device for constructing standard knowledge base
CN116756382A (en) Method, device, setting and storage medium for detecting sensitive character string
Ule et al. Unexpected Productions May Well be Errors.
CN110781310A (en) Target concept graph construction method and device, computer equipment and storage medium
CN114547087B (en) Method, device, equipment and medium for automatically identifying proposal and generating report
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system
CN114358032A (en) Machine translation error detection model training method, device, equipment and medium
CN114220113A (en) Paper quality detection method, device and equipment
CN111581950B (en) Method for determining synonym names and method for establishing knowledge base of synonym names

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination