CN111783467A

CN111783467A - Enterprise name identification method and device

Info

Publication number: CN111783467A
Application number: CN202010704340.8A
Authority: CN
Inventors: 谭树国; 艾青; 王征; 梁华欣
Original assignee: Beijing Yixin Zhicheng Credit Management Co ltd; Zhicheng Afu Technology Development Beijing Co ltd
Current assignee: Beijing Yixin Zhicheng Credit Management Co ltd; Zhicheng Afu Technology Development Beijing Co ltd
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-10-16

Abstract

The embodiment of the application discloses an enterprise name identification method and device, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be recognized, wherein the text to be recognized at least comprises an enterprise name, and obtaining a word to be inquired from the text to be recognized. And searching the word to be queried in a search engine as a keyword, and acquiring an alternative name set according to a search result, wherein the alternative name set comprises alternative enterprise names. Namely, more alternative business names including the words to be inquired are obtained by utilizing the internet data, and more reference data is provided for subsequently determining the business names. And matching the alternative enterprise name with a pre-constructed enterprise name set, and determining the alternative enterprise name as a standard enterprise name if the alternative enterprise name is matched with one enterprise name. Therefore, a large number of alternative enterprise names including the word to be queried are obtained by the search engine, so that possible enterprise names in the text to be recognized are avoided being omitted, and the recognition accuracy and the recall rate are improved.

Description

Enterprise name identification method and device

Technical Field

The application relates to the technical field of data processing, in particular to an enterprise name identification method and device.

Background

At present, named entity recognition only achieves effects in limited text types and entity categories (mainly including names of people, names of places and the like), and in most cases, the named entity recognition task in the Chinese field is more complex and the recognition of entity boundaries is more difficult.

In the prior art, a dictionary and rule template mode or a deep learning algorithm-based training recognition model is mostly adopted for named entity recognition. However, the key to training the recognition model is to obtain a high-quality training set, which requires a lot of manual judgment and calibration, and is time-consuming and labor-consuming. Moreover, due to the low efficiency of manual labeling, the accuracy and recall rate of the trained recognition model cannot be applied to more complex scenes.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for identifying an enterprise name, so as to achieve more accurate identification of an enterprise name and improve accuracy and recall rate of identifying an enterprise name.

In order to solve the above problem, the technical solution provided by the embodiment of the present application is as follows:

in a first aspect of an embodiment of the present application, a method for identifying an enterprise name is provided, where the method may include:

acquiring a text to be recognized, wherein the text to be recognized at least comprises an enterprise name;

acquiring a word to be queried from the text to be recognized;

searching according to the word to be queried to obtain an alternative name set, wherein the alternative name set comprises at least one alternative enterprise name, and the alternative enterprise name comprises the word to be queried;

for any one of the alternative enterprise names, matching the alternative enterprise name with a pre-constructed enterprise name set;

and when the alternative enterprise name is matched with one enterprise name in the enterprise name set, determining the alternative enterprise name as a standard enterprise name.

In a possible implementation manner, the obtaining a word to be queried from the text to be recognized includes:

inputting the text to be recognized into an enterprise name extraction model to obtain at least one enterprise name to be processed;

performing word segmentation processing on the text to be recognized to obtain a word segmentation set, wherein the word segmentation set comprises at least one word segmentation;

and for any one of the enterprise names to be processed, combining the enterprise name to be processed and adjacent participles according to the appearance sequence in the text to be recognized to obtain the term to be inquired.

In one possible implementation, the method further includes:

obtaining suffix information of the enterprise name to be processed;

and when the suffix information meets a preset condition, determining the name of the enterprise to be processed as a standard enterprise name.

In one possible implementation, when the candidate business name matches a plurality of business names in the set of business names, the method further includes:

and combining the enterprise name to be processed with a plurality of preset preceding participles and/or a plurality of preset following participles to obtain a word to be inquired.

performing word segmentation processing on the text to be recognized to obtain a word segmentation set, wherein the word segmentation set at least comprises one word segmentation;

judging whether the character length of a target word segmentation in the word segmentation set meets a preset length, wherein the target word segmentation is the word segmentation in the word segmentation set in sequence;

when the character length of the target word segmentation meets the preset length, determining the target word segmentation as a word to be queried;

when the character length of the target word segmentation does not meet the preset length, combining the target word segmentation with a plurality of subsequent preset word segmentation to generate a target word segmentation, wherein the target word segmentation meets the preset length;

and determining the target word segmentation as a word to be queried.

In one possible implementation manner, the determining the target word segmentation as a word to be queried includes:

inputting the target word segmentation into a local phrase discrimination model to obtain a target probability, wherein the local phrase discrimination model is used for determining the probability that the target word segmentation is an enterprise name local phrase;

and when the target probability is greater than a preset threshold value, determining the target word segmentation as a word to be queried.

and combining the target word segmentation with a plurality of subsequent preset word segmentation to generate a target word segmentation, and determining the target word segmentation as a word to be queried.

In one possible implementation, the method further includes:

when the target probability is smaller than or equal to the preset threshold and the target participle is a word generated by combination, determining a later participle in the target participle as a target participle;

if the character length of the target word segmentation meets the preset length, determining the target word segmentation as a word to be queried;

if the character length of the target word segmentation does not meet the preset length, combining the target word segmentation with a plurality of subsequent preset word segmentation to generate a target word segmentation, wherein the target word segmentation meets the preset length;

and determining the target word segmentation as a word to be queried.

In one possible implementation, the method further includes:

when the target probability is smaller than or equal to the preset threshold and the target participle is a non-combination generated word, determining the next participle in the participle set as a target participle;

and determining the target word segmentation as a word to be queried.

In a possible implementation manner, the matching the candidate business name with a pre-constructed business name set includes:

splitting the alternative enterprise name and the enterprise name set according to enterprise name composition fields to obtain information corresponding to each field;

matching first information corresponding to a target field with second information to obtain a matching result, wherein the first information is information corresponding to the target field in the alternative enterprise name, the second information is second information corresponding to the target field in the enterprise name set, and the target field is any field;

and determining the matching result of the alternative enterprise name and the enterprise name set according to the matching result corresponding to each field.

In a second aspect of embodiments of the present application, there is provided an apparatus for identifying a business name, the apparatus including:

the system comprises a first acquisition unit, a second acquisition unit and a recognition unit, wherein the first acquisition unit is used for acquiring a text to be recognized, and the text to be recognized at least comprises an enterprise name;

the second acquisition unit is used for acquiring the words to be inquired from the text to be identified;

a third obtaining unit, configured to search according to the word to be queried, and obtain an alternative name set, where the alternative name set includes at least one alternative enterprise name, and the alternative enterprise name includes the word to be queried;

the matching unit is used for matching any one of the alternative enterprise names with a pre-constructed enterprise name set;

and the determining unit is used for determining the alternative enterprise name as a standard enterprise name when the alternative enterprise name is matched with one enterprise name in the enterprise name set.

Therefore, the embodiment of the application has the following beneficial effects:

the method includes the steps of firstly obtaining a text to be recognized, wherein the text to be recognized at least comprises an enterprise name, and obtaining a word to be inquired from the text to be recognized. That is, the word to be queried can be obtained from the text to be recognized in a word segmentation processing manner. And searching the word to be queried in a search engine as a keyword, and acquiring an alternative name set according to a search result, wherein the alternative name set comprises alternative enterprise names, and the alternative enterprise names comprise the word to be queried. That is, the internet data is used to obtain more alternative business names including the word to be queried, so as to provide more reference data for subsequently determining the business names. And for any alternative enterprise name, matching the alternative enterprise name with a pre-constructed enterprise name set, and determining the alternative enterprise name as a standard enterprise name if the alternative enterprise name is matched with one enterprise name in the enterprise name set. Therefore, after the word to be queried is obtained, a large number of alternative enterprise names including the word to be queried are obtained by using the search engine, so that possible enterprise names in the text to be recognized are avoided being omitted, and the recognition accuracy and the recall rate are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of an enterprise name identification method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for obtaining a word to be queried according to an embodiment of the present application;

fig. 3 is a flowchart of another method for obtaining a term to be queried according to an embodiment of the present application;

FIG. 4 is a diagram of an enterprise name identification framework provided by an embodiment of the present application;

fig. 5 is a structural diagram of an enterprise name identifying device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

For the purpose of facilitating understanding of the embodiments of the present application, the terms referred to will be described below.

Named entity recognition refers to extracting named entities with specific meanings, such as a person name, an organization name, a place name and the like, by a certain method.

The business name identification is one of named entity identifications and is mainly used for extracting and identifying business names in texts.

In the word segmentation combination, an entity is often formed by combining a plurality of independent words, and a new word is generated based on multi-level combination after word segmentation.

And the knowledge verification is to analyze and verify the rules of the entities so as to improve the accuracy of the extracted entities.

Based on the above description, the business name identification method provided in the embodiment of the present application will be described below with reference to the drawings.

Method embodiment one

Referring to fig. 1, which is a flowchart of an enterprise name identification method provided in an embodiment of the present application, as shown in fig. 1, the method may include:

s101: acquiring a text to be recognized, wherein the text to be recognized at least comprises an enterprise name.

In this embodiment, when the name of an enterprise included in a certain text needs to be acquired, the text is acquired and taken as the text to be recognized. Specifically, the text to be recognized may be news text in any field. For example, the text to be recognized is that "6-month-8-day beijing XXXX limited company (hereinafter abbreviated as XXXX) issues recruitment information, and the specific recruitment information is … below; the recruitment information … was published by the company YYYY, beijing.

S102: and acquiring the words to be queried from the text to be recognized.

In this embodiment, after the text to be recognized is obtained, the word to be queried is obtained from the text to be recognized, so that subsequent search is performed by using the word to be queried. The word to be queried is a word which may be a name of a business. For example, the acquired words to be queried are "XXXX" and "YYYY".

Specifically, in order to enable the to-be-queried word obtained from the to-be-recognized text to include all words that may be enterprise names as much as possible, the embodiment provides two obtaining modes, one is a mode combining an enterprise name extraction model and word segmentation processing to obtain the to-be-queried word; the other method is to acquire the word to be inquired in a mode of combining word segmentation processing and a local phrase discrimination model. That is, in the embodiment, when the word to be queried is obtained, the two modes are combined to obtain the word to be queried, so that the obtaining quantity and the quality of the word to be queried are improved. Specific implementation of the above two acquisition modes will be described in the following embodiments.

S103: and searching according to the word to be queried to obtain an alternative name set.

In this embodiment, after the word to be queried is obtained through S102, the word to be queried is used as a keyword to search in a search engine, and an alternative name set is obtained according to a search result. That is, when searching is performed by using the word to be queried as the keyword, a large number of search results including the keyword can be obtained, and the candidate enterprise name including the word to be queried is extracted from each search result to form a candidate name set. For example, the word to be queried is "information technology", and a large number of candidate enterprise names including the word to be queried, such as "beijing china science and technology information technology limited", "shanghai high and constant information technology limited", are obtained by searching, and each candidate enterprise name constitutes a candidate name set.

In some implementation manners, when a plurality of candidate enterprise names including a word to be queried are acquired through a searching manner, in order to ensure that the acquired candidate enterprise names conform to an enterprise name naming rule, when subsequent matching processing is performed by using the candidate enterprise, the candidate enterprise names are firstly distinguished so as to screen out the candidate enterprise names conforming to the enterprise naming rule. In particular, the alternative business names may be validated using the business name discrimination model for screening by the business name discrimination model. The enterprise name distinguishing model can be generated according to a positive sample training set and a negative sample training set, wherein the positive sample training set refers to an enterprise name data set which accords with enterprise name naming rules, and the negative sample training set refers to a data set which does not accord with the enterprise naming rules.

S104: and for any alternative business name, matching the alternative business name with a pre-constructed business name set.

For each candidate business name in the candidate name set, matching the candidate business name with a pre-constructed business name set to determine whether the candidate business name matches one or more business names in the pre-constructed at least one name set.

In some implementations, the present embodiment may employ a business name equivalence determination to determine whether the candidate business name matches a business name in the set of business names. Specifically, the method can be realized by the following steps:

1) and splitting the alternative enterprise name and the enterprise name set according to the enterprise name composition field to obtain information corresponding to each field.

It is understood that the name of the enterprise is formed with certain rules, and specific fields, such as administrative division information, enterprise font size information, enterprise affiliated industry information, enterprise suffix information, etc., are included. The administrative division is simply called administrative division, and is an area division performed by the country for hierarchical management. The word size is a core element in the name of an enterprise, is the most significant and important component in the name of the enterprise, and is used for distinguishing different enterprises. The industry information is the industry field of the enterprise. The enterprise suffix information refers to the form and type of the enterprise, and mainly includes three forms of exclusive enterprises, partner enterprises and company-made enterprises. For example, the alternative business names "beijing yigaiconing management limited" are divided into "beijing", "yigaiconing", "credit management", and "limited".

And dividing each enterprise name in the enterprise name set according to the composition rule to obtain information corresponding to each field split by each enterprise name.

2) And matching the first information and the second information corresponding to the target field to obtain a matching result.

In this embodiment, for any field, the field is used as a target field, first information corresponding to the target field in the candidate enterprise name and second information corresponding to the target field in the enterprise name set are obtained, and the first information and the second information corresponding to the target field are matched. That is, the first information is matched with the second information corresponding to each business name, and a matching result corresponding to each field is obtained.

For example, when the target field is an administrative division, matching information corresponding to the target field in the candidate enterprise name with information corresponding to the target field in each enterprise name in the enterprise name set to obtain a matching result corresponding to the target field.

3) And determining a matching result of the alternative enterprise name and the enterprise name set according to the matching result corresponding to each field.

In this embodiment, after obtaining the matching result corresponding to each field, the matching result between the candidate enterprise name and the enterprise name set may be determined; or after the matching result corresponding to a certain field is obtained, the matching result of the alternative enterprise name and the enterprise name set can be determined. For example, when the target field is the enterprise font size, the information corresponding to the field in the candidate enterprise name is "good faith", and the matching result of the candidate enterprise name and the enterprise name set is determined to be failure by traversing the enterprise name set without the second information of "good faith". At this time, the above matching analysis is performed on the next candidate business name in the candidate name set.

In addition, in practical applications, it may occur that the alternative business name matches multiple business names in the set of business names. For example, the candidate enterprise name is "puppy technology", the enterprise name set includes both "beijing puppy technology limited" and "shenzhen puppy technology limited", and when field matching is performed at this time, the candidate enterprise name corresponds to two matching results. When there are multiple matching results, in order to further confirm the matching result corresponding to the candidate enterprise name, the candidate enterprise name needs to be corrected, and then a unique matching result is determined, and a specific correction manner will be described in the following embodiments.

S105: and if the alternative business name is matched with one business name in the business name set, determining the alternative business name as a standard business name.

When the matching mode is used for determining that the alternative enterprise name is matched with one enterprise name in the enterprise name set, the alternative enterprise name is shown as the existing enterprise name, the alternative enterprise name is determined as the standard enterprise name to be stored, and therefore all possible enterprise names are extracted from the text to be identified.

According to the method, the search engine is used for acquiring a large number of alternative enterprise names including the word to be queried, so that possible enterprise names in the text to be recognized are avoided from being omitted, and the recognition accuracy and the recall rate are improved.

In the above method, it is mentioned that the word to be queried can be obtained from the text to be recognized in two ways, which will be described below with reference to the accompanying drawings.

Method embodiment two

Referring to fig. 2, which is a flowchart of a method for obtaining a term to be queried according to an embodiment of the present application, the method may include:

s201: and inputting the text to be recognized into the enterprise name extraction model to obtain at least one enterprise name to be processed.

In this embodiment, after the text to be recognized is obtained, the text to be recognized is input into an enterprise name extraction model generated by pre-training, so that the enterprise name existing in the text to be recognized is extracted by using the enterprise name extraction model, thereby obtaining at least one name of an enterprise to be processed. The enterprise name extraction model is generated by training according to an artificially labeled enterprise name training set, and can be a model combining a bidirectional long-time memory cyclic neural network model and a conditional random field model. Among them, Conditional Random Field (CRF), which is a kind of discriminating probability model, is commonly used for labeling or analyzing sequence data.

That is, in this embodiment, an enterprise name extraction model is first used to extract an enterprise name that may exist in the text to be recognized, and the extracted enterprise name is used as the name of the enterprise to be processed.

In some implementations, after the to-be-processed business name is obtained, S203 may be executed first, and instead, suffix information of the to-be-processed business name is obtained first; and if the suffix information meets the preset condition, determining the name of the to-be-processed enterprise as the standard enterprise name. That is, when the acquired to-be-processed enterprise name has suffix information, it is first determined whether the suffix information satisfies a suffix rule, and if so, the to-be-processed enterprise name can be determined as a standard enterprise name, thereby improving the accuracy of acquiring the enterprise name.

S202: and performing word segmentation processing on the text to be recognized to obtain a word segmentation set.

In this embodiment, word segmentation processing is performed on the text to be recognized at the same time, so that a word segmentation set including all the word segments is obtained. The word segmentation technology used specifically may be an existing relatively mature technology, such as a word segmentation method based on dictionary and lexicon matching; word segmentation method based on word frequency statistics and word segmentation method based on knowledge understanding.

It should be noted that in this embodiment, the execution sequence of S201 and S202 may be to execute S202 first and then execute S201, or execute S201 and S202 simultaneously.

S203: and for any enterprise name to be processed, combining the enterprise name to be processed and adjacent participles according to the appearance sequence in the text to be recognized to obtain the word to be queried.

After the enterprise name to be processed and the segmentation set are obtained, the enterprise name to be processed and adjacent segmentation are combined according to the enterprise name to be processed and the appearance sequence of each segmentation in the text to be recognized, and therefore the word to be inquired is obtained.

Specifically, when the name of the enterprise to be processed is combined with the adjacent participle, the name of the enterprise to be processed may be combined with one adjacent participle, or may be combined with a plurality of consecutively adjacent participles. For example, if the text to be recognized is "XXX company sienna division", the extracted name of the business to be processed is "XXX company", and the extracted set of the division is "XXX company sienna division", the word to be queried may be obtained as "XXX company sienna" or "XXX company sienna division" when combined.

In some implementation manners, after the word to be queried is obtained through S201 to S203, the candidate enterprise name is obtained by using the word to be queried, and then the candidate enterprise name is matched with the enterprise name set. When the candidate enterprise name participates in the matching of the plurality of enterprise names in the enterprise name set successfully, it is indicated that the word to be queried is not accurate enough, and the word to be queried needs to be corrected. Specifically, context participles are expanded continuously to be combined, the name of the enterprise to be processed is combined with a preset number of preceding participles and/or a preset number of following participles, the word to be queried is obtained again, and then the word to be queried is used for subsequent searching and matching, so that the name of the alternative enterprise is successfully matched with one enterprise name in the enterprise name set.

Method embodiment three

Referring to fig. 3, which is a flowchart of another method for obtaining a term to be queried according to an embodiment of the present application, the method may include:

s301: and performing word segmentation processing on the text to be recognized to obtain a word segmentation set.

In this embodiment, after the text to be recognized is obtained, word segmentation processing is performed on the text to be recognized to obtain all words included in the text to be recognized, so as to obtain a word segmentation set. The word segmentation technology used specifically may be an existing relatively mature technology, such as a word segmentation method based on dictionary and lexicon matching; word segmentation method based on word frequency statistics and word segmentation method based on knowledge understanding.

S302: judging whether the character length of the target word segmentation in the word segmentation set meets a preset length, and if so, executing S304; otherwise, S303 is executed.

In this embodiment, regarding each participle in the participle set as a target participle, determining whether the character length of the target participle meets a preset length, and if so, executing S304; otherwise, S303 is executed. The preset length may be set according to an actual application condition, for example, the preset length is 4, that is, when the character length of the target word segmentation includes at least 4 characters, the preset length is satisfied, otherwise, the preset length is not satisfied.

It should be noted that, in this embodiment, when the character length of the target participle does not satisfy the preset length, the target participle is sequentially combined according to the sequence of the participle in the text to be recognized, and therefore the target participle is sequentially participles in the participle set.

S303: and combining the target word segmentation with a plurality of subsequent preset word segmentation to generate the target word segmentation.

In this embodiment, when the character length of the target participle in S301 does not satisfy the preset length, the target participle (the first target participle) is combined with a plurality of following preset participles, and the combined word is used as the target participle. Wherein the next preset segmentation words are one or more next segmentation words adjacent to the first target segmentation word. For example, if the set of participles is [ today is a good weather ], then when combined, the current target participle is "today", and the target participles that can be generated by combination can be "today is" or "today is one".

After recombining to generate the target participle, judging whether the target participle meets the preset length again, and if so, executing S304; and if not, continuing to combine until the combined target participle meets the preset length.

S304: and determining the target participles as the words to be queried.

In this embodiment, the target participle with the character length satisfying the preset length is determined as the word to be queried, so that the target participle is at least ensured to satisfy the length rule of the enterprise name.

In some implementations, after the target segmented word is obtained, in order to exclude segmented words that are unlikely to appear in the business name, for example, words such as "local time", "drama words", and the like, the probable rate of which is unlikely to appear in a business name, segmented words that are likely to be included in the business name may be screened out by using the local phrase discrimination model before the target segmented word is determined as the word to be queried. Specifically, target word segmentation is input into a local phrase discrimination model to obtain target probability; and when the target probability is greater than a preset threshold value, determining the target word segmentation as a word to be queried. The local phrase discrimination model is used for determining the probability that the target word segmentation is the local phrase of the enterprise name, and is generated according to the training of a manually extracted phrase training set appearing in the enterprise name so as to determine the probability that the input phrase is the local phrase of the enterprise name.

And when the target probability corresponding to the target word segmentation is larger than a preset threshold value, determining the target word segmentation as a word to be queried. The preset threshold may be set according to an actual application, for example, the preset threshold is 0.8.

And when the target probability is smaller than or equal to a preset threshold value, the target word segmentation is not a phrase contained in the standard enterprise name, and the target word segmentation needs to be corrected. Specifically, when the target segmented word is corrected, the target segmented word is also corrected according to the specific composition of the target segmented word.

In one case, when the target word is a word generated by combination, a subsequent word in the target word is determined as the target word. Meanwhile, judging whether the character length of the target word segmentation meets a preset length or not, and if so, determining the target word segmentation as a word to be queried; and if the preset length is not met, combining the target participle with a plurality of subsequent preset participles again to generate the target participle, so that the target participle meets the preset length, and the target participle is determined as the word to be inquired. For example, if the target participle is a word "today's web news technology" generated by combination, determining the later participle "web news technology" in the target participle as a new target participle, and if the preset length is 4, determining the "web news technology" as the word to be queried. If the preset length is 6, combining one or more participles behind the network communication technology with the network communication technology, and if the participles are combined to generate the network communication technology limited company, determining the network communication technology limited company as the word to be inquired.

In another case, when the target participle is a non-combination generated word, determining the next participle in the participle set as the target participle; if the character length of the target word segmentation meets the preset length, determining the target word segmentation as a word to be queried; and if the character length of the target participle does not meet the preset length, combining the target participle with a plurality of following preset participles to generate the target participle, and determining the target participle meeting the preset length as a word to be inquired. For example, if the current target segmentation is "local time", and the corresponding target probability is smaller than a preset threshold, the next segmentation "network communication technology" is used as the target segmentation. If the character length 4 of the network communication technology meets the preset length, the network communication technology is used as a word to be inquired; and if the preset length is not met, combining the network communication technology with one or more subsequent participles to generate a target participle meeting the preset length, and determining the target participle as a to-be-queried word.

In some implementation manners, when the candidate enterprise name searched by the word to be queried obtained in the embodiment is successfully matched with the plurality of enterprise names in the enterprise name set, in order to determine the uniquely matched enterprise name, the word to be queried corresponding to the candidate enterprise name needs to be corrected. Specifically, the target participle corresponding to the word to be searched is combined with a plurality of preset participles to generate the target participle, and the target participle is determined as the word to be searched. That is, by adding the participles included in the word to be queried, more accurate matching can be realized during subsequent matching.

It should be noted that, in practical applications, for the same text to be recognized, the embodiment described in fig. 2 and fig. 3 may be used to obtain the word to be queried from the text to be recognized, and then perform subsequent search and matching on the word to be queried obtained by each method, and finally merge the standard enterprise names determined by each method, so as to obtain the enterprise names included in the text to be recognized as much as possible, and improve the accuracy and recall rate of identifying the enterprise names.

To facilitate an understanding of the overall architecture of the present application, reference is made to the business name identification architecture diagram shown in FIG. 4. For the same text to be recognized, two different processes can be performed on the same text. One method is to extract the name of the enterprise to be processed from the text to be recognized by using an enterprise name extraction model, and perform word segmentation processing on the text to be recognized to obtain a word segmentation set. And combining the enterprise name to be processed and the adjacent participles thereof to convert the enterprise name to be processed and the adjacent participles into a word to be searched, and searching by using the word to be searched to obtain an alternative name set. And matching any one of the alternative enterprise names in the alternative name set with the enterprise name set, and determining the alternative enterprise name as a standard enterprise name if the alternative enterprise name is successfully matched with one of the enterprise names in the enterprise name set, so as to realize the identification of the enterprise name. And if a plurality of enterprise names are matched, combining the enterprise name to be processed with more participles, regenerating the word to be searched, and searching and matching until one enterprise name is matched.

In this implementation manner, when the to-be-processed enterprise name and the participle are combined to obtain the to-be-queried word, the to-be-processed enterprise name suffix may be verified first, and if the to-be-processed enterprise name suffix passes the verification, the to-be-processed enterprise name is directly determined as the standard enterprise name. And if the business name does not pass the query, combining the business name to be processed with the segmentation word to generate a word to be queried.

And the other method is to perform word segmentation processing on the text to be recognized to obtain a word segmentation set. Sequentially taking the participles in the participle set as target participles, firstly judging whether the character length of any target participle meets a preset length, and if so, acquiring the target probability that the target participle is a local phrase by using an enterprise name local phrase judgment model; if the preset length is not met, combining the target word segmentation with a plurality of adjacent post-preset word segmentation, and regenerating the target word segmentation until the character length of the target word segmentation meets the preset condition.

When the target probability corresponding to the target participle is obtained, whether the target probability is larger than a preset threshold value or not is judged, if yes, the target participle is used as a word to be inquired, a search engine is used for searching, and an alternative name set is obtained. And for any alternative enterprise name in the alternative name set, matching the alternative enterprise name with the enterprise name set until the alternative enterprise name is successfully matched with one enterprise name in the enterprise name set.

And when the target probability is not greater than the preset threshold, if the target participle is a word generated by combination, taking a later participle in the target participle as the target participle, and performing subsequent operations such as character length verification and the like. And if the target participle is a non-combinatively generated word, taking the next participle positioned in the current target participle in the participle set as the target participle, and performing subsequent operations such as character length verification and the like.

Therefore, the embodiment of the application utilizes internet resources to make up for the defect of insufficient training sets, avoids the problem of low recall rate caused by the fact that a model training mechanism is simply used under the condition of insufficient samples, can quickly traverse long texts and combine related words through word segmentation combination matching logic, finds and screens more alternative enterprise names including the words to be inquired through the internet resources, greatly reduces the work of manual verification and data labeling, and improves the accuracy of enterprise name identification.

Based on the foregoing method embodiment, an embodiment of the present application provides a structure diagram of an enterprise name identifying device, and as shown in fig. 5, the device may include:

a first obtaining unit 501, configured to obtain a text to be recognized, where the text to be recognized at least includes an enterprise name;

a second obtaining unit 502, configured to obtain a word to be queried from the text to be recognized;

a third obtaining unit 503, configured to perform a search according to the word to be queried, and obtain an alternative name set, where the alternative name set includes at least one alternative enterprise name, and the alternative enterprise name includes the word to be queried;

a matching unit 504, configured to match, for any one of the candidate enterprise names, the candidate enterprise name with a pre-constructed enterprise name set;

a determining unit 505, configured to determine the candidate business name as a standard business name when the candidate business name matches one of the business names in the business name set.

In a possible implementation manner, the second obtaining unit includes:

the first obtaining subunit is used for inputting the text to be recognized into an enterprise name extraction model and obtaining at least one enterprise name to be processed;

the first word segmentation processing subunit is used for carrying out word segmentation processing on the text to be recognized to obtain a word segmentation set, wherein the word segmentation set comprises at least one word segmentation;

and the first combination subunit is used for combining the enterprise names to be processed and adjacent participles according to the appearance sequence in the text to be recognized for any enterprise name to be processed to obtain the terms to be inquired.

In one possible implementation, the apparatus further includes:

the fourth acquisition unit is used for acquiring suffix information of the enterprise name to be processed;

the determining unit is further configured to determine the name of the to-be-processed enterprise as a standard enterprise name when the suffix information satisfies a preset condition.

In a possible implementation manner, when the candidate enterprise name is matched with a plurality of enterprise names in the enterprise name set, the first combining subunit is further configured to combine the enterprise name to be processed with a preset number of preceding participles and/or a preset number of following participles to obtain a word to be queried.

In a possible implementation manner, the second obtaining unit includes:

the second word segmentation processing subunit is used for performing word segmentation processing on the text to be recognized to obtain a word segmentation set, wherein the word segmentation set at least comprises one word segmentation;

the judging subunit is used for judging whether the character length of the target participle in the participle set meets a preset length, and the target participle is the participles in the participle set in sequence;

the first determining subunit is used for determining the target word segmentation as a word to be queried when the character length of the target word segmentation meets the preset length;

the second combination subunit is used for combining the target participle with a plurality of following preset participles to generate the target participle when the character length of the target participle does not meet the preset length, and the target participle meets the preset length;

the first determining subunit is further configured to determine the target word segmentation as a word to be queried.

In a possible implementation manner, the first determining subunit is specifically configured to input the target word segmentation into a local phrase decision model, and obtain a target probability, where the local phrase decision model is configured to determine a probability that the target word segmentation is a local phrase of an enterprise name; and when the target probability is greater than a preset threshold value, determining the target word segmentation as a word to be queried.

In a possible implementation manner, when the candidate enterprise name is matched with a plurality of enterprise names in the enterprise name set, the second combining subunit is further configured to combine the target participle with a plurality of following preset participles to generate a target participle, and determine the target participle as a word to be queried.

In a possible implementation manner, the first determining subunit is further configured to determine, when the target probability is smaller than or equal to the preset threshold and the target participle is a word generated by combination, a subsequent participle in the target participle as a target participle;

and determining the target word segmentation as a word to be queried.

In a possible implementation manner, the first determining subunit is further configured to determine, when the target probability is smaller than or equal to the preset threshold and the target participle is a non-combination generated word, a next participle in the participle set as a target participle;

and determining the target word segmentation as a word to be queried.

In one possible implementation manner, the matching unit includes:

the splitting unit is used for splitting the alternative enterprise name and the enterprise name set according to the enterprise name forming fields to obtain information corresponding to each field;

a matching subunit, configured to match first information corresponding to a target field with second information to obtain a matching result, where the first information is information corresponding to the target field in the candidate enterprise name, the second information is second information corresponding to the target field in the enterprise name set, and the target field is any field;

and the second determining subunit is configured to determine, according to the matching result corresponding to each field, a matching result between the candidate enterprise name and the enterprise name set.

It should be noted that, implementation of each unit in this embodiment may refer to the above method embodiment, and this embodiment is not described herein again.

In addition, an embodiment of the present application provides an apparatus, including: a processor and a memory;

the memory to store instructions;

the processor is configured to execute the instructions in the memory and execute the enterprise name identification method.

Embodiments of the present application provide a computer-readable storage medium storing program code or instructions, which when run on a computer, causes the computer to perform the business name identification method described above.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for identifying a business name, the method comprising:

acquiring a word to be queried from the text to be recognized;

2. The method according to claim 1, wherein the obtaining a word to be queried from the text to be recognized comprises:

3. The method of claim 2, further comprising:

obtaining suffix information of the enterprise name to be processed;

4. The method of claim 2, wherein when the candidate business name matches a plurality of business names in the set of business names, the method further comprises:

5. The method according to claim 1, wherein the obtaining a word to be queried from the text to be recognized comprises:

and determining the target word segmentation as a word to be queried.

6. The method of claim 5, wherein the determining the target participle as a word to be queried comprises:

7. The method of claim 5 or 6, wherein when the candidate business name matches a plurality of business names in the set of business names, the method further comprises:

8. The method of claim 6, further comprising:

and determining the target word segmentation as a word to be queried.

9. The method of claim 6, further comprising:

and determining the target word segmentation as a word to be queried.

10. The method of claim 1, wherein matching the candidate business name with a set of pre-constructed business names comprises:

11. An apparatus for identifying a business name, the apparatus comprising: