CN108710671B - Method and device for extracting company name in text - Google Patents

Method and device for extracting company name in text Download PDF

Info

Publication number
CN108710671B
CN108710671B CN201810471361.2A CN201810471361A CN108710671B CN 108710671 B CN108710671 B CN 108710671B CN 201810471361 A CN201810471361 A CN 201810471361A CN 108710671 B CN108710671 B CN 108710671B
Authority
CN
China
Prior art keywords
node
characters
matched
character
company name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810471361.2A
Other languages
Chinese (zh)
Other versions
CN108710671A (en
Inventor
黄文瀚
程浩
柳超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Technology Co Ltd
Original Assignee
Beijing Jindi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Technology Co Ltd filed Critical Beijing Jindi Technology Co Ltd
Priority to CN201810471361.2A priority Critical patent/CN108710671B/en
Publication of CN108710671A publication Critical patent/CN108710671A/en
Application granted granted Critical
Publication of CN108710671B publication Critical patent/CN108710671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a device for extracting company names in texts, wherein the method comprises the following steps: acquiring a text to be extracted, and determining characters to be matched in the text to be extracted; matching the characters in the characters to be matched with node characters along a node path in a company name prefix tree dictionary according to the character sequence of the characters to be matched until the characters cannot be matched with the node path, so as to obtain the longest matching substring matched with the node characters in the node path in the text to be extracted; judging whether the node corresponding to the last character of the longest matching substring is an end node or not; if so, the longest matching substring is taken as the company name. When the company name is extracted, the time consumption is far less than that of the traditional extraction method, the extraction speed is high, the process is simple, the accuracy is good, and the technical problems of complexity, time consumption and poor accuracy of the existing company name extraction method are solved.

Description

Method and device for extracting company name in text
Technical Field
The invention relates to the technical field of information retrieval, in particular to a method and a device for extracting company names in texts.
Background
A company is a subject involved in a business. Company names often appear in business information and financial information. If the company name can be accurately and quickly extracted from the information and the link of the details of the company is embedded in the information, the user can be greatly facilitated to quickly know the company.
The traditional company name extraction mode adopts a natural language processing algorithm based on rules. The method comprises the steps of firstly carrying out word segmentation processing on the whole document, and matching word segmentation results according to a preset sequence. And if the matching is successful, the current word segmentation sequence is considered as the company name. The method needs a plurality of steps of part of speech tagging, sequence rule setting and the like on the text, and the speed is low. In addition, the accuracy of the method is greatly influenced by the accuracy of word segmentation.
In another prior art, in the process of determining whether the text characters are company names, the company names need to be traversed one by one for matching. The method can realize the identification of company names, and the time complexity is O (N x T), wherein N is the number of companies in the dictionary, and T is the text length. When N is smaller, the method is accurate and effective; however, when N is large, the method is too time consuming to make the algorithm unusable. The number scale of company names is now over 1.2 billion, which is clearly undesirable with this approach.
In conclusion, the existing extraction method for company names is complex, time-consuming and poor in accuracy.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for extracting a company name from a text, so as to solve the technical problems of complexity, time consumption and poor accuracy of the existing method for extracting a company name.
In a first aspect, an embodiment of the present invention provides a method for extracting a company name in a text, where the method includes:
acquiring a text to be extracted, and determining characters to be matched in the text to be extracted;
according to the word sequence of the words to be matched, matching the words in the words to be matched with the node characters along the node path in the company name prefix tree dictionary until the words cannot be matched with the node characters in the company name prefix tree dictionary, so as to obtain the longest matching substring matched with the node characters in the node path in the text to be extracted, wherein the company name prefix tree dictionary comprises: the system comprises a plurality of nodes and a node path composed of the nodes, wherein except for a root node, each node corresponds to a node character, and except for the root node, each node comprises any one of the following components: a start node, an intermediate node, and an end node;
judging whether the node corresponding to the last character of the longest matching substring is an end node or not;
and if so, taking the longest matching substring as the company name.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where if a node corresponding to a last word of the longest matching sub-string is not the end node, a word located after a first target word in the text to be extracted is used as the word to be matched, and the step of performing matching of node characters of words in the word to be matched along a node path in a company name prefix tree dictionary according to a word order of the word to be matched is returned, where the first target word is a next word located in a first word of the longest matching sub-string in the text to be extracted.
With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where after taking the longest matching substring as a company name, the method further includes:
and taking the characters behind a second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character positioned in the longest matching substring in the text to be extracted.
With reference to the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where performing matching on a node character of a word in the to-be-matched word along a node path in a company name prefix tree dictionary includes:
matching the first character in the characters to be matched with the node character corresponding to the starting node;
and if the first character is matched with the node character corresponding to the starting node, starting from the second character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
With reference to the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, where performing matching on a node character of a word in the to-be-matched word along a node path in a company name prefix tree dictionary further includes:
if the first character is not matched with the node character corresponding to the starting node, judging whether a second character in the characters to be matched is matched with the node character corresponding to the starting node;
and if the second character is matched with the node character corresponding to the starting node, starting from the third character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where before obtaining the text to be extracted, the method further includes:
obtaining a company name dictionary, wherein the company name dictionary comprises a plurality of company names;
adding a character attribute to each character of each company name in the company name dictionary to obtain a company name dictionary carrying character attributes, wherein the character attribute of each character in the company name dictionary carrying character attributes comprises any one of the following characters: starting characters, middle characters and ending characters;
and combining the company name dictionary carrying the character attributes and a preset prefix tree construction rule to construct the company name prefix tree dictionary.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the preset prefix tree construction rule includes:
the root node is a null node, and each node except the root node corresponds to a node character, wherein the root node is a node on the upper layer of the starting node;
in a node path from the starting node to the ending node, corresponding node characters form the company name, wherein the node characters corresponding to the starting node are matched with the starting characters, and the node characters corresponding to the ending node are matched with the ending characters;
in all nodes on the next layer of each node, the corresponding node characters of any two nodes are different;
starting from the start node, only one node is represented when there are successive repeated node characters.
In a second aspect, an embodiment of the present invention further provides an apparatus for extracting a company name in a text, where the apparatus includes:
the determining module is used for acquiring a text to be extracted and determining characters to be matched in the text to be extracted;
a matching module, configured to perform matching of node characters on the characters in the characters to be matched along a node path in a company name prefix tree dictionary according to the character sequence of the characters to be matched until the characters cannot be matched, so as to obtain a longest matching substring matched with the node characters in the node path in the text to be extracted, where the company name prefix tree dictionary includes: the system comprises a plurality of nodes and a node path composed of the nodes, wherein except for a root node, each node corresponds to a node character, and except for the root node, each node comprises any one of the following components: a start node, an intermediate node, and an end node;
the judging module is used for judging whether the node corresponding to the last character of the longest matching substring is an end node or not;
and the first setting module takes the longest matching substring as a company name if the longest matching substring is the company name.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the apparatus further includes:
and a second setting module, configured to, if a node corresponding to a last word of the longest matching sub-string is not the end node, use a word located after the first target word in the text to be extracted as the word to be matched, and return to execute a step of performing matching of node characters on words in the word to be matched along a node path in a company name prefix tree dictionary according to a word sequence of the word to be matched, where the first target word is a word next to the first word located in the longest matching sub-string in the text to be extracted.
With reference to the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where the apparatus further includes:
and the third setting module is used for taking the characters behind the second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all the characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character of the longest matching substring in the text to be extracted.
The embodiment of the invention has the following beneficial effects:
one of the existing company name extraction methods is a rule-based natural language processing algorithm, which needs a plurality of steps such as part of speech tagging and sequence rule setting on a text, and is slow. In addition, the accuracy of the method is greatly influenced by the accuracy of word segmentation. And the other method is that in the process of judging whether the text characters are the company names, the company names are traversed one by one to be matched, and time is consumed. Compared with the existing company name extraction method, the method for extracting the company name in the text comprises the steps of determining characters to be matched in the text to be extracted, matching the characters in the characters to be matched along a node path in a company name prefix tree dictionary according to the character sequence of the characters to be matched, obtaining a longest matching sub-string matched with the node character in the node path in the text to be extracted, and taking the longest matching sub-string as the company name if a node corresponding to the last character of the longest matching sub-string is an end node. When the company name is extracted, the time complexity is O (L x T), L represents the length of the company name, T represents the length of a text to be extracted, the time consumption is far smaller than that of the traditional extraction method, the extraction speed is high, the process is simple, the accuracy is good, and the technical problems that the existing extraction method of the company name is complex, consumes time and is poor in accuracy are solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for extracting a company name from a text according to an embodiment of the present invention;
FIG. 2 is a diagram of a company name prefix tree dictionary according to an embodiment of the present invention;
fig. 3 is a flowchart of another method for extracting a company name from a text according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for constructing a company name prefix tree dictionary according to an embodiment of the present invention;
fig. 5 is a functional block diagram of an apparatus for extracting company names in text according to an embodiment of the present invention.
Icon:
11-a determination module; 12-a matching module; 13-a judgment module; 14-first setting module.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
To facilitate understanding of the embodiment, a method for extracting a company name in a text disclosed in the embodiment of the present invention is first described in detail.
The first embodiment is as follows:
a method for extracting a company name in a text, referring to fig. 1, the method comprising:
s102, obtaining a text to be extracted, and determining characters to be matched in the text to be extracted;
in the embodiment of the invention, the text to be extracted can be any character text, and after the text to be extracted is obtained, the characters to be matched are determined in the text to be extracted.
When the first extraction is performed, the characters to be matched are all characters in the text to be extracted. After the extraction method of part of company names is executed, the characters to be matched include characters which are not subjected to character matching in the text to be extracted, and the characters to be matched are described in detail below, which is not described herein again.
S104, according to the character sequence of the characters to be matched, matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary until the characters cannot be matched with the node path, so as to obtain the longest matching substring matched with the node characters in the node path in the text to be extracted, wherein the company name prefix tree dictionary comprises: a plurality of nodes and a node path consisting of the plurality of nodes, each node corresponding to a node character except for a root node, each node including any one of the following except the root node: a start node, an intermediate node, and an end node;
after the characters to be matched are obtained, matching the characters in the characters to be matched with the node characters along the node paths in the company name prefix tree dictionary according to the character sequence of the characters to be matched, namely, matching the characters in the characters to be matched with the node characters along the node paths according to the character sequence until the characters cannot be matched, so as to obtain the longest matching substring matched with the node characters in the node paths in the text to be extracted.
Specifically, the company name prefix tree dictionary is a pre-constructed prefix tree dictionary related to company names. A company name prefix tree dictionary is shown in fig. 2. As can be seen from fig. 2: the company name prefix tree dictionary includes: a plurality of nodes and a node path consisting of the plurality of nodes, each node corresponding to a node character except for a root node, each node including any one of the following except the root node: a start node, an intermediate node, and an end node.
S106, judging whether a node corresponding to the last character of the longest matching substring is an end node or not;
and after the longest matching substring is obtained, judging whether the node corresponding to the last character of the longest matching substring is an end node.
And S108, if so, taking the longest matching substring as the company name.
For example: the characters to be matched are '500 ten thousand invested in Beijing Jinbei technology Limited company'. And matching the characters to be matched with node paths in the company name prefix tree dictionary in the figure 2 to obtain the longest matching substring, wherein the longest matching substring is the 'Beijing Jinke technology Limited company', the node corresponding to the last character 'Si' of the longest matching substring is the end node, and then the longest matching substring (namely the Beijing Jinke technology Limited company) is used as the company name.
One of the existing company name extraction methods is a rule-based natural language processing algorithm, which needs a plurality of steps such as part of speech tagging and sequence rule setting on a text, and is slow. In addition, the accuracy of the method is greatly influenced by the accuracy of word segmentation. And the other method is that in the process of judging whether the text characters are the company names, the company names are traversed one by one to be matched, and time is consumed. Compared with the existing company name extraction method, the method for extracting the company name in the text comprises the steps of determining characters to be matched in the text to be extracted, matching the characters in the characters to be matched along a node path in a company name prefix tree dictionary according to the character sequence of the characters to be matched, obtaining a longest matching sub-string matched with the node character in the node path in the text to be extracted, and taking the longest matching sub-string as the company name if a node corresponding to the last character of the longest matching sub-string is an end node. When the company name is extracted, the time complexity is O (L x T), L represents the length of the company name, T represents the length of a text to be extracted, the time consumption is far smaller than that of the traditional extraction method, the extraction speed is high, the process is simple, the accuracy is good, and the technical problems that the existing extraction method of the company name is complex, consumes time and is poor in accuracy are solved.
The above description briefly introduces the method for extracting the company name in the text, and the details of the related contents are described in detail below.
Optionally, referring to fig. 3, S110, if the node corresponding to the last word of the longest matching sub-string is not the end node, taking a word located after the first target word in the text to be extracted as the word to be matched, and returning to execute the step of performing matching of the node characters of the words to be matched along the node path in the company name prefix tree dictionary according to the word sequence of the words to be matched, where the first target word is a word next to the first word located in the longest matching sub-string in the text to be extracted.
Specifically, if the word to be matched is the Beijing cannetitum, the word in the word to be matched is matched with the node character along the node path in the company name prefix tree dictionary, the obtained longest matching sub-string is the Beijing, obviously, the node corresponding to the Beijing is not the end node, the next word of the first word of the longest matching sub-string is taken as the first target word, namely the Beijing word is taken as the first target word, the corresponding word to be matched at the moment is the Beijing cannetitum, and when the matching is carried out, the Beijing cannetitum of the company name can be extracted.
The reason why the next character of the first character positioned in the longest matching substring in the text to be extracted is to be used as the first target character is to prevent the occurrence of omission phenomenon during company name extraction. As shown in the above example, if the longest matching sub-string is "beijing", it is obvious that the node corresponding to "beijing" is not the end node, and if the character following the last character of the longest matching sub-string is the character to be matched, the corresponding character "all can be read" is the character to be matched, and the company name "beijing all can be read" is omitted.
Optionally, referring to fig. 3, after taking the longest matching substring as the company name, the method further comprises:
and S112, taking the characters behind the second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all the characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character positioned in the longest matching substring in the text to be extracted.
The above description describes the whole process of the method for extracting the company name in the text, and the details thereof are described in detail below.
In an optional embodiment, before obtaining the text to be extracted, referring to fig. 4, the method further includes:
s401, obtaining a company name dictionary, wherein the company name dictionary comprises a plurality of company names;
specifically, each company needs to register in the industry and commerce department, and the industry and commerce department can disclose the company registration information (including the company name) in the enterprise industry and commerce information network, so that the company name dictionary can be obtained from the enterprise industry and commerce information network.
The company name dictionary is a company name list, which contains a plurality of company names.
S402, adding a character attribute to each character of each company name in the company name dictionary to obtain a company name dictionary carrying the character attribute, wherein the character attribute of each character in the company name dictionary carrying the character attribute comprises any one of the following characters: starting characters, middle characters and ending characters;
for example: for the name of Beijing Baidu network communication technology company, Inc., the "North" word is the beginning word, the "Si" word is the ending word, and the other words are intermediate words.
And S403, constructing a company name prefix tree dictionary by combining the company name dictionary carrying the character attributes and the preset prefix tree construction rule.
Specifically, the preset prefix tree construction rule includes:
the root node is a null node, and each node except the root node corresponds to a node character, wherein the root node is a node on the upper layer of the starting node;
in a node path from a starting node to an ending node, corresponding node characters form a company name, wherein the node characters corresponding to the starting node are matched with the starting characters, and the node characters corresponding to the ending node are matched with the ending characters;
in all nodes on the next layer of each node, the corresponding node characters of any two nodes are different;
starting from the start node, only one node is represented when there are successive repeated node characters.
In fig. 2, the name dictionary is a company name prefix tree dictionary constructed according to a preset prefix tree construction rule for "beijing Baidu network communication technologies ltd", "beijing Baidu duoku technologies ltd", "beijing Baidu investment management ltd", "beijing kunjing dike technologies ltd", "da lian wanda technologies ltd", "tianjin saikoku technologies ltd".
In an optional embodiment, matching characters of nodes of the words to be matched along a node path in the company name prefix tree dictionary comprises:
(1) matching a first character in the characters to be matched with a node character corresponding to the starting node;
(2) and if the first character is matched with the node character corresponding to the starting node, starting from the second character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
(3) If the first character is not matched with the node character corresponding to the starting node, judging whether a second character in the characters to be matched is matched with the node character corresponding to the starting node;
(4) and if the second character is matched with the node character corresponding to the starting node, starting from the third character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
That is, until the words matched with the node characters corresponding to the starting node are sequentially found, then matching the node characters of the words to be matched with the words in the rest words to be matched along the node path in the company name prefix tree dictionary.
The following describes the process of the present invention in popular language:
s1, sequentially acquiring characters in the text, judging whether the characters are the first-layer nodes (namely the initial nodes) of the company name prefix tree dictionary, and if so, S2; if not, acquiring the next character;
s2, continuously traversing and matching downwards along the path of the company name prefix tree dictionary until the matching cannot be carried out, and obtaining the longest matching substring;
s3, recording the position of the last node of the longest matching substring, and judging whether the node has an ending judgment mark (namely whether the node is an ending node), if so, S4; if not, positioning to the next character of the first character of the longest matching substring, and returning to S1;
and S4, column is named as company name, the current substring is skipped, the next character of the last character of the longest matching substring is located, and the step returns to S1.
When there are a large number of company names, there must be a large number of repeated characters since the name length of each company is between about 5-20. The invention utilizes the prefix tree to extract the company name, which not only has high speed but also has better space complexity. The space complexity of a company name in the tree can be judged to be O (L) by the composition of the Trie tree (namely the prefix tree), L is the length of the company name, the time complexity of the algorithm for acquiring the company name from a text is O (L) T, and T is the text length. This time complexity is much less than O (N x T) of the conventional approach, where N is the number of company names.
The invention is simple to realize and has lower time complexity. The company name can be quickly and effectively extracted from the text. Compared with a multi-step natural language processing method based on word segmentation, the method only needs to construct two steps of prefix tree dictionary construction and substring judgment, and is simple and clear. The introduction of the prefix tree effectively solves the problem that the time complexity is too high when N is too large (namely, the number of company names is too large).
The method is not limited to extracting the company name in the text, and can also be used for extracting other named entities, such as extracting a place name, extracting a person name and the like.
Example two:
an apparatus for extracting a company name in a text, referring to fig. 5, the apparatus comprising:
the determining module 11 is configured to obtain a text to be extracted, and determine characters to be matched in the text to be extracted;
a matching module 12, configured to perform matching of node characters on the words in the words to be matched along a node path in the company name prefix tree dictionary according to the word sequence of the words to be matched until the words cannot be matched, so as to obtain a longest matching sub-string matched with the node character in the node path in the text to be extracted, where the company name prefix tree dictionary includes: a plurality of nodes and a node path consisting of the plurality of nodes, each node corresponding to a node character except for a root node, each node including any one of the following except the root node: a start node, an intermediate node, and an end node;
the judging module 13 is configured to judge whether a node corresponding to the last character of the longest matching substring is an end node;
the first setting module 14, if yes, takes the longest matching substring as the company name.
In the extraction device for the company name in the text, the characters to be matched in the text to be extracted are determined firstly, then the characters in the characters to be matched are matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched, so that the longest matching substring matched with the node characters in the node path is obtained in the text to be extracted, and further, if the node corresponding to the last character of the longest matching substring is an end node, the longest matching substring is used as the company name. The device is when carrying out the extraction of company name, and the time complexity is O (L T), and L represents the length of company name, and T represents the length of waiting to extract the text, and is consuming time far less than traditional extraction element, and extraction rate is fast, and the process is simple to the accuracy is good, and the extraction element that has alleviated current company name is complicated, consuming time, and the poor technical problem of accuracy.
Optionally, the apparatus further comprises:
and the second setting module is used for taking the characters behind the first target character in the text to be extracted as the characters to be matched if the node corresponding to the last character of the longest matching sub-string is not the end node, and returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched, wherein the first target character is the next character of the first character positioned in the longest matching sub-string in the text to be extracted.
Optionally, the apparatus further comprises:
and the third setting module is used for taking the characters behind the second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all the characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character positioned in the longest matching substring in the text to be extracted.
Optionally, the matching module comprises:
the first matching unit is used for matching a first character in the characters to be matched with a node character corresponding to the starting node;
and the second matching unit is used for matching the node characters of the characters to be matched from the second character in the characters to be matched along the node path in the company name prefix tree dictionary if the first character is matched with the node character corresponding to the starting node.
Optionally, the matching module further comprises:
the judging unit is used for judging whether a second character in the characters to be matched is matched with the node character corresponding to the starting node or not if the first character is not matched with the node character corresponding to the starting node;
and the third matching unit is used for matching the node characters of the characters to be matched from the third character in the characters to be matched along the node path in the company name prefix tree dictionary if the second character is matched with the node character corresponding to the starting node.
Optionally, the apparatus further comprises:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a company name dictionary, and the company name dictionary comprises a plurality of company names;
the adding module is used for adding a character attribute to each character of each company name in the company name dictionary to obtain the company name dictionary carrying the character attribute, wherein the character attribute of each character in the company name dictionary carrying the character attribute comprises any one of the following characters: starting characters, middle characters and ending characters;
and the construction module is used for constructing a company name prefix tree dictionary by combining the company name dictionary carrying the character attributes and the preset prefix tree construction rule.
Optionally, the preset prefix tree construction rule includes:
the root node is a null node, and each node except the root node corresponds to a node character, wherein the root node is a node on the upper layer of the starting node;
in a node path from a starting node to an ending node, corresponding node characters form a company name, wherein the node characters corresponding to the starting node are matched with the starting characters, and the node characters corresponding to the ending node are matched with the ending characters;
in all nodes on the next layer of each node, the corresponding node characters of any two nodes are different;
starting from the start node, only one node is represented when there are successive repeated node characters.
For details in the second embodiment, reference may be made to the description in the first embodiment, and details are not repeated herein.
The method and the device for extracting a company name from a text provided by the embodiment of the present invention include a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiment, and specific implementation may refer to the method embodiment, and will not be described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method for extracting company names in texts is characterized by comprising the following steps:
acquiring a text to be extracted, and determining characters to be matched in the text to be extracted;
according to the word sequence of the words to be matched, matching the words in the words to be matched with the node characters along the node path in the company name prefix tree dictionary until the words cannot be matched with the node characters in the company name prefix tree dictionary, so as to obtain the longest matching substring matched with the node characters in the node path in the text to be extracted, wherein the company name prefix tree dictionary comprises: the system comprises a plurality of nodes and a node path composed of the nodes, wherein except for a root node, each node corresponds to a node character, and except for the root node, each node comprises any one of the following components: a start node, an intermediate node, and an end node;
judging whether the node corresponding to the last character of the longest matching substring is an end node or not;
if yes, the longest matching substring is used as a company name;
and if the node corresponding to the last word of the longest matching sub-string is not the end node, taking the word behind the first target word in the text to be extracted as the word to be matched, and returning to execute the step of matching the characters in the word to be matched with the node characters along the node path in the company name prefix tree dictionary according to the word sequence of the word to be matched, wherein the first target word is the next word of the first word of the longest matching sub-string in the text to be extracted.
2. The method of claim 1, wherein after taking the longest matching substring as a company name, the method further comprises:
and taking the characters behind a second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character positioned in the longest matching substring in the text to be extracted.
3. The method of claim 1, wherein matching the characters of the to-be-matched characters along a node path in a company name prefix tree dictionary comprises:
matching the first character in the characters to be matched with the node character corresponding to the starting node;
and if the first character is matched with the node character corresponding to the starting node, starting from the second character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
4. The method of claim 3, wherein matching the characters of the to-be-matched characters along a node path in a company name prefix tree dictionary further comprises:
if the first character is not matched with the node character corresponding to the starting node, judging whether a second character in the characters to be matched is matched with the node character corresponding to the starting node;
and if the second character is matched with the node character corresponding to the starting node, starting from the third character in the characters to be matched, and matching the node characters of the characters in the rest characters to be matched along the node path in the company name prefix tree dictionary.
5. The method of claim 1, wherein prior to obtaining the text to be extracted, the method further comprises:
obtaining a company name dictionary, wherein the company name dictionary comprises a plurality of company names;
adding a character attribute to each character of each company name in the company name dictionary to obtain a company name dictionary carrying character attributes, wherein the character attribute of each character in the company name dictionary carrying character attributes comprises any one of the following characters: starting characters, middle characters and ending characters;
and combining the company name dictionary carrying the character attributes and a preset prefix tree construction rule to construct the company name prefix tree dictionary.
6. The method of claim 5, wherein the pre-defined prefix tree construction rules comprise:
the root node is a null node, and each node except the root node corresponds to a node character, wherein the root node is a node on the upper layer of the starting node;
in a node path from the starting node to the ending node, corresponding node characters form the company name, wherein the node characters corresponding to the starting node are matched with the starting characters, and the node characters corresponding to the ending node are matched with the ending characters;
in all nodes on the next layer of each node, the corresponding node characters of any two nodes are different;
starting from the start node, only one node is represented when there are successive repeated node characters.
7. An apparatus for extracting a company name from a text, the apparatus comprising:
the determining module is used for acquiring a text to be extracted and determining characters to be matched in the text to be extracted;
a matching module, configured to perform matching of node characters on the characters in the characters to be matched along a node path in a company name prefix tree dictionary according to the character sequence of the characters to be matched until the characters cannot be matched, so as to obtain a longest matching substring matched with the node characters in the node path in the text to be extracted, where the company name prefix tree dictionary includes: the system comprises a plurality of nodes and a node path composed of the nodes, wherein except for a root node, each node corresponds to a node character, and except for the root node, each node comprises any one of the following components: a start node, an intermediate node, and an end node;
the judging module is used for judging whether the node corresponding to the last character of the longest matching substring is an end node or not;
the first setting module takes the longest matching substring as a company name if the longest matching substring is the company name;
wherein the apparatus further comprises:
and a second setting module, configured to, if a node corresponding to a last word of the longest matching sub-string is not the end node, use a word located after the first target word in the text to be extracted as the word to be matched, and return to execute a step of performing matching of node characters on words in the word to be matched along a node path in a company name prefix tree dictionary according to a word sequence of the word to be matched, where the first target word is a word next to the first word located in the longest matching sub-string in the text to be extracted.
8. The apparatus of claim 7, further comprising:
and the third setting module is used for taking the characters behind the second target character in the text to be extracted as the characters to be matched, returning to execute the step of matching the characters in the characters to be matched with the node characters along the node path in the company name prefix tree dictionary according to the character sequence of the characters to be matched until all the characters in the text to be extracted are traversed, wherein the second target character is the next character of the last character of the longest matching substring in the text to be extracted.
CN201810471361.2A 2018-05-16 2018-05-16 Method and device for extracting company name in text Active CN108710671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810471361.2A CN108710671B (en) 2018-05-16 2018-05-16 Method and device for extracting company name in text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810471361.2A CN108710671B (en) 2018-05-16 2018-05-16 Method and device for extracting company name in text

Publications (2)

Publication Number Publication Date
CN108710671A CN108710671A (en) 2018-10-26
CN108710671B true CN108710671B (en) 2020-06-05

Family

ID=63868196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810471361.2A Active CN108710671B (en) 2018-05-16 2018-05-16 Method and device for extracting company name in text

Country Status (1)

Country Link
CN (1) CN108710671B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992776B (en) * 2019-03-26 2023-07-25 北京博瑞彤芸文化传播股份有限公司 Chinese word segmentation method
CN111191103B (en) * 2019-12-30 2021-08-24 河南拓普计算机网络工程有限公司 Method, device and storage medium for identifying and analyzing enterprise subject information from internet
CN111274805B (en) * 2020-01-19 2020-11-20 上海众言网络科技有限公司 Method and device for processing suspected words
CN113792129A (en) * 2021-09-16 2021-12-14 平安普惠企业管理有限公司 Intelligent conversation method, device, computer equipment and medium
CN117493540A (en) * 2023-12-28 2024-02-02 荣耀终端有限公司 Text matching method, terminal device and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537107A (en) * 2015-01-15 2015-04-22 中国联合网络通信集团有限公司 URL storage matching method and device
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
WO2011063561A1 (en) * 2009-11-25 2011-06-03 Hewlett-Packard Development Company, L. P. Data extraction method, computer program product and system
CN103617251A (en) * 2013-11-28 2014-03-05 金蝶软件(中国)有限公司 Sensitive word matching method and system
CN105095369A (en) * 2015-06-29 2015-11-25 北京金山安全软件有限公司 Website matching method and device
US10515152B2 (en) * 2015-08-28 2019-12-24 Freedom Solutions Group, Llc Mitigation of conflicts between content matchers in automated document analysis
CN106959962B (en) * 2016-01-12 2019-10-15 中国移动通信集团青海有限公司 A kind of multi-pattern match method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537107A (en) * 2015-01-15 2015-04-22 中国联合网络通信集团有限公司 URL storage matching method and device
CN107357911A (en) * 2017-07-18 2017-11-17 北京新美互通科技有限公司 A kind of text entry method and device

Also Published As

Publication number Publication date
CN108710671A (en) 2018-10-26

Similar Documents

Publication Publication Date Title
CN108710671B (en) Method and device for extracting company name in text
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN105320778B (en) A method of suitable for e-commerce Chinese website Commercial goods labels
CN112560450B (en) Text error correction method and device
CN108959474B (en) Entity relation extraction method
WO2023124005A1 (en) Map point of interest query method and apparatus, device, storage medium, and program product
CN111159329A (en) Sensitive word detection method and device, terminal equipment and computer-readable storage medium
CN109753517A (en) A kind of method, apparatus, computer storage medium and the terminal of information inquiry
CN113935710A (en) Contract auditing method and device, electronic equipment and storage medium
CN110704719B (en) Enterprise search text word segmentation method and device
CN112434533A (en) Entity disambiguation method, apparatus, electronic device, and computer-readable storage medium
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN110442735A (en) Idiom near-meaning word recommendation method and device
CN108073678B (en) Document analysis processing method, system and device applied to big data analysis
CN112182353B (en) Method, electronic device, and storage medium for information search
CN111177403B (en) Sample data processing method and device
CN115130455A (en) Article processing method and device, electronic equipment and storage medium
CN111090737A (en) Word stock updating method and device, electronic equipment and readable storage medium
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN111858953B (en) Entity relationship representation method and system for smart city few-sample-data modeling
CN115017256A (en) Power data processing method and device, electronic equipment and storage medium
CN110909532B (en) User name matching method and device, computer equipment and storage medium
CN113688240A (en) Threat element extraction method, device, equipment and storage medium
CN113177389A (en) Text processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant