WO2020098315A1 - 一种信息匹配方法及终端 - Google Patents

一种信息匹配方法及终端 Download PDF

Info

Publication number
WO2020098315A1
WO2020098315A1 PCT/CN2019/099123 CN2019099123W WO2020098315A1 WO 2020098315 A1 WO2020098315 A1 WO 2020098315A1 CN 2019099123 W CN2019099123 W CN 2019099123W WO 2020098315 A1 WO2020098315 A1 WO 2020098315A1
Authority
WO
WIPO (PCT)
Prior art keywords
enterprise
participle
word segmentation
information
participles
Prior art date
Application number
PCT/CN2019/099123
Other languages
English (en)
French (fr)
Inventor
吴超鹏
张若峰
龚浩杰
郑俊杰
陈志飞
许琨
Original Assignee
厦门市美亚柏科信息股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 厦门市美亚柏科信息股份有限公司 filed Critical 厦门市美亚柏科信息股份有限公司
Publication of WO2020098315A1 publication Critical patent/WO2020098315A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present disclosure relates to the field of data processing, and in particular, to an information matching method and terminal.
  • the technical problem to be solved by the present disclosure is: how to improve the accuracy of matching text information and enterprise information.
  • the present disclosure provides an information matching method, including:
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed;
  • S1 is specifically:
  • the ordered set of first participles is generated according to the administrative division word, the enterprise abbreviation participle, the enterprise nature participle and the enterprise type participle.
  • the matching score is calculated as follows:
  • the second participle set includes the first first participle ordered set, set the matching score corresponding to the first first participle ordered set to the first value;
  • the second participle set contains only the enterprise abbreviated participle, the enterprise nature participle, and the enterprise type participle in the first first participle ordered set
  • an ordered set with the first first participle is set
  • the corresponding matching score is the second value
  • the second participle set contains only the enterprise abbreviated participle and the enterprise property participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Three values
  • the second participle set contains only the enterprise abbreviated participle and the business type participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Four values
  • the first value is greater than the second value; the second value is greater than the third value; the third value is greater than the fourth value.
  • the ordered set of first participles also includes address participles and industry name participles;
  • the matching score is increased by a fifth value
  • the matching score is increased by a sixth value
  • the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the matching score is calculated as follows:
  • the text information word segmentation matches the business abbreviation word segmentation in the first set of first word segmentation, then the number of word segments and the number of matches of the second word segmentation set and the first first word segmentation ordered set match The number of the participle of is in the ordered set of the first participle, and the matching score is calculated.
  • S3 is specifically:
  • the S3 Before the S3, it also includes: if there is a bracket in the preset text information and the number of characters in the bracket is less than 10, deleting the bracket and the characters in the bracket.
  • the present disclosure also provides a computer-readable storage medium having a program stored thereon, which executes the information matching method when executed by a computer.
  • the present disclosure also provides an information matching terminal, including one or more processors and a memory, the memory stores a program, and is configured to be executed by the one or more processors to perform the following steps:
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed;
  • S1 is specifically:
  • the matching score is calculated according to the number of participles matching the second participle set and the first first participle ordered set and the sequence number of the matched participle in the first first participle ordered set, specifically:
  • the matching score corresponding to the ordered set of a first participle is the first numerical value; when the second set of participles only includes the enterprise abbreviated participle, the enterprise nature participle and the When describing the enterprise type word segmentation, set the matching score corresponding to the first set of first participles as a second value; when the second set of participles only includes the abbreviated enterprise in the first set of first participles When the word segmentation and the business nature word segmentation are set, the matching score corresponding to the first set of first word segmentation is set to a third value; when the second set of word segmentation only includes all When describing the enterprise abbreviation participle and the business type participle, set the matching score corresponding to the ordered set of a first participle as a fourth value; the first value is greater than the second value; the
  • the first word segmentation ordered set also includes address word segmentation and industry name word segmentation; when the second word segmentation set includes the address word segmentation, the matching score increases by a fifth value; when the second word segmentation set includes the When the industry name is segmented, the matching score is increased by a sixth value; the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the S1 also includes: deleting the brackets and the characters in the brackets in the enterprise information;
  • the S3 is specifically:
  • brackets in the preset text information If there are brackets in the preset text information and the number of characters in the brackets is less than 10, delete the brackets and the characters in the brackets;
  • the beneficial effect of the present disclosure is that the enterprise information is subjected to word segmentation operations, and the resulting word segmentations are different in importance.
  • the present disclosure arranges the word segments corresponding to the enterprise information in an orderly set of the first word segmentation, so that When the text information of public opinion is matched with the enterprise information of the word segmentation in the enterprise information database in turn, the matching score can be generated by the number of word segments matched by the two and the importance of the matched word segmentation, so that the text information can be
  • the matching score of the enterprise information yields the information of the enterprise most relevant to the incident report or public opinion, which greatly improves the accuracy of matching the text information with the enterprise information.
  • FIG. 1 is a flowchart of a specific implementation manner of an information matching method provided by the present disclosure
  • FIG. 2 is a structural block diagram of a specific implementation manner of an information matching terminal provided by the present disclosure
  • the present disclosure provides an information matching method, including:
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed;
  • S1 is specifically:
  • the ordered set of first participles is generated according to the administrative division word, the enterprise abbreviation participle, the enterprise nature participle and the enterprise type participle.
  • the matching score is calculated as follows:
  • the second participle set includes the first first participle ordered set, set the matching score corresponding to the first first participle ordered set to the first value;
  • the second participle set contains only the enterprise abbreviated participle, the enterprise nature participle, and the enterprise type participle in the first first participle ordered set
  • an ordered set with the first first participle is set
  • the corresponding matching score is the second value
  • the second participle set contains only the enterprise abbreviated participle and the enterprise property participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Three values
  • the second participle set contains only the enterprise abbreviated participle and the business type participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Four values
  • the first value is greater than the second value; the second value is greater than the third value; the third value is greater than the fourth value.
  • the ordered set of first participles also includes address participles and industry name participles;
  • the matching score is increased by a fifth value
  • the matching score is increased by a sixth value
  • the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • mapping the business address keywords and industry keywords is beneficial to improve the accuracy of matching text information and enterprise information.
  • the matching score is calculated as follows:
  • the text information word segmentation matches the business abbreviation word segmentation in the first set of first word segmentation, then the number of word segments and the number of matches of the second word segmentation set and the first first word segmentation ordered set match The number of the participle of is in the ordered set of the first participle, and the matching score is calculated.
  • S3 is specifically:
  • the S3 Before the S3, it also includes: if there is a bracket in the preset text information and the number of characters in the bracket is less than 10, deleting the bracket and the characters in the bracket.
  • the parentheses in the company name are usually not reflected. Therefore, deleting the parentheses in the company information during preprocessing is beneficial to improve the accuracy and efficiency of matching.
  • the deletion of the parentheses and the characters in the parentheses in the text information is to be consistent with the operation when the enterprise information is split, to ensure the consistency of the word segmentation results, and to improve the matching accuracy of the enterprise information and the text information.
  • the content in parentheses in the company name basically does not exceed five words. In order to prevent accidental deletion of the content of other parts of the text information, the deletion operation is performed only if the characters in the parentheses are less than 10.
  • the present disclosure also provides a computer-readable storage medium having a program stored thereon, which executes the information matching method when executed by a computer.
  • the present disclosure also provides an information matching terminal, including one or more processors 1 and a memory 2, the memory 2 stores a program, and is configured to be configured by the one or more processors 1 Perform the following steps:
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed;
  • S1 is specifically:
  • the matching score is calculated according to the number of participles matching the second participle set and the first first participle ordered set and the sequence number of the matched participle in the first first participle ordered set, specifically:
  • the text information word segmentation matches the enterprise abbreviated word segmentation in the first set of first word segmentation, then: when the second word segmentation set includes the first set of first word segmentation, set The matching score corresponding to the ordered set of a first participle is the first numerical value; when the second set of participles only includes the enterprise abbreviated participle, the enterprise nature participle and the When describing the enterprise type word segmentation, set the matching score corresponding to the first set of first participles as the second value; when the second set of participles only includes the abbreviated name of the enterprise in the first set of first participles When the word segmentation and the business nature word segmentation are set, the matching score corresponding to the first set of first word segmentation is set to a third value; when the second set of word segmentation only includes all When describing the enterprise abbreviation participle and the business type participle, set the matching score corresponding to the ordered set of a first participle to a fourth value; the first value is greater than the second value
  • the first word segmentation ordered set also includes address word segmentation and industry name word segmentation; when the second word segmentation set includes the address word segmentation, the matching score increases by a fifth value; when the second word segmentation set includes the When the industry name is segmented, the matching score is increased by a sixth value; the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the S1 also includes: deleting the brackets and the characters in the brackets in the enterprise information;
  • the S3 is specifically:
  • brackets in the preset text information If there are brackets in the preset text information and the number of characters in the brackets is less than 10, delete the brackets and the characters in the brackets;
  • This embodiment provides an information matching method, including:
  • the S1 is specifically: obtaining characters corresponding to the division of the administrative area in the enterprise information to obtain a division word of the administrative area; obtaining characters corresponding to the enterprise abbreviation in the enterprise information, obtaining a participle of the enterprise abbreviation; obtaining The characters corresponding to the nature of the enterprise in the enterprise information are obtained as a participle of the nature of the enterprise; the characters corresponding to the type of enterprise in the information of the one enterprise are obtained as the participle of the type of enterprise; The enterprise nature word segmentation and the enterprise type word segmentation generate the ordered set of the first word segmentation.
  • brackets and the characters in the brackets are generally province name + county name or city name + district name, such as Fujian province or Siming District of Fujian province.
  • the business participles are generally information, e-commerce, real estate, etc.
  • Enterprise type segmentation is generally limited liability companies, joint stock companies, partnerships, etc.
  • an enterprise's information is "Xiamen XXXX Information Co., Ltd. in Fujian province”.
  • the administrative division word "Xiamen City, Fujian province”
  • the enterprise short name segmentation "XXXX”
  • the enterprise nature segmentation "information”
  • the enterprise type segmentation "shareholding company”.
  • the above participles are arranged in an orderly set in the first participle, the first participle set is specifically ⁇ "Xiamen City, Fujian province", “XXXX”, “Information", “Company Limited” ⁇ .
  • an enterprise information database is formed.
  • the event title and event content are obtained as text information.
  • brackets in the preset text information if there are brackets in the preset text information and the number of characters in the brackets is less than 10, the brackets and the characters in the brackets are deleted.
  • the removal of the parentheses and the characters in the parentheses in the text information is to be consistent with the operation when the enterprise information is split, to ensure the consistency of the word segmentation results, and to improve the matching accuracy of the enterprise information and the text information.
  • the content in parentheses in the company name basically does not exceed five words.
  • the deletion operation is performed only if the characters in the parentheses are less than 10.
  • the S3 is specifically:
  • the word segmentation and pure number segmentation are filtered out after the word segmentation, which effectively reduces the number of matching cycles with the enterprise information in the enterprise information database, which is conducive to improving the efficiency of matching enterprise information and text information.
  • the text information word segmentation matches the business abbreviation word segmentation in the first set of first word segmentation, then the number of word segments and the number of matches of the second word segmentation set and the first first word segmentation ordered set match The number of the participle of is in the ordered set of the first participle, and the matching score is calculated.
  • the text information is taken from event reports and public opinion, it may not be possible to state the company name and other information in a detailed and standard manner. Therefore, the administrative division words, enterprise nature segmentation and enterprise type corresponding to the enterprise information may not be included in the text information. Appears, and the participle of enterprise abbreviation must exist in the text information.
  • the enterprise abbreviation participle is used as the keyword key, and the enterprise complete name # Administrative Division Word # ⁇ ⁇ ⁇ ⁇ # ⁇ ⁇ ⁇ ⁇ # Industry name segmentation # Address segmentation as the keyword corresponding value
  • a participle in the text information matches a business abbreviation participle corresponding to one or more enterprise information in the enterprise information database, a further matching operation is performed, which greatly improves the matching efficiency.
  • the matching score is calculated as follows:
  • the second participle set includes the first first participle ordered set, set the matching score corresponding to the first first participle ordered set to the first value;
  • the second participle set contains only the enterprise abbreviated participle, the enterprise nature participle, and the enterprise type participle in the first first participle ordered set
  • an ordered set with the first first participle is set
  • the corresponding matching score is the second value
  • the second participle set contains only the enterprise abbreviated participle and the enterprise property participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Three values
  • the second participle set contains only the enterprise abbreviated participle and the business type participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Four values
  • the first value is greater than the second value; the second value is greater than the third value; the third value is greater than the fourth value.
  • the ordered set of the first participle corresponding to an enterprise's information in the enterprise information database is: ⁇ "Xiamen City, Fujian province", “XXXX”, “Information”, “Company Limited” ⁇ . If “Xiamen City, Fujian province”, “XXXX”, “Information”, and “Company Limited” are also present in the text information, the enterprise referred to in the text information and the enterprise information corresponding to the ordered set of the first participle are fully consistent, The matching score is 100 points. If only “XXXX”, “Information”, and “Company Limited” exist in the text information, the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle highly, with a matching score of 90 Minute.
  • the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle with a high matching degree, and the matching score is 80 points. If there is only "XXXX” in the text information, the enterprise indicated in the text information basically matches the enterprise information corresponding to the ordered set of the first participle, and the matching score is 50 points.
  • scoring according to the different matching degrees of the respective word segmentations of the text information and the enterprise information is helpful to improve the accuracy of the matching result.
  • the ordered set of first participles also includes address participles and industry name participles;
  • the matching score is increased by a fifth value
  • the matching score is increased by a sixth value
  • the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the matching score is increased by a fifth value. If the text information includes industry name word segmentation, the matching score is increased by the sixth value; if the text information does not include the industry name word segmentation, the word segmentation operation is further performed on the industry name word segmentation to obtain the industry name word segmentation list; traverse the industry name word segmentation list, and judge in turn Whether the text information contains the word segmentation in the industry name word segmentation list, the matching score increases correspondingly for each hit, until the end of the traversal.
  • the address is not accurate to the house number, intercepted to the road or street, for example: Software Park Phase II Guanri Road.
  • the score obtained after matching the administrative area division words, enterprise abbreviation word tokens, business nature word tokens, and business type tokens in the ordered set of the first part of words corresponding to a piece of business information is 80 points. If the text information contains the address word segmentation corresponding to the business information, the matching score is increased by 5 points to obtain 85 points. If the text information contains the industry name word segmentation corresponding to the enterprise information, the matching score is added with another 5 points to obtain 90 points. If the text information and the industry name segmentation cannot be completely matched, the industry name segmentation is further subdivided, and the matching score is increased accordingly according to the matching situation.
  • the industry name segmentation is information system integration service, which can be further subdivided into information, system integration, service and text information matching.
  • the matching of enterprise address keywords and industry keywords is helpful to improve the accuracy of matching text information and enterprise information.
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed.
  • the enterprise information with the highest matching score is the information of the enterprise in the enterprise information database that most closely matches the event or public opinion reported by the text information.
  • This disclosure establishes an enterprise information database containing enterprise information word segmentation in advance, and then quickly matches the enterprise and event report text through the key information of the enterprise. Therefore, when the event is related to the operation, the efficiency is very high, and the event is determined by different matching degrees. For enterprise relevance, follow-up can obtain enterprise information with different relevance according to actual needs. At the same time, the method can also be expanded, and expansion and matching can be achieved by adding new dimensions to improve accuracy.
  • This embodiment provides an information matching terminal, including one or more processors 1 and a memory 2, the memory 2 stores a program, and is configured to be executed by the one or more processors 1 in the following steps:
  • the S1 is specifically: obtaining characters corresponding to the division of the administrative area in the enterprise information to obtain a division word of the administrative area; obtaining characters corresponding to the enterprise abbreviation in the enterprise information, obtaining a participle of the enterprise abbreviation; obtaining Characters corresponding to the nature of the enterprise in the enterprise information are obtained as a participle of the nature of the enterprise; characters corresponding to the type of enterprise in the information of the enterprise are obtained as a participle of the type of enterprise; according to the administrative division word, the enterprise abbreviated participle, The enterprise nature word segmentation and the enterprise type word segmentation generate the ordered set of the first word segmentation.
  • brackets and the characters in the brackets are generally province name + county name or city name + district name, such as Fujian province or Siming District of Fujian province.
  • the business participles are generally information, e-commerce, real estate, etc.
  • Enterprise type segmentation is generally limited liability companies, joint stock companies, partnerships, etc.
  • an enterprise's information is "Xiamen XXXX Information Co., Ltd. in Fujian province”.
  • the administrative division word "Xiamen City, Fujian province”
  • the enterprise short name segmentation "XXXX”
  • the enterprise nature segmentation "information”
  • the enterprise type segmentation "shareholding company”.
  • the above participles are arranged in an orderly set in the first participle, the first participle set is specifically ⁇ "Xiamen City, Fujian province", “XXXX”, “Information", “Company Limited” ⁇ .
  • an enterprise information database is formed.
  • the event title and event content are obtained as text information.
  • brackets in the preset text information if there are brackets in the preset text information and the number of characters in the brackets is less than 10, the brackets and the characters in the brackets are deleted.
  • the removal of the parentheses and the characters in the parentheses in the text information is to be consistent with the operation when the enterprise information is split, to ensure the consistency of the word segmentation results, and to improve the matching accuracy of the enterprise information and the text information.
  • the content in parentheses in the company name basically does not exceed five words.
  • the deletion operation is performed only if the characters in the parentheses are less than 10.
  • the S3 is specifically:
  • the word segmentation and pure number segmentation are filtered out after the word segmentation, which effectively reduces the number of matching cycles with the enterprise information in the enterprise information database, which is conducive to improving the efficiency of matching enterprise information and text information.
  • the text information word segmentation matches the business abbreviation word segmentation in the first set of first word segmentation, then the number of word segments and the number of matches of the second word segmentation set and the first first word segmentation ordered set match The number of the participle of is in the ordered set of the first participle, and the matching score is calculated.
  • the text information is taken from event reports and public opinion, it may not be possible to state the company name and other information in a detailed and standard manner. Therefore, the administrative division words, enterprise nature segmentation and enterprise type corresponding to the enterprise information may not be included in the text information. Appears, and the participle of enterprise abbreviation must exist in the text information.
  • the enterprise abbreviation participle is used as the keyword key, and the enterprise complete name # Administrative Division Word # ⁇ ⁇ ⁇ ⁇ # ⁇ ⁇ ⁇ ⁇ # Industry name segmentation # Address segmentation as the keyword corresponding value
  • a participle in the text information matches a business abbreviation participle corresponding to one or more enterprise information in the enterprise information database, a further matching operation is performed, which greatly improves the matching efficiency.
  • the matching score is calculated as follows:
  • the second participle set includes the first first participle ordered set, set the matching score corresponding to the first first participle ordered set to the first value;
  • the second participle set contains only the enterprise abbreviated participle, the enterprise nature participle, and the enterprise type participle in the first first participle ordered set
  • an ordered set with the first first participle is set
  • the corresponding matching score is the second value
  • the second participle set contains only the enterprise abbreviated participle and the enterprise property participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Three values
  • the second participle set contains only the enterprise abbreviated participle and the business type participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Four values
  • the first value is greater than the second value; the second value is greater than the third value; the third value is greater than the fourth value.
  • the ordered set of the first participle corresponding to an enterprise's information in the enterprise information database is: ⁇ "Xiamen City, Fujian province", “XXXX”, “Information”, “Company Limited” ⁇ . If “Xiamen City, Fujian province”, “XXXX”, “Information”, and “Company Limited” are also present in the text information, the enterprise referred to in the text information and the enterprise information corresponding to the ordered set of the first participle are fully consistent, The matching score is 100 points. If only “XXXX”, “Information”, and “Company Limited” exist in the text information, the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle highly, with a matching score of 90 Minute.
  • the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle with a high matching degree, and the matching score is 80 points. If there is only "XXXX” in the text information, the enterprise indicated in the text information basically matches the enterprise information corresponding to the ordered set of the first participle, and the matching score is 50 points.
  • scoring according to the different matching degrees of the respective word segmentations of the text information and the enterprise information is helpful to improve the accuracy of the matching result.
  • the ordered set of first participles also includes address participles and industry name participles;
  • the matching score is increased by a fifth value
  • the matching score is increased by a sixth value
  • the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the matching score is increased by a fifth value. If the text information includes industry name word segmentation, the matching score is increased by the sixth value; if the text information does not include the industry name word segmentation, the word segmentation operation is further performed on the industry name word segmentation to obtain the industry name word segmentation list; traverse the industry name word segmentation list, and judge in turn Whether the text information contains the word segmentation in the industry name word segmentation list, the matching score increases correspondingly for each hit, until the end of the traversal.
  • the address is not accurate to the house number, intercepted to the road or street, for example: Software Park Phase II Guanri Road.
  • the score obtained after matching the administrative area division words, enterprise abbreviation word tokens, business nature word tokens, and business type tokens in the ordered set of the first part of words corresponding to a piece of business information is 80 points. If the text information contains the address word segmentation corresponding to the business information, the matching score is increased by 5 points to obtain 85 points. If the text information contains the industry name word segmentation corresponding to the enterprise information, the matching score is added with another 5 points to obtain 90 points. If the text information and the industry name segmentation cannot be completely matched, the industry name segmentation is further subdivided, and the matching score is increased accordingly according to the matching situation.
  • the industry name segmentation is information system integration service, which can be further subdivided into information, system integration, service and text information matching.
  • the matching of enterprise address keywords and industry keywords is helpful to improve the accuracy of matching text information and enterprise information.
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed.
  • the enterprise information with the highest matching score is the information of the enterprise in the enterprise information database that most closely matches the event or public opinion reported by the text information.
  • This embodiment of the present disclosure also provides a computer-readable storage medium on which a program is stored, and the program performs the following steps by a computer:
  • the S1 is specifically: obtaining characters corresponding to the division of the administrative area in the enterprise information to obtain a division word of the administrative area; obtaining characters corresponding to the enterprise abbreviation in the enterprise information, obtaining a participle of the enterprise abbreviation; obtaining The characters corresponding to the nature of the enterprise in the enterprise information are obtained as a participle of the nature of the enterprise; the characters corresponding to the type of enterprise in the information of the one enterprise are obtained as the participle of the type of enterprise; The enterprise nature word segmentation and the enterprise type word segmentation generate the ordered set of the first word segmentation.
  • brackets and the characters in the brackets are generally province name + county name or city name + district name, such as Fujian province or Siming District of Fujian province.
  • the business participles are generally information, e-commerce, real estate, etc.
  • Enterprise type segmentation is generally limited liability companies, joint stock companies, partnerships, etc.
  • an enterprise's information is "Xiamen XXXX Information Co., Ltd. in Fujian province”.
  • the administrative division word "Xiamen City, Fujian province”
  • the enterprise short name segmentation "XXXX”
  • the enterprise nature segmentation "information”
  • the enterprise type segmentation "shareholding company”.
  • the above participles are arranged in an orderly set in the first participle, the first participle set is specifically ⁇ "Xiamen City, Fujian province", “XXXX”, “Information", “Company Limited” ⁇ .
  • an enterprise information database is formed.
  • the event title and event content are obtained as text information.
  • brackets in the preset text information if there are brackets in the preset text information and the number of characters in the brackets is less than 10, the brackets and the characters in the brackets are deleted.
  • the removal of the parentheses and the characters in the parentheses in the text information is to be consistent with the operation when the enterprise information is split, to ensure the consistency of the word segmentation results, and to improve the matching accuracy of the enterprise information and the text information.
  • the content in parentheses in the company name basically does not exceed five words.
  • the deletion operation is performed only if the characters in the parentheses are less than 10.
  • the S3 is specifically:
  • the word segmentation and pure number segmentation are filtered out after the word segmentation, which effectively reduces the number of matching cycles with the enterprise information in the enterprise information database, which is conducive to improving the efficiency of matching enterprise information and text information.
  • the text information word segmentation matches the business abbreviation word segmentation in the first set of first word segmentation, then the number of word segments and the number of matches of the second word segmentation set and the first first word segmentation ordered set match The number of the participle of is in the ordered set of the first participle, and the matching score is calculated.
  • the text information is taken from event reports and public opinion, it may not be possible to state the company name and other information in a detailed and standard manner. Therefore, the administrative division words, enterprise nature segmentation and enterprise type corresponding to the enterprise information may not be included in the text information. Appears, and the participle of enterprise abbreviation must exist in the text information.
  • the enterprise abbreviation participle is used as the keyword key, and the enterprise complete name # Administrative Division Word # ⁇ ⁇ ⁇ ⁇ # ⁇ ⁇ ⁇ ⁇ # ⁇ ⁇ ⁇ ⁇ #Address participle as the value corresponding to the keyword, only When a participle in the text information matches a business abbreviation participle corresponding to one or more enterprise information in the enterprise information database, a further matching operation is performed, which greatly improves the matching efficiency.
  • the matching score is calculated as follows:
  • the second participle set includes the first first participle ordered set, set the matching score corresponding to the first first participle ordered set to the first value;
  • the second participle set contains only the enterprise abbreviated participle, the enterprise nature participle, and the enterprise type participle in the first first participle ordered set
  • an ordered set with the first first participle is set
  • the corresponding matching score is the second value
  • the second participle set contains only the enterprise abbreviated participle and the enterprise property participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Three values
  • the second participle set contains only the enterprise abbreviated participle and the business type participle in the first set of first participles, set the matching score corresponding to the first set of first participles as the first Four values
  • the first value is greater than the second value; the second value is greater than the third value; the third value is greater than the fourth value.
  • the ordered set of the first participle corresponding to an enterprise's information in the enterprise information database is: ⁇ "Xiamen City, Fujian province", “XXXX”, “Information”, “Company Limited” ⁇ . If “Xiamen City, Fujian province”, “XXXX”, “Information”, and “Company Limited” are also present in the text information, the enterprise referred to in the text information and the enterprise information corresponding to the ordered set of the first participle are fully consistent, The matching score is 100 points. If only “XXXX”, “Information”, and “Company Limited” exist in the text information, the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle highly, with a matching score of 90 Minute.
  • the enterprise indicated in the text information matches the business information corresponding to the ordered set of the first participle with a high matching degree, and the matching score is 80 points. If there is only "XXXX” in the text information, the enterprise indicated in the text information basically matches the enterprise information corresponding to the ordered set of the first participle, and the matching score is 50 points.
  • scoring according to the different matching degrees of the respective word segmentations of the text information and the enterprise information is helpful to improve the accuracy of the matching result.
  • the ordered set of first participles also includes address participles and industry name participles;
  • the matching score is increased by a fifth value
  • the matching score is increased by a sixth value
  • the fourth value is greater than the fifth value; the fourth value is greater than the sixth value.
  • the matching score is increased by a fifth value. If the text information includes industry name word segmentation, the matching score is increased by the sixth value; if the text information does not include the industry name word segmentation, the word segmentation operation is further performed on the industry name word segmentation to obtain the industry name word segmentation list; traverse the industry name word segmentation list, and judge in turn Whether the text information contains the word segmentation in the industry name word segmentation list, the matching score increases correspondingly for each hit, until the end of the traversal.
  • the address is not accurate to the house number, intercepted to the road or street, for example: Software Park Phase II Guanri Road.
  • the score obtained after matching the administrative area division words, enterprise abbreviation word tokens, business nature word tokens, and business type tokens in the ordered set of the first part of words corresponding to a piece of business information is 80 points. If the text information contains the address word segmentation corresponding to the business information, the matching score is increased by 5 points to obtain 85 points. If the text information contains the industry name word segmentation corresponding to the enterprise information, the matching score is added with another 5 points to obtain 90 points. If the text information and the industry name segmentation cannot be completely matched, the industry name segmentation is further subdivided, and the matching score is increased accordingly according to the matching situation.
  • the industry name segmentation is information system integration service, which can be further subdivided into information, system integration, service and text information matching.
  • the matching of enterprise address keywords and industry keywords is helpful to improve the accuracy of matching text information and enterprise information.
  • step S5. Repeat step S4 until all elements in the enterprise information set are traversed.
  • the enterprise information with the highest matching score is the information of the enterprise in the enterprise information database that most closely matches the event or public opinion reported by the text information.
  • An information matching method and terminal provided by the present disclosure arrange the word segmentation corresponding to the enterprise information in an orderly set of the first word segmentation
  • the matching score can be generated by the number of word segments matched by the two and the importance of the matched word segmentation, which can be based on
  • the matching score of the text information and the various enterprise information in the enterprise information database yields the information of the enterprise most relevant to the event report or public opinion, which greatly improves the accuracy of matching the text information with the enterprise information.
  • scoring according to the different matching degrees of the respective word segmentations of the text information and the enterprise information is beneficial to improve the accuracy of the matching result.
  • matching the enterprise address keywords and industry keywords is beneficial to improve the accuracy of matching text information and enterprise information.
  • a further matching operation is performed, which greatly improves the matching efficiency.
  • the word segmentation and pure number segmentation are filtered out, which effectively reduces the number of matching cycles with the enterprise information in the enterprise information database, and is beneficial to improving the efficiency of matching enterprise information and text information.
  • the deletion of the parentheses and the characters in the parentheses in the text information is to be consistent with the operation when the enterprise information is split, to ensure the consistency of the word segmentation results, and to improve the matching accuracy of the enterprise information and the text information.
  • the content in parentheses in the company name basically does not exceed five words.
  • the deletion operation is performed only if the characters in the parentheses are less than 10.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本公开涉及数据处理领域,尤其涉及一种信息匹配方法及终端。本公开通过S1分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;S2获取两个以上所述第一分词有序集合,得到企业信息集合;S3分词预设的文本信息,得到第二分词集合;S4从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;S5重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;S6获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。极大程度上提高了文本信息与企业信息匹配的准确度。

Description

一种信息匹配方法及终端
相关申请
本申请要求保护在2018年11月12日提交的申请号为201811341250.6的中国专利申请的优先权,该申请的全部内容以引用的方式结合到本文中。
技术领域
本公开涉及数据处理领域,尤其涉及一种信息匹配方法及终端。
背景技术
随着社会的发展,人与企业之间的矛盾纠纷越来越多。为了有效预防人与企业间矛盾纠纷的出现,或及时处理相关纠纷,有关部门需快速获取与相关事件信息、舆情匹配的企业信息。
目前常用的信息匹配方法有以下两种:第一,通过将事件信息或舆情分词,然后再用得到的分词与预设的企业信息库进行模糊匹配,从而得到与事件信息或舆情相关的企业信息。通过这种方式,会同时得到很多无关的企业信息,命中率低。第二,通过人工方式关联事件信息和企业信息。此种信息匹配方式虽然具有高正确率,但是效率低下,需要耗费大量人力资源。
公开内容
本公开所要解决的技术问题是:如何提高匹配文本信息与企业信息的准确度。
为了解决上述技术问题,本公开采用的技术方案为:
本公开提供一种信息匹配方法,包括:
S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
S3、分词预设的文本信息,得到第二分词集合;
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
进一步地,所述S1具体为:
获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
进一步地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
进一步地,所述第一分词有序集合还包括地址分词和行业名称分词;
当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
进一步地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
进一步地,所述S3具体为:
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
进一步地,所述S1之前,还包括:
删除所述一企业信息中的括号及括号内的字符;
所述S3之前,还包括:若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
本公开另提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行所述的信息匹配方法。
本公开还提供一种信息匹配终端,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:
S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
S3、分词预设的文本信息,得到第二分词集合;
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
进一步地,所述S1具体为:
获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合;
根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则:当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;当所述第二分词集合只包含所述一第一分词有序集合中的 所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值;
所述第一分词有序集合还包括地址分词和行业名称分词;当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
进一步地,所述S1之前,还包括:删除所述一企业信息中的括号及括号内的字符;
所述S3具体为:
若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符;
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
本公开的有益效果在于:将企业信息进行分词操作,其所得到的各个分词重要程度不同,本公开将企业信息对应的分词有序排列在第一分词有序集合中,使得在将事件报道或舆论的文本信息依次与企业信息库中已分词的企业信息进行匹配时,能够通过二者匹配的分词数量以及所匹配的分词的重要程度生成匹配分数,从而可根据文本信息与企业信息库中各个企业信息的匹配分数得出与事件报道或舆论最为相关的企业的信息,极大程度上提高了文本信息与企业信息匹配的准确度。
附图说明
图1为本公开提供的一种信息匹配方法的具体实施方式的流程框图;
图2为本公开提供的一种信息匹配终端的具体实施方式的结构框图;
标号说明:
1、处理器;2、存储器。
具体实施方式
为详细说明本公开的技术内容、所实现目的及效果,以下结合实施方式并配合附图予以说明。
请参照图1以及图2,
如图1所示,本公开提供一种信息匹配方法,包括:
S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
S3、分词预设的文本信息,得到第二分词集合;
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
进一步地,所述S1具体为:
获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
进一步地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
由上述描述可知,根据文本信息与企业信息各自分词的不同匹配程度进行评分,有利于提高匹配结果的准确度。
进一步地,所述第一分词有序集合还包括地址分词和行业名称分词;
当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
由上述描述可知,通过匹配企业地址关键字和行业关键字有利于提高匹配文本信息和企业信息的准确度。
进一步地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
由上述描述可知,只有当文本信息中的一分词与企业信息库中一个或多个企业信息对应的企业简称分词匹配时,才会进行进一步地匹配操作,极大地提高了匹配效率。
进一步地,所述S3具体为:
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
由上述描述可知,分词后过滤掉单字分词和纯数字分词,有效减少与企业信息库中企业信息匹配的循环次数,有利于提高匹配企业信息和文本信息的效率。
进一步地,所述S1之前,还包括:
删除所述一企业信息中的括号及括号内的字符;
所述S3之前,还包括:若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
由上述描述可知,由于在分析文本信息时,通常企业名称中的括号内容不会体现,因此,预处理时将企业信息中的括号内容删除有利于提高匹配的准确率和效率。删除文本信息中的括号及括号内的字符,是为了与企业信息拆分时的操作一致,确保分词结果一致,提高企业信息与文本信息的匹配准确度。并且,企业名称中括号里面的内容基本不超过五个字,为防止误删除文本信息中其他部分的内容,当且仅当括号内的字符少于10个时才进行删除操作。
本公开另提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行所述的信息匹配方法。
如图2所示,本公开还提供一种信息匹配终端,包括一个或多个处理器1及存储器2,所述存储器2存储有程序,并且被配置成由所述一个或多个处理器1执行以下步骤:
S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
S3、分词预设的文本信息,得到第二分词集合;
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
进一步地,所述S1具体为:
获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合;
根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则:当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值; 所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值;
所述第一分词有序集合还包括地址分词和行业名称分词;当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
进一步地,所述S1之前,还包括:删除所述一企业信息中的括号及括号内的字符;
所述S3具体为:
若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符;
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
本公开的实施例一为:
本实施例提供一种信息匹配方法,包括:
S1、删除一企业信息中的括号及括号内的字符;分词所述一企业信息,得到与所述一企业信息对应的第一分词有序集合。
可选地,所述S1具体为:获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
其中,对企业信息进行分词前,先将括号及括号内的字符删除。行政区划分词一般为省名称+县名称或者市名称+区名称,例如福建省或福建省思明区。企业性质分词一般为信息、电子商务、房地产等。企业类型分词一般为有限责任公司、股份有限公司、合伙企业等。
例如,一企业信息为,“福建省厦门市XXXX信息股份有限公司”。对该企业信息进行分词操作后得到,行政区划分词“福建省厦门市”,企业简称分词“XXXX”,企业性质分词“信息”,企业类型分词“股份有限公司”。上述分词在第一分词有序集合中有序排列,第一分词有序集合具体为{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。
S2、获取两个以上所述第一分词有序集合,得到企业信息集合。
例如,将多个企业信息进行分词操作后,形成企业信息库。
S3、分词预设的文本信息,得到第二分词集合。
其中,获取事件标题及事件内容作为文本信息。
可选地,若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
其中,删除文本信息中的括号及括号内的字符,是为了与企业信息拆分时的操作一致,确保分词结果一致,提高企业信息与文本信息的匹配准确度。并且,企业名称中括号里面的内容基本不超过五个字,为防止误删除文本信息中其他部分的内容,当且仅当括号内的字符少于10个时才进行删除操作。
可选地,所述S3具体为:
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
其中,分词后过滤掉单字分词和纯数字分词,有效减少与企业信息库中企业信息匹配的循环次数,有利于提高匹配企业信息和文本信息的效率。
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。具体地:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
其中,由于文本信息取自于事件报道和舆论,未必会详细标准地写明企业名称等信息,因此,与企业信息对应的行政区划分词、企业性质分词和企业类型均有可能在文本信息中未出现,而企业简称分词是必然存在于文本信息中的。本公开,在企业信息库中,将企业简称分词作为关键字key,将企业完整名称#行政区划分词#企业性质分词#企业类型分词#行业名称分词#地址分词作为关键字对应的值value,只有当文本信息中的一分词与企业信息库中一个或多个企业信息对应的企业简称分词匹配时,才会进行进一步地匹配操作,极大地提高了匹配效率。
可选地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
例如,企业信息库中与一企业信息对应的第一分词有序集合为:{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。若文本信息中同时存在“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息完全符合,其匹配分数为100分。若文本信息中只存在“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度很高,其匹配分数为90分。若文本信息中只存在“XXXX”,“信息”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度较高,其匹配分数为80分。若文本信息中只存在“XXXX”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息基本符合,其匹配分数为50分。
其中,根据文本信息与企业信息各自分词的不同匹配程度进行评分,有利于提高匹配结果的准确度。
可选地,所述第一分词有序集合还包括地址分词和行业名称分词;
当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
其中,若文本信息中包含地址分词,则匹配分数增加第五数值。若文本信息中包含行业名称分词,则匹配分数增加第六数值;若文本信息不包含行业名称分词,则对行业名称分词进一步进行分词操作,得到行业名称分词列表;遍历行业名称分词列表,依次判断文本信息中是否包含行业名称分词列表中的分词,每次命中则匹配分数相应增加,直至之遍历结束。
其中,地址不精确到门牌号,截取到路或者街道,例如:软件园二期观日路。
例如,将文本信息与一企业信息对应的第一分词有序集合中的行政区划分词、企业简称分词、企业性质分词和企业类型分词匹配后所得的分数为80分。若文本信息中包含与该企业信息对应的地址分词,则匹配分数增加5分得到85分。若文本信息中包含与该企业信息对应的行业名称分词,则匹配分数再加5分,得到90分。若文本信息与行业名称分词无法完全匹配,则将行业名称分词进一步细分,根据匹配情况相应地增加匹配分数。如,行业名称分词为信息系统集成服务,可进一步细划分为信息、系统集成、服务与文本信息进行匹配。
其中,通过匹配企业地址关键字和行业关键字有利于提高匹配文本信息和企业信息的准确度。
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历。
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
其中,具有最高匹配分数的企业信息是企业信息库中与文本信息所报道的事件或舆论最为匹配的企业的信息。
本公开通过提前建立包含企业信息分词的企业信息库,然后通过企业的关键信息快速匹配企业和事件报道文本,因此在事件进行关联操作的时候效率非常高,并且通过不同的匹配度来决定事件与企业关联度,后续可以根据实际需求获取不同关联度的企业信息。同时该方法还可以进行扩展,通过增加新的维度进行扩展匹配,提高准确度。
本公开的实施例二为:
本实施例提供一种信息匹配终端,包括一个或多个处理器1及存储器2,所述存储器2存储有程序,并且被配置成由所述一个或多个处理器1执行以下步骤:
S1、删除一企业信息中的括号及括号内的字符;分词所述一企业信息,得到与所述一企业信息对应的第一分词有序集合。
可选地,所述S1具体为:获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
其中,对企业信息进行分词前,先将括号及括号内的字符删除。行政区划分词一般为省名称+县名称或者市名称+区名称,例如福建省或福建省思明区。企业性质分词一般为信息、电子商务、房地产等。企业类型分词一般为有限责任公司、股份有限公司、合伙企业等。
例如,一企业信息为,“福建省厦门市XXXX信息股份有限公司”。对该企业信息进行分词操作后得到,行政区划分词“福建省厦门市”,企业简称分词“XXXX”,企业性质分词“信息”,企业类型分词“股份有限公司”。上述分词在第一分词有序集合中有序排列,第一分词有序集合具体为{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。
S2、获取两个以上所述第一分词有序集合,得到企业信息集合。
例如,将多个企业信息进行分词操作后,形成企业信息库。
S3、分词预设的文本信息,得到第二分词集合。
其中,获取事件标题及事件内容作为文本信息。
可选地,若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
其中,删除文本信息中的括号及括号内的字符,是为了与企业信息拆分时的操作一致,确保分词结果一致,提高企业信息与文本信息的匹配准确度。并且,企业名称中括号里面的内容基本不超过五个字,为防止误删除文本信息中其他部分的内容,当且仅当括号内的字符少于10个时才进行删除操作。
可选地,所述S3具体为:
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
其中,分词后过滤掉单字分词和纯数字分词,有效减少与企业信息库中企业信息匹配的循环次数,有利于提高匹配企业信息和文本信息的效率。
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。具体地:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
其中,由于文本信息取自于事件报道和舆论,未必会详细标准地写明企业名称等信息,因此,与企业信息对应的行政区划分词、企业性质分词和企业类型均有可能在文本信息中未出现,而企业简称分词是必然存在于文本信息中的。本公开,在企业信息库中,将企业简称分词作为关键字key,将企业完整名称#行政区划分词#企业性质分词#企业类型分词#行业名称分词#地址分词作为关键字对应的值value,只有当文本信息中的一分词与企业信 息库中一个或多个企业信息对应的企业简称分词匹配时,才会进行进一步地匹配操作,极大地提高了匹配效率。
可选地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
例如,企业信息库中与一企业信息对应的第一分词有序集合为:{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。若文本信息中同时存在“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息完全符合,其匹配分数为100分。若文本信息中只存在“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度很高,其匹配分数为90分。若文本信息中只存在“XXXX”,“信息”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度较高,其匹配分数为80分。若文本信息中只存在“XXXX”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息基本符合,其匹配分数为50分。
其中,根据文本信息与企业信息各自分词的不同匹配程度进行评分,有利于提高匹配结果的准确度。
可选地,所述第一分词有序集合还包括地址分词和行业名称分词;
当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
其中,若文本信息中包含地址分词,则匹配分数增加第五数值。若文本信息中包含行业名称分词,则匹配分数增加第六数值;若文本信息不包含行业名称分词,则对行业名称分词进一步进行分词操作,得到行业名称分词列表;遍历行业名称分词列表,依次判断文本信息中是否包含行业名称分词列表中的分词,每次命中则匹配分数相应增加,直至之遍历结束。
其中,地址不精确到门牌号,截取到路或者街道,例如:软件园二期观日路。
例如,将文本信息与一企业信息对应的第一分词有序集合中的行政区划分词、企业简称分词、企业性质分词和企业类型分词匹配后所得的分数为80分。若文本信息中包含与该企业信息对应的地址分词,则匹配分数增加5分得到85分。若文本信息中包含与该企业信息对应的行业名称分词,则匹配分数再加5分,得到90分。若文本信息与行业名称分词无法完全匹配,则将行业名称分词进一步细分,根据匹配情况相应地增加匹配分数。如,行业名称分词为信息系统集成服务,可进一步细划分为信息、系统集成、服务与文本信息进行匹配。
其中,通过匹配企业地址关键字和行业关键字有利于提高匹配文本信息和企业信息的准确度。
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历。
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
其中,具有最高匹配分数的企业信息是企业信息库中与文本信息所报道的事件或舆论最为匹配的企业的信息。
本公开的实施例三为:
本实施例本公开另提供一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行以下步骤:
S1、删除一企业信息中的括号及括号内的字符;分词所述一企业信息,得到与所述一企业信息对应的第一分词有序集合。
可选地,所述S1具体为:获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
其中,对企业信息进行分词前,先将括号及括号内的字符删除。行政区划分词一般为省名称+县名称或者市名称+区名称,例如福建省或福建省思明区。企业性质分词一般为信 息、电子商务、房地产等。企业类型分词一般为有限责任公司、股份有限公司、合伙企业等。
例如,一企业信息为,“福建省厦门市XXXX信息股份有限公司”。对该企业信息进行分词操作后得到,行政区划分词“福建省厦门市”,企业简称分词“XXXX”,企业性质分词“信息”,企业类型分词“股份有限公司”。上述分词在第一分词有序集合中有序排列,第一分词有序集合具体为{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。
S2、获取两个以上所述第一分词有序集合,得到企业信息集合。
例如,将多个企业信息进行分词操作后,形成企业信息库。
S3、分词预设的文本信息,得到第二分词集合。
其中,获取事件标题及事件内容作为文本信息。
可选地,若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
其中,删除文本信息中的括号及括号内的字符,是为了与企业信息拆分时的操作一致,确保分词结果一致,提高企业信息与文本信息的匹配准确度。并且,企业名称中括号里面的内容基本不超过五个字,为防止误删除文本信息中其他部分的内容,当且仅当括号内的字符少于10个时才进行删除操作。
可选地,所述S3具体为:
分词预设的文本信息,得到初始分词集合;
删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
其中,分词后过滤掉单字分词和纯数字分词,有效减少与企业信息库中企业信息匹配的循环次数,有利于提高匹配企业信息和文本信息的效率。
S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。具体地:
从所述第二分词集合中获取一分词,得到文本信息分词;
若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
其中,由于文本信息取自于事件报道和舆论,未必会详细标准地写明企业名称等信息,因此,与企业信息对应的行政区划分词、企业性质分词和企业类型均有可能在文本信息中未出现,而企业简称分词是必然存在于文本信息中的。本公开,在企业信息库中,将企业 简称分词作为关键字key,将企业完整名称#行政区划分词#企业性质分词#企业类型分词#行业名称分词#地址分词作为关键字对应的值value,只有当文本信息中的一分词与企业信息库中一个或多个企业信息对应的企业简称分词匹配时,才会进行进一步地匹配操作,极大地提高了匹配效率。
可选地,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
例如,企业信息库中与一企业信息对应的第一分词有序集合为:{“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”}。若文本信息中同时存在“福建省厦门市”,“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息完全符合,其匹配分数为100分。若文本信息中只存在“XXXX”,“信息”,“股份有限公司”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度很高,其匹配分数为90分。若文本信息中只存在“XXXX”,“信息”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息匹配度较高,其匹配分数为80分。若文本信息中只存在“XXXX”,则文本信息中所指的企业与该第一分词有序集合对应的企业信息基本符合,其匹配分数为50分。
其中,根据文本信息与企业信息各自分词的不同匹配程度进行评分,有利于提高匹配结果的准确度。
可选地,所述第一分词有序集合还包括地址分词和行业名称分词;
当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
其中,若文本信息中包含地址分词,则匹配分数增加第五数值。若文本信息中包含行业名称分词,则匹配分数增加第六数值;若文本信息不包含行业名称分词,则对行业名称分词进一步进行分词操作,得到行业名称分词列表;遍历行业名称分词列表,依次判断文本信息中是否包含行业名称分词列表中的分词,每次命中则匹配分数相应增加,直至之遍历结束。
其中,地址不精确到门牌号,截取到路或者街道,例如:软件园二期观日路。
例如,将文本信息与一企业信息对应的第一分词有序集合中的行政区划分词、企业简称分词、企业性质分词和企业类型分词匹配后所得的分数为80分。若文本信息中包含与该企业信息对应的地址分词,则匹配分数增加5分得到85分。若文本信息中包含与该企业信息对应的行业名称分词,则匹配分数再加5分,得到90分。若文本信息与行业名称分词无法完全匹配,则将行业名称分词进一步细分,根据匹配情况相应地增加匹配分数。如,行业名称分词为信息系统集成服务,可进一步细划分为信息、系统集成、服务与文本信息进行匹配。
其中,通过匹配企业地址关键字和行业关键字有利于提高匹配文本信息和企业信息的准确度。
S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历。
S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
其中,具有最高匹配分数的企业信息是企业信息库中与文本信息所报道的事件或舆论最为匹配的企业的信息。
综上所述,由于将企业信息进行分词操作,其所得到的各个分词重要程度不同,本公开提供的一种信息匹配方法及终端将企业信息对应的分词有序排列在第一分词有序集合中,使得在将事件报道或舆论的文本信息依次与企业信息库中已分词的企业信息进行匹配时,能够通过二者匹配的分词数量以及所匹配的分词的重要程度生成匹配分数,从而可根据文本信息与企业信息库中各个企业信息的匹配分数得出与事件报道或舆论最为相关的企业的信息,极大程度上提高了文本信息与企业信息匹配的准确度。进一步地,由上述描述可知,根据文本信息与企业信息各自分词的不同匹配程度进行评分,有利于提高匹配结果的准确度。进一步地,通过匹配企业地址关键字和行业关键字有利于提高匹配文本信息和企业信息的准确度。进一步地,只有当文本信息中的一分词与企业信息库中一个或多个企业信息对应的企业简称分词匹配时,才会进行进一步地匹配操作,极大地提高了匹配效率。 进一步地,分词后过滤掉单字分词和纯数字分词,有效减少与企业信息库中企业信息匹配的循环次数,有利于提高匹配企业信息和文本信息的效率。进一步地,删除文本信息中的括号及括号内的字符,是为了与企业信息拆分时的操作一致,确保分词结果一致,提高企业信息与文本信息的匹配准确度。并且,企业名称中括号里面的内容基本不超过五个字,为防止误删除文本信息中其他部分的内容,当且仅当括号内的字符少于10个时才进行删除操作。
以上所述仅为本公开的实施例,并非因此限制本公开的专利范围,凡是利用本公开说明书及附图内容所作的等同变换,或直接或间接运用在相关的技术领域,均同理包括在本公开的专利保护范围内。

Claims (11)

  1. 一种信息匹配方法,其特征在于,包括:
    S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
    S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
    S3、分词预设的文本信息,得到第二分词集合;
    S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
    S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
    S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
  2. 根据权利要求1所述的信息匹配方法,其特征在于,所述S1具体为:
    获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
    获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
    获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
    获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
    根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合。
  3. 根据权利要求2所述的信息匹配方法,其特征在于,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
    当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;
    当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;
    当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;
    当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;
    所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值。
  4. 根据权利要求3所述的信息匹配方法,其特征在于,所述第一分词有序集合还包括地址分词和行业名称分词;
    当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;
    当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;
    所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
  5. 根据权利要求2所述的信息匹配方法,其特征在于,根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
    从所述第二分词集合中获取一分词,得到文本信息分词;
    若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数。
  6. 根据权利要求1所述的信息匹配方法,其特征在于,所述S3具体为:
    分词预设的文本信息,得到初始分词集合;
    删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
  7. 根据权利要求1所述的信息匹配方法,其特征在于,所述S1之前,还包括:
    删除所述一企业信息中的括号及括号内的字符;
    所述S3之前,还包括:若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符。
  8. 一种信息匹配终端,其特征在于,包括一个或多个处理器及存储器,所述存储器存储有程序,并且被配置成由所述一个或多个处理器执行以下步骤:
    S1、分词一企业信息,得到与所述一企业信息对应的第一分词有序集合;
    S2、获取两个以上所述第一分词有序集合,得到企业信息集合;
    S3、分词预设的文本信息,得到第二分词集合;
    S4、从所述企业信息集合中获取一第一分词有序集合;根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数;
    S5、重复执行步骤S4,直至所述企业信息集合中的所有元素均被遍历;
    S6、获取与具有最高匹配分数的所述第一分词有序集合对应的企业信息。
  9. 根据权利要求8所述的信息匹配终端,其特征在于,所述S1具体为:
    获取所述一企业信息中与行政区域划分对应的字符,得到行政区划分词;
    获取所述一企业信息中与企业简称对应的字符,得到企业简称分词;
    获取所述一企业信息中与企业性质对应的字符,得到企业性质分词;
    获取所述一企业信息中与企业类型对应的字符,得到企业类型分词;
    根据所述行政区划分词、所述企业简称分词、所述企业性质分词和所述企业类型分词生成所述第一分词有序集合;
    根据所述第二分词集合与所述一第一分词有序集合匹配的分词数量和匹配的分词在所述一第一分词有序集合中的序号,计算匹配分数,具体为:
    从所述第二分词集合中获取一分词,得到文本信息分词;
    若所述文本信息分词与所述一第一分词有序集合中的所述企业简称分词相匹配,则:当所述第二分词集合包含所述一第一分词有序集合时,设置与所述一第一分词有序集合对应的匹配分数为第一数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词、所述企业性质分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第二数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业性质分词时,设置与所述一第一分词有序集合对应的匹配分数为第三数值;当所述第二分词集合只包含所述一第一分词有序集合中的所述企业简称分词和所述企业类型分词时,设置与所述一第一分词有序集合对应的匹配分数为第四数值;所述第一数值大于所述第二数值;所述第二数值大于所述第三数值;所述第三数值大于所述第四数值;
    所述第一分词有序集合还包括地址分词和行业名称分词;当所述第二分词集合包含所述地址分词时,所述匹配分数增加第五数值;当所述第二分词集合包含所述行业名称分词时,所述匹配分数增加第六数值;所述第四数值大于所述第五数值;所述第四数值大于所述第六数值。
  10. 根据权利要求8所述的信息匹配终端,其特征在于,所述S1之前,还包括:删除所述一企业信息中的括号及括号内的字符;
    所述S3具体为:
    若所述预设的文本信息中存在括号,且括号内的字符数少于10,则删除括号和括号内的字符;
    分词预设的文本信息,得到初始分词集合;
    删除所述初始分词集合中的数字分词和单字分词,得到所述第二分词集合。
  11. 一种计算机可读存储介质,其上存储有程序,所述程序在被计算机执行时执行如权利要求1-8中任一项所述的方法。
PCT/CN2019/099123 2018-11-12 2019-08-02 一种信息匹配方法及终端 WO2020098315A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811341250.6 2018-11-12
CN201811341250.6A CN109635276B (zh) 2018-11-12 2018-11-12 一种信息匹配方法及终端

Publications (1)

Publication Number Publication Date
WO2020098315A1 true WO2020098315A1 (zh) 2020-05-22

Family

ID=66067772

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/099123 WO2020098315A1 (zh) 2018-11-12 2019-08-02 一种信息匹配方法及终端

Country Status (2)

Country Link
CN (1) CN109635276B (zh)
WO (1) WO2020098315A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127009A (zh) * 2022-11-17 2023-05-16 上海倍通医药科技咨询有限公司 一种企业信息匹配系统及方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635276B (zh) * 2018-11-12 2020-12-11 厦门市美亚柏科信息股份有限公司 一种信息匹配方法及终端
CN110134801A (zh) * 2019-04-28 2019-08-16 福建星网视易信息系统有限公司 一种作品名称与多媒体文件的匹配方法及存储介质
CN110377818A (zh) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 用户信息推送方法、装置、存储介质和计算机设备
CN111294347B (zh) * 2020-01-22 2022-06-10 奇安信科技集团股份有限公司 一种工控设备的安全管理方法及系统
CN113239261A (zh) * 2021-06-18 2021-08-10 红盾大数据(北京)有限公司 企业名称匹配方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183759A1 (en) * 2007-01-29 2008-07-31 Word Data Corp System and method for matching expertise
CN103092894A (zh) * 2011-11-08 2013-05-08 阿里巴巴集团控股有限公司 一种结构化信息检索方法和系统
CN103309886A (zh) * 2012-03-13 2013-09-18 阿里巴巴集团控股有限公司 一种基于交易平台的结构化信息搜索方法和装置
CN103885937A (zh) * 2014-04-14 2014-06-25 焦点科技股份有限公司 基于核心词相似度判断企业中文名称重复的方法
CN106951548A (zh) * 2017-03-27 2017-07-14 聚龙融创科技有限公司 基于rm算法提升特写词语搜索精度的方法及系统
CN109635276A (zh) * 2018-11-12 2019-04-16 厦门市美亚柏科信息股份有限公司 一种信息匹配方法及终端

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8200477B2 (en) * 2003-10-22 2012-06-12 International Business Machines Corporation Method and system for extracting opinions from text documents
US20080052147A1 (en) * 2006-07-18 2008-02-28 Eran Reshef System and method for influencing public opinion
CN103064951B (zh) * 2012-12-31 2016-08-31 南京烽火星空通信发展有限公司 一种舆情信息的地域识别方法和装置
CN104636386A (zh) * 2013-11-14 2015-05-20 华为技术有限公司 信息监控方法及装置
CN105574092B (zh) * 2015-12-10 2019-08-23 百度在线网络技术(北京)有限公司 信息挖掘方法和装置
CN107544988B (zh) * 2016-06-27 2021-03-19 百度在线网络技术(北京)有限公司 一种获取舆情数据的方法和装置
CN106951415A (zh) * 2017-04-01 2017-07-14 银联智策顾问(上海)有限公司 一种商户名称搜索方法和装置
CN108460014B (zh) * 2018-02-07 2022-02-25 百度在线网络技术(北京)有限公司 企业实体的识别方法、装置、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080183759A1 (en) * 2007-01-29 2008-07-31 Word Data Corp System and method for matching expertise
CN103092894A (zh) * 2011-11-08 2013-05-08 阿里巴巴集团控股有限公司 一种结构化信息检索方法和系统
CN103309886A (zh) * 2012-03-13 2013-09-18 阿里巴巴集团控股有限公司 一种基于交易平台的结构化信息搜索方法和装置
CN103885937A (zh) * 2014-04-14 2014-06-25 焦点科技股份有限公司 基于核心词相似度判断企业中文名称重复的方法
CN106951548A (zh) * 2017-03-27 2017-07-14 聚龙融创科技有限公司 基于rm算法提升特写词语搜索精度的方法及系统
CN109635276A (zh) * 2018-11-12 2019-04-16 厦门市美亚柏科信息股份有限公司 一种信息匹配方法及终端

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127009A (zh) * 2022-11-17 2023-05-16 上海倍通医药科技咨询有限公司 一种企业信息匹配系统及方法

Also Published As

Publication number Publication date
CN109635276B (zh) 2020-12-11
CN109635276A (zh) 2019-04-16

Similar Documents

Publication Publication Date Title
WO2020098315A1 (zh) 一种信息匹配方法及终端
US7219104B2 (en) Data cleansing
CN104598439B (zh) 信息对象的标题修正方法及装置和推送信息对象的方法
JP5768063B2 (ja) 適合を特徴付けるルールを用いたメタデータソースの照合
US7324998B2 (en) Document search methods and systems
US10671671B2 (en) Supporting tuples in log-based representations of graph databases
US9996607B2 (en) Entity resolution between datasets
WO2018236732A1 (en) AUTOMATIC LEARNING SYSTEM FOR PROCESSING QUERIES FOR DIGITAL CONTENT
CN110781246A (zh) 一种企业关联关系构建方法及系统
JP5227333B2 (ja) ウェブページの分類とそのコンテンツの整理をするための方法
WO2020168839A1 (zh) 物品召回方法、系统、电子设备及可读存储介质
Cheng et al. Rule-based graph repairing: Semantic and efficient repairing methods
US11263218B2 (en) Global matching system
US9430520B2 (en) Semantic reflection storage and automatic reconciliation of hierarchical messages
CN114168608B (zh) 一种用于更新知识图谱的数据处理系统
US20060101452A1 (en) Method and apparatus for preserving dependancies during data transfer and replication
CN110019542B (zh) 企业关系的生成、生成组织成员数据库及识别同名成员
US11170050B1 (en) Method and device for graph data quality verification
US20180203944A1 (en) Graph databases
CN110704719A (zh) 企业搜索文本分词方法和装置
US20080189150A1 (en) Supply chain multi-dimensional serial containment process
US20180357328A1 (en) Functional equivalence of tuples and edges in graph databases
Mezzanzanica et al. Data quality sensitivity analysis on aggregate indicators
CN112612810A (zh) 慢sql语句识别方法及系统
CN115526500A (zh) 一种惠政信息推送方法、装置、设备、介质及程序产品

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19885805

Country of ref document: EP

Kind code of ref document: A1