CN115730595A

CN115730595A - Method, apparatus and medium for identifying pharmaceutical industry target object to be identified

Info

Publication number: CN115730595A
Application number: CN202211211885.0A
Authority: CN
Inventors: 姜金陆
Original assignee: Shanghai Huantong Business Technology Co ltd
Current assignee: Shanghai Huantong Business Technology Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-03-03
Anticipated expiration: 2042-09-30
Also published as: CN115730595B; CN117272991A; WO2024066903A1

Abstract

Embodiments of the present disclosure relate to a method, apparatus, and medium for identifying a pharmaceutical industry target object to be identified. The method comprises the following steps; acquiring original data to be identified for indicating a target object in the pharmaceutical industry; identifying administrative division information and channel type information in original data to be identified; based on administrative division information, channel type information and at least one word bank of a noise word bank, a semantic equivalent word bank and a fixed word bank, carrying out noise removal and word segmentation on original data to be recognized so as to generate word segmentation results; performing hash calculation on a plurality of keywords included in the word segmentation result so as to determine whether the word segmentation result is matched with the reference name; and in response to confirming that the segmentation results do not match the reference name, performing semantic similarity analysis on the reference name and the preprocessed data that is based on the sum of the segmentation results to identify the pharmaceutical industry target object to be identified. Therefore, the target object in the pharmaceutical industry can be quickly and accurately identified.

Description

Method, apparatus and medium for identifying pharmaceutical industry target object to be identified

Technical Field

Embodiments of the present disclosure relate generally to the field of data recognition, and more particularly, to a method, computing device, and computer storage medium for identifying a pharmaceutical industry target object to be identified.

Background

Conventional methods for identifying a pharmaceutical industry target object to be identified (e.g., without limitation, an organization in the field of pharmaceutical distribution) generally include: identifying an unknown pharmaceutical industry target object to be identified based on pure manpower; and two methods for identifying the target object of the pharmaceutical industry to be identified based on the simple word segmentation technology of natural language processing.

Regarding to a pure manual identification method, although original data of irregular pharmaceutical industry target objects can be identified, the identification efficiency is not high, and identification results have differences due to experience differences of identification subjects, so that the method is difficult to adapt to accurate and rapid identification of large-data-volume pharmaceutical industry target objects to be identified, and further difficult to adapt to the identification requirements of a service platform of the pharmaceutical industry on the pharmaceutical industry target objects. Regarding the recognition method based on the simple word segmentation technology, the recognition accuracy of the target object is relatively low due to the fact that the original data of the target object in the pharmaceutical industry is not expressed in a standard manner and usually has obvious differences in content and structure, and the pharmaceutical industry has no existing word segmentation and matching logic.

In summary, the conventional method for identifying a target object in the pharmaceutical industry to be identified has the following disadvantages: it is difficult to quickly and accurately identify target objects for the pharmaceutical industry.

Disclosure of Invention

In view of the above, the present disclosure provides a method, a computing device and a computer storage medium for identifying a pharmaceutical industry target object to be identified, which can quickly and accurately identify the pharmaceutical industry target object.

According to a first aspect of the present disclosure, there is provided a method for identifying a pharmaceutical industry target object to be identified, comprising: acquiring original data to be identified for indicating a target object in the pharmaceutical industry; identifying administrative division information and channel type information in original data to be identified; based on administrative division information, channel type information and at least one word bank of a noise word bank, a semantic equivalent word bank and a fixed word bank, carrying out noise removal and word segmentation on original data to be identified so as to generate a word segmentation result, wherein the word segmentation result comprises a plurality of keywords; performing hash calculation on a plurality of keywords included in the word segmentation result so as to determine whether the word segmentation result is matched with the reference name; and in response to confirming that the word segmentation result does not match the reference name, performing semantic similarity analysis on the preprocessed data of the reference name core based on the sum of the word segmentation result so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis.

According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the disclosure.

In a third aspect of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect of the present disclosure is provided.

In some embodiments, performing a hash calculation on a plurality of keywords included in the segmentation result to confirm whether the segmentation result matches the reference name comprises: calculating the sum of the hash values of a plurality of keywords included in the word segmentation result so as to generate the sum of the hash values of the word segmentation result; calculating the sum of hash values of a plurality of keywords included in the reference name so as to generate the sum of hash values of the reference name; confirming whether the sum of the word segmentation result hash value and the reference name hash value is equal or not; and determining that the segmentation result is matched with the reference name in response to determining that the sum of the segmentation result hash values is equal to the sum of the reference name hash values.

In some embodiments, the method for identifying a pharmaceutical industry target object to be identified further comprises: and in response to confirming that the word segmentation result is matched with the reference name, identifying the pharmaceutical industry target object to be identified as the target object associated with the reference name.

In some embodiments, generating the segmentation results comprises: acquiring non-administrative division data except the administrative division information in the original data to be identified based on the identified administrative division information; removing noise words and replacing equivalent words aiming at non-administrative division data; and segmenting data subjected to noise word removal and equivalent word replacement based on the fixed word bank so as to generate a segmentation result corresponding to the original data to be recognized, wherein the segmentation result comprises a plurality of key words and a plurality of preset identifiers indicating segmentation positions.

In some embodiments, the method for identifying a pharmaceutical industry target object to be identified further comprises: the method for identifying the pharmaceutical industry target object to be identified identifies the digital words in the original data to be identified; normalizing the recognized digital words so as to segment the keywords in the form of the digital words in the original data to be recognized; and combining a plurality of keywords included in the word segmentation result into preprocessed data without geographic information for matching with the reference name.

In some embodiments, normalizing the recognized words of the numeric type to segment the keywords in the form of the numeric words in the original data to be recognized comprises: converting upper case Chinese numbers and/or lower case Chinese numbers in original data to be identified into Arabic numbers; determining whether the number of digits of the converted Arabic numerals is greater than or equal to a predetermined number-of-digits threshold; removing the converted arabic numbers in response to determining that a number of bits of the converted arabic numbers is greater than or equal to a predetermined number-of-bits threshold; in response to determining that the number of digits of the converted Arabic numerals is less than a predetermined digit determination threshold, determining whether the converted Arabic numerals are located at a start position or an end position of the original data to be recognized; in response to determining that the converted arabic numerals are located at a start position or an end position of the original data to be recognized, determining whether data adjacent to the arabic numerals located at the start position or the end position indicate a predetermined channel type; and removing the converted arabic numerals in response to determining that the data adjacent to the arabic numeral bits at the start or end position does not indicate the predetermined channel type.

In some embodiments, the channel type information comprises: a channel type sub-classification name, a channel type classification name and a channel type classification serial number.

In some embodiments, identifying administrative division information and channel type information in the raw data to be identified comprises: determining a plurality of keyword sets respectively associated with different priority orders, wherein each keyword set comprises a plurality of preset keywords; determining a target keyword set in which a preset keyword included in original data to be identified is located in a plurality of keyword sets; determining a channel type sub-classification name matched with original data to be identified based on the priority order associated with the target keyword set; and determining a channel type classification name and a channel type classification serial number which are matched with the original data to be identified based on the determined channel type sub-classification name.

In some embodiments, noise removing and word segmentation for the raw data to be identified comprises: determining a plurality of groups of associated words, wherein each group of associated words comprises an original word and an equivalent word, and the original word and the equivalent word have consistent semantics when indicating a target object in the pharmaceutical industry; determining a sequence number and a belonging classification of each group of associated words, wherein the sequence number indicates the priority of each group of associated words; and replacing and dividing the original data to be recognized by using the equivalent words based on the determined associated sequence numbers, so that the data subjected to equivalent word replacement and division comprises the equivalent words and a predetermined identifier, wherein the predetermined identifier indicates the division bits.

In some embodiments, noise removing and word segmentation for the raw data to be identified comprises: determining a coincidence part of the preprocessed data and the reference name; deleting the overlapped part in the preprocessed data to obtain a remaining part; in response to determining that a first predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be recognized and the reference name as a first level, the matching confidence level being the first level indicating a match between the original data to be recognized and the reference name, the first predetermined condition including any one of: determining that the remaining portion includes a number of words less than or equal to a first word number threshold; determining that the remaining portion comprises a number of words greater than a second word number threshold and that the remaining portion and the overlapping portion are associated with the same channel type information, the second word number threshold being greater than the first word number threshold; the remaining portion comprises words greater than the first word count threshold and less than the second word count threshold and the remaining portion contains a pair of parentheses; the remainder comprising "original" or parentheses and "original"; the remaining portion contains a pair of brackets and the number of brackets words is less than a third word number threshold, the third word number threshold being greater than the first word number threshold and less than the second word number threshold; it is determined that there is an overlap between the preprocessed data and the reference name, and that the preprocessed data and the reference name have the same channel type subcategory.

In some embodiments, performing semantic similarity analysis on the segmentation results and the reference names further comprises: in response to determining that a second predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be identified and the reference name to be a second level, the second predetermined confidence condition comprising: determining that the word segmentation results of the preprocessed data and the reference name have a coincidence part after structural reorganization, and the channel type classification information of the preprocessed data and the reference name is the same; in response to determining that a third predetermined confidence condition is satisfied, determining a mismatch between the original data to be identified and the reference name, the third predetermined confidence condition comprising: the word segmentation results of the preprocessed data and the reference names have overlapped parts after structural reorganization, and the channel type classification information of the preprocessed data and the reference names is different.

In some embodiments, identifying administrative division information and channel type information in the raw data to be identified comprises: identifying administrative division information in the organization name to be identified based on full names, acronyms, names used, and excluded words about provinces, cities, and counties, the administrative division information including province information, city information, and county information; identifying administrative division information in the original data to be identified using lower administrative division information of the identified county information or city information, or administrative division information of an associated target object of the target object to be identified, in response to confirming that the identified county information or city information does not indicate a unique county or city.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.

Fig. 1 shows a schematic diagram of a system for implementing a method for identifying a pharmaceutical industry target object to be identified according to an embodiment of the invention.

Fig. 2 shows a flow chart of a method for identifying a pharmaceutical industry target object to be identified according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of a method for identifying administrative division information and channel type information in original data to be identified according to an embodiment of the present disclosure.

Fig. 4 shows a flow diagram of a method for segmenting out keywords in the form of digital words in original data to be identified, according to an embodiment of the present disclosure.

FIG. 5 shows a flow diagram of a method for semantic similarity analysis for a participle result and a reference name, according to an embodiment of the present disclosure.

FIG. 6 shows a flow diagram of a method for generating a segmentation result, in accordance with an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As described above, the conventional identification method based on pure manual work has low identification efficiency, and the identification results have differences due to experience differences of the identification subjects, so that it is difficult to adapt to accurate and fast identification of target objects in pharmaceutical industry to be identified with large data volume, and thus the requirement of a service platform in pharmaceutical industry for identifying target objects in pharmaceutical industry cannot be met. The traditional recognition method based on the simple word segmentation technology has relatively low recognition accuracy rate aiming at the target object due to the lack of word segmentation and mode logic in the pharmaceutical industry. Therefore, the conventional method for identifying the target object of the pharmaceutical industry to be identified has the following disadvantages: it is difficult to quickly and accurately identify target objects for the pharmaceutical industry. For example, conventional methods for identifying pharmaceutical industry target objects to be identified have difficulty in quickly and accurately identifying "three medical limited liability companies" and "Huarun three gorge medical limited companies".

To at least partially solve one or more of the above problems and other potential problems, example embodiments of the present disclosure propose a scheme for identifying a pharmaceutical industry target object to be identified, in which segmentation results are generated by performing identification of administrative division information and channel type information with respect to acquired raw data to be identified indicating a pharmaceutical industry target object so as to perform noise removal and segmentation on the raw data to be identified based on the identified administrative division information, channel type information, and at least one of a noise lexicon, a semantic equivalent lexicon, and a fixed lexicon, so that the segmentation results are segmentation results normalized via noise removal via semantic equivalents and/or fixed words, and the channel type information is assisted in judgment, thereby overcoming problems of structural differences, irregular expression, and easy confusion of the raw data of the pharmaceutical industry target object. In addition, the present disclosure performs hash calculation using a plurality of keywords included for the segmentation result in order to confirm whether the segmentation result matches the reference name; and if the word segmentation result is determined not to be matched with the reference name, performing semantic similarity analysis on the preprocessed data generated by the word segmentation result and the reference name so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for identifying a pharmaceutical industry target object to be identified according to an embodiment of the invention. As shown in fig. 1, system 100 includes computing device 110 and server 130 and network 140. The computing device 110, server 130 may interact with data via a network 140 (e.g., the internet).

Server 130, which may, for example, send raw data to be identified indicating a pharmaceutical industry target object to computing device 110.

With respect to computing device 110, for example, for obtaining raw data to be identified that is provided by server 130 for indicating a pharmaceutical industry target object; and identifying administrative division information and channel type information in the original data to be identified. The computing device 110 may also perform noise removal and word segmentation on the original data to be recognized based on the administrative division information, the channel type information, and at least one of the noise lexicon, the semantic equivalent lexicon, and the fixed lexicon, so as to generate a word segmentation result; performing hash calculation on a plurality of keywords included in the word segmentation result so as to determine whether the word segmentation result is matched with the reference name; and if the word segmentation result is not matched with the reference name, performing semantic similarity analysis on the word segmentation result and the reference name so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as CPUs. Additionally, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the medical image imaging device 110 may be integrated or may be separate from each other. In some embodiments, the computing device 110 includes, for example, a raw data to be identified acquisition unit 112, an administrative division and channel type information identification unit 114, a word segmentation result generation unit 116, a hash calculation unit 118, and a pharmaceutical industry target object to be identified identification unit 120.

Regarding the raw data to be identified obtaining unit 112, it is used to obtain raw data to be identified for indicating the pharmaceutical industry target object.

And an administrative division and channel type information identifying unit 114 for identifying administrative division information and channel type information in the original data to be identified.

Regarding the segmentation result generating unit 116, it is used for performing noise removal and segmentation on the original data to be recognized based on the administrative district information, the channel type information, and at least one thesaurus of the noise thesaurus, the semantic equivalent thesaurus, and the fixed thesaurus, so as to generate a segmentation result, which includes a plurality of keywords.

With regard to the hash calculation unit 118, hash calculation is performed based on a plurality of keywords included for the segmentation result in order to confirm whether the segmentation result matches the reference name.

And the pharmaceutical industry target object identification unit 120 to be identified is used for performing semantic similarity analysis on the preprocessed data of the reference name core based on the sum of the word segmentation results if the word segmentation results are not matched with the reference name, so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis.

A method 200 for identifying a pharmaceutical industry target object to be identified is described below in connection with fig. 2. Fig. 2 shows a flow diagram of a method 200 for identifying a pharmaceutical industry target object to be identified, according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 700 shown in FIG. 7. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

At step 202, the computing device 110 obtains raw data to be identified that indicates a pharmaceutical industry target object. For example, the computing device 110 obtains raw data to be identified from the server 130 regarding unknown institutions in the medical distribution field.

As regards the pharmaceutical industry target object to be identified, it is, for example and without limitation, an unknown institution in the field of pharmaceutical distribution. For example, the computing device 110 needs to identify which standard agency name an unknown company agency name in a certain medical distribution domain represents. It should be understood that there is a supply relationship between the same pharmaceutical industry target object (e.g., without limitation, the same pharmacy) and different medical institutions (e.g., distributors), and that the pharmaceutical industry target object may not be consistent in name or title at the different medical institutions (e.g., distributors).

At step 204, the computing device 110 identifies administrative division information and channel type information in the raw data to be identified.

As for administrative division information, it includes, for example: province, city, district and county three-level administrative institution affiliated information.

Regarding the method for identifying administrative division information in the original data to be identified, it includes, for example: the computing device 110 identifies administrative division information in the organization name to be identified based on full names, short names, great names, and exclusion words about provinces, cities, counties, and counties, the administrative division information including province information, city information, and county information; if it is confirmed that the identified district-county information or city information does not indicate a unique district or city, administrative division information in the original data to be identified is identified using lower administrative division information of the identified district-county information or city information, or administrative division information of an associated target object of the target object to be identified. Specifically, if the computing device 110 determines that the province information contained in the original data to be identified includes a full province name, a short province name, a province city or a great province name of the province city, and does not include an excluded name about the province or the province city, the identification of the province information is determined; if determining that the city information contained in the original data to be identified comprises the city full name, the city short name or the city long name and does not contain the exclusion name of the city, determining to identify the city information; determining that the county information contained in the original data to be identified comprises the full name, the short name or the past name of the county and does not comprise the excluded name of the county, and determining the information of the county; determining to identify the trip region information if the computing device 110 determines that any of the following is satisfied: confirming identification province information, city information and district and county information; confirming and identifying administrative division information and district and county information, and confirming and identifying city information and district and county information; confirming that the identified county information or city information indicates a unique county or city.

For example, if the computing device 110 determines that the name of the institution to be identified includes a city-district-county three-level administrative structure, a city-county two-level administrative structure (e.g., province + county/second-level city/district), and a city-county two-level administrative structure (e.g., city-level + county-level), the administrative information to which the name of the institution to be identified belongs can be directly identified without subsequent detection, i.e., the city-county to which the name of the institution to be identified belongs is considered to be accurately found.

For example, if the computing device 110 determines that the institution name to be identified includes a prefecture full name or a city full name, and the prefecture full name or the city full name is unique, it determines to identify administrative division information to which the institution name to be identified belongs. It should be understood that cities and counties across the country are unique, and if the names of the organizations to be identified include the full names or short names or past names of the unique cities/counties, the province, city and county to which the names of the organizations to be identified belong are considered to be uniquely identified.

If the computing device 110 confirms that the identified prefecture and county information or city information does not indicate a unique prefecture and county or city, administrative district information in the raw data to be identified is identified using lower administrative district information of the identified prefecture and county information or city information, or administrative district information of an associated target object of the target object to be identified. For example, "Tongzhou district Yongshun Touchun Yuancun clean Room" and "Tongzhou district Jinshazhen pharmacia" where the Tongzhou district does not indicate a unique district, e.g., beijing includes Tongzhou, jiangsu province, and Tongzhou. Therefore, administrative division information in the organization name to be identified can be identified by means of lower-level administrative division information (e.g., township and county relationships). For example, the unique administrative division relationship "beijing + tong state + wushu" can be found by "tong state" + "wushu", and then the tong state area of beijing city will be located at this time; similarly, the unique administrative division relation of Jiangsu province + Nantong city + Tongzhou district can be found through the Tongzhou region and the Jinsha region.

For another example, as shown in the following table one, the name of the target object to be identified (e.g., the buyer organization) is, for example, "northern ditch health institute", and the geographic information or administrative division information of the province, city, county and county to which the target object belongs cannot be found from the "northern ditch health institute". The computing device 110 may identify that the associated target object (e.g., the seller organization "huarun tobacco table medicine, inc.) is of the cigarette table of shandong, and the computing device 110 may find whether a north ditch exists in a downstream town in the area of the cigarette table, and finally may find a north ditch town under the unique" paulian area ".

Watch 1

In some embodiments, computing device 110 identifies administrative division information in the organization name to be identified based on full, short, great, and exclusionary terms for province, city, county, including province information, city information, and county information province, city, county. The following table two illustrates the full names, acronyms, great names, and exclusions for municipalities and counties. In table two, the full names, acronyms, names used earlier, and exclusions for the province are not shown.

Watch two

For example, "three-department of medicine, llc", "huarun three-gorge medicine, llc", and "three cities of medicine, jianjiang county", there are some easily confused area-to-county abbreviations. Computing device 110 may assist in identifying administrative division information in other organization names based on exclusionary words for provinces, cities, counties. For example, three counties, which are simply three, exclude words include: three gorges, three cities, three shops, and three shops. By adopting the means, the method can accurately identify the administrative division information which is easy to be confused, and is further beneficial to improving the accuracy of identifying the target object.

Regarding the channel type information, it includes, for example: a channel type sub-classification name, a channel type sub-classification name and a channel type classification serial number. It should be understood that the pharmaceutical distribution industry agency data falls into three broad categories: the retail terminal is divided into individual drug stores and chain drug store branches, the name of the organization usually contains attribute information such as channel type information, and the attribute information is helpful for improving the identification accuracy of the target object in the pharmaceutical industry to be identified, for example, the retail terminal cannot identify the medical terminal. Therefore, the channel type information in the original data to be identified is identified, so that the identification accuracy of the pharmaceutical industry target object to be identified is improved.

Regarding the method for identifying administrative division information and channel type information in the original data to be identified, it includes, for example: the computing device 110 determines a plurality of sets of keywords respectively associated with different priority orders, each set of keywords comprising a plurality of predetermined keywords; determining a target keyword set in which a preset keyword included in original data to be identified is located in a plurality of keyword sets; determining a channel type sub-classification name matched with original data to be identified based on the priority order associated with the target keyword set; and determining a channel type classification name and a channel type classification serial number which are matched with the original data to be identified based on the determined channel type sub-classification name.

At step 206, the computing device 110 performs noise removal and word segmentation on the original data to be recognized based on the administrative district information, the channel type information, and at least one of the noise lexicon, the semantic equivalent lexicon, and the fixed lexicon, so as to generate a word segmentation result, which includes a plurality of keywords.

Regarding the method for performing noise removal and word segmentation on the original data to be recognized, the method includes, for example: confirming whether the preprocessed data processed through the noise removal and normalization process matches at least one of full name, alias, and great name of the reference name; if it is confirmed that the preprocessed data that is subjected to the noise removal and normalization processes does not match the full name, the alias, or the great name of the reference name, the word segmentation is performed with respect to the preprocessed data to generate a word segmentation result. If the preprocessed data is equal to the alias or past name of the reference name, or the preprocessed data plus its upstream name is equal to the reference name or its alias, or the preprocessed data plus its upstream name is a homophone with the reference name or its alias, then the computing device 110 determines that the raw data to be identified matches the reference name without requiring word segmentation with respect to the preprocessed data.

Regarding the method of generating the word segmentation result, it includes, for example: acquiring non-administrative division data except the administrative division information in the original data to be identified based on the identified administrative division information; removing noise words and replacing equivalent words aiming at non-administrative division data; and based on the fixed word stock, segmenting the data subjected to the noise word removal and equivalent word replacement so as to generate a segmentation result corresponding to the original data to be recognized, wherein the segmentation result comprises a plurality of key words and a plurality of preset identifiers indicating segmentation positions; identifying digital words in original data to be identified; normalizing the recognized digital words so as to segment the keywords in the form of the digital words in the original data to be recognized; and combining a plurality of keywords included in the word segmentation result into preprocessed data without geographic information for matching with the reference name. The following will describe in detail the method for performing semantic similarity analysis on the word segmentation result and the reference name with reference to fig. 6, and details are not repeated here.

Regarding a method for segmenting a keyword in the form of a digital word in original data to be recognized, it includes, for example: the computing device 110 converts uppercase Chinese digits and/or lowercase Chinese digits in the original data to be recognized into Arabic digits; determining whether the number of digits of the converted Arabic numerals is greater than or equal to a predetermined number-of-digits threshold; removing the converted arabic numbers in response to determining that the number of bits of the converted arabic numbers is greater than or equal to a predetermined number of bits threshold; in response to determining that the number of digits of the converted Arabic numerals is smaller than a predetermined digit determination threshold, determining whether the converted Arabic numerals are located at a start position or an end position of the original data to be recognized; in response to determining that the converted arabic numerals are located at a start position or an end position of the original data to be identified, determining whether data adjacent to the arabic numeral bits located at the start position or the end position indicate a predetermined channel type; and removing the converted arabic numerals in response to determining that the data adjacent to the arabic numeral bits located at the start position or the end position does not indicate the predetermined channel type. The method for segmenting the keywords in the form of digital words in the original data to be identified will be described in detail with reference to fig. 4, and will not be described herein again. Regarding a method for performing noise word removal and equivalent word replacement for non-administrative district data, the method includes, for example: the computing device 110 determines a plurality of groups of associated words, each group of associated words including an original word and an equivalent word, the original word and the equivalent word having consistent semantics when indicating a target object in the pharmaceutical industry; determining a sequence number and a belonging classification of each group of associated words, wherein the sequence number indicates the priority of each group of associated words; and replacing and dividing the original data to be recognized by using the equivalent words based on the determined associated sequence numbers, so that the data subjected to equivalent word replacement and division comprises the equivalent words and a predetermined identifier, wherein the predetermined identifier indicates the division bits.

In step 208, the computing device 110 performs a hash calculation on a plurality of keywords included in the segmentation result to confirm whether the segmentation result matches the reference name.

Regarding a method of confirming whether or not a word segmentation result matches a reference name, it includes, for example: the computing device 110 calculates the sum of the hash values of the plurality of keywords included in the segmentation result to generate the sum of the hash values of the segmentation result; calculating the sum of hash values of a plurality of keywords included in the reference name so as to generate the sum of hash values of the reference name; confirming whether the sum of the word segmentation result hash value and the reference name hash value is equal or not; and determining that the segmentation result matches the reference name in response to determining that the sum of the segmentation result hash values is equal to the sum of the reference name hash values. By adopting the algorithm logic after the hash value of the keyword after word segmentation, the matching result is not influenced by different word sequences.

The following formula (1) schematically shows an algorithm for confirming whether the word segmentation result matches the reference name.

In the above formula (1), ora _ hash (key) _reference i) Represents a hash value calculated for the ith keyword included in the word segmentation result of the reference data. i represents the serial number of the keyword.

Representing the sum of the reference name hash values. n represents the total number of keywords, e.g. in table three or table four, the total number of keywords n is 19.ora _ hash (key) _original i) Representing the hash value calculated for the ith keyword included in the word segmentation result of the original data to be recognized.

Representing the sum of the hash values of the word segmentation results.

For example, the following table three schematically shows the word segmentation result of the reference name. The reference name is, for example, "the mengyang city yangsheng big pharmacy retail chain limited, menglian branch shop," and the reference name is, for example, classified into nineteen keywords from keyword 1 to keyword 19 in table three. Only nine of the keywords are schematically shown in table three.

Watch III

For example, the following table four schematically shows the word segmentation result of the original data to be recognized. The original data to be identified is, for example, "funyang city, yangsheng Dayurt retail chain company (dream cheapest)", and the original data to be identified is, for example, classified into nineteen keywords from keyword 1 to keyword 19 in table four. Only nine of the keywords are schematically shown in table four.

Watch four

The original data to be identified, namely 'Fuyang City Yangsheng Dache retail chain company (Menglian)' breaks the word order through denoising and word segmentation, and the case and case are normalized to generate nineteen keywords from keyword 1 to keyword 19 in the table IV. The sum of hash values of all the keywords of keyword 1 to keyword 19 of the segmentation result of the original data to be recognized (i.e., the sum of the hash values of the segmentation result) is equal to the sum of reference names (the reference name refers to a standard target object name), so that the computing device 110 determines that the segmentation result matches the reference name.

The following illustrates exemplary program code for implementing an algorithm for confirming whether a word segmentation result matches a reference name.

select*

from(select a.collatejobdetailid,a.orgname,o.ovalmasterid as stdorgid,o.orgcode as stdorgcode,

o.orgname as stdorgname,2 as status,2 as gradelevel,length(o.orgname)as orglen,

case where a. Channel name = o. Channel and substr (a. Keyword05, -1) in ('-', 'shop', 'drug', 'birth', 'clinic', 'hospital') the '99%'.

Channel name = o.channel then '98%' else '95%' end as grade, 'tear-word congruence recommendation' as split status _ std

from collatejobdetail a,ovalmaster o

where a.jobid＝v_jobid......

and a.keyword01＝o.keyword01

and a.keyword02＝o.keyword02

and a.keyword03＝o.keyword03

and a.hashvalue＝o.hashvalue

/*hashvalue

ora_hash(a.keyword04)+ora_hash(a.keyword05)+

ora_hash(a.keyword06)+ora_hash(a.keyword07)+

ora_hash(a.keyword08)+ora_hash(a.keyword09)+

ora_hash(a.keyword10)+ora_hash(a.keyword11)+

ora_hash(a.keyword12)+ora_hash(a.keyword19)＝

ora_hash(o.keyword04)+ora_hash(o.keyword05)+

ora_hash(o.keyword06)+ora_hash(o.keyword07)+

ora_hash(o.keyword08)+ora_hash(o.keyword09)+

ora_hash(o.keyword10)+ora_hash(o.keyword11)+

ora_hash(o.keyword12)+ora_hash(o.keyword19)*/

)

order by orgname,orglen

In step 210, if the computing device 110 confirms that the segmentation result does not match the reference name, performing semantic similarity analysis on the preprocessed data of the reference name core based on the sum of the segmentation results to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis. For example, if the computing device 110 confirms that the word segmentation result matches the reference name, the pharmaceutical industry target object to be identified is identified as the target object associated with the reference name.

Regarding the method for semantic similarity analysis for the word segmentation result and the reference name, it includes, for example: determining a coincidence part of the preprocessed data and the reference name; deleting the overlapped part in the preprocessed data to obtain a remaining part; in response to determining that a first predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be recognized and the reference name as a first level, the matching confidence level being the first level indicating a match between the original data to be recognized and the reference name, the first predetermined condition including any one of: determining that the remaining portion includes a number of words less than or equal to a first word number threshold; determining that the remaining portion comprises a number of words greater than a second word number threshold and that the remaining portion and the overlapping portion are associated with the same channel type information, the second word number threshold being greater than the first word number threshold; the remaining portion includes a number of words greater than the first word number threshold and less than the second word number threshold and the remaining portion includes a pair of parentheses; the remainder comprising "original" or parenthesis and "original"; the remaining portion contains a pair of brackets and the number of brackets words is less than a third word number threshold, the third word number threshold being greater than the first word number threshold and less than the second word number threshold; in response to a second predetermined confidence condition being met, determining a matching confidence level between the original data to be identified and the reference name as a second level, the second predetermined confidence condition comprising any one of: determining that the preprocessed data and the reference name have a coincidence part, and the preprocessed data and the reference name have the same channel type subcategory; determining that the word segmentation results of the preprocessed data and the reference name have a coincidence part after structural recombination, and the channel type classification information of the preprocessed data and the reference name is the same; determining a mismatch between the original data to be identified and the reference name in response to a third predetermined confidence condition being met, the third predetermined confidence condition comprising: the word segmentation results of the preprocessed data and the reference names have overlapped parts after structural reorganization, and the channel type classification information of the preprocessed data and the reference names is different. The following will describe in detail the method for performing semantic similarity analysis on the word segmentation result and the reference name with reference to fig. 5, and details are not repeated here.

In some embodiments, if neither the semantic similarity analysis for the segmented results and the reference names nor the hash calculation for the segmented results accurately identify the pharmaceutical industry target objects to be identified, the computing device 110 may adjust the weights of the semantic similarity analysis for the segmented results and the reference names based on the channel type information so as to perform the semantic similarity analysis for the segmented results and the reference names based on the adjusted weights.

In the scheme, by identifying administrative division information and channel type information aiming at the acquired to-be-identified original data for indicating a target object in the pharmaceutical industry, so as to perform noise removal and word segmentation on the to-be-identified original data based on the identified administrative division information, channel type information and at least one of a noise word bank, a semantic equivalent word bank and a fixed word bank to generate a word segmentation result, the word segmentation result can be a word segmentation result standardized by noise removal and semantic equivalent words and/or fixed words, and the channel type information is assisted to be judged, so that the problems of structural difference, irregular expression and easiness in confusion of the original data of the target object in the pharmaceutical industry can be solved. In addition, the present disclosure performs hash calculation using a plurality of keywords included for the word segmentation result in order to confirm whether the word segmentation result matches the reference name; and if the word segmentation result is determined not to be matched with the reference name, performing semantic similarity analysis on the preprocessed data generated by the word segmentation result and the reference name so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis.

A method for identifying administrative division information and channel type information in the original data to be identified is described below with reference to fig. 3. Fig. 3 shows a flowchart of a method 300 for identifying administrative division information and channel type information in raw data to be identified, according to an embodiment of the present disclosure. The method 300 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 700 shown in FIG. 7. It should be understood that method 300 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

At step 302, the computing device 110 determines a plurality of sets of keywords respectively associated with different priority orders, each set of keywords comprising a plurality of predetermined keywords.

Regarding the keyword set for identifying the channel type classification, it includes, for example, a keyword set for identifying chain drug stores, a keyword set for identifying individual drug stores, a keyword set for identifying chain companies, a keyword set for identifying hospitals, and a keyword set for identifying health supervisors.

The following table three illustrates exemplary sets of keywords for identifying chain pharmacies and sets of keywords for identifying individual pharmacies.

Watch five

At step 304, the computing device 110 determines a target keyword set in which a predetermined keyword included in the original data to be recognized is located, among the plurality of keyword sets. For example, the computing device 110 determines that the predetermined keyword "% retail center%" included in the raw data to be identified and does not include "% chain store%", then the target keyword set in which the included predetermined keyword is located is the keyword set in the second row in table five.

At step 306, the computing device 110 determines a channel type sub-category name matching the raw data to be identified based on the priority order associated with the set of target keywords. For example, the keyword set in the second row of table five is associated with a priority order of 18, and the computing device 110 determines the channel type subcategory name "monomer pharmacy" matching the raw data to be identified based on the priority order 18. It should be appreciated that in identifying a channel type, it is preferably eligible, i.e., deemed "located to a channel type subcategory name that matches the original data to be identified," based on the priority order associated with the set of target keywords.

At step 308, the computing device 110 determines a channel type classification name and a channel type classification order number that match the raw data to be identified based on the determined channel type sub-classification name. For example, the computing device 110 determines a channel type classification name "terminal pharmacy" and a channel type classification number "114" that are matched with the original data to be identified, based on the determined channel type sub-classification name "individual pharmacy".

In the scheme, the method and the device can accurately determine the type of the channel to which the original data to be identified belongs, and are beneficial to improving the accuracy of identifying the target object in the pharmaceutical industry based on the accurate type of the channel to which the original data to be identified belongs.

A method for segmenting keywords in the form of digital words in the original data to be recognized is described below with reference to fig. 4. Fig. 4 shows a flow diagram of a method 400 for segmenting out keywords in the form of digital words in original data to be identified, according to an embodiment of the present disclosure. The method 400 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 700 shown in FIG. 7. It should be understood that method 400 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

In step 402, the computing device 110 converts uppercase and/or lowercase Chinese numbers in the raw data to be identified to Arabic numbers. It should be understood that the original data to be recognized may contain case numbers, phone numbers, zip codes. These numbers may appear in front of, behind or in the middle of the name of the target object to be recognized. For example, the computing device 110 may unify all upper and lower case Chinese numbers appearing in the raw data to be recognized into Arabic numbers, e.g., "one hundred fifty one," "one five one," or "one hundred fifty one" all being eventually converted into Arabic numbers "151". By utilizing the above means, the digital words in the original data to be recognized can be normalized.

At step 404, the computing device 110 determines whether the number of digits of the converted arabic numerals is greater than or equal to a predetermined number-of-digits threshold.

As for the predetermined number-of-bits threshold, it is, for example, without limitation, a number of 6 and above.

If the computing device 110 determines that the number of digits of the converted arabic numbers is greater than or equal to the predetermined number of digits threshold, the converted arabic numbers are removed at step 406. For example, if it is determined that the number of bits of the converted arabic numbers is greater than or equal to 6 (or 6 bits and more), the converted arabic numbers are directly discarded regardless of where the converted arabic numbers appear. Because the converted arabic numbers may be telephone, zip code, etc. information.

In step 408, if the computing device 110 determines that the number of digits of the converted arabic numerals is less than the predetermined digit determination threshold, it is determined whether the converted arabic numerals are located at the start position or the end position of the original data to be recognized.

In step 410, if the computing device 110 determines that the converted arabic numerals are located at the start or end positions of the original data to be identified, it is determined whether data adjacent to the arabic numerals located at the start or end positions indicates a predetermined channel type. For example, if the computing device 110 determines that the converted arabic number is less than the predetermined number determination threshold and is present at the start of the target object name to be identified, the converted arabic number may also be removed because it is likely to be an inadvertently added sequence number when providing the target object name.

If the computing device 110 determines that the data adjacent to the arabic numeral bits at the start or end location does not indicate a predetermined channel type, it jumps to step 406 to remove the converted arabic numerals.

In step 412, if the computing device 110 determines that the data adjacent to the arabic numeral bits located at the start or end location indicates a predetermined channel type, the converted arabic numerals are not removed.

For example, if the computing device 110 determines that the converted arabic numeral is located at a start position or an end position of the original data to be recognized and is not a pharmacy type or a medical institution type following the converted arabic numeral (the channel type is not a pharmacy), the arabic numeral may be removed; if the computing device 110 determines that the digit associated with the digit at the termination location is present at the termination location and that the converted digit is immediately following the pharmacy type or the medical institution type and that the converted digit is less than or equal to the predetermined digit threshold, then the digit cannot be removed. For example, "56-shop Wangzheng" as exemplified in Table six below, where the Arabic numeral "56" is located at the beginning of the raw data to be identified and immediately following the converted Arabic numeral "56" is a pharmacy type, while the Arabic numeral "56" is less than a predetermined numeral threshold associated with municipal pharmaceutical, inc., at which time the computing device 110 determines that the Arabic numeral "56" cannot be removed.

For example, "56-shop Wangzheng" as exemplified in Table six below, where the Arabic numeral "56" is located at the beginning of the raw data to be identified and immediately following the converted Arabic numeral "56" is a pharmacy type, while the Arabic numeral "56" is less than a predetermined numeral threshold associated with municipal pharmaceutical, inc., at which time the computing device 110 determines that the Arabic numeral "56" cannot be removed.

As another example, the "hippur king building washroom 50" or "hippur king building washroom 1" illustrated in table six below, wherein the arabic numeral "50" or "1" is located at the end of the raw data to be identified, and the immediately following converted arabic numeral "50" or "1" is the type of medical institution, assuming that the arabic numeral "50" is greater than the predetermined numeral threshold associated with the rural washroom, at which time the computing device 110 determines to remove the arabic numeral "50"; and the arabic numeral "1" is less than the predetermined numeral threshold associated with the rural restroom, at which point the computing device 110 determines that the arabic numeral "1" cannot be removed.

Watch six

In the scheme, the digital noise in the original data to be recognized can be accurately recognized and removed, and the digital keywords which are beneficial to recognizing the target object can be accurately segmented.

A method for performing semantic similarity analysis with respect to the segmentation result and the reference name is described below with reference to fig. 5. FIG. 5 shows a flow diagram of a method 500 for semantic similarity analysis for a participle result and a reference name, according to an embodiment of the present disclosure. The method 500 may be performed by the computing device 110 as shown in fig. 1, or may be performed at the electronic device 700 shown in fig. 7. It should be understood that method 500 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

At step 502, the computing device 110 determines the overlap of the preprocessed data and the reference name.

At step 504, the computing device 110 deletes the overlapping portion in the pre-processed data to obtain a remaining portion.

For example, the preprocessed data are "the" clinical setting of the public area, field, and eastern clinic "the reference name of the public area, field, and the overlapped part of the preprocessed data and the reference name is" the clinical setting of the public area, field, and eastern clinic ". The remaining portion of the preprocessed data after the overlapped portion is deleted is "1"

At step 506, if the computing device 110 determines that a first predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be recognized and the reference name as a first level, the matching confidence level being the first level indicating a match between the original data to be recognized and the reference name, the first predetermined condition including any one of: determining that the remaining portion comprises a number of words less than or equal to a first word number threshold; determining that the remaining portion comprises a number of words greater than a second word number threshold and that the remaining portion and the overlapping portion are associated with the same channel type information, the second word number threshold being greater than the first word number threshold; the remaining portion includes a number of words greater than the first word number threshold and less than the second word number threshold and the remaining portion includes a pair of parentheses; the remainder comprising "original" or parenthesis and "original"; or the remaining portion contains a pair of brackets and the number of brackets words is less than a third word number threshold, the third word number threshold being greater than the first word number threshold and less than the second word number threshold; it is determined that there is an overlap between the preprocessed data and the reference name, and that the preprocessed data and the reference name have the same channel type subcategory.

With respect to the first word count threshold, it is, for example and without limitation, 2. For example, the number of words included in the above remaining portion "1" is smaller than the first word number threshold, and the matching reliability level between the original data to be recognized and the reference name is determined to be the first level, for example, the matching similarity is 100%, that is, the matching between the original data to be recognized and the reference name. With respect to the second word number threshold, it is, for example and without limitation, 10. For example, the overlapping part of the preprocessed data between the "qitai he city source fu big pharmacy (qitai he city source hongfu medical instrument shop)" and the reference name "qitai he city source fu big pharmacy" is the "universe shidong clinic of the localization area". The remaining part of the preprocessed data after the overlapping part is deleted is "(seven rivers city source houfu medical instrument shop)". The remaining part includes a number of words greater than 10 and the channel type information of the remaining part and the same part are the same, it is determined that the matching reliability level between the original data to be recognized and the reference name is the first level, for example, the matching similarity is 98%, i.e., the original data to be recognized and the reference name are highly similar and thus match.

For example, the preprocessed data is that the coincidence part between the "kangwu north village sanitary room (original kangwu joint sanitary room)" and the reference name "kangwu north village sanitary room" is the "kangwu north village sanitary room". The remaining part of the preprocessed data after the overlap part is deleted is "Yujia county (originally Kangbei joint health Room)". The remaining portion contains "original" or "(original", then the matching reliability level between the original data to be recognized and the reference name is determined to be a first level, for example, the matching similarity is 99%, i.e., the original data to be recognized and the reference name are highly similar and thus matched.

With respect to the third word number threshold, it is, for example and without limitation, 4. For example, the pre-processed data is that the overlapping part between "xiang hongtang medicine retail (06) in three villages and towns in the city of NE" and the reference name "xiang hongtang medicine retail in three villages and towns in the city of zhong shan" is "xiang hongtang medicine retail in three villages and towns in the city of zhong shan". If the remaining part of the preprocessed data after the overlap part is deleted contains a pair of brackets, and the length of the content of the brackets is less than 4 characters, the matching reliability level between the original data to be recognized and the reference name is determined to be the first level, for example, the matching similarity is 96%, that is, the original data to be recognized and the reference name are highly similar, and thus are matched.

For example, the coincidence between the preprocessed data "Xiamen lake Ling first clinic Co., ltd" and the reference name "Xiamen lake Ling first clinic" is "Xiamen lake Ling first clinic". Both having a complete overlap with each other and both having the same channel sub-classification name, it is determined that the matching reliability level between the original data to be recognized and the reference name is the first level, for example, the matching similarity is 70%, i.e., the original data to be recognized and the reference name have a higher similarity therebetween and thus match.

The confidence level of the match is a first level, which indicates, for example, that the similarity of the match is between 70% and 100%.

At step 508, if computing device 110 determines that a second predetermined confidence condition is satisfied, determining a level of confidence in the match between the original data to be identified and the reference name as a second level, the second predetermined confidence condition comprising: determining that the word segmentation results of the preprocessed data and the reference name have a coincidence part after structural reorganization, and the channel type classification information of the preprocessed data and the reference name is the same.

For example, the preprocessed data is "the Pan district and Lu village and town health institute in Huainan city (northern city)" and the reference name is "the health institute in Lu district and northern city of Lu village", there is an overlapping part between them. The word segmentation result of the preprocessed data, "reed collection health (urban north health)", of the Pan-district and town-collecting health institute in Huainan city (sanitary room in the urban north village) ", the word segmentation result," (urban north health) "after structural reorganization is included by the reference name," reed collection urban north health ", and the channel type classification information of the preprocessed data and the reference name is the same, and the matching reliability grade between the original data to be recognized and the reference name is determined to be the second grade, for example, the matching similarity is 65%, that is, certain similarity exists between the original data to be recognized and the reference name.

In some embodiments, a matching confidence level of a second level may be considered a match between the preprocessed data and the reference name.

At step 510, if computing device 110 determines that a third predetermined confidence condition is satisfied, determining a mismatch between the original data to be identified and the reference name, the third predetermined confidence condition comprising: the word segmentation results of the preprocessed data and the reference names have overlapped parts after structural recombination, and the channel type classification information of the preprocessed data and the reference names is different.

For example, there is an overlapping portion between the preprocessed data "laoshan branch of national island medicine limited" and the reference name "national island medicine limited". The channel type information related to the preprocessed data Laoshan branch of Huarunqingdao medicine Limited company is different from the channel type information related to the reference name Laoshan branch of Huarunqingdao medicine Limited company, the reference name ' Huarunqingdao medicine Limited company ' is a commercial company, the preprocessed data Laoshan branch of Huarundao medicine Limited company ' is a drug store, and the preprocessed data Laoshan branch of Huarundao medicine Limited company have no similarity. A mismatch between the original data to be identified and the reference name is determined.

By adopting the above means, the method and the device can still quickly and accurately identify whether the original data to be identified is matched with the reference name or not under the condition that the preprocessed data is different from the reference name.

The following describes the method for generating the segmentation result in conjunction with fig. 6. FIG. 6 shows a flow diagram of a method 600 for generating a segmentation result, in accordance with an embodiment of the present disclosure. The method 600 may be performed by the computing device 110 as shown in FIG. 1, or may be performed at the electronic device 700 shown in FIG. 7. It should be understood that method 600 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect.

At step 602, the computing device 110 obtains non-administrative division data excluding administrative division information from the original data to be identified based on the identified administrative division information.

At step 604, the computing device 110 performs noise word removal and equivalent word replacement for the non-administrative division data.

The thesaurus of equivalents comprises a voluminous database of equivalent words that are classified, for example, via manual labeling or based on machine learning. The following table six schematically shows part of the equivalent words in the equivalent word library.

Watch six

Sequence number	Primitive word	Equivalent words	Classification
				126	Pharmaceutical trade company	% of the company%	Chain lock
126	Drugstore Co Ltd	% of the company%	Chain lock
				140	Store Co Ltd	% of the company%	Chain lock
61	Chain of drugstores, inc	% of the company%	Chain lock
				209	Large drug store for Chinese medicine stock control	% of the company%	Chain lock
225	Retail store	% store%	Chain lock
				226	Straight camp part	% store%	Chain lock
252	Storefront	% store%	Chain lock
				224	Franchise chain stores	% store%	Chain lock
230	Joining chain	% store%	Chain lock
				235	Affiliate store	% store%	Chain lock
240	Chain store	% store%	Chain lock
				245	Branch store	% store%	Chain lock

The computing device 110 determines a plurality of groups of relevant words, each group of relevant words comprises an original word and an equivalent word, and the original word and the equivalent word have consistent semantics when indicating a target object in the pharmaceutical industry; determining a sequence number and a belonging classification of each group of associated words, wherein the sequence number indicates the priority of each group of associated words; and replacing and dividing the original data to be recognized by using the equivalent words based on the determined associated sequence numbers, so that the data subjected to equivalent word replacement and division comprises the equivalent words and a predetermined identifier, wherein the predetermined identifier indicates the division bits.

As for the predetermined identifier, for example and without limitation, "%" is used, and the predetermined identifier indicates that one of the divided bits is located at the corresponding position. The use of equivalents in replacing and segmenting is prioritized, for example, the computing device 110 replaces and segments the raw data to be recognized using equivalents, such as the raw word "store limited" in "suzhou hui rue pharmaceutical store limited" being replaced and segmented by the equivalent% corporation, in terms of "order number" (e.g., as shown in table six, raw words of large length are generally higher in priority), and the data via equivalent replacement and segmentation is, for example, "suzhou hui rui pharmaceutical% corporation".

In step 606, the computing device 110 segments the data after the noise word removal and equivalent word replacement based on the fixed thesaurus to generate a segmentation result corresponding to the original data to be recognized, the segmentation result including a plurality of keywords and a plurality of predetermined identifiers indicating segmentation bits. With respect to fixed words, they include, for example, at least: provinces, cities, counties, and other conventional fixed phrases. For example, in the institution name "subsidiary health station of middle school of the Beijing university", the terms of the university, subsidiary, middle school and health station belong to fixed terms and do not need to be split. In some embodiments, after the noise word removal and the equivalent word replacement are performed by the computing device 110, it is confirmed whether the ASCII value of each character in the data after the noise word removal and the equivalent word replacement is outside the first predetermined numerical range (e.g., 48 to 57) based on the ASCII code table, so as to remove all characters outside the first predetermined numerical range (e.g., 48 to 57). The reason for adopting the above means is mainly that: a batch of noise words can be removed through noise word removal and equivalent word replacement, but letters, horizontal bars and other contents often appear in the names of the mechanisms, and Chinese medical retail mechanisms and medical terminals do not contain capital letters or other English symbols, so that other symbols need to be removed except Chinese and numbers, and the noise words can be further filtered through the means.

Regarding the method for segmenting data after noise word removal and equivalent word replacement, it includes, for example: after the replacement of the noise word and the equivalent word is completed, the computing device 110 uses the fixed word to "segment" the original data to be recognized. Taking Suzhou Hui ren pharmaceutical shop Co., ltd as an example, the product becomes Suzhou Hui ren pharmaceutical% Co., ltd after equivalent replacement and segmentation; the segmented word stock is divided into 'Suzhou% Huren% medicine% company' through a fixed word stock, wherein 'Suzhou' is geographic information which is intercepted and stored separately, and the rest is separated one by one so as to generate word segmentation results corresponding to the original data to be recognized. For example, table seven below illustrates the analysis result of the raw data to be identified "suzhou hui pharmaceutical store limited", and table eight illustrates the word segmentation result of the raw data to be identified "pharmaceutical store limited (suzhou hui). Table nine illustrates the segmentation results of the raw data "suzhou hunren pharmaceutical trade ltd" to be identified.

Watch seven

Table eight

Watch nine

At step 608, the computing device 110 identifies words of the numeric type in the raw data to be identified.

At step 610, the computing device 110 performs a normalization process on the identified words of the numeric type to segment the keywords in the form of numeric words in the raw data to be identified. The method for segmenting the keywords in the form of digital words in the original data to be recognized has already been described above with reference to fig. 4, and thus, the description thereof is omitted here.

At step 612, the computing device 110 combines the plurality of keywords included in the word segmentation result into pre-processed data without geographic information for matching with the reference name.

In some embodiments, the computing device 110 may combine the segmented plurality of keywords into pre-processed data without geographic information for matching with a reference name. For example, taking "Suzhou Huren pharmaceutical shop Co., ltd" as an example, the preprocessed data that is combined without geographic information is "Huren pharmaceutical Co., ltd" for matching with a reference name.

Fig. 7 schematically shows a block diagram of an electronic device 700 suitable for use in implementing an embodiment of the invention. The electronic device 700 may be a device for implementing the method 200 to 600 shown in fig. 2 to 6. As shown in fig. 7, electronic device 700 includes a central processing unit (i.e., CPU 701) that can perform various appropriate actions and processes in accordance with computer program instructions stored in a read-only memory (i.e., ROM 702) or loaded from storage unit 708 into a random access memory (i.e., RAM 703). In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can be stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output interface (i.e., I/O interface 705) is also connected to bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: the input unit 706, the output unit 707, the storage unit 708, and the cpu 701 perform the respective methods and processes described above, such as performing the methods 200 to 600. For example, in some embodiments, the methods 200-600 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by CPU 701, one or more of the operations of methods 200-600 described above may be performed. Alternatively, in other embodiments, the CPU 701 may be configured in any other suitable manner (e.g., by way of firmware) to perform one or more acts of the methods 200-600.

It should be further appreciated that the present invention may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as a punch card or an in-groove protruding structure with instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above description is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations of the present invention are possible to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying a pharmaceutical industry target object to be identified, comprising:

acquiring original data to be identified for indicating a target object in the pharmaceutical industry;

identifying administrative division information and channel type information in original data to be identified;

based on administrative division information, channel type information and at least one word bank of a noise word bank, a semantic equivalent word bank and a fixed word bank, performing noise removal and word segmentation on original data to be identified so as to generate word segmentation results, wherein the word segmentation results comprise a plurality of keywords;

performing hash calculation on a plurality of keywords included in the word segmentation result so as to determine whether the word segmentation result is matched with the reference name; and

and in response to confirming that the word segmentation result does not match the reference name, performing semantic similarity analysis on the preprocessed data of the reference name core based on the sum of the word segmentation result so as to identify the pharmaceutical industry target object to be identified based on the result of the similarity analysis.

2. The method of claim 1, wherein performing a hash calculation on a plurality of keywords included in the segmentation result to confirm whether the segmentation result matches the reference name comprises:

calculating the sum of the hash values of a plurality of keywords included in the word segmentation result so as to generate the sum of the hash values of the word segmentation result;

calculating the sum of hash values of a plurality of keywords included in the reference name so as to generate the sum of hash values of the reference name;

determining whether the sum of the word segmentation result hash value and the reference name hash value is equal; and

and in response to determining that the sum of the word segmentation result hash values is equal to the sum of the reference name hash values, determining that the word segmentation result matches the reference name.

3. The method of claim 1 or 2, further comprising:

and identifying the pharmaceutical industry target object to be identified as the target object associated with the reference name in response to confirming that the word segmentation result is matched with the reference name.

4. The method of claim 1, wherein generating a word segmentation result comprises:

acquiring non-administrative division data except the administrative division information in the original data to be identified based on the identified administrative division information;

carrying out noise word removal and equivalent word replacement aiming at non-administrative division data; and

segmenting data subjected to noise word removal and equivalent word replacement based on a fixed word bank so as to generate segmentation results corresponding to original data to be recognized, wherein the segmentation results comprise a plurality of key words and a plurality of preset identifiers indicating segmentation positions.

5. The method of claim 4, further comprising:

recognizing digital words in original data to be recognized;

normalizing the recognized digital words so as to segment the keywords in the form of the digital words in the original data to be recognized; and

combining a plurality of keywords included in the word segmentation result into preprocessed data without geographic information for matching with the reference name.

6. The method of claim 2, wherein normalizing the identified words of the numeric type to segment the keywords in the form of numeric words in the raw data to be identified comprises:

converting upper case Chinese numbers and/or lower case Chinese numbers in original data to be identified into Arabic numbers;

determining whether the number of digits of the converted Arabic numerals is greater than or equal to a predetermined number-of-digits threshold;

removing the converted arabic numbers in response to determining that a number of bits of the converted arabic numbers is greater than or equal to a predetermined number-of-bits threshold;

in response to determining that the number of digits of the converted Arabic numerals is less than a predetermined digit determination threshold, determining whether the converted Arabic numerals are located at a start position or an end position of the original data to be recognized;

in response to determining that the converted arabic numerals are located at a start position or an end position of the original data to be identified, determining whether data adjacent to the arabic numeral bits located at the start position or the end position indicate a predetermined channel type; and

in response to determining that the data adjacent to the arabic numeral bits located at the start or end location does not indicate the predetermined channel type, removing the converted arabic numerals.

7. The method of claim 2 wherein channel type information comprises: a channel type sub-classification name, a channel type classification name and a channel type classification serial number.

8. The method of claim 1, wherein identifying administrative division information and channel type information in raw data to be identified comprises:

determining a plurality of keyword sets respectively associated with different priority orders, wherein each keyword set comprises a plurality of preset keywords;

determining a target keyword set in which a preset keyword included in original data to be identified is located in a plurality of keyword sets;

determining a channel type sub-classification name matched with original data to be identified based on the priority order associated with the target keyword set; and

and determining a channel type classification name and a channel type classification serial number which are matched with the original data to be identified based on the determined channel type sub-classification name.

9. The method of claim 1, wherein performing noise removal and word segmentation on the raw data to be identified comprises:

determining a plurality of groups of associated words, wherein each group of associated words comprises an original word and an equivalent word, and the original word and the equivalent word have consistent semantics when indicating a target object in the pharmaceutical industry;

determining the sequence number and the category of association for each group of associated words, wherein the sequence number indicates the priority of each group of associated words; and

and replacing and dividing the original data to be recognized by using the equivalent words based on the determined associated sequence numbers, so that the data subjected to equivalent word replacement and division comprises the equivalent words and a predetermined identifier, wherein the predetermined identifier indicates the division bits.

10. The method of claim 1, wherein performing noise removal and word segmentation on the raw data to be identified comprises:

determining a coincidence part of the preprocessed data and the reference name;

deleting the overlapping portion in the preprocessed data to obtain a remaining portion;

in response to determining that a first predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be recognized and the reference name as a first level, the matching confidence level being the first level indicating a match between the original data to be recognized and the reference name, the first predetermined condition including any one of:

determining that the remaining portion includes a number of words less than or equal to a first word number threshold;

determining that the remaining portion comprises a number of words greater than a second word number threshold and that the remaining portion and the overlapping portion are associated with the same channel type information, the second word number threshold being greater than the first word number threshold;

the remaining portion includes a number of words greater than the first word number threshold and less than the second word number threshold and the remaining portion includes a pair of parentheses;

the remainder comprising "original" or parentheses and "original";

the remaining portion contains a pair of brackets and the number of brackets words is less than a third word number threshold, the third word number threshold being greater than the first word number threshold and less than the second word number threshold;

it is determined that there is an overlap between the preprocessed data and the reference name, and that the preprocessed data and the reference name have the same channel type subcategory.

11. The method of claim 10, wherein performing semantic similarity analysis on the word segmentation results and the reference names further comprises:

in response to determining that a second predetermined confidence condition is satisfied, determining a matching confidence level between the original data to be recognized and the reference name as a second level, the second predetermined confidence condition comprising:

determining that the word segmentation results of the preprocessed data and the reference name have a coincidence part after structural reorganization, and the channel type classification information of the preprocessed data and the reference name is the same;

in response to determining that a third predetermined confidence condition is satisfied, determining a mismatch between the original data to be identified and the reference name, the third predetermined confidence condition comprising:

the word segmentation results of the preprocessed data and the reference names have overlapped parts after structural reorganization, and the channel type classification information of the preprocessed data and the reference names is different.

12. The method of claim 1, wherein identifying administrative division information and channel type information in the raw data to be identified comprises:

identifying administrative division information in the organization name to be identified based on full names, acronyms, names used, and excluded words about provinces, cities, and counties, the administrative division information including province information, city information, and county information;

identifying administrative division information in the original data to be identified using lower administrative division information of the identified county information or city information, or administrative division information of an associated target object of the target object to be identified, in response to confirming that the identified county information or city information does not indicate a unique county or city.

13. A computing device, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.