CN110390332A - A kind of classification determines method, device and equipment - Google Patents

A kind of classification determines method, device and equipment Download PDF

Info

Publication number
CN110390332A
CN110390332A CN201810344756.6A CN201810344756A CN110390332A CN 110390332 A CN110390332 A CN 110390332A CN 201810344756 A CN201810344756 A CN 201810344756A CN 110390332 A CN110390332 A CN 110390332A
Authority
CN
China
Prior art keywords
character
target
target character
group
character set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810344756.6A
Other languages
Chinese (zh)
Other versions
CN110390332B (en
Inventor
梁奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810344756.6A priority Critical patent/CN110390332B/en
Publication of CN110390332A publication Critical patent/CN110390332A/en
Application granted granted Critical
Publication of CN110390332B publication Critical patent/CN110390332B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of classification and determines method, device and equipment, this method comprises: the name information of data is divided at least one character group using character attibute;Target character group is chosen from least one described character group using the character attibute of character group;The classification of the data is determined according to the target character group.By the technical solution of the application, the classification of data can be effectively determined, improve the accuracy rate that classification determines, a variety of descriptions of same class commodity can be normalized to the same category as far as possible, achieve the purpose that reduce categorical measure.

Description

Method, device and equipment for determining category
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for determining a category.
Background
Data classification refers to: the data category is determined, all data belonging to the same category are acquired, and processing is performed based on all data of the category. For example, in the tax industry, the commodity name of the invoice data is a key element, the category of the invoice data can be determined according to the commodity name, all invoice data of the category can be obtained, and all invoice data of the category can be used for processing such as macroscopic analysis, sales anomaly, tax evasion detection and the like.
However, how the category of the data should be determined currently lacks an efficient implementation.
For example, the commodity name in the invoice data may be manually entered by the user, and there is no canonical naming method, as in the invoice data of the "cement" category, the commodity name may include: cement 208, cement 322, cement (quick set), cement PC 325-paper bag 50 KG-red water river brand, etc., so there is currently no effective way to classify the invoice data including these trade names into the "cement" category.
Disclosure of Invention
The application provides a category determination method, which comprises the following steps:
dividing name information of the data into at least one character group by using character attributes;
selecting a target character set from the at least one character set by using the character attributes of the character sets;
and determining the category of the data according to the target character set.
The application provides a category determination method, which comprises the following steps:
dividing the commodity name in the data into at least one character group by utilizing the character attribute;
selecting a target character set from the at least one character set by using the character attributes of the character sets;
determining the category corresponding to the commodity name according to the target character set;
and collecting the data into a category corresponding to the commodity name.
The present application provides a category determination device, the device comprising:
the segmentation module is used for segmenting the name information of the data into at least one character group by utilizing the character attributes;
the selection module is used for selecting a target character set from the at least one character set by utilizing the character attributes of the character sets; and the determining module is used for determining the category of the data according to the target character group.
The present application provides a category determination device, the device comprising:
the segmentation module is used for segmenting the commodity name in the data into at least one character group by utilizing the character attribute;
the selection module is used for selecting a target character set from the at least one character set by utilizing the character attributes of the character sets; the determining module is used for determining the category corresponding to the commodity name according to the target character group;
and the collection module is used for collecting the data to the category corresponding to the commodity name.
The present application provides a category determination device, including:
a processor and a machine-readable storage medium having stored thereon a plurality of computer instructions, the processor when executing the computer instructions performs: dividing name information of the data into at least one character group by using character attributes; selecting a target character set from the at least one character set by using the character attributes of the character sets; and determining the category of the data according to the target character set.
Based on the technical scheme, in the embodiment of the application, the name information can be divided into at least one character group by using the character attribute, a target character group is selected from the at least one character group by using the character attribute of the character group, and then the category of the data is determined according to the target character group. The method can effectively determine the category of the data, improve the accuracy of category determination, and can reduce the number of categories by unifying multiple descriptions of the same type of commodities to the same category as much as possible. The method does not need to use a word segmentation device to perform word segmentation processing on the name information, and can identify the category even if the word segmentation dictionary of the word segmentation device does not have the name information.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments of the present application or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings of the embodiments of the present application.
FIG. 1 is a flow diagram of a category determination method in one embodiment of the present application;
FIG. 2 is a schematic diagram of a system architecture in one embodiment of the present application;
FIG. 3 is a schematic diagram of a graph structure in one embodiment of the present application;
fig. 4 is a block diagram of a category identifying device according to an embodiment of the present application.
Detailed Description
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein is meant to encompass any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Depending on the context, moreover, the word "if" as used may be interpreted as "at … …" or "when … …" or "in response to a determination".
The embodiment of the present application provides a category determining method, which may be applied to category determining devices, such as a Personal Computer (PC), a notebook Computer, a mobile terminal, a terminal device, a smart phone, a server, a data platform, an analysis platform, and the like, without limitation to the types of the devices.
Referring to fig. 1, a schematic flow chart of the above category determining method is shown, where the method may include:
step 101, dividing the name information of the data into at least one character group by using character attributes.
The dividing of the name information of the data into at least one character group by using the character attribute may include:
the method comprises the following steps: dividing adjacent characters with the same character attribute in the name information into the same character group; dividing characters with different character attributes in the name information into different character groups; and dividing non-adjacent characters with the same character attribute in the name information into different character groups.
Carrying out Hash coding on the name information by utilizing the character attribute to obtain at least one coded value; a character group corresponding to the code value is determined from the name information, thereby obtaining at least one character group.
In the second mode, the name information is hash-coded by using the character attribute to obtain at least one coded value, which may include but is not limited to: determining a coded value corresponding to the character attribute of each character in the name information; and combining adjacent coded values with the same coded value to obtain at least one coded value.
Further, determining the code value corresponding to the character attribute of each character in the name information may include, but is not limited to: inquiring a mapping table according to character attributes of characters to obtain a coded value corresponding to the character attributes; the mapping table is used for recording the corresponding relation between the character attribute and the coding value.
And 102, selecting a target character set from at least one character set by using the character attributes of the character sets.
The selecting a target character set from at least one character set by using the character attributes of the character sets may include, but is not limited to: and selecting a character group with specific character attributes from at least one character group, and determining the selected character group as a target character group. Also, the target character group may be one or at least two.
Step 103, determining the data type according to the target character set. The determining the category of the data according to the target character set may include, but is not limited to: if the target character set is one, determining the target character set as the type of the data; or if the number of the target character groups is at least two, selecting one target character group from the at least two target character groups, and determining the selected target character group as the data type.
In one example, selecting a target character set from at least two target character sets may include, but is not limited to: determining the score value of the target character set according to the characteristic information of the target character set; then, a target character group having the highest score value may be selected from the at least two target character groups. The characteristic information of the target character set may include, but is not limited to, one or any combination of the following: the total occurrence number corresponding to the target character set; total number of businesses using the target character set; the number of directories corresponding to the target character set.
When the point value of the target character set is determined according to the characteristic information of the target character set, the point value is in direct proportion to the total occurrence number, the point value is in direct proportion to the total number of enterprises, and the point value is in inverse proportion to the number of catalogues.
In an example, the execution sequence is only an example given for convenience of description, and in practical applications, the execution sequence between steps may also be changed, and the execution sequence is not limited. Moreover, in other embodiments, the steps of the respective methods do not have to be performed in the order shown and described herein, and the methods may include more or less steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In one example, after determining the category of the data according to the target character set, the target character set may be recorded in a segmentation dictionary, and the segmentation dictionary is used for performing segmentation processing. That is, the word segmentation device can perform word segmentation processing by using the target character set in the word segmentation dictionary, and the word segmentation processing process is not limited.
The character attributes may include, but are not limited to, one or any combination of the following: alphabetic characters, numeric characters, symbolic type characters. Of course, there may be other character attributes, which are not limited.
Based on the technical scheme, in the embodiment of the application, the name information can be divided into at least one character group by using the character attribute, a target character group is selected from the at least one character group by using the character attribute of the character group, and then the category of the data is determined according to the target character group. The method can effectively determine the category of the data, improve the accuracy of category determination, and can reduce the number of categories by unifying multiple descriptions of the same type of commodities to the same category as much as possible. The method does not need to use a word segmentation device to perform word segmentation processing on the name information, and can identify the category even if the word segmentation dictionary of the word segmentation device does not have the name information.
Based on the same application concept as the method, another category determination method is also provided in the embodiment of the present application, and the method may include: dividing the commodity name in the data into at least one character group by utilizing the character attribute; selecting a target character set from the at least one character set by using the character attributes of the character sets; determining the category corresponding to the commodity name according to the target character group; and collecting the data into a category corresponding to the commodity name. The implementation process of the above steps may refer to the flow shown in fig. 1, and is not described herein again.
The above technical solution is described in detail below with reference to specific application scenarios. In the application context, the invoice data is taken as an example, and of course, in practical application, other types of data may be used, which is not limited to this. Considering that the commodity name of the invoice data is a key element, the category of the invoice data can be determined according to the commodity name, and therefore, the name information may be the commodity name of the invoice data, and of course, the name information may be other information of the invoice data, which is not limited to this, and is hereinafter referred to as an example of the commodity name.
Because the commodity name in the invoice data can be manually entered by the user, there is no standard naming mode, for example, in the invoice data for the category of "cement", the commodity name may include: cement 208, cement 322, cement (quick set), cement PC 325-paper bag 50 KG-red water river brand, etc., in order to classify the invoice data including these trade names into the "cement" category, the system configuration shown in fig. 2 may also be employed.
Referring to fig. 2, after the invoice data is obtained, the commodity name of the invoice data may be input to the data preprocessing module, and the data preprocessing module preprocesses the commodity name. For example, special characters in the commodity name can be cleaned, useless words (such as kg, kilogram and the like) in the commodity name can be cleaned, a full corner in the commodity name can be converted into a half corner, capital letters in the commodity name can be converted into lowercase letters, and the like, the preprocessing process is not limited, and then the preprocessed commodity name is output to the mode hash coding module.
The mode hash encoding module may segment the product name into at least one character group by using character attributes after obtaining the product name, wherein each character group may include one or more characters, and the character attributes may include, but are not limited to, alphabetic characters (e.g., chinese characters, such as water, mud, paper, etc., whose character attributes are all alphabetic characters), alphabetic characters (e.g., english characters, french characters, etc., whose character attributes are all alphabetic characters), numeric characters (e.g., 1, 2, 3, etc., whose character attributes are all numeric characters), and symbol-like characters (e.g., |?, etc., whose character attributes are all symbol-like characters).
The process of dividing the commodity name into at least one character group by using the character attribute by the mode hash encoding module, wherein each character group comprises one or more characters may include, but is not limited to, the following ways:
in the first mode, the mode hash coding module can divide adjacent characters with the same character attribute in the commodity name into the same character group, divide characters with different character attributes in the commodity name into different character groups, and divide non-adjacent characters with the same character attribute in the commodity name into different character groups.
For example, for "cement 208," cement "may be segmented into character set 1, since the character attributes of" water "are the same as those of" mud, "and" water "and" mud "are adjacent characters. Similarly, "208" may be segmented into character set 2. Since the character attribute of "mud" is different from that of "2", even if "mud" and "2" are adjacent characters, "mud" and "2" are divided into different character groups. For another example, for "cement (quick dry)", the "cement" may be divided into character group 1 and "(" divided into character group 2 and "quick dry" divided into character group 3 and ")" divided into character group 4. For example, for the "cement pc325 paper bag 50 red river card", the "cement" may be divided into character group 1, character group 2, character group 3, character group 4, character group 5, and character group 6.
Wherein, although the cement and the paper bag have the same character attribute, the cement and the paper bag can be in different character groups because the cement and the paper bag are not adjacent characters. Similarly, the cement and the red water river are in different character groups, and the paper bag and the red water river are in different character groups.
And secondly, the mode hash coding module performs hash coding on the commodity name by using the character attribute to obtain at least one code value, and determines a character group corresponding to each code value from the commodity name to obtain at least one character group. When the character attribute is used for carrying out hash coding on the commodity name, the code value corresponding to the character attribute of each character in the commodity name can be determined, and adjacent code values with the same code value are combined.
Further, when determining the code value corresponding to the character attribute of each character in the commodity name, the mode hash coding module may query the mapping table according to the character attribute of each character to obtain the code value corresponding to the character attribute; the mapping table is used for recording the corresponding relation between the character attribute and the coding value.
For example, without limitation, the mode hash encoding module may establish the mapping table shown in table 1.
TABLE 1
Character attributes Encoding a value
Character of the characters A
Alphabetic character B
Numerical characters C
Symbol-like characters D
For the "cement 208", since the character attribute of "water" and the character attribute of "mud" are both alphabetical characters, the code values of "water" and "mud" are a, and the character attributes of "2", "0", and "8" are both numeric characters, and the code values of "2", "0", and "8" are C. Based on this, the "cement 208" may have a code value of AACCC, and then adjacent code values having the same code value may be combined, i.e., two adjacent A's are combined and three adjacent C's are combined, and the final code value is AC. Since two adjacent characters a are combined, the first character "water" and the second character "mud" can be divided into the character group 1 corresponding to the code value a, i.e. the character group 1 includes "cement". Similarly, "2", "0", "8" may be divided into character group 2 corresponding to the code value C, i.e., character group 2 includes "208". Similarly, for the "cement pc325 paper bag 50 red river brand," the pre-incorporation code value may be AABBCCCAABBAAAA, and the post-incorporation code value may be abccaa. Further, the first character "water" and the second character "mud" may be divided into the character set 1 corresponding to the first code value a, i.e. the character set 1 includes "cement". Similarly, character set 2 includes "pc", character set 3 includes "325", character set 4 includes "paper bag", character set 5 includes "50", and character set 6 includes "red river card".
After the mode hash encoding module obtains the character set, the character set can be output to the data layering module. For example, for "cement 208", character set 1 and character set 2 may be output to the data layering module, and for "cement pc325 paper bag 50 hong shui river brand", character set 1-character set 6 may be output to the data layering module.
After obtaining the character sets, the data layering module may select a target character set from all the character sets by using the character attribute of each character set. For example, the data hierarchy module may select a character group having a specific character attribute from all character groups, and determine the selected character group as a target character group. In one example, the specific character attribute may be a literal character, that is, the data layering module may select a character group of the literal character from all the character groups and determine the selected character group as a target character group.
For example, for "cement 208," the data stratification module may obtain character set 1 and character set 2, and since the character attribute of character set 1 is a literal character and the character attribute of character set 2 is a numeric character, the data stratification module determines character set 1 as the target character set and character set 2 is not the target character set.
For another example, for the "cement pc325 paper bag 50 red water river brand", the data layering module can obtain character sets 1 to 6, and since the character attributes of the character sets 1, 4, and 6 are alphabetic characters, the character attribute of the character set 2 is an alphabetic character, and the character attributes of the character sets 3 and 5 are numeric characters, the data layering module determines the character sets 1, 4, and 6 as the target character sets.
After the data layering module obtains the target character group, the target character group can be output to the propagation collection module. For example, for "cement 208," character set 1 is output to the propagation aggregation module. And aiming at the 50 red river cards of the cement pc325 paper bags, outputting the character groups 1, 4 and 6 to a propagation collection module.
After the propagation collection module obtains the target character group, the category of the invoice data can be determined according to the target character group. For example, if there is one target character group, the propagation aggregation module determines the target character group as the category of the invoice data; if the number of the target character groups is at least two, the propagation collection module selects one target character group from the at least two target character groups, and determines the selected target character group as the category of the invoice data.
For example, for "cement 208", the broadcast aggregation module receives only one target character set, such as character set 1, and thus, character set 1 may be determined as the category of invoice data, and since character set 1 includes "cement", the category of invoice data is "cement", and thus invoice data can be correctly classified.
For another example, for the cement pc325 paper bag 50 red river brand, the propagation aggregation module may receive a plurality of target character sets, such as character set 1, character set 4, and character set 6, and thus may select one character set from character set 1, character set 4, and character set 6. Assuming that the character set 1 is selected, the character set 1 may be determined as a category of invoice data, and since the character set 1 includes "cement", the category of invoice data may be "cement". Assuming that the character set 4 is selected, the character set 4 may be determined as a category of invoice data, which may be "paper bags" since the character set 4 includes the "paper bags"; and so on.
The propagation collection module selects one target character group from at least two target character groups, which may include but is not limited to: randomly selecting a target character group from at least two target character groups; or determining the score value of each target character group according to the characteristic information of each target character group, and selecting the target character group with the highest score value from at least two target character groups. Wherein, the characteristic information of the target character group may include: the total occurrence number corresponding to the target character set; total number of businesses using the target character set; the number of directories corresponding to the target character set. Also, in determining the point value from the characteristic information, the point value is proportional to the total number of occurrences, the point value is proportional to the total number of businesses, and the point value is inversely proportional to the number of directories.
The following describes the Processing procedure of the broadcast aggregation module with reference to the graph structure shown in fig. 3, where the graph structure may be a computation framework oriented to graph computation, for example, the graph structure may be built on an ODPS (Open Data Processing Service) platform, or may be built on another platform, which is not limited thereto.
For example, referring to FIG. 3, for "cement 208," the propagation aggregation module receives only one target character set, such as the character set "cement," and thus, may directly determine that the category of the invoice data is "cement.
For another example, as shown in fig. 3, for the "cement pc325 paper bag 50 red river card", the propagation aggregation module receives three target character sets, such as a character set "cement", a character set "paper bag", and a character set "red river card", so that a character set may be selected from the character set "cement", the character set "paper bag", and the character set "red river card", by using the total occurrence number 1, the total enterprise number 1, and the directory number 1 corresponding to the character set "cement", the total occurrence number 2, the total enterprise number 2, and the directory number 2 corresponding to the character set "paper bag", the total occurrence number 3, the total enterprise number 3, and the directory number 3 corresponding to the character set "red river card", and the selection of the character set "cement" in fig. 3 is taken as an example, so that the type of the invoice data may be "cement".
For another example, referring to fig. 3, for "cement (quick drying)", the propagation aggregation module receives two target character sets, such as a character set "cement" and a character set "quick drying", and selects one character set from the character set "cement" and the character set "quick drying" by using a total occurrence number 1, a total number of enterprises 1, a total number of catalogues 1 corresponding to the character set "cement", a total occurrence number 4, a total number of enterprises 4, and a total number of catalogues 4 corresponding to the character set "quick drying", and taking the selection of the character set "cement" as an example, it is determined that the category of the invoice data is "cement".
Wherein, assuming that the characteristic information is total occurrence number, when selecting one character set from the character set 'cement', the character set 'paper bag' and the character set 'red river plate', assuming that the total occurrence number 1 is greater than the total occurrence number 2 and the total occurrence number 2 is greater than the total occurrence number 3, the score value of the character set 'cement' is higher than the score value of the character set 'paper bag', and the score value of the character set 'paper bag' is higher than the score value of the character set 'red river plate', therefore, the propagation and collection module can select the character set 'cement' with the highest score value, namely, the category can be 'cement'.
The character set cement is a character set 'cement', a character set 'paper bag' and a character set 'red river card', the total number of enterprises is assumed to be 1, 2 and 3, the score value of the character set 'cement' is higher than that of the character set 'paper bag', and the score value of the character set 'paper bag' is higher than that of the character set 'red river card', so that the character set 'cement' with the highest score value can be selected by the propagation collection module, namely the character set 'cement' can be selected by the category.
Wherein, assuming that the characteristic information is the number of directories, when one character set is selected from the character set "cement", the character set "paper bag", and the character set "red river plate", assuming that the number of directories 1 is smaller than the number of directories 2 and the number of directories 2 is smaller than the number of directories 3, the score value of the character set "cement" is higher than the score value of the character set "paper bag", and the score value of the character set "paper bag" is higher than the score value of the character set "red river plate", so that the propagation and collection module may select the character set "cement" with the highest score value, that is, the category may be "cement".
Assuming that the feature information is at least two of the total occurrence number, the total number of enterprises, and the number of directories, taking the feature information is the total occurrence number, the total number of enterprises, and the number of directories as an example, corresponding weights may also be configured for the total occurrence number, the total number of enterprises, and the number of directories, and the score value of each character set is determined according to the total occurrence number, the total number of enterprises, and the number of directories, and the determination method is not limited as long as the score value is directly proportional to the total occurrence number, the score value is directly proportional to the total number of enterprises, and the score value is inversely proportional to the number of directories.
In the above embodiment, the reason why the score value is proportional to the total number of occurrences is that: when the total number of occurrences corresponding to a character group is larger, the probability that the character group is a category is larger as the number of times the character group is used is larger, and therefore, the score value is larger as the total number of occurrences is larger, that is, the score value is proportional to the total number of occurrences.
For example, the total number of occurrences corresponding to the character set "cement" is 10000, and the total number of occurrences corresponding to the character set "red river plate" is 20, based on which, it means that in all the product names of all the invoice data, the total number of occurrences of the character set "cement" is 10000 times, and the total number of occurrences of the character set "red river plate" is 20 times, obviously, since the total number of occurrences of the character set "cement" is much greater than the total number of occurrences of the character set "red river plate", the character set "cement" has universality, and the character set "cement" should be determined as a category.
In the above embodiment, the reason why the point value is proportional to the total number of businesses is that: when the total number of the enterprises corresponding to the character set is larger, the probability that the character is composed into the category is larger, and therefore, the score value is larger when the total number of the enterprises is larger, namely, the score value is in direct proportion to the total number of the enterprises.
For example, the total number of enterprises corresponding to the character set "cement" is 300, and the total number of enterprises corresponding to the character set "red river plate" is 1, based on which, in all the trade names of all the invoice data, a total of 300 enterprises use the character set "cement", and a total of 1 enterprise uses the character set "red river plate", it is obvious that the character set "cement" has universality and the character set "cement" should be determined as a category because the total number of enterprises of the character set "cement" is far greater than the total number of enterprises of the character set "red river plate".
In the above embodiment, the reason why the score value is inversely proportional to the number of directories is that: the larger the number of categories corresponding to the character group is, the more likely the character group appears in each commodity category, and the lower the probability that the character group is a category is.
For example, the character set "cement" corresponds to the number of catalogs of 1, and the character set "paper bag" corresponds to the number of catalogs of 20, based on which it is indicated that the character set "cement" appears only in 1 commodity catalog, and the character set "paper bag" appears in 20 commodity catalogs. When the character set paper bags appear in the 20 commodity catalogs, the character set paper bags cannot effectively distinguish different categories, if the character set paper bags are used as categories, the categories belong to a plurality of commodity catalogs at the same time, and obviously, the categories cannot effectively distinguish the commodity catalogs.
Therefore, the larger the number of directories corresponding to the character group is, the lower the probability that the character is composed into the category is, whereas the smaller the number of directories corresponding to the character group is, the higher the probability that the character is composed into the category is. In summary, since the number of catalogues of the character set "cement" is much smaller than that of the character set "paper bag", the character set "cement" can reflect the uniqueness of the commodity catalog, and the character set "cement" should be determined as a category.
Through the processing, the transmission and collection module can obtain the category 'cement' of the invoice data and can output the category 'cement' of the invoice data. Further, the upper application can obtain all invoice data of the category cement based on the output of the propagation collection module, and performs processing such as macroscopic analysis, abnormal sale, tax evasion and tax leakage detection and the like by using all invoice data of the category cement, and the processing process is not limited.
In one example, the target character set (e.g., character set "cement", character set "paper bag", etc.) may also be recorded in the word segmentation dictionary, so that the word segmentation device performs word segmentation processing using the character set in the word segmentation dictionary.
Based on the technical scheme, in the embodiment of the application, the commodity name can be divided into at least one character group by using the character attribute, the target character group is selected from the at least one character group by using the character attribute of the character group, and then the category of the data is determined according to the target character group. The method can effectively determine the category of the data, improve the accuracy of category determination, and can reduce the number of categories by unifying multiple descriptions of the same type of commodities to the same category as much as possible. The method does not need to use a word segmentation device to perform word segmentation processing on the commodity name, and the category can be identified even if the commodity name does not exist in a word segmentation dictionary of the word segmentation device.
In the above manner, instead of segmenting the product name into at least one character group by using the word segmenter, the segmentation of the product name into at least one character group by using the character attributes may include, but is not limited to:
1. when the commodity name is divided into character groups by using the word segmentation device, the division effect depends on the accuracy of the word segmentation dictionary, and if the commodity name does not exist in the word segmentation dictionary or the word segmentation in the word segmentation dictionary is not accurate, the commodity name cannot be accurately divided, even the division is wrong. In the embodiment, the commodity name is divided into at least one character group by using the character attributes without using a word segmentation device, so that the commodity name can be accurately divided without depending on the accuracy of a word segmentation dictionary, and the user experience is improved even if the commodity name does not exist in the word segmentation dictionary or the word segmentation in the word segmentation dictionary is inaccurate.
2. When segmenting the name of a commodity using a word segmenter, a complete name of the commodity may be segmented into a plurality of character groups, resulting in an erroneous category determination result. For example, in the case where the trade name is "lemipramine hydrochloride tablet", the tokenizer may segment "lemipramine hydrochloride tablet" into "hydrochloric acid", "lemipramine", and "pamine tablet", and further, may recognize the trade name "lemipramine hydrochloride tablet" as a category "hydrochloric acid", which is obviously an erroneous recognition result, and the category thereof should be "lemipramine hydrochloride tablet".
In the embodiment, the commodity name is divided into at least one character group by using the character attribute, and the division result of the commodity name 'the lamipramine hydrochloride tablet' is the 'lamipramine hydrochloride tablet', so that the finally determined class is 'the lamipramine hydrochloride tablet', and the accurate recognition result is displayed.
Based on the same application concept as the method described above, an embodiment of the present application further provides a category determining apparatus, as shown in fig. 4, which is a structural diagram of the category determining apparatus, and the apparatus includes:
a dividing module 401, configured to divide the name information of the data into at least one character group by using the character attribute; a selecting module 402, configured to select a target character set from the at least one character set by using a character attribute of the character set; a determining module 403, configured to determine a category of the data according to the target character set.
The segmentation module 401 is specifically configured to perform at least one of the following operations when segmenting the name information of the data into at least one character group by using the character attribute: dividing adjacent characters with the same character attribute in the name information into the same character group; dividing characters with different character attributes in the name information into different character groups; and dividing non-adjacent characters with the same character attribute in the name information into different character groups.
The segmenting module 401, when segmenting the name information of the data into at least one character group by using the character attribute, is specifically configured to: carrying out Hash coding on the name information by utilizing character attributes to obtain at least one coded value; determining a character group corresponding to the code value from the name information;
the segmentation module 401 performs hash coding on the name information by using a character attribute, and is specifically configured to: determining a coded value corresponding to the character attribute of each character in the name information; and combining adjacent coded values with the same coded value to obtain the at least one coded value.
The determining module 403 is specifically configured to, when determining the category of the data according to the target character set:
if the number of the target character groups is one, determining the target character groups as the data types; or,
and if the number of the target character groups is at least two, selecting one target character group from the at least two target character groups, and determining the selected target character group as the type of the data.
The determining module 403 is specifically configured to, when selecting one target character group from the at least two target character groups: determining the score value of the target character set according to the characteristic information of the target character set; selecting a target character group with the highest score value from at least two target character groups;
wherein, the characteristic information of the target character group comprises one or any combination of the following: the total occurrence number corresponding to the target character set; using the total number of businesses for the target character set; the number of directories corresponding to the target character set; the point value is proportional to the total number of occurrences, the point value is proportional to the total number of businesses, and the point value is inversely proportional to the number of catalogs.
Based on the same application concept as the method, an embodiment of the present application further provides a category determining apparatus, including: a processor and a machine-readable storage medium having stored thereon a plurality of computer instructions, the processor when executing the computer instructions performs: dividing name information of the data into at least one character group by using character attributes; selecting a target character set from the at least one character set by using the character attributes of the character sets; and determining the category of the data according to the target character set.
Based on the same application concept as the method, the embodiment of the present application further provides a machine-readable storage medium, where a plurality of computer instructions are stored on the machine-readable storage medium, and when executed, the computer instructions perform the following processes: dividing name information of the data into at least one character group by using character attributes; selecting a target character set from the at least one character set by using the character attributes of the character sets; and determining the category of the data according to the target character set.
Based on the same application concept as the method, an embodiment of the present application further provides a category determining apparatus, including: the segmentation module is used for segmenting the commodity name in the data into at least one character group by utilizing the character attribute; the selection module is used for selecting a target character set from the at least one character set by utilizing the character attributes of the character sets; the determining module is used for determining the category corresponding to the commodity name according to the target character group; and the collection module is used for collecting the data to the category corresponding to the commodity name.
The functions of the modules can be seen in fig. 4, and are not described herein again.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Furthermore, these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (20)

1. A method for class determination, the method comprising:
dividing name information of the data into at least one character group by using character attributes;
selecting a target character set from the at least one character set by using the character attributes of the character sets;
and determining the category of the data according to the target character set.
2. The method of claim 1, wherein the segmenting the name information of the data into at least one character group by using the character attribute comprises at least one of the following ways:
dividing adjacent characters with the same character attribute in the name information into the same character group;
dividing characters with different character attributes in the name information into different character groups;
and dividing non-adjacent characters with the same character attribute in the name information into different character groups.
3. The method of claim 1,
the dividing of the name information of the data into at least one character group by using the character attribute includes:
carrying out Hash coding on the name information by utilizing character attributes to obtain at least one coded value;
and determining a character group corresponding to the code value from the name information.
4. The method of claim 3, wherein the hash-coding the name information using the character attribute to obtain at least one coded value comprises:
determining a coded value corresponding to the character attribute of each character in the name information;
and combining adjacent coded values with the same coded value to obtain the at least one coded value.
5. The method of claim 4,
the determining an encoding value corresponding to a character attribute of each character in the name information includes:
inquiring a mapping table according to character attributes of characters to obtain a coded value corresponding to the character attributes;
the mapping table is used for recording the corresponding relation between the character attribute and the coding value.
6. The method of claim 1,
selecting a target character set from the at least one character set by using the character attributes of the character sets, comprising:
selecting a character group with a specific character attribute from the at least one character group;
and determining the selected character set as the target character set.
7. The method of claim 1,
the determining the category of the data according to the target character set comprises:
if the number of the target character groups is one, determining the target character groups as the data types; or,
and if the number of the target character groups is at least two, selecting one target character group from the at least two target character groups, and determining the selected target character group as the type of the data.
8. The method of claim 7,
the selecting a target character group from at least two target character groups includes:
determining the score value of the target character set according to the characteristic information of the target character set;
the target character group with the highest score value is selected from the at least two target character groups.
9. The method according to claim 8, wherein the characteristic information of the target character set comprises one or any combination of the following: the total occurrence number corresponding to the target character set; using the total number of businesses for the target character set; and the number of directories corresponding to the target character set.
10. The method as claimed in claim 9, wherein the score value of the target character set is determined according to the feature information of the target character set, the score value is proportional to the total occurrence number, the score value is proportional to the total number of businesses, and the score value is inversely proportional to the number of directories.
11. The method of claim 1,
after determining the category of the data according to the target character set, the method further comprises:
and recording the target character set into a word segmentation dictionary, wherein the word segmentation dictionary is used for performing word segmentation processing.
12. The method of claim 1, wherein the character attribute comprises one or any combination of the following: alphabetic characters, numeric characters, symbolic type characters.
13. A method for class determination, the method comprising:
dividing the commodity name in the data into at least one character group by utilizing the character attribute;
selecting a target character set from the at least one character set by using the character attributes of the character sets;
determining the category corresponding to the commodity name according to the target character set;
and collecting the data into a category corresponding to the commodity name.
14. A class determination apparatus, the apparatus comprising:
the segmentation module is used for segmenting the name information of the data into at least one character group by utilizing the character attributes;
the selection module is used for selecting a target character set from the at least one character set by utilizing the character attributes of the character sets;
and the determining module is used for determining the category of the data according to the target character group.
15. The apparatus of claim 14, wherein the segmentation module is specifically configured to perform at least one of the following when segmenting the name information of the data into at least one character group using the character attributes:
dividing adjacent characters with the same character attribute in the name information into the same character group;
dividing characters with different character attributes in the name information into different character groups;
and dividing non-adjacent characters with the same character attribute in the name information into different character groups.
16. The apparatus of claim 14,
the segmentation module is specifically configured to, when segmenting the name information of the data into at least one character group by using the character attribute: carrying out Hash coding on the name information by utilizing character attributes to obtain at least one coded value; determining a character group corresponding to the code value from the name information;
the segmentation module performs hash coding on the name information by using a character attribute, and is specifically configured to: determining a coded value corresponding to the character attribute of each character in the name information; and combining adjacent coded values with the same coded value to obtain the at least one coded value.
17. The apparatus of claim 14,
the determining module is specifically configured to, when determining the category of the data according to the target character group:
if the number of the target character groups is one, determining the target character groups as the data types; or,
and if the number of the target character groups is at least two, selecting one target character group from the at least two target character groups, and determining the selected target character group as the type of the data.
18. The apparatus of claim 17, wherein the determination module is further configured to, when selecting one of the at least two target character sets: determining the score value of the target character set according to the characteristic information of the target character set; selecting a target character group with the highest score value from at least two target character groups; wherein, the characteristic information of the target character group comprises one or any combination of the following: the total occurrence number corresponding to the target character set; using the total number of businesses for the target character set; the number of directories corresponding to the target character set; the point value is proportional to the total number of occurrences, the point value is proportional to the total number of businesses, and the point value is inversely proportional to the number of catalogs.
19. A class determination apparatus, the apparatus comprising:
the segmentation module is used for segmenting the commodity name in the data into at least one character group by utilizing the character attribute;
the selection module is used for selecting a target character set from the at least one character set by utilizing the character attributes of the character sets;
the determining module is used for determining the category corresponding to the commodity name according to the target character group;
and the collection module is used for collecting the data to the category corresponding to the commodity name.
20. A category determination device, comprising:
a processor and a machine-readable storage medium having stored thereon a plurality of computer instructions, the processor when executing the computer instructions performs: dividing name information of the data into at least one character group by using character attributes; selecting a target character set from the at least one character set by using the character attributes of the character sets; and determining the category of the data according to the target character set.
CN201810344756.6A 2018-04-17 2018-04-17 Class determination method, device and equipment Active CN110390332B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810344756.6A CN110390332B (en) 2018-04-17 2018-04-17 Class determination method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810344756.6A CN110390332B (en) 2018-04-17 2018-04-17 Class determination method, device and equipment

Publications (2)

Publication Number Publication Date
CN110390332A true CN110390332A (en) 2019-10-29
CN110390332B CN110390332B (en) 2023-12-15

Family

ID=68283162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810344756.6A Active CN110390332B (en) 2018-04-17 2018-04-17 Class determination method, device and equipment

Country Status (1)

Country Link
CN (1) CN110390332B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103535033A (en) * 2011-05-10 2014-01-22 高通股份有限公司 Offset type and coefficients signaling method for sample adaptive offset
US20140143254A1 (en) * 2012-11-16 2014-05-22 Ritendra Datta Category and Attribute Specifications for Product Search Queries
CN104331173A (en) * 2012-04-16 2015-02-04 宗刚 Computer processing method and system for character information
CN105184052A (en) * 2015-08-13 2015-12-23 易保互联医疗信息科技(北京)有限公司 Automatic coding method and system for medicine information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103535033A (en) * 2011-05-10 2014-01-22 高通股份有限公司 Offset type and coefficients signaling method for sample adaptive offset
CN104331173A (en) * 2012-04-16 2015-02-04 宗刚 Computer processing method and system for character information
US20140143254A1 (en) * 2012-11-16 2014-05-22 Ritendra Datta Category and Attribute Specifications for Product Search Queries
CN105184052A (en) * 2015-08-13 2015-12-23 易保互联医疗信息科技(北京)有限公司 Automatic coding method and system for medicine information

Also Published As

Publication number Publication date
CN110390332B (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
JP2019502979A (en) Automatic interpretation of structured multi-field file layouts
AU2014201516A1 (en) Resolving similar entities from a transaction database
CN111291571A (en) Semantic error correction method, electronic device and storage medium
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN113449187A (en) Product recommendation method, device and equipment based on double portraits and storage medium
CN111191652A (en) Certificate image identification method and device, electronic equipment and storage medium
CN106462633B (en) Efficiently storing related sparse data in a search index
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN114153962A (en) Data matching method and device and electronic equipment
CN111597309A (en) Similar enterprise recommendation method and device, electronic equipment and medium
CN106933878B (en) Information processing method and device
CN110825817B (en) Enterprise suspected association judgment method and system
CN113591881B (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN118396786A (en) Contract document auditing method and device, electronic equipment and computer readable storage medium
CN109241360B (en) Matching method and device of combined character strings and electronic equipment
CN113327132A (en) Multimedia recommendation method, device, equipment and storage medium
CN110427496B (en) Knowledge graph expansion method and device for text processing
CN110309313B (en) Method and device for generating event transfer graph
CN110765100A (en) Label generation method and device, computer readable storage medium and server
CN116228374A (en) Logistics industry market single data early warning method, device, equipment and storage medium
CN110019829B (en) Data attribute determination method and device
CN110390332B (en) Class determination method, device and equipment
CN114706899A (en) Express delivery data sensitivity calculation method and device, storage medium and equipment
CN111191049B (en) Information pushing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant