CN109784776B

CN109784776B - Commodity quality risk judgment method based on label identification

Info

Publication number: CN109784776B
Application number: CN201910145851.8A
Authority: CN
Inventors: 张华桁; 何军良; 宋博; 严伟; 杨锐
Original assignee: Shanghai Pinboluo Intelligent Technology Co ltd
Current assignee: Shanghai Pinboluo Intelligent Technology Co ltd
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2020-11-06
Anticipated expiration: 2039-02-27
Also published as: CN109784776A

Abstract

The invention provides a commodity quality risk judgment method based on label identification, which comprises the following steps: reading label information of a commodity, converting the commodity name into a numerical value coding set, and performing 0-1 vectorization expression on a commodity formula; calculating the comprehensive distance of each two commodities, and clustering by adopting a K-Medoide algorithm to obtain a commodity class; determining an illegal additive set according to the occurrence frequency of ingredients in the commodities, calculating average mutual information among ingredients for each commodity, selecting a certain amount of ingredients with the minimum average mutual information, and comparing and judging the ingredients with the illegal additive set. The method comprises commodity clustering, commodity classification and illegal additive identification, commodity information or a large number of rules of codes do not need to be registered in advance, illegal additives in a commodity formula can be automatically identified and memorized, the learning capability is realized, multiple languages can be compatible, and the quick identification of the illegal additives in the commodities and the automatic screening of commodity quality risks are realized.

Description

Commodity quality risk judgment method based on label identification

Technical Field

The invention relates to the technical field of commodity quality analysis, in particular to a commodity quality risk judgment method based on label identification.

Background

The quality of goods is related to the life and property safety of people, and is an important field of government supervision. The rapid development of the current imported cross-border e-commerce enables a large amount of goods produced and sold abroad to rapidly enter China. The cross-border e-commerce import trade has the characteristics of small batch and multiple batches, and great pressure is caused to customs and other supervision departments. Since the characters on the foreign product label are foreign, it is difficult to determine the product type according to the domestic standard, and therefore, it is very difficult to determine the risk of the product.

The main basis for judging the quality risk of the commodities is the national standard of the people's republic of China. At present, 303 food safety national standards such as dairy safety standards, mycotoxins, pesticide and veterinary drug residues, food additives and nutrition enhancers, prepackaged food labels and nutrition label general rules and the like are established and published in China, and more than 6000 food safety indexes are covered. Strictly speaking, judging whether a commodity has a quality risk requires sending a commodity sample to a laboratory for inspection, and then comparing the result with the national standard. However, in practical regulatory procedures, only a small portion of the sample is drawn for inspection due to the constraints. Most of the time, due to the lack of commodity pre-judgment, random inspection has low representativeness and can not accurately reflect the quality of commodities. The detection rate of risks is improved, suspected risk commodities are selected as far as possible for submission, the commodity quality risks need to be judged in advance before the implementation of a sampling behavior, and targeted sampling is carried out according to the judgment result.

The existing commodity quality risk judgment method mainly depends on scanning commodity bar codes, obtaining commodity information registered on a server, identifying risk items according to a predefined illegal word list, or coding national standards into rules and carrying out risk reasoning by using a rule engine. The method of scanning the bar code requires that the commodity information is registered in the database in advance, which is not suitable for imported commodities, especially for commodities imported for the first time; for the method of encoding all national standards into rules, it is necessary to consume considerable labor and time costs, and it is necessary to translate information such as names and formulas of foreign language commodities into chinese so as to correspond to the national standards of china.

At present, certain achievements are obtained in the aspect of the construction of a domestic and foreign import and export commodity quality safety risk supervision system, but some problems still exist. Some countries or regions, such as the european union, have commodity quality safety risk regulatory systems whose risk assessment is not data-driven, but based on a single event for early warning. At home, at present, units or departments use an 'intelligent import and export industrial product risk management informatization platform', although the application data analysis method is used for managing the commodity quality risk, the evaluation model is relatively solidified and cannot be expanded and automatically learned. In the prior art, although the 'risk assessment grading rule in the technology' considers the inherent risk of the product and special risk caused by the production place, the user and the like, the establishment of the rule depends on manual work, and the automation degree is not high.

Disclosure of Invention

The invention aims to provide a commodity quality risk judgment method based on label identification, which can realize identification and judgment of illegal additives in a commodity formula based on the information of a commodity label without coding a large number of rules, has high accuracy, is compatible with multiple languages, and can automatically screen commodity quality risks.

The invention adopts the following technical scheme:

the commodity quality risk judgment method based on label identification comprises the following steps:

firstly, inputting a commodity label;

secondly, judging whether the input commodity labels are in batches or in single;

1. when the label is a batch commodity label, the method comprises the following steps:

1.1 scanning batch commodity labels, extracting commodity names by adopting an N-gram language model, converting the commodity names into a set consisting of continuous N characters, and calculating the Jaccard distance between the two commodity names;

meanwhile, carrying out 0-1 vectorization representation on the commodity formula, and calculating the Cosine distance between the two commodity names;

1.2 calculating the comprehensive distance between two commodities;

clustering the commodities by adopting a K-Medoide clustering algorithm based on the comprehensive distance to obtain a commodity class;

1.3 establishing a set of offending additions belonging to each commodity class, the set of offending additions comprising: confirming an illegal additive set and a suspected illegal additive set, wherein the confirmed illegal additive set comprises illegal additives which are determined in historical risk information and belong to the commodity class, the suspected illegal additive set comprises newly appeared illegal additives, the newly appeared illegal additives are ingredients with the frequency of 0-p x n in an ingredient information list of the commodity, n is the number of commodities contained in the commodity class, p is a constant set manually, and 0< p < 1;

1.4 calculating the average mutual information of each ingredient and other ingredients of each commodity inside each commodity class;

1.5 selecting q ingredients with minimum average mutual information of each commodity, and detecting whether the q ingredients are contained in a suspected illegal additive set one by one; q is an artificially set positive integer and is less than the total number of ingredients of the commodity;

if yes, marking the contained ingredients as illegal additives of the commodity, and entering the next step; if not, entering the next step;

1.6 detecting whether a part of the ingredients other than the q ingredients is contained in the confirmed offending additive set;

1.7 judging whether the commodity has marked illegal additives: if yes, reporting the commodity and the corresponding illegal additives, and entering a third step; if not, judging the product to be qualified, and entering a third step;

2. when the label is a single commodity label, the method comprises the following steps:

2.1 scanning a single commodity label, extracting the commodity name by adopting an N-gram language model, converting the commodity name into a set consisting of continuous N characters, and calculating the Jaccard distance between the commodity and at most x commodities in each commodity class in historical data; wherein x is a manually set positive integer and is less than or equal to the number of all commodities in the commodity class;

meanwhile, carrying out 0-1 vectorization representation on the commodity formula, and calculating the Cosine distance between the commodity and at most x commodities in each commodity class in the historical data;

2.2 calculating the comprehensive distance between the commodity and at most x commodities in each commodity class in the historical data;

2.3, selecting y commodities with the minimum comprehensive distance from the commodities, and counting the commodity classes to which the commodities belong, wherein the commodity class with the maximum number of commodities is the commodity class of the commodities; wherein y is a manually set positive integer and is smaller than the total number of commodities participating in calculating the comprehensive distance;

2.4 calculating the average mutual information of each ingredient and other ingredients of the commodity inside the commodity class;

2.5 selecting q ingredients with minimum average mutual information, and detecting whether the q ingredients are contained in a suspected illegal additive set; q is an artificially set positive integer and is less than the total number of ingredients of the commodity;

2.6 detecting whether a portion of the ingredients other than the q ingredients are included in the identified offending additive set;

2.7 judging whether the commodity has marked illegal additives: if yes, reporting the commodity and the corresponding illegal additives, and entering a third step; if not, judging the product to be qualified, and entering a third step;

and thirdly, auditing, modifying, confirming and storing data by the user.

Preferably, the commodity class is the commodity class obtained in step 1.2 or a commodity class existing in a database.

In a preferred embodiment, the N-gram language models include a 1-gram language model and a 2-gram language model, wherein: the 1-gram language model extracts the names of the commodities adopting the label information in the form of the Indonesian system; the 2-gram language model extracts names of commodities which adopt tag information in a Chinese form.

In a preferred embodiment, in the steps 1.1 and 2.1, the method for performing 0-1 vectorization representation on the commercial formula comprises the following steps:

establishing an ingredient information list capable of adding ingredients at the tail end, if the ingredients of the commodity are already in the current ingredient information list, replacing the ingredients with the order of the ingredients in the ingredient information list, if the ingredients are not in the ingredient information list, adding the ingredients at the tail end in the ingredient information list, and then replacing the ingredients with the order of the ingredients in the ingredient information list;

the recipe of each commodity is represented by an array having a length of the ingredient information list, and the array takes 1 at the position where the ingredient is located in the ingredient information list when the ingredient is contained in the recipe, and otherwise takes 0 at the position.

In a preferred embodiment, in step 1, the method for calculating the comprehensive distance between two commodities comprises the following steps:

extracting the name of a commodity A through an N-gram language model to obtain a set A, extracting the name of a commodity B to obtain a set B, wherein a is the 0-1 vector representation of the formula of the commodity A, and B is the 0-1 vector representation of the formula of the commodity B; when the total distance between article a and article B is denoted by D (a, B):

wherein the content of the first and second substances,

a_irepresenting the i-th component of the vector a, b_iRepresenting the i-th component of the vector b,

j (A, B) represents the Jaccard distance, and C (A, B) represents the Cosine distance.

In a preferred embodiment, the average mutual information is represented by MI, and the calculating method includes:

(1) calculating mutual information I (r, r') of every two ingredients; wherein

Wherein p (r, r ') is the proportion of the commodities containing the ingredients r and r' in all the commodities, p (r) is the proportion of the commodity containing the ingredient r in all the commodities, and p (r ') is the proportion of the commodity containing the ingredient r' in all the commodities;

(2) calculating average mutual information MI, r and r of each ingredient and other ingredientsOther n ingredients r'₁,r′₂,r′₃,...,r′_i,...r′_nThe average mutual information mi (r) of (a) is:

in a preferred embodiment, the K-medoid clustering algorithm calculates a clustering result by means of partitioning, and the method includes:

(1) randomly selecting K commodities as clustering centers, wherein K can be specified by a user;

(2) calculating the distance from each other commodity to the center of the K selected commodities, and attributing the commodities to the nearest clustering center to form a commodity cluster;

(3) calculating the sum of the distances between each commodity and other commodities in each commodity cluster, and selecting the commodity with the minimum distance sum as a new clustering center;

(4) and (3) repeating the steps (2) and (3) until the center of each commodity cluster is not changed any more, and finally forming the commodity cluster as a clustering result.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

(1) the method comprises commodity clustering, commodity classification and illegal additive identification based on mutual information after commodity label image identification, commodity information or a large number of rules for coding are not required to be registered in advance, and illegal additives in a commodity formula can be automatically identified and added in an unsupervised environment, so that the learning capability is realized, and the operation efficiency and accuracy are effectively improved;

(2) the name of the commodity is extracted by adopting an N-gram language model, so that the commodity is compatible with multiple languages;

(3) the rapid identification of one or more commodity illegal additives and the automatic screening of commodity quality risks are realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

fig. 1 is a flowchart of a method for determining a product quality risk based on tag identification according to the present invention.

Detailed Description

The present invention provides a method for determining a product quality risk based on tag identification, which is described in further detail below with reference to the accompanying drawings and examples in order to make the objects, technical solutions and effects of the present invention clearer and clearer. It should be understood that the embodiments described herein are only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.

The embodiment provides a method for determining a quality risk of a commodity based on tag identification, as shown in fig. 1, including the following steps:

firstly, inputting a commodity label.

And secondly, judging whether the input commodity labels are in batches or single.

(1) When a batch of merchandise tags, comprising the steps of:

1.1 scanning batch commodity labels, extracting commodity names by adopting an N-gram language model, converting the commodity names into a set consisting of continuous N characters, and calculating the Jaccard distance between the two commodity names. The N-gram language models include a 1-gram language model and a 2-gram language model, wherein: the 1-gram language model extracts the names of the commodities adopting the label information in the form of the Indonesian system; the 2-gram language model extracts names of commodities which adopt tag information in a Chinese form.

Meanwhile, the commodity formula is subjected to 0-1 vectorization expression, and the Cosine distance between the two commodity names is calculated. The method for performing 0-1 vectorization table on the commodity formula comprises the following steps:

establishing an ingredient information list capable of adding ingredients at the tail, wherein if the ingredients of the commodity are in the current ingredient information list, the ingredients are replaced by the order of the ingredients in the ingredient information list, if the ingredients are not in the ingredient information list, the ingredients are added at the tail in the ingredient information list, and then the ingredients are replaced by the order of the ingredients in the ingredient information list;

and representing the formula of each commodity by using an array with the length being the length of the ingredient information list, wherein when the ingredient is contained in the formula, the array takes 1 at the position of the ingredient in the ingredient information list, and otherwise, the array takes 0 at the position.

1.2 calculating the comprehensive distance between two commodities; and clustering the commodities by adopting a K-Medoide clustering algorithm based on the comprehensive distance to obtain a commodity class.

A method of calculating a composite distance between two items of merchandise comprising the steps of:

wherein:

1.3 establishing a set of offending additions belonging to each commodity class, the set of offending additions comprising: and confirming a violation additive set and a suspected violation additive set, wherein the confirmed violation additive set comprises the violation additives which are determined in the historical risk information and belong to the commodity class, the suspected violation additive set comprises newly appeared violation additives, the newly appeared violation additives are the ingredients with the appearance frequency of 0-p x n in the ingredient information list of the commodity, n is the number of the commodities contained in the commodity class, p is a constant which is set manually, and 0< p < 1.

1.4 calculating the average mutual information of each ingredient and other ingredients of each commodity inside each commodity class; the average mutual information is represented by MI, and the calculation method comprises the following steps:

calculating mutual information I (r, r') of every two ingredients, then

Wherein: p (r, r ') is the proportion of the commodities containing the ingredients r and r' in all the commodities, p (r) is the proportion of the commodity containing the ingredient r in all the commodities, and p (r ') is the proportion of the commodity containing the ingredient r' in all the commodities;

calculating average mutual information MI of each ingredient and other ingredients, and calculating average mutual information of the ingredient r and other n ingredients r'₁,r′₂,r′₃,...,r′_i,...r′_nThe average mutual information mi (r) of (a) is:

if yes, marking the contained ingredients as illegal additives of the commodity, and entering the next step; if not, the next step is carried out.

1.7 judging whether the commodity has marked illegal additives: if yes, reporting the commodity and the corresponding illegal additives, and entering a third step; if not, the judgment is qualified, and the third step is entered.

2.1 scanning a single commodity label, extracting the commodity name by adopting an N-gram language model, converting the commodity name into a set consisting of continuous N characters, and calculating the Jaccard distance between the commodity and at most x commodities in each commodity class in historical data; wherein x is a manually set positive integer and is less than or equal to the number of all commodities in the commodity class.

Meanwhile, the commodity formula is subjected to 0-1 vectorization expression, and the Cosine distance between the commodity and at most x commodities in each commodity class in the historical data is calculated.

The commodity class in the history data refers to the commodity class obtained by operating step 1.2 for the non-category-labeled commodities input in batches or commodities of a naturally existing classified type in the database.

2.2 calculating the comprehensive distance between the commodity and at most x commodities in each commodity class in the historical data; the method for calculating the comprehensive distance is the same as the method for calculating the comprehensive distance in the step 1.2.

2.3, selecting y commodities with the minimum comprehensive distance from the commodities, and counting the commodity classes to which the commodities belong, wherein the commodity class with the maximum number of commodities is the commodity class of the commodities; wherein y is a manually set positive integer and is less than the total number of commodities participating in calculating the comprehensive distance.

2.4 calculating the average mutual information of each ingredient and other ingredients of the commodity inside the commodity class; the method for calculating the average mutual information is the same as the method for calculating the average mutual information in step 1.4.

2.7 judging whether the commodity has marked illegal additives: if yes, reporting the commodity and the corresponding illegal additives, and entering a third step; if not, the judgment is qualified, and the third step is entered.

And thirdly, auditing, modifying, confirming and storing data by the user.

The embodiments of the present invention have been described in detail, but the embodiments are merely examples, and the present invention is not limited to the embodiments described above. Any equivalent modifications and substitutions to those skilled in the art are also within the scope of the present invention. Accordingly, equivalent changes and modifications made without departing from the spirit and scope of the present invention should be covered by the present invention.

Claims

1. The commodity quality risk judgment method based on label identification is characterized by comprising the following steps of:

firstly, inputting a commodity label;

(1) when the label is a batch commodity label, the method comprises the following steps:

1.2 calculating the comprehensive distance between two commodities;

(2) when the label is a single commodity label, the method comprises the following steps:

and thirdly, auditing, modifying, confirming and storing data by the user.

2. The commodity quality risk assessment method according to claim 1, wherein the N-gram language model includes a 1-gram language model and a 2-gram language model, wherein: the 1-gram language model extracts the names of the commodities adopting the label information in the form of the Indonesian system; the 2-gram language model extracts names of commodities which adopt tag information in a Chinese form.

3. The method for determining the risk of quality of a commodity according to claim 1, wherein in the steps 1.1 and 2.1, the method for performing a 0-1 vectorization table on a commodity formula comprises the following steps:

4. The method for determining a quality risk of a commodity according to claim 1, wherein the method for calculating the integrated distance between two commodities in step 1 includes the steps of:

wherein the content of the first and second substances,

5. The method of determining a product quality risk according to claim 1, wherein the average mutual information is represented by MI, and the calculation method includes:

(1) calculating mutual information I (r, r') of every two ingredients

(2) calculating average mutual information MI of each ingredient and other ingredients, and calculating average mutual information of the ingredient r and other n ingredients r'₁,r'₂,r'₃,...,r'_i,...r'_nThe average mutual information mi (r) of (a) is: