CN113468885A

CN113468885A - Chinese trademark similarity calculation method

Info

Publication number: CN113468885A
Application number: CN202110790797.XA
Authority: CN
Inventors: 李学俊; 高仕锦; 廖伟伟
Original assignee: Green Industry Innovation Research Institute of Anhui University
Current assignee: Green Industry Innovation Research Institute of Anhui University
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-01

Abstract

The invention discloses a Chinese trademark similarity calculation method, which belongs to the technical field of trademark retrieval and comprises the following steps: acquiring names of a first trademark and a second trademark, and segmenting words of the first trademark name and the second trademark name to respectively obtain a word list; calculating the comprehensive similarity of the Word forest Word similarity, the web Word similarity and the Word2Vec Word similarity by pairwise combination of the words in the two Word lists as the Word similarity, and taking the maximum value as the local similarity to obtain two local similarity lists; and calculating the meaning similarity of the two trademarks according to the two local similarity lists, and finally judging whether the two trademarks are similar trademark applications. The method can solve the problems of inaccurate synonym recognition, limited knowledge base and inaccurate similarity calculation result when the semantic dictionary is used for calculating the trademark meaning similarity in the ontology knowledge method.

Description

Chinese trademark similarity calculation method

Technical Field

The invention relates to the technical field of trademark retrieval, in particular to a Chinese trademark similarity calculation method.

Background

The research of the meaning approximate judgment method for the text trademark at present has some defects by reading relevant documents. For example, in a conventional short text semantic similarity calculation method based on ontology knowledge, a similarity calculation method based on a synonym forest and a similarity calculation method based on a knowledge network are commonly used, which essentially calculate the similarity of trademark meanings according to different semantic dictionaries, have high dependency on the ontology of the word forest and cannot be updated in time, and have the problems that synonyms in the semantic dictionary do not accord with the judgment of the trademark field on approximate trademarks, a knowledge base is limited, and the similarity calculation result is inaccurate.

Disclosure of Invention

The invention aims to overcome the defects in the background art and solve the problem of inaccurate synonym identification.

In order to achieve the purpose, a Chinese trademark similarity calculation method is adopted, and comprises the following steps:

acquiring names of a first trademark and a second trademark to be compared, and performing word segmentation processing on the first trademark name and the second trademark name to respectively obtain a first word segmentation list W_aAnd a second participle list W_b；

Calculating the Word forest Word similarity, the web Word similarity and the Word2Vec Word similarity for the pairwise combination of the words in the two Word segmentation lists;

respectively calculating the comprehensive similarity of the Word-forest Word similarity, the web Word similarity and the Word2Vec Word similarity of each Word in the first Word segmentation list and each Word in the second Word segmentation list by adopting a dynamic weighting strategy to serve as a Word similarity group corresponding to each Word in the first Word segmentation list, taking the maximum value in the Word similarity group corresponding to each Word as the local similarity of the current Word, and forming a first local similarity list by using the local similarities of all words in the first Word segmentation list;

respectively calculating the comprehensive similarity of the Word-forest Word similarity, the web Word similarity and the Word2Vec Word similarity of each Word in the second Word segmentation list and each Word in the first Word segmentation list by adopting a dynamic weighting strategy to serve as a Word similarity group corresponding to each Word in the second Word segmentation list, taking the maximum value in the Word similarity group corresponding to each Word as the local similarity of the current Word, and forming a second local similarity list by using the local similarities of all words in the second Word segmentation list;

and calculating the meaning similarity of the name of the first trademark and the name of the second trademark according to the first local similarity list and the second local similarity list.

Further, the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmWord similarity Sim of the words between_Cilin(W_ar,W_bm) The calculating step comprises:

constructing a trade mark forest synonym library traCilin File by using a dictionary file cillinFile of synonym forest expansion edition;

converting word W into brand word forest synonym library traCilinFile_arThe word and phrase W_bmConverting into word forest code and obtaining word W_arThe word and phrase W_bmThe corresponding combination of all word forest codes;

judging whether the word forest codes are equal in the combination or not based on the combination of all the word forest codes;

if yes, reading the word group of the row of the current code, and judging the word W_arThe word and phrase W_bmIf they are similar, then the sum is recorded_Cilin(W_ar,W_bm) 0, if approximate, then Sim_Cilin(W_ar,W_bm)＝1；

If not, calculating the similarity of all word forest coding combinations by adopting a word forest similarity calculation method based on information content, and taking the maximum value as S_Cilin(W_ar,W_bm)。

Further, the similarity calculation formula of the forest code combination is as follows:

in the formula (II), Sim'_Cilin(C_ai,C_bj) Means word W_arThe ith word forest of (1) encodes C_aiThe word and phrase W_bmThe jth word forest of (1) encodes C_bjThe similarity of (2); n is a radical of₁And N₂Are all positive integers;

two-word-forest-coded word forest similarity Sim'_Cilin(C_a,C_b) The calculation formula of (a) is as follows:

in the formula (II), LCS (C)_a,C_b) Expression forest code C_aHeyulin code C_bThe nearest common parent node of; IC (C) represents the information content of the word forest code C, and the calculation formula is as follows:

wherein hypo (C) is the number of lower nodes of C in the body, and C is C_aOr C_b(ii) a Maxnodes is the total number of ontology nodes.

Further, the method for constructing the trade mark forest synonym library tracillinfile by using the synonym forest expansion edition dictionary file cillinfile comprises the following steps:

marking words in the same line with the same word forest codes in the dictionary file cilinFile of the synonym forest expansion edition as a number 0, wherein the words are dissimilar;

and marking words in the same line with the same word forest code in the synonym forest expansion edition dictionary file cilinFile as the same non-0 number, and constructing the trademark word forest synonym library traCilinFile.

Further, the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmBetween web word similarity Sim_HowNet(W_ar,W_bm) The calculating step comprises:

constructing a trademark and known network synonym library traHownetFile by utilizing a dictionary file of the known network;

obtaining words W according to trademark known network synonym library traHownetFile_arAnd the word W_bmAnd obtains the word W_arThe word and phrase W_bmA combination of corresponding all meanings;

judging whether the combination has the condition that the meaning items are equal or not based on the combination of all the meaning items;

if yes, reading the word group of the line of the current meaning item, and judging W according to the word mark_arAnd W_bmIf they are similar, then the sum is recorded_HowNet(W_ar,W_bm) 0, if approximate, then Sim_HowNet(W_ar,W_bm)＝1；

If not, calculating the similarity of all the meaning item combinations by adopting a similarity calculation method based on the known network, and taking the maximum value as Sim_HowNet(W_ar,W_bm)。

Further, the similarity calculation formula for all the combinations of the meaning terms is as follows:

in the formula (II), Sim'_HowNet(S_ai,S_bj) Represents W_arThe ith item of sense S_aiAnd W_bmThe j-th item of_bjThe similarity of (2); n is a radical of₁And N₂Are all positive integers; similarity Sim 'of two artificial items'_HowNet(S_a,S_b) The calculation formula of (a) is as follows:

in the formula (II), Sim'₁(S_a,S_b) Representing two items of significance S_aAnd S_bA first degree of similarity to an independent sense; sim'₂(S_a,S_b) Representing similarity of other independent senses; sim'₃(S_a,S_b) Representing similarity of the relation senses; sim'₄(S_a,S_b) Representing the similarity of the symbol senses; beta is a_kK is more than or equal to 1 and less than or equal to 4 and has beta for adjustable parameters₁+β₂+β₃+β₄＝1，β₁≥β₂≥β₃≥β₄，β_kThe values of (A) are as follows: beta is a₁＝0.5，β₂＝0.2，β₃＝0.17，β₄＝0.13。

Further, the two items of significance S_aAnd S_bIs the first independent sense similarity Sim'₁(S_a,S_b) The calculation formula of (2) is as follows:

in the formula, p_aAnd p_bRepresents an antigen; alpha is an adjustable parameter, and alpha is 1.6; dep (p)_a)、dep(p_b) Represents p_a、p_bDepth on the hierarchy tree of the sememe, i.e. sememe depth, min (dep (p)_a),dep(p_b) Is represented by p_a、p_bThe minimum value of the depth of the sememe; dist (p)_a,p_b) Represents p_aAnd p_bPath length in the hierarchy tree of the sememe, i.e. the sememe distance, when p_aAnd p_bWhen not in the same semantic hierarchy tree, the distance between the sememes is uniformly set to 20.

Further, the Sim'₂(S_a,S_b)、Sim′₃(S_a,S_b) And Sim'₄(S_a,S_b) The calculation process of (2) is as follows:

if no sememe exists in the two sememe description formulas, the similarity is directly 1;

if only one of the semantic description formulas does not have any semantic, the similarity takes a default value of 0.2;

if the two semantic source description formulas contain one or more semantic sources, calculating the similarity of every two semantic sources according to the similarity calculation mode of the first independent semantic source description formula by all combinations of the semantic sources, and taking the maximum value as the similarity value.

Further, the method for constructing a trademark and known network synonym library traHownetFile by utilizing the "known network" dictionary file, comprises the following steps:

marking words in the same line with the same meaning item in the < Zhi network > dictionary file as a number 0, wherein the words are not similar to each other;

and marking words in the same row with the same meaning item in the < namely known network > dictionary file housewife as the same non-0 number, and constructing the trademark wordbook thesaurus trahousewife.

Further, the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmWord2Vec Word similarity Sim between_Word2Vec(W_ar,W_bm) The calculating step comprises:

training a Wikipedia Chinese language database by using a Word2Vec deep learning model to obtain a Word vector file Word2 vecFile;

according to the word vector file word2vecFile, the word W is converted into the word vector file_arThe word and phrase W_bmConverting into word vector, and calculating Sim by cosine formula_Word2Vec(W_ar,W_bm)。

Further, Sim_Word2Vec(W_ar,W_bm) Calculating according to the word vector corresponding to the word, wherein the calculation formula is as follows:

in the formula, V_arAnd V_bmRespectively represent W_aMiddle (r) th word W_arWord vector of and W_bM-th word W_bmN denotes the dimension of the Word vector during the training of the Word2Vec model, V_arnRepresents V_arN-th value of, V_bmnRepresents V_bmThe nth value of (a).

Further, the training of the wikipedia Chinese corpus by using the Word2Vec deep learning model to obtain the Word vector file Word2vecFile includes:

acquiring a Wikipedia Chinese corpus, and cleaning and preprocessing the Wikipedia Chinese corpus to obtain a corpus to be trained;

constructing a sentence iterator by using a LineStrentence () method;

setting model parameters, inputting the model parameters into a Word2Vec model, and starting training;

and saving the trained Word2Vec deep learning model, and saving the Word vector file Word2vecFile in a non-binary form.

Further, the integrated similarity Sim_W(W_ar,W_bm) The calculation formula of (2) is as follows:

Sim_W(W_ar,W_bm)＝λ₁Sim_Cilin(W_ar,W_bm)+λ₂Sim_HowNet(W_ar,W_bm)+λ₃Sim_Word2Vec(W_ar,W_bm)

in the formula, λ₁、λ₂And lambda₃Respectively represent Sim_Cilin(W_ar,W_bm)、Sim_HowNet(W_ar,W_bm) And Sim_Word2Vec(W_ar，W_bm) And satisfies lambda₁+λ₂+λ₃＝1。

Further, said λ₁、λ₂And lambda₃The value taking situation is as follows:

(1) when W is_ar∈D，W_bmE.g. when W_arAnd W_bmWhen present in traCilinFile, traHownetFile and word2vecFile at the same time, lambda₁＝λ₂＝λ₃＝1/3；

(2) When W is_ar∈E，W_bm∈E；W_ar∈D，W_bm∈E；W_ar∈E，W_bmE.g. when W_arAnd W_bmOne of the words is present in both tracillinfile and word2vecFile, and the other word is present in both tracillinfile and word2vecFile or both, lambda is present in both tracillinfile, traHownetFile and word2vecFile₁＝λ₃＝1/2，λ₂＝0；

(3) When W is_ar∈F，W_bm∈F；W_ar∈D，W_bm∈F；W_ar∈F，W_bmWhen e is equal to D, λ₂＝λ₃＝1/2，λ₁＝0；

(4) When W is_ar∈G，W_bm∈G；W_ar∈D，W_bm∈G；W_ar∈G，W_bmWhen e is equal to D, λ₁＝λ₂＝1/2，λ₃＝0；

(5) When W is_ar∈A，W_bm∈A；W_ar∈A，W_bm∈D；W_ar∈D，W_bm∈A；W_arE is A, Wbm E; war E is belonged to 0E, Wbm A is belonged to 1A; war is an element of A, Wbm is an element of G; war E is G, Wbm E is A; war E, Wbm G; war is G, Wbm is E, λ 1 is 1, λ 2 is 0;

(6) when W is_ar∈B，W_bm∈B；W_ar∈B，W_bm∈D；W_ar∈D，W_bm∈B；W_arE.g. B, Wbm e.g. F; war is belonged to 0F, Wbm is belonged to 1B; war E B, Wbm E G; war E G, Wbm E B; war E F, Wbm E G; war is G, Wbm is F, λ 2 is 1, λ 1 is 0;

(7) when W is_ar∈C，W_bm∈C；W_ar∈C，W_bm∈D；W_ar∈D，W_bm∈C；W_arC, Wbm E; war E, Wbm C1C; war belongs to C, Wbm belongs to F; war belongs to F, Wbm belongs to C; war E, Wbm F; war is belonged to F, Wbm is belonged to E, λ 3 is equal to 1, λ 1 is equal to λ 2 is equal to 0;

(8) the two words do not have any cross in tracillinFile, traHownetFile or word2vecFile, and the similarity weight lambda₁、λ₂、λ₃Meaningless;

wherein, A represents a word set included in traCilinFile only; b represents a set of words that are only included in traHownetFile; c represents a set of words that exist only in word2 vecFile; d represents a word set which simultaneously exists in traCilinFile, traHownetFile and word2 vecFile; g represents a word set which is simultaneously included by traCilinFile and traHownetFile; e represents a set of words that exist in both traCilinFile and word2 vecFile; f denotes a set of words existing in both traHownetFile and word2 vecFile.

Further, the calculating the meaning similarity of the name of the first trademark and the name of the second trademark according to the first local similarity list and the second local similarity list includes:

calculating the meaning similarity Sim (a, b) of the first and second brand names using the following formula from the first and second local similarity lists:

in the formula, sim_arRepresents W_aMiddle (r) th word and W_bLocal similarity of (2); sim_bmRepresents W_bM-th word and W_aLocal similarity of (2); the first local similarity list is [ sim ]_a1,sim_a2,…,sim_as](ii) a The second local similarity list is [ sim ]_b1,sim_b2,…,sim_bt]。

Compared with the prior art, the invention has the following technical effects: the invention constructs a synonym library meeting the trademark examination standard for synonym forest and the unknown web by combining trademark examination and trial standard revised by trademark office and trademark review committee in 2016 (12 months), solves the problem of inaccurate synonym identification, trains a high-quality Word vector model by using a Word2Vec deep learning model, greatly expands the range of calculable words, solves the problem of limited knowledge base, and calculates the final trademark meaning similarity by using a dynamic weighting strategy, so that the result is more uniform and reasonable, and the accuracy of the approximate detection of the trademark meaning is improved.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a flow chart of a method for calculating the similarity of Chinese trademarks;

FIG. 2 is an overall flow chart of a method for calculating the similarity of Chinese trademarks;

FIG. 3 is a text structure diagram of a synonym library of a forest of trademarks;

FIG. 4 is a text structure diagram of a synonym library of trademark Hopkins;

FIG. 5 is a diagram of an semantic hierarchy tree;

FIG. 6 is a schematic diagram of a distribution of words.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

As shown in fig. 1 to 2, the present embodiment discloses a chinese trademark similarity calculation method, which takes the calculation of the similarity between the meaning of the first trademark a ═ sumitomo property "and the meaning of the second trademark b ═ sumitomo property" as an example, and includes the following steps S1 to S5:

s1, acquiring the name str of the first trademark a to be compared_aAnd the name str of the second trademark b_bAnd performing word segmentation processing on the first trademark name and the second trademark name to respectively obtain a first word segmentation list W_aAnd a second participle list W_b；

It should be noted that str is a word segmentation tool pair in ansj_aAnd str_bPerforming word segmentation to obtain word lists W of a and b respectively_a{ 'Sumitomo', 'Gegen' } and W_bIn the formula, the term "a" means a word "and" b "means a word" and "a" and "b", respectively.

S2, calculating the Word similarity of a Word forest, the Word similarity of a web and the Word similarity of Word2Vec by pairwise combination of the words in the two Word segmentation lists;

the method specifically comprises the following steps: traversing the W according to the front-back sequence_aIn each Word, respectively calculating the currently traversed Word and the W by adopting the calculation methods of Word forest Word similarity, web Word similarity and Word2Vec Word similarity_bThe Word forest similarity, the web similarity and the Word2Vec similarity of each Word in the Chinese; traversing the W according to the front-back sequence_bEach term in the Chinese language adopts term similarity of term forest, term similarity of web andthe Word2Vec Word similarity calculation method respectively calculates the currently traversed Word and W_aThe Word forest similarity, the web similarity and the Word2Vec similarity of each Word in the Chinese sentence.

S3, respectively calculating the comprehensive similarity of the Word forest Word similarity, the Hopkinson web Word similarity and the Word2Vec Word similarity of each Word in the first Word segmentation list and each Word in the second Word segmentation list by adopting a dynamic weighting strategy to serve as a Word similarity group corresponding to each Word in the first Word segmentation list, taking the maximum value in the Word similarity group corresponding to each Word as the local similarity of the current Word, and forming a first local similarity list by using the local similarities of all the words in the first Word segmentation list;

s4, respectively calculating the comprehensive similarity of the Word forest Word similarity, the Hopkinson web Word similarity and the Word2Vec Word similarity of each Word in the second Word segmentation list and each Word in the first Word segmentation list by adopting a dynamic weighting strategy to serve as a Word similarity group corresponding to each Word in the second Word segmentation list, taking the maximum value in the Word similarity group corresponding to each Word as the local similarity of the current Word, and forming a second local similarity list by using the local similarities of all words in the second Word segmentation list;

the method specifically comprises the following steps: calculating the comprehensive similarity of the three similarities as W by adopting a dynamic weighting strategy_aThe traversed words and W_bThe similarity of each word and phrase is taken as W, and the maximum value in the similarity of all words and phrases is taken as W_aThe traversed words and W_bLocal similarity of (1), when W is traversed_aAll the words in (1) can obtain a first local similarity list (sim) with the length of s_a1,sim_a2,…,sim_as](ii) a In the same way, traverse W_bEach term in (1) can obtain W_bThe traversed words and W_aFinally, a second local similarity list [ sim ] with the length of t can be obtained_b1,sim_b2,…,sim_bt]。

And S5, calculating the meaning similarity of the name of the first trademark and the name of the second trademark according to the first local similarity list and the second local similarity list.

In the embodiment, a dynamic weighting strategy is adopted, and the comprehensive similarity of the Word forest Word similarity, the web Word similarity and the Word2Vec Word similarity is calculated to serve as the Word similarity, so that the trademark meaning similarity is calculated, the calculation result is more uniform and reasonable, and the accuracy of trademark meaning approximate judgment is improved.

As a further preferable technical solution, in step S2, the r-th word W in the first word list_arWith the mth word W in the second word-dividing list_bmWord similarity Sim of the words between_Cilin(W_ar,W_bm) The calculating step comprises:

(1) constructing a trade mark forest synonym library traCilin File by using a dictionary file cillinFile of synonym forest expansion edition;

the method specifically comprises the following steps: marking words in the same line with the same word forest codes in the cilinFile as a number 0, wherein the words are dissimilar; marking words in the same line with the same word forest code in the cilinFile as the same non-0 number, wherein the words are similar to each other; as shown in fig. 3, a text structure diagram of tracillinfile, taking "property" and "real property" as examples, the two words are "Dj 03a09 #", and the following labels are all marked as number 1, which indicates that the two are likely to cause confusion and cause approximation in the trademark field.

(2) Calculating the similarity Sim of the words and forest words according to the traCilinFile of the brand words and forest synonym library_Cilin(W_ar,W_bm)。

The method specifically comprises the following steps: converting word W into brand word forest synonym library traCilinFile_arThe word and phrase W_bmConverting into word forest code and obtaining word W_arThe word and phrase W_bmThe corresponding combination of all word forest codes;

If not, calculating the similarity of all word forest coding combinations by using the word forest similarity calculation method based on the information content, and taking the maximum value as Sim_Cilin(W_ar,W_bm) The calculation formula is as follows:

In this example, "Sumitomo" is not recorded in traCilinFile, and therefore Sim_Cilin('Sumitomo', 'real estate') 0, Sim_Cilin('property', 'friend') -0;

in summary, Sim_Cilin('Sumitomo' ) -1,Sim_Cilin('Sumitomo', 'real estate') 0, Sim_Cilin('property', 'Sumitomo') -0, Sim_Cilin('property', 'real property') 1.

It should be noted that, in this embodiment, the influence of the information content on the meaning of the word is reflected by constructing the trademark word forest synonym library meeting the trademark review standard and calculating the word forest word similarity based on the trademark word forest synonym library.

As a further preferable technical solution, in step S2, the r-th word W in the first word list_arWith the mth word W in the second word-dividing list_bmBetween web word similarity Sim_HowNet(W_ar,W_bm) The calculating step comprises:

(1) constructing a trademark and known network synonym library traHownetFile by utilizing a dictionary file of the known network;

the method specifically comprises the following steps: for each semantic item in the said nowetFile, firstly making statistics and obtaining word group of each semantic item, then marking the word group of the same line in each semantic item to obtain traHownetFile, the construction of the said traHownetFile needs to combine "trademark examination and trial standard" to judge synonyms, the construction steps include: marking words in the same line with the same meaning item in the hosnetFile as a number 0, wherein the words are not similar to each other; the words in the same line with the same meaning item in the houseleetfile are marked as the same non-0 number; as shown in fig. 4, which is a text structure diagram of traHownetFile, taking "property" and "real property" as an example, because two words are different from each other in terms of meaning item codes, they are not in the same line, and the mark after the two words is shown in fig. 4.

(2) Calculating the similarity Sim of the words in the cognitive network according to the trademark-cognitive network synonym library traHownetFile_HowNet(W_ar,W_bm)。

The method specifically comprises the following steps: obtaining words W according to trademark known network synonym library traHownetFile_arAnd the word W_bmAnd obtains the word W_arThe word and phrase W_bmA combination of corresponding all meanings;

If not, calculating the similarity of all the meaning item combinations by adopting a similarity calculation method based on the known network, and taking the maximum value as Sim_HowNet(W_ar,W_bm) The calculation formula is as follows:

in the formula (II), Sim'_HowNet(S_ai,S_bj) Represents W_arThe ith item of sense S_aiAnd W_bmThe j-th item of_bjSimilarity of (2), N₁And N₂Are all positive integers;

similarity Sim 'of two artificial items'_HowNet(S_a,S_b) The calculation formula of (a) is as follows:

In the present embodiment, since there is no equality between the meaning item codes of "property" and "real property", the similarity of all the meaning item combinations is calculated by directly using the similarity calculation method based on the knowledge network.

Obtaining a semantic description formula corresponding to each word according to the semantic expression of the word: in the semantic expression of 'property', the 'welth | money and money' is a first independent semantic description formula, and has two symbolic semantic description formulas of '# earth | earth' and '# building | building', and no other independent semantic description formula and relational semantic description formula exist; in the semantic expression of real estate, physical substance is a first independent semantic description formula, and has two symbolic semantic description formulas of # welth money and ^ TakeAway removal, and no other independent semantic description formula and relational semantic description formula.

Calculating Sim'₁(S_a,S_b): because the first independent semantic-describing formula only contains one semantic, the similarity formula of the semantic can be directly adopted for calculation, and the calculation formula is as follows:

in the formula, p_aAnd p_bRepresents an antigen; alpha is an adjustable parameter, and alpha is 1.6; dep (p)_a)、dep(p_b) Represents p_a、p_bDepth on the hierarchy tree of the sememe, i.e. sememe depth, min (dep (p)_a),dep(p_b) Is represented by p_a、p_bThe minimum value of the depth of the sememe; dist (p)_a,p_b) Represents p_aAnd p_bPath length in the hierarchy tree of the sememe, i.e. the sememe distance, when p_aAnd p_bWhen not in the same primitive hierarchical tree, setting the primitive distance as 20;

as shown in FIG. 5 as p_aEqual to "welth | money" and p_bDist (p) can be obtained from the tree of the semantic hierarchy where "physical | substance" is located_a,p_b)＝3，dep(p_a)＝5，dep(p_b)＝2，min(dep(p_a),dep(p_b) 2, so Sim (p) is calculated according to the formula_a,p_b)′＝Sim′₁(S_a,S_b)＝0.5161；

Calculating Sim'₂(S_a,S_b)、Sim′₃(S_a,S_b) And Sim'₄(S_a,S_b): there are three cases: if no sense exists in both the two sense description formulas, the similarity is directly 1; if only one of the semantic description formulas does not have any semantic, the similarity takes a default value of 0.2; if the two semantic source description formulas contain one or more semantic sources, calculating the similarity of every two semantic sources according to the similarity calculation mode of the first independent semantic source description formula by all combinations of the semantic sources, and taking the maximum value as the similarity value.

In the embodiment, none of the real estate and the real estate has other independent and relational sememes, and is consistent with the first case, so Sim'₂(S_a,S_b)＝1，Sim′₃(S_a,S_b) 1, there are a plurality of symbol semaphores of "local" and "real" which are "# earth | ground, # building | building" # weather | money, and ^ TakeAway | moving ", respectively, and hence Sim 'is calculated as the third case'₄(S_a,S_b)＝0.2。

Finally, according to Sim'₁(S_a,S_b)、Sim′₂(S_a,S_b)、Sim′₃(S_a,S_b) And Sim'₄(S_a,S_b) Calculating to obtain the similarity Sim ' of the meaning items of ' real estate ' and ' real estate '_HowNet(S_a,S_b) Sim is available as 0.4625, both words having only one meaning term_HowNet('property', 'real property') 0.4625; in this example, Sim is not registered in traHownetFile, so "sumitomo" is not registered in traHownetFile_HowNet('Sumitomo', 'real estate') 0, Sim_HowNet('property', 'friend') -0.

In summary, Sim_HowNet('Sumitomo' ) -1, Sim_HowNet('Sumitomo', 'real estate') 0, Sim_HowNet('property', 'Sumitomo') -0, Sim_HowNet('property', 'real property') 0.4625.

In the embodiment, a trademark-known web synonym library meeting the trademark examination standard is constructed on the basis of a web-known item dictionary, the web similarity of the words is calculated on the basis of the trademark-known web synonym library, and the influence of the depth of the sense and the distance of the sense on the meaning similarity of the words is considered.

It should be noted that, in this embodiment, a synonym library meeting the trademark review standard is respectively constructed for synonyms in the synonym forest and the synonym in the two ontology knowledge bases of the public network in combination with the trademark review and trial standard, so that words with the same forest code and the same meaning term better meet the judgment of synonyms in the trademark field.

As a further preferable technical solution, in step S2, the r-th word W in the first word list_arWith the mth word W in the second word-dividing list_bmWord2Vec Word similarity Sim between_Word2Vec(W_ar,W_bm) The calculating step comprises:

(1) training a Wikipedia Chinese language corpus by using a Word2Vec deep learning model to obtain a Word vector file Word2vecFile, which specifically comprises the following steps:

1-1) downloading a Wikipedia Chinese language database of 12 months in 2020, cleaning and preprocessing the Wikipedia Chinese language database to obtain a language database to be trained;

1-2) constructing a sentence iterator by using a LineStrength () method;

1-3) setting model parameters: the word vector dimension size is set to 100; setting the maximum distance window between the current central word and the predicted context word as 5; the minimum word frequency min _ count allowed in the corpus is set to be 5; the training model sg is set to 1; the iteration number iter is set to 5;

1-4) inputting the model parameters into a Word2Vec model, and starting training;

1-5) storing the trained Word2Vec deep learning model and storing the Word vector file Word2vecFile in a non-binary form.

(2) According to the word vector file word2vecFile, the word W is converted into the word vector file_arThe word and phrase W_bmConverting into word vector, and calculating Sim by cosine formula_Word2Vec(W_ar,W_bm) The formula is as follows:

in the formula, V_arAnd V_bmRespectively represent W_aMiddle (r) th word W_arWord vector of and W_bM-th word W_bmN denotes the dimension of the Word vector during the training of the Word2Vec model, V_arnRepresents V_arN-th value of, B_bmnRepresents V_bmThe nth value of (a). It should be noted that Sim is calculated in this embodiment_Word2Vec('Sumitomo' ) -1, Sim_Word2Vec('Sumitomo', 'real estate') -0.5816, Sim_Word2Vec('property', 'Sumitomo') -0.4892, Sim_Word2Vec('property', 'real property') 0.6853.

It should be noted that, in this embodiment, a Word2Vec deep learning model is used to train a wikipedia chinese corpus with rich vocabularies, a higher-quality Word vector model is obtained, and then, more similar vocabularies are obtained by using Word vectors, so that the problem of limited knowledge base in a body semantic dictionary is solved.

As a more preferable technical solution, in step S3, the first local similarity list [ sim [ ]_a1,sim_a2,…,sim_as]The construction process comprises the following steps:

calculating a first list of terms W using the dynamic weighting strategy_aList W of words and second participles in (1)_bThe comprehensive similarity of the Word forest Word similarity, the web Word similarity and the Word2Vec Word similarity of each Word in the Chinese language is taken as W_aThe traversed words and W_bThe similarity of each word and phrase is taken as W, and the maximum value in the similarity of all words and phrases is taken as W_aThe traversed words and W_bLocal similarity of (1), when W is traversed_aAll the words in the Chinese character are used for obtaining a local similarity list with the length of s_a1,sim_a2,…,sim_as]；

In step S4, the second local similarity list [ sim_b1,sim_b2,…,sim_bt]The construction process comprises the following steps:

calculating a second participle list W using the dynamic weighting policy_bList W of words and first word-segments in (1)_aThe comprehensive similarity of the Word forest Word similarity, the web Word similarity and the Word2Vec Word similarity of each Word in the Chinese language is taken as W_bThe traversed words and W_aThe similarity of each word and phrase is taken as W, and the maximum value in the similarity of all words and phrases is taken as W_bThe traversed words and W_aLocal similarity of (1), when W is traversed_bAll the words in the Chinese character are used for obtaining a local similarity list (sim) with the length of t_b1,sim_b2,…,sim_bt]。

in the formula, λ₁、λ₂And lambda₃Respectively represent Sim_Cilin(W_ar,W_bm)、Sim_HowNet(W_ar,W_bm) And Sim_Word2Vec(W_ar,W_bm) And satisfies lambda₁+λ₂+λ₃1, said λ, as shown in fig. 6₁、λ₂And lambda₃Is based on W_arAnd W_bmThe distribution in tracillinfile, traHownetFile and word2vecFile is obtained, in fig. 6, U represents all word sets; a represents a word that is only included in tracillinfile; b is only indicated atWords included in traHownetFile; c represents a word existing only in word2 vecFile; d represents a word existing in traCilinFile, traHownetFile and word2vecFile at the same time; g represents a word simultaneously included by traCilinFile and traHownetFile; e represents a word existing in both traCilinFile and word2 vecFile; f represents a word existing in both traHownetFile and word2vecFile, and the value taking situation is specifically divided into 8 types:

In addition, W is_ar∈D，W_bmE or W_ar∈E，W_bmE D all represents that one Word simultaneously exists in the Word forest, the knownnet and the Word2Vec, and the other Word simultaneously exists in the Word forest and the Word2Vec, namely one Word is not in the knownnet, and the similarity in the knownnet is necessarily 0, so the lambda is taken₁＝λ₃＝1/2，λ₂＝0。

in addition, W is_ar∈A，W_bmE.g. D or W_ar∈D，W_bmThe epsilon A represents that one Word simultaneously exists in tracillinFile, traHownetFile and Word2vecFile, and the other Word only exists in tracillinFile, namely one Word simultaneously does not exist in traHownetFile and Word2vecFile, and the similarity of the known net and the similarity of Word2vecFile are necessarily 0, so that lambda is taken₁＝1，λ₂＝λ₃＝0。

(8) other cases, considering the weight is meaningless;

it should be noted that, in this embodiment, the other condition means that in this condition, there is no intersection between two words in tracillinfile, traHownetFile or word2vecFile, and at this time, all three similarity values are necessarily equal to 0, and this condition considers that the similarity weight is meaningless, for example, W is meaningless_ar∈B，W_bmE.g. C or W_ar∈C，W_bmE.b, i.e. the case where one word is present only in traHownetFile and the other word is present only in word2 vecFile.

This example was calculated to obtain Sim_W('Sumitomo' ) -1, Sim_W(' Sumitomo ', ' real estate)′)＝0.5816，Sim_W('property', 'Sumitomo') -0.4892, Sim_W('property') and 'real property') 0.7159, so that two local similarity lists are available, both of which are [1.0, 0.7159 ]]。

It should be noted that, in the embodiment, a dynamic weighting strategy is adopted, and the comprehensive similarity of the Word forest Word similarity, the web Word similarity, and the Word2Vec Word similarity is calculated as the Word similarity, so that the trademark meaning similarity is calculated, the calculation result is more uniform and reasonable, and the accuracy of the trademark meaning approximate judgment is improved.

As a more preferable embodiment, in step S5: calculating the meaning similarity of the name of the first trademark and the name of the second trademark according to the first local similarity list and the second local similarity list, wherein the meaning similarity comprises the following steps:

according to the first local similarity list [ sim_a1,sim_a2,…,sim_as]And said second local similarity list sim_b1,sim_b2,…,sim_bt]Calculating the similarity of meaning Sim (a, b) of the first brand name and the second brand name using the formula:

in the formula, sim_arRepresents W_aMiddle (r) th word and W_bLocal similarity of (2), sim_bmRepresents W_bM-th word and W_aLocal similarity of (3).

In the present embodiment, Sim (a, b) is calculated to be 0.858 by using the meaning similarity formula according to the first local similarity list and the second local similarity list calculated by the first trademark name and the second trademark name.

As a more preferable mode, in the present embodiment, Sim (a, b) is compared with the infringement threshold θ of the similarity between the meanings of trademarks, which is 0.75, and if it is equal to or greater than the infringement threshold, it is determined as the approximate trademark application, so that "sumitomo property" is the approximate trademark application of "sumitomo property".

It should be understood that the specific value of the infringement threshold in this embodiment is an example, and those skilled in the art may set the specific value of the infringement threshold according to actual situations.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A Chinese trademark similarity calculation method is characterized by comprising the following steps:

acquiring names of a first trademark and a second trademark to be compared, and performing word segmentation processing on the first trademark name and the second trademark name to respectively obtain a first word segmentation list and a second word segmentation list;

2. The method of calculating the similarity of a chinese trademark according to claim 1, wherein the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmWord similarity Sim of the words between_Cilin(W_ar，W_bm) The calculating step comprises:

if yes, reading the word group of the row of the current code, and judging the word W_arThe word and phrase W_bmIf they are similar, then the sum is recorded_Cilin(W_ar，W_bm) 0, if approximate, then Sim_Cilin(W_ar，W_bm)＝1；

If not, calculating the similarity of all word forest coding combinations by adopting a word forest similarity calculation method based on information content, and taking the maximum value as Sim_Cilin(W_ar，W_bm)。

3. The method for calculating the similarity of a Chinese trademark according to claim 2, wherein the construction of a synonym library traCilin File of a trademark word forest by using a "synonym forest extension" dictionary file cillin File comprises:

4. The method of calculating the similarity of a chinese trademark according to claim 1, wherein the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmBetween web word similarity Sim_HowNet(W_ar，W_bm) The calculating step comprises:

if yes, reading the word group of the line of the current meaning item, and judging W according to the word mark_arAnd W_bmIf they are similar, then the sum is recorded_HowNet(W_ar，W_bm) 0, if approximate, then Sim_HowNet(W_ar，W_bm)＝1；

If not, calculating the similarity of all the meaning item combinations by adopting a similarity calculation method based on the known network, and taking the maximum value as Sim_HowNet(W_ar，W_bm)。

5. The method for calculating the similarity of the Chinese trademark according to claim 4, wherein the method for constructing the synonym library traHownetFile of the trademark and the known web by using the dictionary file of the known web comprises the following steps:

6. The method of calculating the similarity of a chinese trademark according to claim 1, wherein the r-th word W in the first word-dividing list_arWith the mth word W in the second word-dividing list_bmWord2Vec Word similarity Sim between_Word2Vec(W_ar，W_bm) The calculating step comprises:

according to the word vector file word2vecFile, the word W is converted into the word vector file_arThe word and phrase W_bmConverting into word vector, and calculating Sim by cosine formula_Word2Vec(W_ar，W_bm)。

7. The method for calculating the similarity of the Chinese trademark according to claim 6, wherein the training of the Wikipedia Chinese corpus by using the Word2Vec deep learning model to obtain the Word vector file Word2vecFile comprises the following steps:

constructing a sentence iterator by using a LineStrentence () method;

8. The method for calculating the similarity of the Chinese trademarks of claim 1, wherein the integrated similarity Sim_W(W_ar，W_bm) The calculation formula of (2) is as follows:

Sim_w(W_ar，W_bm)＝λ₁Sim_Cilin(W_ar，W_bm)+λ₂Sim_HowNet(W_ar，W_bm)+λ₃Sim_Word2Vec(W_ar，W_bm)

in the formula, λ₁、λ₂And lambda₃Respectively represent Sim_Cilin(W_ar，W_bm)、Sim_HowNet(W_ar，W_bm) And Sim_Word2Vec(W_ar，W_bm) And satisfies lambda₁+λ₂+λ₃＝1。

9. The method for calculating the similarity of Chinese trademarks of claim 8, wherein λ is₁、λ₂And lambda₃The value taking situation is as follows:

10. The chinese trademark similarity calculation method according to any one of claims 1 to 9, wherein calculating the meaning similarity of the name of the first trademark and the name of the second trademark based on the first local similarity list and the second local similarity list includes:

in the formula, sim_arRepresents W_aMiddle (r) th word and W_bLocal similarity of (2); sim_bmRepresents W_bM-th word and W_aLocal similarity of (2); the first local similarity list is [ sim ]_a1，sim_a2，...，sim_as](ii) a The second local similarity list is [ sim ]_b1，sim_b2，...，sim_bt]。