CN111125160A

CN111125160A - Data preprocessing method, system and terminal based on trademark approximate analysis

Info

Publication number: CN111125160A
Application number: CN201911370644.9A
Authority: CN
Inventors: 朱峰; 彭丽
Original assignee: Guangdong Knowledge Gain And Loss Network Technology Co ltd
Current assignee: Guangdong Knowledge Gain And Loss Network Technology Co ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-08

Abstract

The invention relates to the technical field of data analysis, in particular to a data preprocessing method, a system and a terminal based on trademark approximate analysis, wherein the data preprocessing method based on the trademark approximate analysis comprises the following steps: acquiring an input keyword; performing character type recognition on the keywords; judging whether the keywords are of multi-type character combination, judging whether the desensitization recognition result completely hits the sensitive words, judging whether at least one sensitive word in the set is contained in the A-type sensitive word, judging whether at least one sensitive word in the set is contained in the B-type sensitive word, and further judging whether the significance is achieved. The method of the invention can simplify the analysis steps of the user on the combined trademark and improve the analysis efficiency.

Description

Data preprocessing method, system and terminal based on trademark approximate analysis

Technical Field

The invention relates to the technical field of data analysis, in particular to a data preprocessing method, a data preprocessing system and a data preprocessing terminal based on trademark approximate analysis.

Background

In recent years, with rapid development of the world economy and society, the value of trademarks has increased dramatically, and the number of trademarks registered has continued to increase. Under the condition that a trademark owner usually queries and searches registered trademarks published by a trademark office in a fixed period by himself or a proxy agency on the registration or the right of maintenance of the trademarks so as to find approximate trademarks in time, the search level of manual search is narrow, so that the search result is not comprehensive, and a method for analyzing the approximation of the trademarks is continuously created by a person skilled in the art.

In this case, how to make the structure of the search more accurate becomes a problem to be solved. To solve the above problems. The invention provides a data preprocessing method, a data preprocessing system and a data preprocessing terminal based on trademark approximate analysis.

Disclosure of Invention

The invention solves the technical problem of providing a data preprocessing method, a data preprocessing system and a data preprocessing terminal based on trademark approximate analysis. The data preprocessing method, the data preprocessing system and the data preprocessing terminal based on the trademark approximate analysis can simplify the analysis steps of a user on the combined trademark and improve the analysis efficiency.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a data preprocessing method based on trademark approximate analysis comprises the following steps:

acquiring an input keyword;

performing character type recognition on the keywords;

judging whether the keywords are of multi-type character combination, if so, judging whether the keywords contain numbers, and if not, performing desensitization identification according to Chinese, English and numbers respectively;

judging whether the desensitization recognition result completely hits the sensitive words, if so, popping up a word without significance prompt, and if not, performing word splitting processing on the analysis object to generate a set;

judging whether at least one sensitive word in the set is contained in the A-type sensitive word, if so, popping up an influence significance prompt;

if not, judging whether at least one sensitive word in the set is contained in the B-type sensitive words, if so, popping up the sensitive word without significance prompt, and if not, entering a registration analysis logic.

Preferably, said determination of whether a number is included is, in particular,

judging whether the numbers are contained;

if yes, performing Arabic numerals extraction, English numerals extraction, Chinese type numerals extraction,

judging whether characters remain after extraction;

if not, carrying out unified initialization of the digital format;

desensitization recognition is respectively carried out according to Chinese, English and numbers. By analyzing the keywords, the keywords are respectively extracted with Arabic numerals, English and Chinese characters and are respectively identified, so that the method can greatly improve the accuracy of trademark analysis.

Preferably, the judgment is made whether there are residual characters after extraction;

if yes, extracting Chinese characters and English characters.

Carrying out unified initialization of digital formats;

desensitization recognition according to Chinese, English and number

Preferably, the determination of whether a number is included, and if not,

extracting Chinese characters and English characters;

carrying out unified initialization of digital formats;

desensitization recognition is carried out according to Chinese, English and number respectively.

Further preferably, the judging whether the result of desensitization recognition completely hits the sensitive word specifically includes:

establishing a sensitive word corpus;

acquiring extracted Chinese, English and number;

matching the acquired Chinese, English and number with the sensitive word text set to acquire matched words;

and performing similarity analysis on the matched words and the extracted Chinese, English and number.

Further preferably, the similarity is determined according to the following formula:

similarity Y: (

,

…

)=α∗

；

Wherein, α>0, α are adjustable parameters,

for obtaining Chinese, English and digital words,

similar words that match for the sensitive word corpus,

is composed of

Respectively, the level of the layer. In the process of judging the similarity, a similarity Y algorithm is adopted, the hierarchical relationship of each word is defined on the position by the Y algorithm, different positions of the same word are judged, and the similarity is judged if the same word is extremely similar on the position. The method for judging the similarity can match similar trademarks in a database, and improves the matching accuracy.

Preferably, the splitting word processing is performed on the analysis object, and the generating set specifically includes: and separating characters of the analysis object, combining the separated characters in an ascending order one by one to generate a set of combined characters.

Preferably, the type a sensitive word is a word that affects registration.

Preferably, the B-type sensitive word is a word which cannot be registered.

A data pre-processing system based on trademark approximation analysis, comprising:

a keyword acquisition module: the keyword acquisition module is used for acquiring input keywords;

a character type identification module: the character type identification module is used for identifying the character type of the keyword;

a character type judging module: the character type judging module is used for judging whether the key words are multi-type character combinations, if so, judging whether the key words contain numbers, and if not, performing desensitization identification according to Chinese, English and numbers respectively;

desensitizing the recognition module: the desensitization recognition module is used for judging whether a desensitization recognition result completely hits sensitive words, if so, popping up a word without significance prompt, and if not, performing word splitting processing on an analysis object to generate a set;

desensitization judging module: the desensitization judging module is used for judging whether at least one sensitive word in the set is contained in the A-type sensitive word, and if so, popping up an influence significance prompt; if not, judging whether at least one sensitive word in the set is contained in the B-type sensitive words, if so, popping up the sensitive word without significance prompt, and if not, entering a registration analysis logic.

A computer readable storage medium having stored thereon computer program instructions adapted to be loaded by a processor and to execute a method of data pre-processing based on trademark approximation analysis.

A mobile terminal comprises a processor and a memory, wherein the processor is used for executing a program stored in the memory so as to realize a data preprocessing method based on trademark approximate analysis.

Compared with the prior art, the invention has the beneficial effects that: the trademark to be analyzed is subjected to data preprocessing, so that the analysis steps of the user on the combined trademark can be simplified, the analysis efficiency is improved, after desensitization identification is carried out on an analysis object, the user can be fed back more accurately, the problem that the trademark is rejected after being submitted for registration due to the existence of sensitive words is reduced, and the accuracy of subsequent approximate analysis is improved. Specifically, the method is not limited to analyzing the English trademark, the Chinese trademark and the Arabic numeral trademark, and can be used for analyzing three combined trademarks one by one after extraction, so that the intelligence and the judgment accuracy are greatly improved.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a schematic flow chart of a data preprocessing method based on trademark approximation analysis according to the present invention;

FIG. 2 is a block diagram of a data preprocessing system based on trademark approximation analysis according to the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic drawings and illustrate only the basic flow diagram of the invention, and therefore they show only the flow associated with the invention.

Example 1

As shown in fig. 1, the present invention is a data preprocessing method based on trademark approximation analysis, and the method specifically comprises:

acquiring an input keyword;

performing character type recognition on the keywords;

The splitting processing is performed on the analysis object, and the generating set specifically comprises: and separating characters of the analysis object, combining the separated characters in an ascending order one by one to generate a set of combined characters.

The A-type sensitive words are words affecting registration. The B-type sensitive words are words which cannot be registered.

For example, the following steps are carried out: words such as class a sensitive words such as "kitten" and "puppy" have a strong potential to be registered in different classes, because such words have a large relevance and may be difficult to register. For the type B sensitive words, for example, the ticket is a word which cannot be registered, the ticket is a ticket which is identified, and the identified ticket is marked with cross-class protection benefits, so that the system can remind that the words are not significant when the applicant registers again.

Example 2

The method for judging whether the keyword is a multi-type character combination is as follows:

step 1, the judgment of whether the number is included is specifically as follows:

judging whether the numbers are contained;

judging whether characters remain after extraction;

if not, carrying out unified initialization of the digital format;

desensitization recognition is respectively carried out according to Chinese, English and numbers.

Step 2, judging whether residual characters exist after extraction;

wherein, the Chinese character extraction and the English character extraction are carried out.

Carrying out unified initialization of digital formats;

desensitization recognition is performed according to Chinese, English and number respectively.

The specific steps for judging whether the desensitization recognition result completely hits the sensitive word are as follows:

establishing a sensitive word corpus;

acquiring extracted Chinese, English and number;

similarity Y: (

,

…

)=α∗

；

Wherein, α>0, α are adjustable parameters,

for obtaining Chinese, English and digital words,

similar words that match for the sensitive word corpus,

is composed of

Respectively, the level of the layer.

For example, in trademark registration, there are some provisions that some words may not be registered as trademarks, such as: the names, flags and logos of international organizations between governments are the same or similar, but the names, flags and logos of administrative divisions above county level or foreign names known by the public are harmful to socialist moral fashion or have other adverse effects except that the names, flags and logos of international organizations between governments are the same or similar, but are not easily misled by the organizations.

In the trademark application, the applicant uses "love house and house" as the trademarkThe application, its word application is that "love and wu" evolves, is caused the misleading easily by masses, lets masses regard love and wu of idiom as this kind of writing method of love and room, and at this moment, the user writes into the word, and the system acquires this word, carries out the matching process of similarity in the database, and first when judging the position of the same word, through judging the position of same word, the word is write into

Defining a position for two words, selecting the maximum distance in the positions

When the distance is larger, the similarity is lower, and when the distance is smaller, the similarity is higher, and at this time, the love house and the wu have great similarity with the love house and the house, and cannot be used. Therein defined

Is composed of

The respective levels are the positions of the words in the word.

For example: the obtained keywords are: if a keyword is obtained: the color123 firstly identifies the character type of the word, and in the identification process, the keywords have Chinese, English and numbers, so that the keywords are split, the Chinese, English and numbers are respectively split, and desensitization identification is respectively carried out according to the Chinese, English and numbers;

Example 3

As shown in FIG. 2, the present invention provides a data preprocessing system based on trademark approximation analysis:

keyword acquisition module 1: the keyword acquisition module is used for acquiring input keywords;

character type recognition module 2: the character type identification module is used for identifying the character type of the keyword;

character type judging module 3: the character type judging module is used for judging whether the key words are multi-type character combinations, if so, judging whether the key words contain numbers, and if not, performing desensitization identification according to Chinese, English and numbers respectively;

desensitization identification module 4: the desensitization recognition module is used for judging whether a desensitization recognition result completely hits sensitive words, if so, popping up a word without significance prompt, and if not, performing word splitting processing on an analysis object to generate a set;

desensitization judgment module 5: the desensitization judging module is used for judging whether at least one sensitive word in the set is contained in the A-type sensitive word, and if so, popping up an influence significance prompt; if not, judging whether at least one sensitive word in the set is contained in the B-type sensitive words, if so, popping up the sensitive word without significance prompt, and if not, entering a registration analysis logic.

In the character type determining module 3, the specific process of determining whether the keyword is a multi-type character is as follows:

judging whether the numbers are contained; if yes, performing Arabic number extraction, English number extraction and Chinese type number extraction, and judging whether characters remain after extraction; if not, carrying out unified initialization of the digital format; desensitization recognition is respectively carried out according to Chinese, English and numbers.

Wherein, the judgment is carried out to judge whether residual characters exist after extraction; the Chinese character extraction and the English character extraction are carried out. Carrying out unified initialization of digital formats; desensitization recognition is performed according to Chinese, English and number respectively.

establishing a sensitive word corpus;

acquiring extracted Chinese, English and number;

similarity Y: (

,

…

)=α∗

；

Wherein, α>0, α are adjustable parameters,

for obtaining Chinese, English and digital words,

similar words that match for the sensitive word corpus,

is composed of

Respectively, the level of the layer.

The above detailed description is specific to possible embodiments of the present invention, and the above embodiments are not intended to limit the scope of the present invention, and all equivalent implementations or modifications that do not depart from the scope of the present invention should be included in the present claims.

Claims

1. A data preprocessing method based on trademark approximate analysis is characterized by comprising the following steps:

acquiring an input keyword;

performing character type recognition on the keywords;

2. The method of claim 1, wherein the determination of whether the data includes a number is specifically,

judging whether the numbers are contained;

judging whether characters remain after extraction;

if not, carrying out unified initialization of the digital format;

3. The data preprocessing method based on trademark approximate analysis as claimed in claim 2, wherein said judging whether there are remaining characters after extraction;

if yes, extracting Chinese characters and English characters.

4. Carrying out unified initialization of digital formats;

desensitization recognition according to Chinese, English and number

The method of claim 2, wherein the determination of whether the data includes a number is performed by comparing the data with a reference value,

extracting Chinese characters and English characters;

carrying out unified initialization of digital formats;

5. The data preprocessing method based on trademark approximate analysis according to claim 1, wherein the word splitting processing is performed on the analysis object, and the generating set specifically comprises: and separating characters of the analysis object, combining the separated characters in an ascending order one by one to generate a set of combined characters.

6. The trademark approximation analysis-based data preprocessing method as claimed in claim 1, wherein the sensitive words in class A are words affecting registration.

7. The trademark approximate analysis-based data preprocessing method as claimed in claim 1, wherein the B-type sensitive words are unregisterable words.

8. A data preprocessing system based on trademark approximation analysis, comprising:

9. A computer-readable storage medium, characterized in that it stores computer program instructions adapted to be loaded by a processor and to execute the method of any of claims 1 to 7.

10. A mobile terminal comprising a processor and a memory, the processor being configured to execute a program stored in the memory to implement the method of any one of claims 1 to 7.