CN117493532A - Text processing method, device, equipment and storage medium - Google Patents

Text processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN117493532A
CN117493532A CN202311846910.7A CN202311846910A CN117493532A CN 117493532 A CN117493532 A CN 117493532A CN 202311846910 A CN202311846910 A CN 202311846910A CN 117493532 A CN117493532 A CN 117493532A
Authority
CN
China
Prior art keywords
word
detected
trademark
sequence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311846910.7A
Other languages
Chinese (zh)
Other versions
CN117493532B (en
Inventor
李源肇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhihui Chuangxiang Technology Co ltd
Original Assignee
Shenzhen Zhihui Chuangxiang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhihui Chuangxiang Technology Co ltd filed Critical Shenzhen Zhihui Chuangxiang Technology Co ltd
Priority to CN202311846910.7A priority Critical patent/CN117493532B/en
Publication of CN117493532A publication Critical patent/CN117493532A/en
Application granted granted Critical
Publication of CN117493532B publication Critical patent/CN117493532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The application discloses a text processing method, a text processing device, text processing equipment and a storage medium, and belongs to the technical field of infringement detection. The method comprises the steps of obtaining a description text to be detected; performing word segmentation on the descriptive text to be detected to obtain a word segmentation sequence to be detected, if the word segmentation to be detected in the word segmentation sequence to be detected is single-byte language word segmentation, determining a trademark word sequence from a preset trademark library based on the part of speech of the single-byte language word segmentation, wherein the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte language word segmentation, querying the descriptive text to be detected by utilizing the trademark word sequence, and deleting the part of the descriptive text to be detected, which is equivalent to the trademark word in the trademark word sequence, to obtain the non-infringed descriptive text to be detected. The method and the device solve the technical problem that whether the filled text content relates to the infringement of trademark words cannot be rapidly and accurately pre-judged when the product is put on a shelf to a platform.

Description

Text processing method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of infringement detection technologies, and in particular, to a text processing method, apparatus, device, and storage medium.
Background
When a commodity is put on a platform, there is usually a document describing the commodity, and there may be descriptions of registered trademark words in the content of the document, which may involve infringement of the trademark words.
At present, in the related art, aiming at a preset trademark library with millions of magnitude, quick screening of trademark words is difficult to realize, so that when a product is put on a platform, the related art cannot quickly and accurately pre-judge whether the filled text content relates to the infringement of the trademark words.
Disclosure of Invention
The main purpose of the application is to provide a text processing method, a device, equipment and a storage medium, which aim to solve the technical problem that whether the filled text content relates to trademark infringement can not be rapidly and accurately pre-judged when a product is put on a platform.
To achieve the above object, in a first aspect, the present application provides a text processing method, including:
acquiring a description text to be detected;
word segmentation is carried out on the descriptive text to be detected, and a word segmentation sequence to be detected is obtained;
if the word to be detected in the word segmentation sequence to be detected is a single-byte Chinese word segmentation, determining a trademark word sequence from a preset trademark library based on the part of speech of the single-byte Chinese word segmentation, wherein the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte Chinese word segmentation;
and inquiring the description text to be detected by using the trademark word sequence, and deleting the part of the description text to be detected, which is equivalent to the trademark word in the trademark word sequence, so as to obtain the description text to be detected without infringement.
Optionally, determining the trademark word sequence from the preset trademark library based on the part of speech of the single-byte language word segmentation includes:
taking the word to be detected with the part of speech being a high-frequency word as a short tail word to be detected;
taking the word group consisting of the word to be detected and the preset number of word to be detected adjacent to the word to be detected as the long-tail word to be detected;
inquiring trademark words with parts of speech of the short-tail words in a preset trademark library, and acquiring a trademark word sequence of the short-tail words including the short-tail words to be detected;
inquiring trademark words with parts of speech of long-tail words in a preset trademark library, and acquiring a long-tail word trademark word sequence containing the long-tail words to be detected;
and generating a trademark word sequence based on the short-tail word trademark word sequence and the long-tail word trademark word sequence.
Optionally, determining the trademark word sequence from the preset trademark library based on the part of speech of the single-byte language word segmentation includes:
determining a product type corresponding to the descriptive text to be detected;
acquiring a trademark word sequence from a target trademark sub-library corresponding to the product type; the preset trademark library is divided into a plurality of trademark sub-libraries according to the types of products.
Optionally, querying the description text to be detected by using the trademark word sequence, deleting the same part of the description text to be detected as the trademark word in the trademark word sequence, and obtaining the description text to be detected without infringement, including:
identifying and deleting functional characters in the description text to be detected, and obtaining updated description text to be detected;
if all the characters with trademark words in the updated description text to be detected are included, extracting target phrases where all the characters are located from the description text to be detected;
and if the target phrase is equivalent to the trademark word, deleting the phrase to obtain the description text to be detected without infringement.
Optionally, before the text word segmentation of the description to be detected and the word segmentation sequence to be detected are obtained, the method further comprises:
and cleaning the descriptive text to be detected to obtain the descriptive text to be segmented.
Optionally, word segmentation is performed on the descriptive text to be detected to obtain a word segmentation sequence to be detected, and the method comprises the following steps:
dividing the descriptive text of the word to be segmented by taking punctuation marks in the descriptive text of the word to be segmented as a dividing limit to obtain a first word segmentation sequence;
and extracting single-byte language word segmentation in the first word segmentation text for the single-byte language characters in the first word segmentation sequence to obtain a single-byte word segmentation sequence to be detected.
Optionally, after obtaining the first word segmentation sequence, the method further includes:
and extracting multi-byte language word segmentation in the first word segmentation text for multi-byte language characters in the first word segmentation sequence to obtain a multi-byte word segmentation sequence to be detected.
In a second aspect, to achieve the above object, the present application provides a text processing apparatus, including:
the acquisition module is used for acquiring the description text to be detected;
the word segmentation module is used for segmenting the descriptive text to be detected to obtain a word segmentation sequence to be detected;
the trademark word confirming module is used for confirming a trademark word sequence from a preset trademark library based on the part of speech of the single-byte language word if the word to be detected in the word sequence to be detected is the single-byte language word, and the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte language word;
and the text processing module is used for inquiring the description text to be detected by utilizing the trademark word sequence, deleting the part equivalent to the trademark word in the trademark word sequence in the description text to be detected, and obtaining the description text to be detected without infringement.
In a third aspect, to achieve the above object, the present application provides a text processing apparatus, including: a processor, a memory and a text processing program stored in the memory, which when executed by the processor, implements the steps of the text processing method described above.
In a fourth aspect, to achieve the above object, the present application provides a computer-readable storage medium having a text processing program stored thereon, which when executed by a processor, implements the above text processing method.
According to the method and the device, the description text to be detected is segmented, the whole description text is converted into the processable word segmentation sequence, a basis is provided for subsequent infringement detection, and trademark words possibly related to infringement are searched in a trademark library by narrowing the search range according to the part of speech during search, so that the problem of inefficiency of searching one by one is avoided, and the detection efficiency is improved while the accuracy is maintained. And finally deleting the content equivalent to the trademark word in the description text to be detected, so that misjudgment can be further reduced, and the accuracy of infringement detection can be improved.
Drawings
FIG. 1 is a schematic diagram of a text processing apparatus of the present application;
FIG. 2 is a schematic flow chart of a first embodiment of a text processing method according to the present application;
FIG. 3 is a schematic diagram of a second flow chart of a first embodiment of a text processing method of the present application;
FIG. 4 is a third flow chart of a first embodiment of a text processing method according to the present application;
FIG. 5 is a fourth flowchart of a first embodiment of a text processing method according to the present application;
fig. 6 is a schematic structural view of the text processing device of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
When a commodity is put on a platform, there is usually a document describing the commodity, and there may be descriptions of registered trademark words in the content of the document, which may involve infringement of the trademark words.
At present, in the related art, aiming at a preset trademark library with millions of magnitude, quick screening of trademark words is difficult to realize, so that when an infringement detection method cannot accurately predict that a product is put on a platform, whether filled text content relates to the infringement of the trademark words or not is difficult.
The following description will explain a text processing method, a device, equipment and a storage medium applied in the implementation of the technology of the present application:
referring to fig. 1, fig. 1 is a schematic structural diagram of a text processing device of a hardware running environment according to an embodiment of the present application.
As shown in fig. 1, the text processing apparatus may include: a processor 1001, such as a CPU, a user interface 1003, a memory 1005, and a communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a voice pick-up module, such as a microphone array, etc., and the optional user interface 1003 may also be a Display (Display), an input unit such as a Keyboard (Keyboard), etc. The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It is to be appreciated that the text processing device can also include a network interface 1004, and that the network interface 1004 can optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). Optionally, the text processing device may also include RF (Radio Frequency) circuitry, sensors, audio circuitry, wiFi modules, and the like.
It will be appreciated by those skilled in the art that the text processing device structure shown in fig. 1 is not limiting of the text processing device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
Based on the above-described hardware structure of the text processing apparatus, but not limited to the above-described hardware structure, the present application provides a first embodiment of a text processing method. Referring to fig. 2, fig. 2 shows a schematic flow chart of a first embodiment of the text processing method of the present application.
It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein.
In this embodiment, the text processing method includes:
step S100, a description text to be detected is obtained.
In this embodiment, when the description text to be detected may be a commodity on-line e-commerce platform, it may be understood that the platform refers to an on-line e-commerce platform, and the description text needs to be presented in a visual webpage such as a browser, an application program, and the like.
Step S200, word segmentation is carried out on the descriptive text to be detected, and a word segmentation sequence to be detected is obtained.
The word sequence to be segmented in the step is obtained after the obtained description text to be detected is segmented, namely, a whole description text is segmented into a sequence taking the word to be detected as a unit, so that subsequent infringement retrieval is carried out. Specifically, the embodiment divides the description text to be detected into a character string sequence composed of one or more words, namely, the word to be detected. It will be appreciated that the words are determined based on the language of the text in the descriptive text to be detected.
As in an example, the text of the description to be detected is "apple is delicious" in english, which can be divided into a sequence of word to be detected consisting of 3 word to be detected of "apple", "is" and "dellicius".
Step S300, if the word to be detected in the word sequence to be detected is a single-byte language word, determining a trademark word sequence from a preset trademark library based on the part of speech of the single-byte language word.
The part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte language word segmentation.
It can be appreciated that the word to be detected may be a single byte word or a multi-byte word. Wherein, single-byte Chinese word segmentation refers to word segmentation consisting of characters occupying one byte in UTF-8 coding, and English word segmentation consisting of English characters as mentioned in the above steps. The multi-byte Chinese word is a word composed of characters occupying a plurality of bytes in UTF-8 coding, such as Japanese word or Chinese word composed of Japanese characters or Chinese characters.
The preset trademark library is a database for storing trademark words, wherein the trademark words can be obtained from legal websites for recording intellectual property information. Of course, the preset trademark library may be the trademark library of the e-commerce platform itself, or may be a trademark library used by a third party to provide trademark inquiry service.
In this embodiment, the part of speech of the single-byte Chinese word segmentation reflects the frequency of occurrence of the single-byte Chinese word segmentation in the preset trademark library, for example, for the word "apple" to be detected, if the preset trademark library contains many trademark words of "apple", the word "apple" to be detected is a high-frequency word.
It can be understood that the purpose of this step is to retrieve a trademark word sequence containing the to-be-detected word in the preset trademark library, for example, if a certain to-be-detected word is "apple", then a trademark word containing "apple" may be searched for during the retrieval, which may be "apple watch". If the preset trademark library contains too many trademark words of the to-be-detected word, the time spent in searching the to-be-detected trademark word is long, so that the embodiment designs a more efficient searching mode aiming at the to-be-detected word with the part of speech being a high-frequency word to improve the searching efficiency. In one embodiment, referring to fig. 4, step S300 includes:
step S310, determining the product type corresponding to the description text to be detected.
Step S320, a trademark word sequence is obtained from a target trademark sub-library corresponding to the product type.
The preset trademark library is divided into a plurality of trademark sub-libraries according to the types of products.
In this embodiment, the product types corresponding to the description text to be detected are the product types described by the description text to be detected, and it can be understood that, for an e-commerce platform, products on the platform are classified according to a certain classification rule, for example, apples of a certain brand are classified into fruit categories. Correspondingly, a corresponding trademark sub-library can be set in a preset trademark library according to the product type of the recorded trademark words according to the product classification mode of the electronic commerce platform, and the recorded trademark words are placed in the corresponding trademark sub-library.
It can be understood that, according to the product type corresponding to the description text to be detected, the trademark word sequence is obtained from the corresponding trademark sub-library, so that the search range can be further reduced, and the search efficiency and the search accuracy are improved.
Step S330, the word to be detected with the part of speech being a high-frequency word is used as a short tail word to be detected.
Step S340, taking the word group consisting of the short-tail word to be detected and the preset number of word segments to be detected adjacent to the short-tail word to be detected as the long-tail word to be detected.
Step S350, inquiring trademark words with parts of speech of the short-tail words in a preset trademark library, and obtaining a trademark word sequence of the short-tail words containing the short-tail words to be detected.
Step S360, inquiring trademark words with parts of speech of long-tail words in a preset trademark library, and obtaining a long-tail word trademark word sequence containing the long-tail words to be detected.
Step S370, a trademark word sequence is generated based on the short tail word trademark word sequence and the long tail word trademark word sequence.
In this embodiment, if the part of speech of the word to be detected is a high-frequency word, the word to be detected is used as a short tail word to be detected, meanwhile, the word to be detected and one or more than one word to be detected adjacent to the word to be detected are taken to form a long tail word to be detected, and meanwhile, in a preset trademark library, the trademark word containing more than two words is marked as a long tail trademark word and the trademark word containing one word is marked as a short tail trademark word according to the number of words in the trademark word. When the trademark word retrieval is carried out, only the part of the long-tail trademark word matched with the long-tail trademark word to be detected is retrieved, and only the short-tail trademark word matched with the short-tail trademark word to be detected is retrieved. For the word to be detected, the part of speech of which is not a high-frequency word, all short-tail trademark words and long-tail trademark words can be searched according to a normal searching mode.
It can be understood that, for the word to be detected, the part of speech in the word sequence to be detected is a high-frequency word, the word to be detected is searched according to the short tail word to be detected and the long tail word to be detected, the purpose of dividing one-time high-intensity search into two-time low-intensity search can be achieved, the search efficiency can be further improved, and meanwhile, the search accuracy is not affected.
And step S400, inquiring the description text to be detected by using the trademark word sequence, and deleting the part equivalent to the trademark word in the trademark word sequence in the description text to be detected to obtain the description text to be detected without infringement.
Retrieving the trademark word sequence through the above steps means that the content identical to a certain trademark word in the trademark word sequence to be detected may exist in the description text to be detected. The step utilizes the retrieved trademark word sequence to match the description text to be detected, and whether infringement content is related in the description text to be detected can be further confirmed.
Specifically, if the content equivalent to the trademark word sequence exists in the description text to be detected, the content is considered to form infringement, and the content needs to be deleted, so that the description text to be detected without infringement is generated. It can be appreciated that matching the content equivalent to the trademark word in this step can avoid some special cases from affecting the final detection result. For example, some special characters may appear in the descriptive text to be detected to influence the detection result, such as "A < b > pp </b > le" may appear in the descriptive text to be detected, that is, special characters similar to HTML (Hypertext markup language) tags appear between characters, but "A < b > pp </b > le" is equivalent to the trademark word "Apple", and "A < b > pp </b > le" should be regarded as infringement content.
It is obvious that the embodiment divides the description text to be detected through the word, and converts the whole description text into a processable word division sequence, so that a foundation is provided for subsequent infringement detection. By narrowing the search range according to the part of speech during search, searching trademark words possibly related to infringed trademark in a trademark library, the accuracy is maintained, and meanwhile, the detection efficiency is improved. By comparing and matching the description text to be detected with the retrieved trademark word sequence one by one, searching and deleting the content equivalent to the trademark word in the description text to be detected, misjudgment can be further reduced, and the accuracy of infringement detection can be improved.
Further, referring to fig. 3, before step S200, in an embodiment, the method further includes:
and S500, cleaning the description text to be detected to obtain the description text to be segmented.
Before word segmentation of the description text to be detected, the description text to be detected needs to be cleaned, and it can be understood that the description text to be detected may contain special characters such as line feed characters or HTML labels, which may affect word segmentation results and need to be cleaned. The specific cleaning mode can be that space characters among single byte language characters are reserved, line feed characters among single byte language characters are replaced by space characters, the space characters and the line feed characters among multi-byte language characters are removed, and HTML labels in the description text to be detected are removed, so that the description text to be segmented is obtained. Before word segmentation of the description text to be detected, the description text to be detected is cleaned, so that the quality of word segmentation can be improved, and the accuracy of primary retrieval can be further improved.
Further, referring to fig. 3, in a specific embodiment, step S200 includes:
and S210, taking punctuation marks in the descriptive text of the word to be segmented as segmentation limits, and segmenting the descriptive text of the word to be segmented to obtain a first word segmentation sequence.
According to the method, firstly, a text description to be segmented is subjected to preliminary segmentation according to punctuation marks of the text description to be segmented as segmentation limits, for example, if the content in the text description to be segmented is This kind of apple is delicious, it is software for only, through the method, the text can be divided into This kind of apple is delicious and it is software for only, and a first segmentation sequence is obtained.
Step S220, extracting single-byte language word segmentation in the first word segmentation sequence for single-byte language characters in the first word segmentation sequence to obtain a single-byte word segmentation sequence to be detected.
Step S230, extracting multi-byte language word segmentation in the first word segmentation sequence for multi-byte language characters in the first word segmentation sequence to obtain a multi-byte word segmentation sequence to be detected.
Step S220 and step S230 are respectively used for word segmentation for the single-byte Chinese characters and the multi-byte Chinese characters in the first word segmentation sequence so as to obtain a single-byte word segmentation sequence to be detected and a multi-byte word segmentation sequence to be detected.
Specifically, for single-byte Chinese characters, space can be used as a division boundary, and single-byte Chinese characters between every two spaces can be used as a word for word segmentation, for example, after word segmentation is performed on 'This kind of apple is delicious' in the first word segmentation sequence, a single-byte word segmentation sequence consisting of 'This', 'kine', 'of', 'apple', 'is' and 'dellicius' is obtained.
For multi-byte word segmentation, two adjacent multi-byte characters can be used as a multi-byte word segmentation, for example, if a first word segmentation sequence is 'the apple is delicious', the multi-byte word segmentation sequence consisting of 'the apple', 'the fruit is very beautiful' and 'the delicious' is obtained after the multi-byte word segmentation. It will be appreciated that the multi-byte language characters may also include other languages, such as japanese, in which the japanese includes a non-chinese character portion (e.g., hiragana) and a chinese character portion, that is, the encoding range of japanese in the UTF-8 character set has a portion overlapping with chinese and a portion not in the chinese encoding range, and for the japanese characters in the chinese encoding range, the same word segmentation manner as the chinese characters may be adopted, and for the japanese characters outside the chinese encoding range, four adjacent multi-byte characters may be used as one word.
It will be appreciated that the present embodiment uses different word segmentation methods for single-byte language characters and multi-byte language characters, and the formulation of these word segmentation methods depends on the language characteristics of each language and the characteristics of trademark words corresponding to each language. Through the word segmentation mode, the trademark word retrieval is carried out by selecting the word segments with proper lengths, so that the retrieval efficiency can be improved while the correctness of the retrieval result is ensured, and a good basis is provided for the subsequent trademark word retrieval.
It can be understood that besides language word segmentation, there are non-language word segmentation cases, for example, word segmentation composed of numbers or emoticons, and for such non-language word segmentation, it can be used as an independent word to be detected to search in a preset trademark library.
Further, referring to fig. 5, in an embodiment, step S400 includes;
step S410, identifying and deleting functional characters in the description text to be detected, and obtaining updated description text to be detected.
Step S420, if all the characters with trademark words in the updated description text to be detected, extracting the target words where all the characters are located from the description text to be detected.
And step S430, deleting the target phrase if the target phrase is equivalent to the trademark word, and obtaining the description text to be detected without infringement.
In the present embodiment, the functional character refers to a text character, such as an HTML tag or the like, which may affect the infringement judgment accuracy without a special meaning. The target phrase refers to a word sequence of all characters of a certain trademark word which is searched out in the description text to be detected, and it is worth noting that the appearance sequence of the characters in the phrase, which is identical to the trademark word, is consistent with the character sequence in the trademark word.
Specifically, the method comprises the steps of firstly pre-cleaning a description text to be detected once, removing functional characters which possibly influence infringement detection in the description text to be detected, obtaining updated description text to be detected, comparing the characters in the updated description text to be detected with each trademark word in a trademark word sequence one by one, and determining target phrases of the characters in the description text to be detected if the description text to be detected contains characters in a trademark word and the sequences are consistent. The method for determining the target phrase may be to determine the upper bound and the lower bound of the target phrase where the character is located by taking a space, a line feed, a retract, a punctuation mark, a coding range and the like as boundaries, so as to determine the target phrase. For example, there is a trademark word "hair", a piece of content in the description text to be detected is "how to catch air", and by taking a space as a boundary, it can be determined that the target phrase in the description text to be detected is "catch air". "catch air" is not equivalent to "hair" and does not constitute infringement, and need not be deleted from the trademark text to be detected.
In this embodiment, whether the target word group And the trademark word are equivalent or not may be determined according to a preset rule of equivalence, for example, "&" is equivalent to the character "nd", and "Bath And Body Works" is regarded as "base & Body Works" detection. "-" is equivalent to "", i.e., e moji is considered to be an e-moji test.
According to the embodiment, whether the to-be-detected descriptive text has the part equivalent to a certain trademark word or not is determined by comparing the to-be-detected descriptive text with each trademark word of the searched trademark word sequence, so that the infringement detection of the to-be-detected descriptive text is accurately realized. And the target phrase is determined by utilizing a boundary confirmation mode, so that the influence on a detection result caused by judging whether the detection result is equivalent or not by utilizing only one word or part of the phrases can be avoided.
In general, the embodiment realizes the efficient infringement detection and processing of the descriptive text to be detected by combining text analysis and trademark library retrieval, and avoids the infringement possibly occurring when the commodity is put on shelf.
It will of course be readily appreciated that, although a logical order is illustrated in the present embodiment, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
Referring to fig. 6, based on the same inventive concept, the present application also provides a text processing apparatus, the apparatus comprising:
the acquisition module is used for acquiring the description text to be detected;
the word segmentation module is used for segmenting the descriptive text to be detected to obtain a word segmentation sequence to be detected;
the trademark word confirming module is used for confirming a trademark word sequence from a preset trademark library based on the part of speech of the single-byte language word if the word to be detected in the word sequence to be detected is the single-byte language word, and the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte language word;
and the text processing module is used for inquiring the description text to be detected by utilizing the trademark word sequence, deleting the part equivalent to the trademark word in the trademark word sequence in the description text to be detected, and obtaining the description text to be detected without infringement.
It should be noted that, in this embodiment, the technical effects achieved by the embodiments of the text processing apparatus may refer to various implementations of the text processing method in the foregoing embodiments, which are not described herein again.
In addition, the embodiment of the application also provides a computer storage medium, and a text processing program is stored on the storage medium, and when the text processing program is executed by a processor, the steps of the text processing method are realized. Therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.
It should be further noted that the above-described apparatus embodiments are merely illustrative, where elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only memory (ROM), a random-access memory (RAM, randomAccessMemory), a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. A method of text processing, the method comprising:
acquiring a description text to be detected;
word segmentation is carried out on the description text to be detected, and a word segmentation sequence to be detected is obtained;
if the word to be detected in the word sequence to be detected is a single-byte Chinese word, determining a trademark word sequence from a preset trademark library based on the part of speech of the single-byte Chinese word, wherein the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte Chinese word;
and inquiring the description text to be detected by using the trademark word sequence, and deleting the part equivalent to the trademark word in the trademark word sequence in the description text to be detected to obtain the description text to be detected without infringement.
2. The text processing method according to claim 1, wherein the determining the trademark word sequence from the preset trademark library based on the part of speech of the single-byte language word segmentation includes:
taking the word to be detected, the part of speech of which is a high-frequency word, as a short tail word to be detected;
taking the phrase consisting of the short tail word to be detected and the preset number of word segments to be detected adjacent to the short tail word to be detected as a long tail word to be detected;
inquiring trademark words with parts of speech of short-tail words in the preset trademark library to obtain a short-tail word trademark word sequence containing the short-tail words to be detected;
inquiring trademark words with parts of speech of long-tail words in the preset trademark library to obtain a long-tail word trademark word sequence containing the long-tail words to be detected;
and generating the trademark word sequence based on the short tail word trademark word sequence and the long tail word trademark word sequence.
3. The text processing method according to claim 1, wherein the determining the trademark word sequence from the preset trademark library based on the part of speech of the single-byte language word segmentation includes:
determining the product type corresponding to the description text to be detected;
acquiring the trademark word sequence from a target trademark sub-library corresponding to the product type; the preset trademark library is divided into a plurality of trademark sub-libraries according to the types of products.
4. The text processing method according to claim 1, wherein the querying the description text to be detected using the trademark word sequence deletes a portion of the description text to be detected that is identical to a trademark word in the trademark word sequence, and obtaining the description text to be detected without infringement includes:
identifying and deleting functional characters in the description text to be detected, and obtaining updated description text to be detected;
if all the characters of the trademark word are in the updated description text to be detected, extracting a target phrase where all the characters are located from the description text to be detected;
and if the target phrase is equivalent to the trademark word, deleting the target phrase to obtain the description text to be detected without infringement.
5. The method for processing text according to claim 1, wherein before the text segmentation of the description to be detected and the sequence of segmented words to be detected are obtained, the method further comprises:
and cleaning the description text to be detected to obtain the description text to be segmented.
6. The text processing method according to claim 5, wherein the text of the description to be detected is segmented to obtain a sequence of segmented words to be detected, the method comprising:
dividing the description text to be segmented by taking punctuation marks in the description text to be segmented as dividing limits to obtain a first segmentation sequence;
and extracting single-byte language word segmentation in the first word segmentation sequence for the single-byte language characters in the first word segmentation sequence to obtain a single-byte word segmentation sequence to be detected.
7. The text processing method of claim 6, wherein after the obtaining the first word segmentation sequence, the method further comprises:
and extracting the multi-byte language word segmentation in the first word segmentation sequence for the multi-byte language characters in the first word segmentation sequence to obtain a multi-byte word segmentation sequence to be detected.
8. A text processing apparatus, characterized in that the text processing apparatus comprises:
the acquisition module is used for acquiring the description text to be detected;
the word segmentation module is used for segmenting the description text to be detected to obtain a word segmentation sequence to be detected;
the trademark word confirming module is used for confirming a trademark word sequence from a preset trademark library based on the part of speech of the single-byte language word segmentation if the word to be detected in the word sequence to be detected is the single-byte language word segmentation, and the part of speech of each trademark word in the trademark word sequence corresponds to the part of speech of the single-byte language word segmentation;
and the text processing module is used for inquiring the description text to be detected by utilizing the trademark word sequence, deleting the part equivalent to the trademark word in the trademark word sequence in the description text to be detected, and obtaining the description text to be detected without infringement.
9. A text processing apparatus, comprising: a processor, a memory and a text processing program stored in the memory, which when executed by the processor, implements the steps of the text processing method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it has stored thereon a text processing program which, when executed by a processor, implements the text processing method according to any one of claims 1 to 7.
CN202311846910.7A 2023-12-29 2023-12-29 Text processing method, device, equipment and storage medium Active CN117493532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311846910.7A CN117493532B (en) 2023-12-29 2023-12-29 Text processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311846910.7A CN117493532B (en) 2023-12-29 2023-12-29 Text processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117493532A true CN117493532A (en) 2024-02-02
CN117493532B CN117493532B (en) 2024-03-29

Family

ID=89676820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311846910.7A Active CN117493532B (en) 2023-12-29 2023-12-29 Text processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117493532B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158559A1 (en) * 2002-10-17 2004-08-12 Poltorak Alexander I. Apparatus and method for identifying potential patent infringement
CN106909630A (en) * 2017-01-26 2017-06-30 武汉奇米网络科技有限公司 Filtering sensitive words method and system based on dynamic dictionary
US20200250374A1 (en) * 2019-07-26 2020-08-06 Alibaba Group Holding Limited Blockchain-based text similarity detection method, apparatus and electronic device
CN112364637A (en) * 2020-11-30 2021-02-12 北京天融信网络安全技术有限公司 Sensitive word detection method and device, electronic equipment and storage medium
CN113032524A (en) * 2021-03-23 2021-06-25 平安科技(深圳)有限公司 Trademark infringement identification method, terminal device and storage medium
CN113094543A (en) * 2021-04-27 2021-07-09 杭州网易云音乐科技有限公司 Music authentication method, device, equipment and medium
CN116645049A (en) * 2023-05-24 2023-08-25 杨之运 E-commerce infringement analysis system and method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158559A1 (en) * 2002-10-17 2004-08-12 Poltorak Alexander I. Apparatus and method for identifying potential patent infringement
CN106909630A (en) * 2017-01-26 2017-06-30 武汉奇米网络科技有限公司 Filtering sensitive words method and system based on dynamic dictionary
US20200250374A1 (en) * 2019-07-26 2020-08-06 Alibaba Group Holding Limited Blockchain-based text similarity detection method, apparatus and electronic device
CN112364637A (en) * 2020-11-30 2021-02-12 北京天融信网络安全技术有限公司 Sensitive word detection method and device, electronic equipment and storage medium
CN113032524A (en) * 2021-03-23 2021-06-25 平安科技(深圳)有限公司 Trademark infringement identification method, terminal device and storage medium
CN113094543A (en) * 2021-04-27 2021-07-09 杭州网易云音乐科技有限公司 Music authentication method, device, equipment and medium
CN116645049A (en) * 2023-05-24 2023-08-25 杨之运 E-commerce infringement analysis system and method

Also Published As

Publication number Publication date
CN117493532B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN108717406B (en) Text emotion analysis method and device and storage medium
JP6398510B2 (en) Entity linking method and entity linking apparatus
US8209318B2 (en) Product searching system and method using search logic according to each category
CN108573707B (en) Method, device, equipment and medium for processing voice recognition result
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN107729453B (en) Method and device for extracting central product words
US9772991B2 (en) Text extraction
CN110321560B (en) Method and device for determining position information from text information and electronic equipment
CN108536868B (en) Data processing method and device for short text data on social network
CN109947903B (en) Idiom query method and device
CN107885717B (en) Keyword extraction method and device
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
WO2021253252A1 (en) Method and apparatus for testing webpage, and electronic device and storage medium
CN111339457B (en) Method and apparatus for extracting information from web page and storage medium
CN111325019A (en) Word bank updating method and device and electronic equipment
JP5072832B2 (en) Signature generation and matching engine with relevance
CN117493532B (en) Text processing method, device, equipment and storage medium
CN111444712A (en) Keyword extraction method, terminal and computer readable storage medium
CN112579937A (en) Character highlight display method and device
CN109947947B (en) Text classification method and device and computer readable storage medium
JP4360167B2 (en) Keyword extraction device, keyword extraction method, and computer program
US8195686B2 (en) Search method and search program
JP2011070541A (en) Method and device for supporting internet marketing
CN113378555B (en) Intelligent association method of individual strands and related products
CN116719839B (en) Data query method and device of accounting file and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant