CN115186647A - Text similarity detection method and device, electronic equipment and storage medium - Google Patents

Text similarity detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115186647A
CN115186647A CN202210651882.2A CN202210651882A CN115186647A CN 115186647 A CN115186647 A CN 115186647A CN 202210651882 A CN202210651882 A CN 202210651882A CN 115186647 A CN115186647 A CN 115186647A
Authority
CN
China
Prior art keywords
sequence
target
type
text
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210651882.2A
Other languages
Chinese (zh)
Inventor
刘祥业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tenpay Payment Technology Co Ltd
Original Assignee
Tenpay Payment Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tenpay Payment Technology Co Ltd filed Critical Tenpay Payment Technology Co Ltd
Priority to CN202210651882.2A priority Critical patent/CN115186647A/en
Publication of CN115186647A publication Critical patent/CN115186647A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The embodiment of the application provides a text similarity detection method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of natural word type processing. The method comprises the following steps: acquiring at least two texts to be detected; performing word segmentation on the text, and obtaining a target sequence of the text according to a word segmentation result; combining each target element with a subsequent preset number of target elements in sequence to obtain at least one combined element corresponding to the target element, and obtaining a combined element sequence of the text according to all the target elements and the at least one combined element; obtaining a frequency vector of each text; and for any two texts, obtaining the text similarity of any two texts according to the frequency vectors of any two texts. The embodiment of the application is more suitable for a network name auditing scene, and effectively carries out actions of mining nicknames and carrying out batch malicious registration and batch group aggregation, and carries out risk early warning.

Description

Text similarity detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of natural word type processing technologies, and in particular, to a text similarity detection method, apparatus, electronic device, computer-readable storage medium, and computer program product.
Background
In the field of natural word type processing, the similarity of different texts is frequently required to be counted; in the related art, cosine similarity is often used to calculate similarity between two text segments, and the process can be summarized as follows: 1. word segmentation; 2. listing all words; 3. word segmentation coding; 4. vectorizing word frequency; 5. and calculating the similarity of the two sentences by applying a cosine function.
The related art generally performs word segmentation on sentences without considering the sequence of the sentences, such as: "you love me" and "i love you", calculated according to the correlation method, the similarity is 1, but the meanings expressed by the two sentences are different. When the related art is applied to a registration name checking scenario, behaviors that malicious registrations and batch registrations cannot be identified often occur.
Disclosure of Invention
Embodiments of the present application provide a method and an apparatus for detecting text similarity, an electronic device, a computer-readable storage medium, and a computer program product, which can solve the above problems in the prior art. The technical scheme is as follows:
according to an aspect of the embodiments of the present application, there is provided a method for detecting text similarity, including:
acquiring at least two texts to be detected;
for each text, segmenting the text, and obtaining a target sequence of the text according to a segmentation result; the target elements in the target sequence are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles;
for the target sequence of each text, sequentially combining each target element with a subsequent preset number of target elements according to the sequence of each target element in the target sequence to obtain at least one combined element corresponding to the target element, and obtaining a combined element sequence of the text according to all the target elements and the at least one combined element;
coding the merging element sequence of each text, and obtaining the frequency vector of each text according to the coding result of the merging element sequence of each text; the feature of each dimension in the frequency count vector is used for representing the frequency count of the corresponding element in the total merging sequence in the merging element sequence of the text, the total merging sequence is obtained by sequencing the target element and the merging element in the merging element sequences of all the texts, and no repeated element exists in the total merging sequence;
and for any two texts, obtaining the text similarity of any two texts according to the frequency vectors of any two texts.
According to another aspect of the embodiments of the present application, there is provided an apparatus for detecting text similarity, including:
the text acquisition module is used for acquiring at least two texts to be detected;
the target sequence obtaining module is used for segmenting the text for each text and obtaining a target sequence of the text according to the segmentation result; the target elements in the target sequence are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles;
the merging module is used for sequentially merging each target element with a subsequent preset number of target elements according to the sequence of each target element in the target sequence for the target sequence of each text to obtain at least one merging element corresponding to the target element, and obtaining a merging element sequence of the text according to all the target elements and the at least one merging element;
the frequency vector module is used for coding the merging element sequence of each text and obtaining the frequency vector of each text according to the coding result of the merging element sequence of each text; the feature of each dimension in the frequency count vector is used for representing the frequency count of the corresponding element in the total merging sequence in the merging element sequence of the text, the total merging sequence is obtained by sequencing the target element and the merging element in the merging element sequences of all the texts, and no repeated element exists in the total merging sequence;
and the similarity calculation module is used for obtaining the text similarity of any two texts according to the frequency vectors of any two texts.
As an alternative embodiment, the target sequence obtaining module comprises:
the word segmentation sequence submodule is used for obtaining a word segmentation sequence of the text according to the word segmentation result;
and the target sequence sub-module is used for obtaining a target sequence of the text according to the word segmentation sequence of the text.
As an alternative embodiment, the word segmentation sequence submodule includes:
the initial word segmentation unit is used for obtaining an initial word segmentation sequence of the text according to the word segmentation result;
the continuous word segmentation judging module is used for determining the initial word segmentation sequence as a word segmentation sequence if continuous first target type word segmentation does not exist in the initial word segmentation sequence;
if continuous participles of the first target type exist in the initial participle sequence, replacing the whole continuous participles of the target type with preset participles of a preset number, and taking the replaced initial participle sequence as a participle sequence
As an optional embodiment, the continuous word segmentation judging module is specifically configured to:
if continuous first-type participles exist in the initial participle sequence, replacing the continuous first-type participles with first target participles, wherein the first target participles are used for representing the number of the continuous first-type participles;
and if continuous participles of a second specified type exist in the initial participle sequence, replacing the continuous participles of the second specified type with second target participles, wherein the second target participles are a combination of the continuous participles of the second specified type.
As an alternative embodiment, the target sequence submodule comprises:
the word segmentation type unit is used for determining the type of each word segmentation in the word segmentation sequence;
the word type sequence unit is used for obtaining a word type sequence of the text according to the type of each participle, and each element in the word type sequence is used for indicating the type of the participle at the corresponding position in the word type sequence;
and the target sequence unit is used for obtaining a target sequence of the text according to the word type sequence of the text, and the attribute information of the participle comprises the type of the participle.
As an alternative embodiment, the target sequence unit comprises:
the type judging unit is used for determining the sequence of the elements in each type element belonging to the second target type in the word type sequence if the type represented by the elements belongs to the second target type; if the type represented by the element does not belong to the second target type, determining the order of the element in each element which does not belong to the second target type in the word type sequence;
and the sequence unit is used for obtaining the target sequence according to the sequence corresponding to each element in the word type sequence, and the attribute information of the participle also comprises the sequence of the element corresponding to the participle.
As an alternative embodiment, the word segmentation type unit is specifically configured to:
and for each participle in the participle sequence, if the participle belongs to multiple candidate types, determining the word frequency of the participle belonging to the candidate type in the participle sequence for each candidate type, and determining the type of the participle according to the candidate type corresponding to the highest word frequency.
As an alternative embodiment, the segmentation type unit comprises:
the first situation determining unit is used for taking the candidate type corresponding to the highest word frequency as the type of the participle if the candidate type corresponding to the highest word frequency is unique;
and the second condition determining unit is used for taking the candidate type with the highest priority and the highest word frequency as the type of the participle according to the predetermined type priority if the candidate type corresponding to the highest word frequency is not unique.
According to another aspect of embodiments of the present application, there is provided an electronic device including: the detection method comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the detection method for the text similarity.
According to still another aspect of embodiments of the present application, there is provided a computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the steps of the method for detecting text similarity described above.
According to an aspect of the embodiments of the present application, there is provided a computer program product, including a computer program, which when executed by a processor implements the steps of the method for detecting text similarity described above.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
the method comprises the steps of obtaining at least two texts to be detected, performing word segmentation on each text, obtaining a target sequence of the text according to word segmentation results, combining each target element with a subsequent preset number of target elements in sequence according to the sequence of each target element in each target sequence, obtaining at least one combination element corresponding to the target element, obtaining a combination element sequence of the text according to all the target elements and at least one combination element, representing the sequence relation of each target element in the text in a more specific and fine-grained manner, identifying fine differences among the texts with overlapped words, laying a foundation for more accurately analyzing the semantics of the texts, encoding the combination element sequence of each text, and obtaining a frequency vector of each text according to the encoding result of the combination element sequence of each text; the feature of each dimension in the frequency vector is used for representing the frequency of the corresponding element in the total combined sequence in the combined element sequence of the text, for any two texts, the text similarity of any two texts is obtained according to the frequency vectors of any two texts, the accuracy is higher, the method is more suitable for a network name auditing scene, the method can effectively perform the actions of mining nickname batch malicious registration and batch group aggregation, and perform risk early warning.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram illustrating cosine similarity in the related art;
FIG. 2 is a schematic illustration of an environment in which an embodiment of the present application may be implemented;
fig. 3 is a schematic flowchart of a text similarity detection method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an interface for modifying a user name according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of obtaining a merged element sequence according to a target sequence according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a process for determining a position of an element according to an embodiment of the present application;
FIG. 7a is a schematic diagram of an initial interface for setting a detection mode according to an embodiment of the present application;
FIG. 7b is a schematic interface diagram illustrating a second detection method according to an embodiment of the present disclosure;
fig. 7c is a schematic interface diagram after a detection party operates a confirmation control according to an embodiment of the present application;
fig. 8 is a schematic interface diagram illustrating text similarity when determining two detection modes according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text similarity detection system applied in this scenario embodiment of the present application;
fig. 10 is a schematic structural diagram of a device for detecting text similarity according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms referred to in this application will first be introduced and explained:
cosine similarity, which is to draw a vector into a vector space, such as the most common two-dimensional space, according to coordinate values. Profile cosine similarity measures the similarity between two vectors by measuring their cosine value of their angle. The cosine value of the 0-degree angle is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction.
Please refer to fig. 1, which schematically illustrates a principle diagram of cosine similarity, wherein vectors are shown
Figure BDA0003686433450000071
Cosine of included angle of two vectors
Figure BDA0003686433450000072
Where the numerator is the inner product of 2 vectors and the denominator is the product of the modulo lengths of the two vectors. When the included angle of the two vectors is larger, the distance is farther, the cosine similarity is smaller, and the maximum distance is the included angle of the two vectors of 180 degrees; distance as the included angle between two vectors is smallerThe closer the distance is, the greater the cosine similarity is, the minimum distance is the included angle of two vectors is 0 degrees, and the vectors are completely overlapped.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. The natural language processing can be applied to the aspects of machine translation, public opinion monitoring, automatic summarization, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and the like.
Word segmentation refers to a process of recombining continuous word sequences into semantic independent word sequences according to a certain specification.
The process of calculating the similarity between two texts by using the cosine similarity in the related art can be summarized as follows: 1. word segmentation; 2. listing all words; 3. word segmentation coding; 4. vectorizing word frequency; 5. and measuring the similarity of the two sentences by applying a cosine function.
The following is exemplified by a sample:
sentence a: i need to write patents with great concentration;
sentence B: i need to concentrate on writing patents;
1. word segmentation: the phrases are not considered here, and are divided by single words
listA = [ 'me', 'to', 'special', 'heart', 'write', 'special', 'benefit' ];
listB = [ 'i', 'to', 'special', 'inject', 'write', 'special', 'benefit' ];
2. list all words, put listA and listB in one set, resulting in:
set = { 'me', 'to', 'special', 'heart', 'inject', 'write', 'benefit' };
and converting the set into a dictionary dit, wherein the key is a word in the set, the value is the position of the word in the set, and the position starts from 0.
dict = { 'me': 0 'to': 1, 'special': 2, the 'core': 3, 'note': 4, 'write': 5, the 'benefit': 6, it can be seen that the word "I" is ranked 1 in the set, with a 0 index.
3. Encoding listA and listB, converting each word to a position appearing in the set, after conversion:
listAcode=[0,1,2,3,5,2,6];
listBcode=[0,1,2,3,4,2,6];
4. and carrying out element frequency statistics on the listAcode and the listBcode, namely calculating the occurrence frequency of each participle. The final results were as follows:
listAcodeOneHot=[1,1,2,1,0,1,1];
listBcodeOneHot=[1,1,2,1,1,0,1];
listing all words, coding the words and calculating word frequency;
5. after the frequency vectors of the two sentences are obtained, cosine values of included angles between the two vectors are calculated, and the larger the value is, the higher the similarity is.
Figure BDA0003686433450000081
The related art has the following problems in calculating the similarity:
1) The order of sentences is not considered when the sentences are segmented, such as: "you love me" and "I love you", calculated according to the above scheme, the similarity is 1, but the meanings expressed by the above two sentences are different;
2) Regardless of the length and position of the overlapped words, such as "haha" and "haha", the similarity is 1.
3) Regardless of the language type;
when the text similarity determination method in the related art is applied to detection of the registration name, because the number of characters in the registration name is small, multiple language types and situations of word overlapping (such as Chinese and English inclusion, and character and number inclusion) exist frequently, and the related art cannot accurately identify the batch malicious registration and the group rejection behaviors.
The text similarity detection method, the text similarity detection device, the electronic equipment, the computer readable storage medium and the computer program product provided by the application aim to solve the technical problems in the prior art.
The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps, etc. in different embodiments is not repeated.
Refer to fig. 2, which illustrates a schematic diagram of an implementation environment of an embodiment provided in the present application. The embodiment implementation environment may be implemented as a video processing system. The embodiment implementation environment may include: a terminal device 10 and a server 20.
The text similarity detection system realizes detection of text similarity through the terminal device 10 and the server 20.
The terminal device 10 may be an electronic device such as a mobile phone, a tablet Computer, a PC (Personal Computer), a wearable device, an in-vehicle terminal device, a VR (Virtual Reality) device, and an AR (Augmented Reality) device, which is not limited in this application. A client running a target application may be installed in the terminal device 10. For example, the target application may be any type of application that needs to register a network name, such as a game application, an audio/video playing application, a forum application, a chat application, and the like, and may also be an application that performs lyric/subtitle detection, where the type of application compares the lyric/subtitle to be disclosed with the already disclosed lyric/subtitle corresponding to the same multimedia file by acquiring the lyric/subtitle to be disclosed, and if the text similarity is too high, prompts or cancels disclosure, thereby protecting the rights and interests of the copyright side of the already disclosed lyric/subtitle. The type of the target application is not limited in this application.
The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing a cloud computing service. The server 20 may be a background server of the target application program, and is used for providing a background service for the client of the target application program. The server 20 is used for providing the text publisher with the detection result of the text similarity.
The embodiment of the application provides a text similarity detection method, as shown in fig. 3, the method includes:
s101, acquiring at least two texts to be detected;
the text of the embodiment of the application may be from documents, information, web pages, and the like, and in some embodiments, the text may also be a registered net name, such as a nickname of a blog, a user name of a chat application, a user name of an audio/video playing application, and the like.
Referring to fig. 4, which exemplarily shows an interface diagram of modifying personal information according to an embodiment of the present application, a user may edit information such as a user avatar, a user name, and a personal signature through the interface, where the interface includes a textbox 101 to be filled with the user name, when an object wants to register or modify the user name on an application, fill a new user name into the textbox 101, in response to determining that a control 102 is triggered, take the new user name as a text, perform text similarity check, specifically, may perform text similarity comparison between the new user name and the user name that has been identified as a malicious registration, and if the text similarity check passes, determine that the new user name does not belong to the malicious registration, display a popup window 103 on the interface to prompt the user to successfully edit the user name.
And S102, for each text, performing word segmentation on the text, and obtaining a target sequence of the text according to a word segmentation result.
The method for segmenting words in text is not limited in particular in this application, and may be, for example, a dictionary word segmentation based algorithm, a statistical-based machine learning algorithm, or the like. The word segmentation result of the present application may be all single characters in the text, and the type of the character may be simplified chinese, traditional chinese, english alphabet, japanese, numeral, special character, emoticon, and the like, and the embodiment of the present application is not particularly limited.
After the segmentation result is obtained, a target sequence of the text may be obtained based on the segmentation result. Elements in the target sequence in the embodiment of the present application are referred to as target elements, and the target elements are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles.
In the text G "lost King 2022! By way of example, the segmentation result X can be expressed as:
x = [ ' lose ', ' fall ', ' of ', ' K ', ' i ', ' n ', ' g ', '2', '0', '2', '2', '2', ' | a! ',' |! ',' |! ']
And if the target original mark in the target sequence is the word segmentation of the response position, the target sequence Y is the same as the word segmentation result X.
In the embodiment of the present application, the attribute information of the segmented word is not specifically limited, and may be, for example, a language type of the segmented word, and the text G includes simplified chinese, english, numbers, and special characters, so when the language type of the segmented word is used to identify the target element, the target sequence is:
Y=['L1','L1','L1','L2','L2','L2','L2','L4','L3','L3','L3','L3','L4','L4','L4']
wherein, L1 represents Chinese simplified body, L2 represents English, L3 represents number, and L4 identifies special character.
S103, for the target sequence of each text, sequentially combining each target element with a subsequent preset number of target elements according to the sequence of each target element in the target sequence to obtain at least one combined element corresponding to the target element, and obtaining the combined element sequence of the text according to all the target elements and the at least one combined element.
It can be known from the above example that, if the similarity between two sequences is determined based on the difference between the elements in the two sequences, it is easy to occur that after the target sequence of each text is obtained, instead of simply marking the sequence of each target element in the target sequence, each target element is sequentially merged with a subsequent preset number of target elements to form at least one merged element corresponding to the target element, because semantics expressed by each element based on the word sequence (such as "love me you" and "love you me") are ignored, and the merged element can more specifically and more finely represent the sequential relationship of each target element in the text.
Combining each target element with the following 1, 2, \ 8230;, k target elements in sequence in turn, if the number of the following target elements of a certain target element in the sequence is less than k, only the final target element of the sequence needs to be combined, taking the text G as an example, if k is 2, for the 'missing' target element, the following first target element is 'fallen', the second target element is 'fallen', so the available combined elements are 'fallen' and 'fallen'; for the penultimate target element'! 'in the sense that the merging element obtained is only'! | A ' one.
After obtaining the merge element, the embodiment of the application may obtain the merge element sequence by combining the merge elements according to the order of the target element in the target sequence.
In an alternative embodiment, the relative order between the target elements in the merged element sequence is kept unchanged from the relative order of the target elements in the target sequence, and two adjacent target elements in the target sequence further include at least one merged element of the previous target element in the merged element sequence.
Referring to fig. 5, a schematic flow chart of obtaining a merging element sequence according to a target sequence in an embodiment of the present application is exemplarily shown, as shown in the figure, the text is "i love apple", and the target sequence obtained according to the text is [ 'i', 'love', 'eat', 'apple', 'fruit'). If it is determined that k is 2, that is, each target element is merged with the respectively subsequent 1 and 2 target elements, the obtainable merged elements include:
'I love', 'I love' and,
'Eat' and 'Eat apple',
'eating apple',
the general formula of the "apple",
the merging element sequence is: 'I', I love ',' love apple ',' etc., eating ',' eating apple ',' fruit ',').
As can be seen from the merging element sequence, the relative order between the target elements in the merging element sequence remains the same as the relative order of the target elements in the target sequence, and two adjacent target elements in the target sequence also include at least one merging element of the previous target element in the merging element sequence, and the relative order between the merging elements corresponding to the same target element is also fixed — the fewer the number of target elements in the merging element is, the more ahead the merging element is.
It should be noted that, in the present application, by combining each target element with a subsequent preset number of target elements in sequence, at least one combined element corresponding to the target element is obtained, and compared with the prior art, the present application can also more accurately identify differences between texts with different word numbers in which a double-letter exists.
S104, coding the merging element sequence of each text, and obtaining the frequency vector of each text according to the coding result of the merging element sequence of each text.
It should be understood that after obtaining the merge element sequence of each text, the embodiments of the present application may list the individual elements (including the target element and the merge element) in the merge element sequences of all texts to form an overall merge sequence in which no duplicate elements exist. Specifically, the elements in the merged element sequence of all texts may be placed in a set, the set is converted into a dictionary, and key value pairs are constructed, where the key is each element in the set, the value is a position sequence number of occurrence of each element in the set, and the position sequence numbers may be counted from 0.
Coding each merging element sequence, namely converting each element into a position appearing in the set, and performing element frequency statistics on the merging element sequence, namely calculating the number of times each element appears to finally obtain a frequency sequence corresponding to each text.
On the basis of the embodiment shown in fig. 5, if there is another text asking "i love eating nuts", the merging element sequence of the text can be determined accordingly as:
' i ', i love ', ' eat ' and ' nut ', ' fruit ', ' eat ' and ' nut '.
If the merging element sequence of the text "i love eating apple" is defined as List1, and the merging element sequence of the text "i love eating nut" is defined as List2, counting all elements in List1 and List2 to form a set:
set = { 'i', i love ',' love apple ',' love hardness ',' eat ', eating apple', 'eating nut', 'apple', 'nut', 'fruit' }
In an embodiment, when determining the positions of the elements in the dictionary, the embodiment of the present application may sequentially determine the corresponding positions according to the order of the elements in each sequence.
Referring to fig. 6, a schematic flowchart of determining a position of an element according to an embodiment of the present application is exemplarily shown, and as shown in the drawing, the method includes:
s201, numbering each merged element sequence; traversing from the merging element sequence with the minimum number;
s202, judging whether an unrepeated element (comprising a target element and a merged element) exists in the current traversed merged element sequence, and if so, executing S203; if not, go to S205;
s203, determining the element with the most front ranking from the elements which are not traversed as the element traversed currently;
s204, judging whether the element appears in the dictionary or not, and if not, executing S207; if yes, executing S205;
s205, judging whether the number of the sequence traversed currently is the maximum number, and if the number of the sequence traversed currently is not the maximum number, executing S206; if the serial number of the current traversal sequence is the maximum serial number, ending the process;
s206, taking the next sequence after the current traversed sequence as a new current traversed sequence, and returning to execute the step S202;
s207, determining the element as the current last position in the dictionary, and returning to the step S205.
On the basis of the above embodiments, the embodiment of the present application aims at the text "i love eating apples" and the text "i love eating nuts", and the obtained dictionary is:
dict = { 'me': 0, I love': 1, 'I love eating': 2, 'love': 3, 'love to eat': 4, 'love eating apple': 5, 'love eating firm': 6, 'eat': 7, 'eating apple': 8, 'eat firm': 9 'eat apple': 10, 'eating nut': 11, 'apple': 12, the 'hard': 13, 'apple': 14, 'nut': 15, 'fruit': 16}
List1 and List2 are encoded separately, translating each element into a position in set, after translation:
List1code=[0,1,2,3,4,5,7,8,10,12,14,16]
List2code=[0,1,2,3,4,6,7,9,11,13,15,16]
and carrying out element frequency statistics on the list1code and the list2code, namely calculating the occurrence frequency of each participle. The resulting frequency sequence results are as follows:
List1codeOneHot=[1,1,1,1,1,1,0,1,1,0,1,0,1,0,1,0,1]
List2codeOneHot=[1,1,1,1,1,0,1,1,0,1,0,1,0,1,0,1,1]
the frequency vector of the text "i love eating apples" is: <xnotran> (1,1,1,1,1,1,0,1,1,0,1,0,1,0,1,0,1); </xnotran>
The frequency vector of the text "i love eating nuts" is: (1,1,1,1,1,0,1,1,0,1,0,1,0,1,0,1,1).
And S105, for any two texts, obtaining the text similarity of any two texts according to the frequency vectors of any two texts.
Specifically, the cosine similarity of the frequency vectors of the two texts can be calculated to serve as the text similarity of the two texts.
According to the method for detecting the text similarity, at least two texts to be detected are obtained, word segmentation is carried out on each text, a target sequence of the text is obtained according to a word segmentation result, target elements in the target sequence can be word segmentation at corresponding positions in the word segmentation result, attribute information of the word segmentation can also be attribute information of the word segmentation, multi-similarity detection from different angles can be carried out, aiming at the target sequence of each text, each target element is sequentially combined with a subsequent preset number of target elements according to the sequence of each target element in the target sequence, at least one combination element corresponding to the target element is obtained, a combination element sequence of the text is obtained according to all the target elements and the at least one combination element, the sequence relation of each target element in the text can be expressed in a more fine-grained mode, fine differences among the texts with overlapped words can be recognized, a foundation is laid for more accurately analyzing the semantics of the texts, the combination element sequence of each text is coded, and a frequency vector of each text is obtained according to the coding result of the combination element sequence of each text; the feature of each dimension in the frequency vector is used for representing the frequency of the corresponding element in the total combined sequence in the combined element sequence of the text, for any two texts, the text similarity of any two texts is obtained according to the frequency vectors of any two texts, the accuracy is higher, the method is more suitable for a network name auditing scene, the method can effectively perform the actions of mining nickname batch malicious registration and batch group aggregation, and perform risk early warning.
The following describes, with reference to two specific examples, that the embodiment of the present application gives higher attention to word order and word overlap, so that text similarity can be determined more accurately.
Example 1
For the two texts "you love me" and "i love you", assuming that each target element is in turn with the next 2 target elements, then:
the merging element sequence of "you love me" is [ 'you', 'you love', 'love me', 'i' ];
the merging element sequence of "i love" is [ 'i', 'i love', 'love you', 'you' ];
all target elements and merging elements include [ ' you ', ' you ' love ', ' love me ', ' me love ', ' love you ' ], the frequency vectors are (1, 0) respectively, (1, 0, 1), cosine similarity being:
Figure BDA0003686433450000151
if for nicknames "you love me" and "i love you", assuming that each target element is in turn associated with 3 subsequent target elements, then:
the 'you love me' is divided into words of [ 'you', 'you love', 'love me', 'me' ];
after the word division, the word 'I love' is [ 'I', 'I love', 'love you', 'you' ];
<xnotran> [ '', ' ', ' ', '', ' ', '', ' ', ' ', '' ], (1,1,1,1,1,1,0,0,0), (1,0,0,1,0,1,1,1,1), : </xnotran>
Figure BDA0003686433450000161
As can be seen from embodiment 1, in some cases, when a merged element is obtained based on merging of target elements, the more subsequent target elements are involved, the lower the cosine similarity is obtained, and thus the difference between texts can be distinguished more significantly.
Example 2
For example, two texts: "haha" and "haha", if each target element is set to be merged with the following two target elements, then:
the word "haha" is divided into [ 'ha', 'ha' ];
the word division of 'Ha-Ha' is [ 'Ha', 'haha', 'haha' ];
all elements of the merged element sequence include [ 'ha', 'ha' ], the frequency vectors of the two texts are (2, 1), (4, 3), respectively, and the cosine similarity is:
Figure BDA0003686433450000162
if each target element is set to be merged with the following three target elements, then:
the word "haha" is divided into [ 'ha', 'ha' ];
the word "haha" is divided into [ 'ha', 'haha', 'haha';
all elements of the merged element sequence include [ 'ha', 'haha' ], frequency vectors of two texts are (2, 1, 0), (4, 3, 2), respectively, and the cosine similarity is:
Figure BDA0003686433450000163
compared with the prior art that the text similarity of the 'haha' and the 'haha' is determined to be 1, the method and the device can find the difference between the two texts, so that omission is reduced when malicious or batched registration of network names is determined, and under some conditions, when the merging elements are obtained based on merging of target elements, the more the related subsequent target elements are, the lower the cosine similarity is obtained, and the difference between the texts can be distinguished more remarkably.
On the basis of the above embodiments, as an alternative embodiment, obtaining the target sequence of the text according to the word segmentation result includes:
obtaining a word segmentation sequence of the text according to the word segmentation result;
and obtaining a target sequence of the text according to the word segmentation sequence of the text.
The word segmentation sequence can be used as a target sequence, and the attributes of each word in the word segmentation sequence can be further analyzed, so that the target sequence is obtained based on the attributes of each word, that is, target elements in the target sequence are used for representing attribute information of corresponding words.
The word segmentation sequence of the text is obtained according to the word segmentation result of the text, the target sequence is obtained according to the word segmentation sequence of the text, the word segmentation sequence can be directly used as the target sequence, the attribute of each word segmentation in the word segmentation sequence can be further analyzed, the target sequence is obtained based on the attribute of each word segmentation, various feasible schemes are provided for obtaining the target sequence, and the flexibility of similarity detection is provided.
On the basis of the above embodiments, as an alternative embodiment, obtaining a word segmentation sequence of a text according to a word segmentation result includes:
obtaining an initial word segmentation sequence of the text according to the word segmentation result;
if the continuous first target type participles do not exist in the initial participle sequence, determining the initial participle sequence as a participle sequence;
if continuous first target type participles exist in the initial participle sequence, replacing the whole continuous target type participles with preset number, and taking the replaced initial participle sequence as the participle sequence.
Considering that malicious registered net names in a net name audit scene usually present batch information, such as "fruit wholesale xxx", wherein "xxx" is sequence information in batch generation, and may be arranged from 001 to 999.
After the initial segmentation sequence is obtained, whether continuous first target type segmentation exists in the initial segmentation sequence is further judged, and it can be understood that the requirement of the number of characters meeting the "continuous" condition can be preset in the embodiment of the present application, and when the number of the first target type segmentation exceeds the preset number of characters (for example, 3), it is determined that continuous first target type segmentation exists in the initial segmentation sequence.
The first target type of the embodiment of the present application may include a first specified type and a second specified type, and for the first specified type and the second specified type, the embodiment of the present application has a corresponding manner of updating the initial word segmentation sequence:
a) If continuous first-type participles exist in the initial participle sequence, replacing the continuous first-type participles with first target participles, wherein the first target participles are used for representing the number of the continuous first-type participles.
B) And if continuous participles of the second specified type exist in the initial participle sequence, replacing the continuous participles of the second specified type with second target participles, wherein the second target participles are a combination of the continuous participles of the second specified type.
The specific types of the first designated type and the second designated type are not limited in the embodiments of the present application, for example, the first designated type may be a number, and the second designated type may be a letter, such as an english letter, a latin letter, a pinyin letter, a slv letter, and the like.
In the text "missing King 2022! | A | A For example, the embodiment of the present application may decompose all the single characters in the text (including simplified chinese, traditional chinese, english letters, japanese, korean, numbers, special characters, expressions, etc.), to obtain an initial word segmentation sequence:
x = [ ' lose ', ' fall ', ' of ', ' K ', ' i ', ' n ', ' g ', '2', '0', '2', '2', '2', ' | a! ',' |! ',' |! ']
Since there is a first specified type in the sequence of participles: numbers, and a second specified type: the english alphabet can be combined with numerals until the combined numerals cannot be combined, and the character length of the combined numeral combination is counted, and if the length is n, the numeral character string is replaced by NUM _ n, and since the numeral combination in the text is "2022" and the length is 4, "NUM _4" is replaced. Aiming at English letters, continuous English letters are spliced, and can be further uniformly converted into lower case "king", so that the updated word segmentation sequence is as follows:
x '= [' lose ',' fall ',' of ',' King ',' NUM _4','! ',' |! ','! '].
In the embodiment of the application, the situation that the user is classified by using the serial number when the user is registered through the mobile phone number or the batch account number can be identified in the network name detection scene, for example, "xiaolu 13X11111111" of a new company "," asheng 15X 11111111111 "of a new company", "new customer service 001" of a new company, and "new customer service 002" of a new company.
On the basis of the foregoing embodiments, as an optional embodiment, obtaining a target sequence of a text according to a word segmentation sequence of the text includes:
s301, determining the type of each participle in the participle sequence;
s302, according to the type of each participle, obtaining a word type sequence of the text, wherein each element in the word type sequence is used for representing the type of the participle at the corresponding position in the word type sequence;
and S303, obtaining a target sequence of the text according to the word type sequence of the text.
The type of the participle in the embodiment of the present application may refer to the language type of the participle, such as chinese simplified, chinese traditional, english, number, and special character.
In some embodiments, for the case that the same participle may belong to multiple types (for example, a chinese character may belong to both a simplified chinese character and a traditional chinese character), the present application sets priority information for each type, so that when a participle belongs to multiple types, the type with higher priority is taken as the type of the participle. For example, the priority may be simplified Chinese, traditional Chinese, english, number, and special character in order from high to low.
In the embodiment of the present application, the word type sequence may be directly used as the target sequence, so that the attribute information of the participle represented by the target element in the target sequence, that is, the type of the participle, may be obtained.
For example, the text "a fruit wholesale 1", wherein "a" belongs to english, "water," "fruit," "batch," and "hair" belong to chinese, "1" belongs to a number, and if L1 denotes chinese, L2 denotes english, and L3 denotes a number, the part-of-speech sequence of the text is [ 'L2', 'L1', 'L3'). If such a type of sequence is used as the target sequence, the target sequence of the text is also [ 'L2', 'L1', 'L1', 'L1', 'L1', 'L3' ].
On the basis of the foregoing embodiments, as an alternative embodiment, obtaining a target sequence of texts according to a part-of-speech sequence of texts includes:
for each element in the word type sequence, if the type represented by the element belongs to a second target type, determining the order of the element in each type element belonging to the second target type in the word type sequence; if the type represented by the element does not belong to the second target type, determining the order of the element in each element which does not belong to the second target type in the word type sequence;
and obtaining an attribute sequence according to the sequence corresponding to each type element in the word type sequence, wherein the attribute information of the participle also comprises the sequence of the element corresponding to the participle.
In the embodiment of the present application, the target sequence may be obtained as to whether each element in the word type sequence belongs to the second target type and the order of the element in the corresponding type (that is, either the second target type or not), so that the attribute information represented by the element in the target sequence also includes the order of the element corresponding to the participle, that is, the attribute information of the target element in the target sequence of the embodiment of the present application includes both the type of the target element and the order of each target element in the corresponding type of the target element, and the word order information of the text is strengthened.
Taking the above text "a fruit wholesale 1" as an example, on the basis of obtaining that the part-of-speech sequence is [ 'L2', 'L1', 'L3' ], if it is determined that the second target type is chinese, it is known that neither english nor numerals belong to the second target type, and further, if the second target type is set to be a, and the non-second target type is set to be B, it is known that the target sequence is: [ 'B1', 'A1', 'A2', 'A3', 'A4', 'B2' ], in particular, 'L2' corresponds to 'B1' in the target sequence since 'L2' is not of the second target type, but is the first element which is not of the second target type, and 'L2' corresponds to 'B1' in the target sequence, and similarly, 'hair' corresponds to 'L1' which is the 4 th element in the sequence of word types, and 'L1' belongs to the second target type, so 'L1' corresponds to 'A4' in the target sequence.
Please refer to fig. 7a to 7c, which exemplarily show interface schematic diagrams of a detection method set by a detection party in an abnormal internet name detection scenario according to an embodiment of the present application, it should be noted that the detection method may be set after a text to be detected is imported or before the text to be detected.
The similarity between any two introduced texts can be calculated, and the similarity between the introduced text and a plurality of sample texts preset in the background can also be calculated.
As shown in fig. 7a, it is a schematic diagram of an initial interface for setting a detection mode, and the interface has two triggerable controls: the first detection control 201 and the second detection control 202 are not limited in the embodiment of the present application to the triggering manner, and may be, for example, a single click, a double click, a slide, and the like.
When the detecting party triggers the first detecting control 201, it indicates that the detecting party wants to perform similarity detection based on the word segmentation of the text (detecting mode 1), in this case, the target element of the target sequence is used to indicate the word segmentation at the corresponding position in the word segmentation result, and this detecting mode belongs to the most basic detecting mode, which is relatively faster in detection.
In some embodiments, after the detecting party triggers the first detection control 201, the website may further display the prompt information of the set completion, and may further display whether to continue to set the prompt information of other detection modes, if the detecting party determines that other detection modes are not to be set, the detection of the text similarity is performed based on the detection mode 1, and after the detection is completed, the result of detecting the similarity between two texts is displayed.
When the detecting party triggers the second detection control 202, the attribute information indicating that the detecting party wants to perform similarity detection on the base segmented word (detection mode 2), in this case, the target element of the target sequence is used for indicating the attribute of the segmented word at the corresponding position in the word segmentation result. After the detecting party triggers the second detection control 202, the website jumps to a new interface, please refer to fig. 7b, which schematically shows an interface for setting the second detection mode according to the present application.
It should be noted that when the detection party determines to continue to set other detection modes, the website may also be adjusted to the interface shown in fig. 7b, so that when the user sets the detection mode 2, the website will obtain the similarity between the texts to be detected based on the parallel service flows and the detection modes 1 and 2, respectively, and after obtaining the similarity detection results, display two similarity detection results at the same time.
As shown in fig. 7b, the currently supported language types are further displayed in the interface, and the user may further adjust the priority order of each language type, where in this embodiment, the initial priority order of each voice type is displayed, and when the user wants to adjust a certain language type, the user may drag the language type control corresponding to the language type to move the control to a position before or after another language type control. For example, the initial priority order of the chinese simplified body in the left drawing is 1, the initial priority order of the english is 3, and the priority order of the arabic numerals is 4, and moving the language type control of the arabic numerals to the left side of the language type control of the english realizes updating the priority order of the english to 4 and updating the priority order of the arabic numerals to 3 in the right drawing, and realizes quick and intuitive adjustment of the priority order of the language type. When the detecting party determines that the order of the language types is set, the confirmation control 203 is operated, meaning that the similarity detection can be performed based on the types of the segmented words.
Please refer to fig. 7c, which exemplarily shows an interface schematic diagram after the detecting party operates the confirmation control, where the left diagram shows prompt information 204, the prompt information is used to prompt the detecting party whether to further set a second target type, if the second target type is not set, then subsequently, based on the priority of the set language type, determine the type of each participle, and calculate the text similarity based on the type of each participle, if the detecting party sets the second target type, then, further set the second target type, and the right diagram shows that the second target type is set as a chinese simplified body, after the setting is completed, it may be determined whether the type of each participle belongs to the second target type according to the set second target type, so that the obtained attribute information of the participle in the target sequence further includes the order of the element corresponding to the participle.
Referring to fig. 8, an interface schematic diagram showing text similarity when determining a detection manner based on a word segmentation (detection manner 1) and a detection manner based on a specified second target type (detection manner 2) in the embodiment of the present application is exemplarily shown, as shown in the figure, a query interface for querying for correspondence between different detection manners is provided in the interface, and when a detecting party operates the query interface 1, a text with similarity reaching a preset threshold obtained based on the detection manner 1 can be shown by inputting a preset threshold;
when the detection party operates the query interface 2, a text with the similarity reaching the preset threshold value obtained based on the detection mode 2 can be displayed by inputting the preset threshold value;
when the detection party operates the query interface 3, a text meeting the requirements of the detection mode 1 and the detection mode 2 and having similarity reaching the corresponding preset threshold can be displayed by inputting two preset thresholds (respectively corresponding to the detection mode and the detection mode 2).
For example, there are the following 3 texts to be tested:
A1='コール001';
A2='コール10000000000';
a3= 'customer service 002';
respectively performing word segmentation on the 3 texts, and setting each word segmentation to be combined with the subsequent 2 words segmentation in sequence; the sequence of the participles and the conjunctions of the 3 texts is:
A11=['コ','コー','ー','ール','ル','ルNUM_3','NUM_3'];
A12=['コ','コー','ー','ール','ル','ルNUM_11','NUM_11'];
a13= [ 'guest', 'dose _3', 'NUM _3' ];
setting the second target type as japanese through the foregoing embodiment, that is, the detecting party needs to calculate the language nickname similarity of japanese (assuming that the language library setting includes chinese simplified body, japanese, numeral, english, and the priority is consistent with the foregoing embodiment), judging whether each element in the sequence is japanese, and determining the order in the language type, to obtain a new sequence:
L11=['A1','A1A2','A2','A2A3','A3','A3B1','B1']
L12=['A1','A1A2','A2','A2A3','A3','A3B1','B1']
L13=['B1','B1B2','B2','B2B3','B3']
the similarity of the detection modes 1 of A1 and A2 is easily obtained as follows: 0.7143, the similarity of detection mode 2 is: 1;
the similarity of the detection modes 1 of A1 and A3 is easily obtained as follows: 0.169, the similarity of detection mode 2 is: 0;
the similarity of the detection modes 1 of A2 and A3 is easily determined as follows: 0, the similarity of the detection mode 2 is as follows: 0;
if the similarity of the detection mode 1 needs to be chosen to exceed 0.6 and the similarity of the detection mode 2 exceeds 0.6, the text pair [ '124671254012512523001', '12412540125234040125230000000' ] is screened and displayed for subsequent use.
On the basis of the above embodiments, as an optional embodiment, the method further includes:
for any two texts, if it is determined that an element belonging to the second target type does not exist in the word type sequence corresponding to one of the two texts and an element belonging to the second target type exists in the word type sequence corresponding to the other text, it is determined that the text similarity of the two texts is not higher than a preset threshold (for example, 0).
In the embodiment of the application, if it is determined that the second target type is preset by the detection party, it is determined that the similarity of the texts in the second target type needs to be calculated, and if the second target type is not involved in one text to be compared and is involved in the other text, it can be directly determined that the similarity of the two texts is not higher than a preset threshold, so that the efficiency of calculating the similarity can be greatly improved.
To take a specific example, the following text is directed:
text 1= 'Ann yellow';
text 2= 'bob';
if the second target type is specified as the chinese language, that is, the similarity of the chinese simplified languages of the text 1 and the text 2 needs to be calculated, according to the above embodiment, since the text 2 does not include the chinese simplified, the cosine similarity of the chinese simplified languages of the text 1 and the text 2 is directly set to be 0 at this time.
On the basis of the above embodiments, as an alternative embodiment, the determining the type of each participle in the participle sequence includes:
and for each participle in the participle sequence, if the participle belongs to multiple candidate types, determining the word frequency of the participle belonging to the candidate type in the participle sequence for each candidate type, and determining the type of the participle according to the candidate type corresponding to the highest word frequency.
According to the method and the device, for the participles belonging to multiple candidate types, the candidate type corresponding to the highest word frequency is determined as the type of the participle, so that the difference among the participles of different types in the text can be highlighted, the characteristics of the text can be determined more accurately based on the difference, and the determination of the text similarity can be improved.
As a specific example, if the participle sequence is present [ ' Central ', ' GUO ', '! ' ], wherein the participle "middle" can belong to both simplified Chinese and traditional Chinese, so that the candidate types of the participle "middle" include simplified Chinese and traditional Chinese, when belonging to simplified Chinese, the participle sequence includes 1 simplified Chinese participle, 1 traditional Chinese participle and one special character, when belonging to traditional Chinese, the participle sequence includes two traditional Chinese participles and one special character, it can be known that when the participle "middle" belongs to traditional Chinese, the traditional Chinese has a higher word frequency, and thus the type sequence [ ' L2', ' L2', ' L5' ] (in this embodiment, L2 represents traditional Chinese, and L5 represents a special character) is generated.
When the type of the participle is determined, the word frequency of the participle of each candidate type is determined, and the type of the participle is determined according to the candidate type corresponding to the highest word frequency, so that the distribution difference of different types in the obtained target sequence is more remarkable, and the difference between two texts can be more remarkably distinguished when the similarity is calculated.
On the basis of the foregoing embodiments, as an optional embodiment, determining the type of the participle according to the candidate type having the highest word frequency correspondence includes:
if the candidate type corresponding to the highest word frequency is unique, taking the candidate type corresponding to the highest word frequency as the type of the participle;
and if the candidate type corresponding to the highest word frequency is not unique, taking the candidate type with the highest priority and the highest word frequency as the type of the participle according to the predetermined type priority.
The method and the device for determining the type of the participle in the word frequency domain have the advantages that the type of the participle is determined according to the situation, when the candidate type corresponding to the highest word frequency is unique, the candidate type corresponding to the highest word frequency is directly used as the type of the participle, particularly when the candidate type corresponding to the highest word frequency is not unique, the type of the participle is determined based on the priority relation through the preset type priority relation, and therefore the method and the device are beneficial to facilitating a detection party to dynamically adjust the priority of each type according to the main language environment of application during actual application, and obtaining a similarity comparison result according with the language environment.
For example, the netizens in different regions use different language characters, so that the netizens in different regions use different characters when registering net names, and therefore, in some embodiments, the priority relationship of the different language characters may be dynamically set for the registration place of the user (which may be determined by the IP address of the user or the phone number used for registration).
Referring to fig. 9, fig. 9 is a schematic structural diagram illustrating a text similarity detection system applied in this scenario embodiment of the present application, where as shown in the figure, the detection system includes a user terminal, a text server, a similarity detection server, and a verification platform.
The user terminal may be a terminal that runs any application having a user registration function, and may be in communication with the similarity detection server via a network. The application program type is not limited in the embodiment of the application program, the application program can be an application program which needs to be downloaded and installed by a user, can also be a cloud application program, and can also be a game application program in an applet program, and when the user needs to register or modify a user name, a new user name is uploaded to a similarity detection server by a user terminal through a network.
The text server stores a certain number of abnormal user names which are registered in batches, each abnormal user name is stored in the text server as a sample text, the text server can store the abnormal user names in a classified mode according to the language types related in the sample text, namely each storage interval is used for storing the sample text with the same main language type, and the main language type is the voice type with the most word scores in the text.
The method comprises the steps that a text similarity testing party logs in a verification platform, the verification platform can provide a selectable detection mode for a detecting party (namely, a target element in a target sequence is used for representing a word segmentation at a corresponding position in a word segmentation result or attribute information of the word segmentation, the attribute information can be at least one of a language type or a sequence of corresponding elements), the verification platform sends the detection mode determined by the detecting party to a similarity detection server, the similarity detection server takes a new user name as a text to be detected, the similarity between the text to be detected and each sample text is calculated according to the determined detection mode, if the similarity between the text to be detected and a preset number of sample texts exceeds a threshold value, the text to be detected is determined to be an abnormal user name, and prompt information of name fetching abnormity is sent to a terminal so that a user can edit the user name again.
The similarity detection server can also feed back the detected abnormal user name to the verification platform, the detection party carries out manual verification, and if the manual verification determines that the user name is correct, the detected abnormal user name can be stored in the text server.
An embodiment of the present application provides a device for detecting text similarity, as shown in fig. 10, the device for detecting text similarity may include: a text obtaining module 1001, an object sequence obtaining module 1002, a merging module 1003, a frequency vector module 1004, and a similarity calculating module 1005, wherein,
a text acquiring module 1001 configured to acquire at least two texts to be detected;
a target sequence obtaining module 1002, configured to perform word segmentation on each text, and obtain a target sequence of the text according to a word segmentation result; the target elements in the target sequence are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles;
a merging module 1003, configured to, for the target sequence of each text, sequentially merge, according to the sequence of each target element in the target sequence, each target element with a subsequent preset number of target elements, to obtain at least one merged element corresponding to the target element, and obtain a merged element sequence of the text according to all the target elements and the at least one merged element;
the frequency vector module 1004 is configured to encode the merging element sequence of each text, and obtain a frequency vector of each text according to an encoding result of the merging element sequence of each text; the feature of each dimension in the frequency count vector is used for representing the frequency count of the corresponding element in the total merging sequence in the merging element sequence of the text, the total merging sequence is obtained by sequencing the target element and the merging element in the merging element sequences of all the texts, and no repeated element exists in the total merging sequence;
the similarity calculation module 1005 is configured to, for any two texts, obtain a text similarity between any two texts according to the frequency vectors of any two texts.
The apparatus in the embodiment of the present application may execute the method provided in the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus in the embodiments of the present application correspond to the steps in the method in the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
In an embodiment of the present application, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory, where the processor executes the computer program to implement the steps of the method for detecting text similarity, and compared with the related art, the method can implement:
the method comprises the steps of obtaining at least two texts to be detected, performing word segmentation on each text, obtaining a target sequence of the text according to a word segmentation result, wherein target elements in the target sequence can be word segmentation at corresponding positions in the word segmentation result and can also be attribute information of the word segmentation, performing first detection from different angles and multiple similarities, combining each target element with a subsequent preset number of target elements in sequence according to the sequence of each target element in the target sequence to obtain at least one combined element corresponding to the target element, obtaining a combined element sequence of the text according to all the target elements and the at least one combined element, representing the sequence relation of each target element in the text in a more detailed and fine-grained manner, identifying fine differences among the texts with overlapped words, laying a foundation for more accurately analyzing the semantics of the texts, encoding the combined element sequence of each text, and obtaining a frequency vector of each text according to the encoding result of the combined element sequence of each text; the feature of each dimension in the frequency vector is used for representing the frequency of the corresponding element in the total combined sequence in the combined element sequence of the text, for any two texts, the text similarity of any two texts is obtained according to the frequency vectors of any two texts, the accuracy is higher, the method is more suitable for a network name auditing scene, the method can effectively perform the actions of mining nickname batch malicious registration and batch group aggregation, and perform risk early warning.
As an alternative embodiment, the target sequence obtaining module comprises:
the word segmentation sequence submodule is used for obtaining a word segmentation sequence of the text according to the word segmentation result;
and the target sequence sub-module is used for obtaining a target sequence of the text according to the word segmentation sequence of the text.
As an alternative embodiment, the word segmentation sequence submodule includes:
the initial word segmentation unit is used for obtaining an initial word segmentation sequence of the text according to the word segmentation result;
the continuous word segmentation judging module is used for determining the initial word segmentation sequence as a word segmentation sequence if continuous first target type word segmentation does not exist in the initial word segmentation sequence;
if continuous participles of the first target type exist in the initial participle sequence, replacing the whole continuous participles of the target type with preset participles of a preset number, and taking the replaced initial participle sequence as a participle sequence
As an optional embodiment, the continuous word segmentation judging module is specifically configured to:
if continuous first-type participles exist in the initial participle sequence, replacing the continuous first-type participles with first target participles, wherein the first target participles are used for representing the number of the continuous first-type participles;
and if continuous participles of the second specified type exist in the initial participle sequence, replacing the continuous participles of the second specified type with second target participles, wherein the second target participles are a combination of the continuous participles of the second specified type.
As an alternative embodiment, the target sequence submodule includes:
the word segmentation type unit is used for determining the type of each word segmentation in the word segmentation sequence;
the word type sequence unit is used for obtaining a word type sequence of the text according to the type of each participle, and each element in the word type sequence is used for indicating the type of the participle at the corresponding position in the word type sequence;
and the target sequence unit is used for obtaining a target sequence of the text according to the word type sequence of the text, and the attribute information of the participles comprises the types of the participles.
As an alternative embodiment, the target sequence unit comprises:
the type judging unit is used for determining the sequence of the elements in each type element belonging to the second target type in the word type sequence if the type represented by the elements belongs to the second target type; if the type represented by the element does not belong to the second target type, determining the order of the element in each element which does not belong to the second target type in the word type sequence;
and the sequence unit is used for obtaining the target sequence according to the sequence corresponding to each element in the word type sequence, and the attribute information of the participle also comprises the sequence of the element corresponding to the participle.
As an alternative embodiment, the word segmentation type unit is specifically configured to:
and for each participle in the participle sequence, if the participle belongs to multiple candidate types, determining the word frequency of the participle belonging to the candidate type in the participle sequence for each candidate type, and determining the type of the participle according to the candidate type corresponding to the highest word frequency.
As an alternative embodiment, the segmentation type unit comprises:
the first situation determining unit is used for taking the candidate type corresponding to the highest word frequency as the type of the participle if the candidate type corresponding to the highest word frequency is unique;
and the second condition determining unit is used for taking the candidate type with the highest priority and the highest word frequency as the type of the participle according to the predetermined type priority if the candidate type corresponding to the highest word frequency is not unique.
In an alternative embodiment, an electronic device is provided, as shown in fig. 11, the electronic device 4000 shown in fig. 11 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, and is not limited herein.
The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is configured to execute a computer program stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and corresponding contents of the foregoing method embodiments.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than described or illustrated herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as needed, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The above are only optional embodiments of partial implementation scenarios in the present application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of the present application are also within the scope of protection of the embodiments of the present application without departing from the technical idea of the present application.

Claims (12)

1. A text similarity detection method is characterized by comprising the following steps:
acquiring at least two texts to be detected;
for each text, performing word segmentation on the text, and obtaining a target sequence of the text according to a word segmentation result; the target elements in the target sequence are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles;
for the target sequence of each text, sequentially combining each target element with a subsequent preset number of target elements according to the sequence of each target element in the target sequence to obtain at least one combined element corresponding to the target element, and obtaining a combined element sequence of the text according to all the target elements and the at least one combined element;
coding the merging element sequence of each text, and obtaining the frequency vector of each text according to the coding result of the merging element sequence of each text; the feature of each dimension in the frequency count vector is used for representing the frequency count of a corresponding element in a total merging sequence in a merging element sequence of the text, the total merging sequence is obtained by sequencing target elements and merging elements in the merging element sequences of all the texts, and no repeated element exists in the total merging sequence;
and for any two texts, obtaining the text similarity of the any two texts according to the frequency vectors of the any two texts.
2. The method of claim 1, wherein obtaining the target sequence of text according to the segmentation result comprises:
obtaining a word segmentation sequence of the text according to the word segmentation result;
and obtaining a target sequence of the text according to the word segmentation sequence of the text.
3. The method of claim 2, wherein obtaining the segmentation sequence of the text according to the segmentation result comprises:
obtaining an initial word segmentation sequence of the text according to the word segmentation result;
if the initial word segmentation sequence does not have continuous first target type word segmentation, determining the initial word segmentation sequence as a word segmentation sequence;
if continuous first target type participles exist in the initial participle sequence, replacing the whole continuous first target type participles with preset participles with a preset number, and taking the replaced initial participle sequence as the participle sequence.
4. The method of claim 3, wherein the first target type comprises at least one of a first specified type or a second specified type;
if continuous first target type participles exist in the initial participle sequence, replacing the continuous first target type participle whole with preset participles with a preset number, wherein the preset participles include at least one of the following:
if continuous first appointed type participles exist in the initial participle sequence, replacing the continuous first appointed type participles with first target participles, wherein the first target participles are used for representing the number of the continuous first type participles;
and if continuous second specified type participles exist in the initial participle sequence, replacing the continuous second specified type participles with second target participles, wherein the second target participles are the combination of the continuous second specified type participles.
5. The method according to any one of claims 2-3, wherein obtaining the target sequence of the text according to the segmentation sequence of the text comprises:
determining the type of each participle in the participle sequence;
obtaining a word type sequence of the text according to the type of each word segmentation, wherein each element in the word type sequence is used for representing the type of the word segmentation at the corresponding position in the word segmentation sequence;
and obtaining a target sequence of the text according to the word type sequence of the text, wherein the attribute information of the participle comprises the type of the participle.
6. The method of claim 5, wherein obtaining the target sequence of text from the sequence of part-of-speech of the text comprises:
for each element in the part of speech sequence, if the type represented by the element belongs to a second target type, determining the order of the element in each type element belonging to the second target type in the part of speech sequence; if the type represented by the element does not belong to the second target type, determining the order of the element in each element which does not belong to the second target type in the word type sequence;
and acquiring the target sequence according to the sequence corresponding to each element in the word type sequence, wherein the attribute information of the participle also comprises the sequence of the element corresponding to the participle.
7. The method of claim 5, wherein the determining the type of each participle in the sequence of participles comprises:
and for each participle in the participle sequence, if the participle belongs to multiple candidate types, determining the word frequency of the participle belonging to the candidate type in the participle sequence for each candidate type, and determining the type of the participle according to the candidate type corresponding to the highest word frequency.
8. The method of claim 7, wherein determining the type of the participle according to the candidate type with the highest word frequency correspondence comprises:
if the candidate type corresponding to the highest word frequency is unique, taking the candidate type corresponding to the highest word frequency as the type of the participle;
and if the candidate type corresponding to the highest word frequency is not unique, taking the candidate type with the highest priority and the highest word frequency as the type of the participle according to the predetermined type priority.
9. A text similarity detection device is characterized by comprising:
the text acquisition module is used for acquiring at least two texts to be detected;
the target sequence obtaining module is used for segmenting the texts for each text and obtaining the target sequence of the texts according to the segmentation result; the target elements in the target sequence are used for representing at least one of the participles at corresponding positions in the participle result or attribute information of the participles;
a merging module, configured to merge, for the target sequence of each text, each target element with a subsequent preset number of target elements in sequence according to the sequence of each target element in the target sequence, to obtain at least one merged element corresponding to the target element, and obtain a merged element sequence of the text according to all the target elements and the at least one merged element;
the frequency vector module is used for coding the merging element sequence of each text and obtaining the frequency vector of each text according to the coding result of the merging element sequence of each text; the feature of each dimension in the frequency count vector is used for representing the frequency count of a corresponding element in a total merging sequence in the merging element sequence of the text, the total merging sequence is obtained by sequencing target elements and merging elements in the merging element sequences of all the texts, and no repeated element exists in the total merging sequence;
and the similarity calculation module is used for obtaining the text similarity of any two texts according to the frequency vectors of any two texts.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the steps of the method for detecting text similarity according to any one of claims 1 to 8.
11. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the text similarity detection method according to any one of claims 1 to 8.
12. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method for detecting text similarity according to any one of claims 1 to 8.
CN202210651882.2A 2022-06-09 2022-06-09 Text similarity detection method and device, electronic equipment and storage medium Pending CN115186647A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210651882.2A CN115186647A (en) 2022-06-09 2022-06-09 Text similarity detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210651882.2A CN115186647A (en) 2022-06-09 2022-06-09 Text similarity detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115186647A true CN115186647A (en) 2022-10-14

Family

ID=83513104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210651882.2A Pending CN115186647A (en) 2022-06-09 2022-06-09 Text similarity detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115186647A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204918A (en) * 2023-01-17 2023-06-02 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204918A (en) * 2023-01-17 2023-06-02 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing
CN116204918B (en) * 2023-01-17 2024-03-26 内蒙古科技大学 Text similarity secret calculation method and equipment in natural language processing

Similar Documents

Publication Publication Date Title
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
KR102532396B1 (en) Data set processing method, device, electronic equipment and storage medium
AU2017408800B2 (en) Method and system of mining information, electronic device and readable storable medium
CN111695352A (en) Grading method and device based on semantic analysis, terminal equipment and storage medium
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN112686036B (en) Risk text recognition method and device, computer equipment and storage medium
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN110134965B (en) Method, apparatus, device and computer readable storage medium for information processing
CN111783443A (en) Text disturbance detection method, disturbance reduction method, disturbance processing method and device
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN108205524B (en) Text data processing method and device
CN111259262A (en) Information retrieval method, device, equipment and medium
CN111767714B (en) Text smoothness determination method, device, equipment and medium
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN114741468B (en) Text deduplication method, device, equipment and storage medium
CN113408278A (en) Intention recognition method, device, equipment and storage medium
CN114861635A (en) Chinese spelling error correction method, device, equipment and storage medium
CN115186647A (en) Text similarity detection method and device, electronic equipment and storage medium
CN108268443B (en) Method and device for determining topic point transfer and acquiring reply text
CN110738056A (en) Method and apparatus for generating information
US20230130662A1 (en) Method and apparatus for analyzing multimodal data
CN111460224A (en) Comment data quality labeling method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40075277

Country of ref document: HK