CN111221960A - Text detection method, similarity calculation method, model training method and device - Google Patents

Text detection method, similarity calculation method, model training method and device Download PDF

Info

Publication number
CN111221960A
CN111221960A CN201911030483.9A CN201911030483A CN111221960A CN 111221960 A CN111221960 A CN 111221960A CN 201911030483 A CN201911030483 A CN 201911030483A CN 111221960 A CN111221960 A CN 111221960A
Authority
CN
China
Prior art keywords
text
vectors
training
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911030483.9A
Other languages
Chinese (zh)
Inventor
曹绍升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN201911030483.9A priority Critical patent/CN111221960A/en
Publication of CN111221960A publication Critical patent/CN111221960A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a text detection method, a similarity calculation method, a model training method, a device and equipment. The method comprises the steps of obtaining a second text and a first text to be detected, generating a first text and a vector set of words in the second text, wherein the vector set comprises word vectors of the words and n-element stroke vectors, inputting the words in the first text, the words in the second text and the vector set into a pre-trained text similarity calculation model to calculate the similarity of the first text and the second text, and determining whether the first text is a text of a target category or not based on the similarity and the category of the second text. The text similarity is calculated based on the n-element stroke vectors of all words of the text, the association between the words can be extracted more granularly, the problem of new words in prediction is solved, and the text of the target category can be effectively detected and detected.

Description

Text detection method, similarity calculation method, model training method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text detection method, a similarity calculation method, a model training method, a device, and an apparatus.
Background
In some cases, it is necessary to detect a text of a target category that satisfies a certain condition. Generally, when detecting whether a text is a text of a target category, the text can be compared with the text of the target category, and if the similarity between the text and the text of the target category is high, it can be determined that the text belongs to the target category, so it is very critical to accurately calculate the similarity of the text. For example, some lawbreakers may cheat and guarantee some insurance services by illegal means, for example, the lawbreakers cheat and guarantee a group by establishing a QQ group, and for the cheat and guarantee mode, the group members cannot be well captured only through information analysis of conventional geographical locations, account registration devices and the like. Through careful research on the cheat insurance claim texts, the insurance claim texts filled by the cheat insurance group members are found to be very similar semantically in the same working group, so that potential cheat insurance group members can be mined through means of text semantic analysis. In order to identify the text of the target category more effectively and more accurately, it is necessary to improve the calculation method of the text similarity and the detection method of the text.
Disclosure of Invention
Based on the above, the present specification provides a text detection method, a similarity calculation method, a model training method, a device and an apparatus.
According to a first aspect of embodiments of the present specification, there is provided a method for detecting a target text, the method including:
acquiring a second text and a first text to be detected;
generating a set of vectors for each word in the first text and the second text, the set of vectors including word vectors and n-gram stroke vectors for the words;
inputting each word in the first text, each word in the second text and the vector set into a pre-trained text similarity calculation model to calculate the similarity of the first text and the second text;
determining whether the first text is a text of a target category based on the similarity and the category of the second text.
According to a second aspect of embodiments of the present specification, there is provided a method for training a text similarity calculation model, the method including:
acquiring a first training text, a second training text and the similarity between the first training text and the second training text;
generating a set of vectors for each word in the first training text and the second training text, the set of vectors including word vectors for the words and n-gram stroke vectors;
and training according to the words in the first training text, the words in the second training text, the vector set and the similarity to obtain the text similarity calculation model.
According to a third aspect of embodiments herein, there is provided a method of determining text similarity, the method comprising:
acquiring at least two texts;
generating a set of vectors for each word in the at least two texts, the set of vectors including word vectors for the words and n-gram stroke vectors;
and inputting the at least two texts and the vector set into a pre-trained text similarity calculation model, and calculating the similarity between every two texts in the texts.
According to a fourth aspect of embodiments herein, there is provided an apparatus for detecting a target text, the apparatus comprising:
the acquisition module is used for acquiring the second text and the first text to be detected;
a vector generation module configured to generate a set of vectors for each word in the first text and the second text, the set of vectors including word vectors and n-gram stroke vectors for the words;
the calculation module is used for inputting all the words in the first text, all the words in the second text and the vector set into a pre-trained text similarity calculation model so as to calculate the similarity of the first text and the second text;
and the judging module is used for determining whether the first text is the text of the target category or not according to the similarity and the category of the second text.
According to a fifth aspect of embodiments herein, there is provided a training apparatus of a text similarity calculation model, the apparatus including:
the acquisition module is used for acquiring a first training text, a second training text and the similarity between the first training text and the second training text;
a vector generation module, configured to generate a set of vectors for each word in the first training text and the second training text, where the set of vectors includes word vectors for the words and n-gram stroke vectors;
and the training module is used for training according to the words in the first training text, the words in the second training text, the vector set and the similarity to obtain the text similarity calculation model.
According to a sixth aspect of embodiments herein, there is provided an apparatus for determining text similarity, the apparatus comprising:
the acquisition module is used for acquiring at least two texts;
a vector generation module for generating a set of vectors for each word in the at least two texts, the set of vectors including word vectors for the words and n-gram stroke vectors;
and the calculation module is used for inputting each word in the at least two texts and the vector set into a pre-trained text similarity calculation model and calculating the similarity of every two texts in the texts.
According to a seventh aspect of embodiments herein, there is provided a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the embodiments when executing the program.
By applying the scheme of the embodiment of the specification, on one hand, when the text similarity calculation model is trained, the word vectors of all words in the training text are used, and the n-element stroke vectors of all words are also used as model input to train the model, so that the semantic information of Chinese words can be described on finer granularity, more words which do not appear in training data can be described, and the text similarity calculated by the model is more accurate through the n-element stroke vectors. On the other hand, when the similarity between texts is calculated, word vectors of all words in the texts and n-element stroke vectors are input into a pre-trained model as features to be subjected to similarity calculation, association between the words can be extracted more granularly, the problem that new words appear in prediction is solved, and the calculation result is more accurate. By the method, the similarity between the texts can be calculated more accurately, and the text of the target category can be detected more accurately according to the similarity of the texts.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
Fig. 1 is a flowchart of a text similarity calculation model training method according to an embodiment of the present disclosure.
Fig. 2 is a flowchart of a text similarity calculation method according to an embodiment of the present specification.
Fig. 3 is a flowchart of a method for detecting a target category of text according to an embodiment of the present disclosure.
Fig. 4 is a block diagram of a logical structure of a text similarity calculation model training apparatus according to an embodiment of the present specification.
Fig. 5 is a block diagram of a logical structure of a text similarity calculation apparatus according to an embodiment of the present specification.
Fig. 6 is a block diagram of a logical structure of an apparatus for detecting a target type of text according to an embodiment of the present specification.
FIG. 7 is a block diagram of a computer device for implementing the methods of the present description, in accordance with one embodiment of the present description.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In many application scenarios, it is necessary to detect a text of a target type that satisfies a certain condition. Generally, when detecting whether a text is a text of a target category, the text can be compared with the text of the target category, and if the similarity between the text and the text of the target category is high, the text can be judged to belong to the same category, so that it is very critical to accurately calculate the similarity of the text. For example, some lawbreakers may cheat some insurance services by illegal means, for example, some insurance services launched by panning, some cheat actions are usually cheated by the lawbreakers. For fraud, the conventional method is to analyze information such as geographical location and account registration equipment to detect, but for a way that lawless persons conduct group fraud and protection by establishing a QQ group, the conventional method cannot capture group members well. Through careful research on the cheat insurance claim texts, the insurance claim texts filled by the cheat insurance group members are found to be very similar semantically in the same working group, so that potential cheat insurance group members can be mined through means of text semantic analysis.
When detecting the illegal spoofing and protecting text, comparing the text to be detected with the determined spoofing and protecting text or non-spoofing and protecting text, analyzing the semantic similarity of sentences in the two texts, and determining whether the text to be detected is the spoofing and protecting text or not according to the semantic similarity. Wherein, the text similarity can be calculated by some text similarity calculation models. In order to detect the text of the target category more effectively, it is very critical to accurately calculate the similarity between two texts.
Based on this, the embodiment of the present specification first provides a training method of a text similarity calculation model, as shown in fig. 1, the method may include the following steps:
s102, acquiring a first training text, a second training text and the similarity between the first training text and the second training text;
s104, generating a vector set of each word in the first training text and the second training text, wherein the vector set comprises word vectors of the words and n-element stroke vectors;
s106, training according to the words in the first training text, the words in the second training text, the vector set and the similarity to obtain the text similarity calculation model.
When training the text similarity calculation model, a large number of training texts can be adopted to train the model. The training text may include a first training text and a second training text, and the first training text and the second training text may be a sentence or a paragraph. The similarity of the semantics of the first training text and the second training text has been determined in advance. The similarity may be used to represent the similarity or correlation degree of two training texts, and in some embodiments, the similarity may be represented by a specific numerical value, such as 0 to 100%, and the greater the numerical value, the higher the similarity. In some embodiments, the similarity may also be expressed directly as "similar" or "dissimilar".
Because the computer can not recognize the training text, the training text can be firstly converted into a vector which can be recognized by the computer, namely the text is vectorized and expressed, and the semantic similarity between the texts is expressed by the vector. Therefore, after the first training text and the second training text and the similarity between the first training text and the second training text are obtained, a vector set for generating each word in the first training text and the second training text may be generated first, where the vector set includes a word vector and an n-gram stroke vector of each word. Algorithms for representing words in text as Word vectors are mature, some algorithms construct Word vectors directly according to the positions of the words in a Word list, such as One-hot algorithms, some algorithms represent words as vectors through context semantics of the words, such as Word2Vec algorithms, and of course, some algorithms train Word vectors based on n-member strokes of the words, such as cw2Vec algorithms. The embodiment of the present specification may obtain the word vector of each word by any method.
Because the word vectors only represent the association relationship between words from the word level, the word vectors are more suitable for representing the association of the appeared words, and for some new words, the association between words may not be well determined. Therefore, when the text similarity calculation model is trained, in addition to the word vector of each word as the input of the model, the n-element stroke vector of each word is also used as the input of the model, and the association between words is mined from finer granularity. Wherein, n-element strokes are continuous n strokes of each word, for example, the strokes of the ' forest ' can be divided into ' horizontal stroke, vertical stroke, left falling stroke and right falling stroke, each stroke corresponds to a number, such as horizontal stroke (1), vertical stroke (2), left falling stroke (3) and right falling stroke (4), wherein, the 1-element strokes are respectively ' horizontal stroke, vertical stroke, left falling stroke, right falling stroke, horizontal stroke, vertical stroke, left falling stroke and right falling stroke ', and the corresponding numbers are 1, 2, 3, 4, 1, 2, 3 and 4; 2-element strokes are horizontal and vertical strokes, vertical and left-falling strokes, right-falling strokes, horizontal and vertical strokes, vertical and left-falling strokes, and corresponding numbers are respectively 12, 23, 34, 41, 12, 23 and 34; the 3-element strokes are respectively horizontal, vertical, left-falling, right-falling, left-falling, horizontal, vertical, left-falling and right-falling, and the corresponding numbers are respectively 123, 234, 412, 123 and 234. Of course, the strokes may be 4-element strokes, 5-element strokes or more, and the extraction manner is similar, which is not described herein again.
An n-gram stroke vector is a vector representing n consecutive strokes of each word. In some embodiments, n-gram strokes of each word are represented digitally as features of each word, and then word vector training is performed in combination with word context semantics, so that word vectors of each word and n-gram stroke vectors of each word can be obtained.
In some embodiments, the set of vectors may also include n-ary pinyin vectors for each word, where n-ary pinyin is n successive pinyins for each word, e.g., "senlin" for "forest" pinyin, and the pinyin may be divided into different characters, i.e., s, e, n, l, i, n. Wherein, different characters can be represented by different numbers, for example, s (1), e (2), n (3), l (4), i (5), n (6), then the 1-element pinyin of the term "forest" is s, e, n, l, i, n, respectively, and the corresponding numbers are represented as 1, 2, 3, 4, 5; the 2-element pinyin is se, en, nl, li and in respectively, and the corresponding numbers are 12, 23, 34, 45 and 53; the 3-element pinyin is sen, enl, nli and lin respectively, and the corresponding numbers are 123, 234, 345 and 453. Of course, 4-element pinyin, 5-element pinyin or more pinyin can be adopted, and the extraction method is similar and will not be described herein. The model is trained by taking the n-element pinyin vector as the input of the model, so that the association between words can be depicted in a finer granularity from the speech angle.
Because there is no space between the chinese languages, the computer cannot distinguish how to divide the characters in a sentence, and thus, in some embodiments, before generating the vector set for each word in the first training text and the second training text, the word segmentation process may be performed on the first training text and the second training text to obtain one or more words of the first training text and the second training text. Certainly, when the first training text and the second training text are segmented, the training texts may be compared with words in a preset word bank table, and the words appearing in the word bank table are divided into one word. Of course, other word segmentation algorithms may be used, and this description is not intended to be limiting.
In some embodiments, after the word segmentation processing is performed on the first training text and the second training text, and the first training text and the second training text are divided into one or more words, word vectors, n-element stroke vectors, and n-element pinyin vectors may be generated for the words through a cw2vec algorithm. Of course, the cw2vec algorithm is only an example in the embodiment of the present specification, and the embodiment of the present specification does not exclude other algorithms having similar functions.
In some embodiments, the text similarity calculation Model may be a DSSM (Deep semantic matching Model) Model, which has high accuracy and is suitable for classifying similar texts. The generated vector set of the first training text and the second training text, each word in the first training text and each word in the second training text can be used as the input of the DSSM model, the similarity of the first training text and the second training text is used as the output of the DSSM model, and the DSSM model is trained to obtain the final text similarity calculation model.
When the text similarity calculation model is trained, the word vectors of all words are used as the input of the model, and the n-element stroke vectors are also used as the input of the model, so that on one hand, semantic information of Chinese words can be described on finer granularity, on the other hand, more words which do not appear in training data can be described through the n-element stroke vectors, and the problem that new words appear in prediction is solved. The text similarity calculation model trained in the mode can calculate the similarity between the texts more accurately.
In addition, the embodiment of the present specification further provides a text similarity calculation method, which may be used to calculate the similarity of two or more texts, as shown in fig. 2, where the method includes the following steps:
s202, acquiring at least two texts;
s204, generating a vector set of each word in the at least two texts, wherein the vector set comprises word vectors of the words and n-element stroke vectors;
s206, inputting the at least two texts and the vector set into a pre-trained text similarity calculation model, and calculating the similarity between every two texts in the texts.
The text similarity calculation model in the embodiment of the present description may be obtained by training through the text similarity calculation model training method, and may also be obtained by training through other training methods, as long as the text similarity calculation model has a similar text similarity prediction function, which is not limited herein.
First, at least two texts with similarity to be determined can be obtained, and then a vector set is generated for each word in the texts, wherein the vector set comprises a word vector and an n-element stroke vector of each word. An n-gram stroke vector is a vector representing n consecutive strokes of each word. The definition of the n-gram strokes and the generating method of the vector set may refer to the description in the training method of the text similarity calculation model, and are not described herein again.
In some embodiments, the set of vectors may also include n-ary pinyin vectors for each term, the n-ary pinyin being consecutive n pinyins for each term. The definition of n-ary pinyin refers to the description in the training method of the text similarity calculation model, and is not repeated herein. When the similarity between texts is calculated, the n-element pinyin vector is used as a feature and is input into a model for similarity calculation, so that the association between words can be described in a finer granularity from the perspective of voice, and the prediction of the text similarity is more accurate.
In some embodiments, before generating a set of vectors for each word in the text, the text may be participled to obtain one or more words of the text. Certainly, when the text is segmented, the text may be compared with each word in a preset word bank table, and the words appearing in the word bank table are divided into one word. Of course, other word segmentation algorithms may be used, and this description is not intended to be limiting.
In some embodiments, after the word segmentation processing is performed on the text with the similarity to be determined, word vectors, n-element stroke vectors or n-element pinyin vectors can be generated for the words through a cw2vec algorithm. Of course, the cw2vec algorithm is only an example in the embodiment of the present specification, and the embodiment of the present specification does not exclude other algorithms having similar functions.
After a vector set of texts with similarity to be determined is obtained, the vector set and words obtained by segmenting at least two texts with similarity to be determined can be input into a pre-trained text similarity calculation model, and the similarity between every two texts is calculated through the text similarity calculation model. When the text similarity is calculated, in addition to inputting the word vector of each word in the text as a feature into the pre-trained model, the n-element stroke vector of each word in the text is also input into the pre-trained model as a feature, so that semantic information of Chinese words can be extracted at a finer granularity, more words which do not appear in training data can be carved through the n-element stroke vector, and the problem that new words appear in prediction is solved. In this way, the similarity between texts can be calculated more accurately.
An embodiment of the present specification further provides a method for detecting a text in a target category, which is used for detecting a text in a target category, and as shown in fig. 3, the method includes the following steps:
s302, acquiring a second text and a first text to be detected;
s304, generating a vector set of each word in the first text and the second text, wherein the vector set comprises word vectors and n-element stroke vectors of the words;
s306, inputting all words in the first text, all words in the second text and the vector set into a pre-trained text similarity calculation model to calculate the similarity of the first text and the second text;
s308, determining whether the first text is the text of the target category or not based on the similarity and the category of the second text.
The target text detection method implemented in this specification can be used in various scenarios for determining whether a text is a text of a target category through text similarity. For example, the method can be used for detecting whether the insurance claim text is illegal fraud protection text, and because the similarity between the fraud protection claim texts is higher, whether the text to be detected is the text of the target category can be determined by judging the similarity between the text to be detected and the fraud protection text or the non-fraud protection text.
When text violation detection is performed, a first text to be detected and a second text of which the category is determined may be obtained first, for example, for an insurance claim text, the category of the second text may be a fraud protection or a non-fraud protection.
A set of vectors for each word in the first text and the second text may then be generated, the set of vectors including a word vector for each word and an n-gram stroke vector. The Word vector of each Word can be obtained based on One or more of One-hot algorithm, Word2Vec algorithm or cw2Vec algorithm. An n-gram stroke vector is a vector representing n consecutive strokes of each word. The definition of the n-gram stroke vector may refer to the description in the training method of the text similarity calculation model, and is not described herein again. In some embodiments, after each word is split into strokes, n-member strokes of each word can be represented digitally as features of each word, and then word vector training is performed in combination with word context semantics, so that word vectors of each word and n-member stroke vectors of each word can be obtained.
In some embodiments, the set of vectors may also include n-ary pinyin vectors for each term, the n-ary pinyin being consecutive n pinyins for each term. The definition of n-ary pinyin refers to the description in the training method of the text similarity calculation model, and is not repeated herein. When the similarity between the first text and the second text is calculated, the n-element pinyin vector is used as a feature and is input into the model for similarity calculation, so that the association between words can be described in a finer granularity from the perspective of voice, and the prediction of the text similarity is more accurate.
Because there is no space between the Chinese languages, the computer cannot distinguish how to divide the characters in a sentence, and thus, in some embodiments, before generating the vector set for each word in the first text and the second text, the word segmentation process may be performed on the first text and the second text to obtain one or more words of the first text and the second text. Certainly, when the first text and the second text are segmented, the text may be compared with each word in a preset word bank table, and the word appearing in the word bank table is divided into one word. Of course, other word segmentation algorithms may be used, and this description is not intended to be limiting.
In some embodiments, the cw2vec algorithm may be used to generate a vector set of words in the first text and the second text, before the cw2vec algorithm is used to generate the vector set of words in the first text and the second text, word segmentation may be performed on the first text and the second text, the first text and the second text may be divided into one or more words, and then word vectors, n-ary stroke vectors, and n-ary pinyin vectors may be generated for the words by the cw2vec algorithm.
In some embodiments, the category of the second text may be derived using an unsupervised learning model. For example, when detecting a fraud protection claim text, the claim text is not provided with a label in the early stage, so that the text can be classified by using an unsupervised learning model, the fraud protection text and the non-fraud protection text are distinguished, and a large number of labels of the text can be obtained after a period of data precipitation.
In some embodiments, the pre-trained text similarity calculation model may be obtained by training using the above text similarity calculation model training method, or may be obtained by training using other methods as long as the model obtained by final training has similar functions.
In some embodiments, the text similarity calculation model may be a DSSM (Deep Semantic matching model) model, which has high accuracy and is suitable for classification of similar texts. The generated vector set of the first training text and the second training text, each word in the first training text and each word in the second training text can be used as the input of the DSSM model, the similarity of the first training text and the second training text is used as the output of the DSSM model, and the DSSM model is trained to obtain a final text similarity calculation model.
After the vector set of each word in the first text and the second text is obtained, each word obtained by word segmentation of the first text, each word obtained by word segmentation of the second text, and the vector set of each word may be input to a pre-trained text similarity calculation model, the similarity between the first text and the second text is calculated, and then whether the first text is a text of a target category is determined based on the similarity and the category of the second text. For example, the similarity between the first text and the second text is greater than a certain threshold, which may be set according to actual conditions, and if the second text is a text of a target category, the first text is used as the text of the target category.
In order to further explain the text similarity calculation model training method, the text similarity calculation method, and the target text detection method provided in the embodiments of the present specification, a specific application scenario is explained below.
Currently, many lawbreakers will perform organized group cheating insurance against some insurance services. Since the fraud protection claim texts are very similar to each other, the fraud protection claim texts can be detected by detecting the similarity between the texts. Specifically, the stage of detecting the fraud protection text comprises two stages, namely a text similarity calculation model training stage and a stage of applying the trained text similarity calculation model to detect the fraud protection claim text.
The model training phase comprises the following steps:
1. data collection: and acquiring a sentence pair consisting of two deposited or labeled sentences in a database, wherein the deposited or labeled sentence pair carries similarity information, namely the similarity information of the two sentences.
2. Sentence segmentation: performing word segmentation processing on the sentence pairs respectively;
3. and (3) word vector training: training word vectors and n-gram stroke vectors according to the sentence pairs after word segmentation by using a cw2vec algorithm to obtain the word vectors and the n-gram stroke vectors of all words in the sentence pairs;
4. training a DSSM model: and taking the sentence pairs after word segmentation, the similarity information, the trained word vectors and the n-element stroke vectors as input, and training the DSSM model to obtain the trained text similarity calculation DSSM model.
And (3) a fraud protection text detection stage:
1. data collection: respectively acquiring a sentence to be detected and a sentence with label information (wherein the label information is used for indicating that the sentence is fraud insurance or normal);
2. sentence segmentation: performing word segmentation processing on the two types of sentences respectively;
3. data processing: and generating word vectors and n-element stroke vectors of all words in the two types of sentences by using a cw2vec algorithm.
4. And (3) correlation calculation: and inputting the two sentences after word segmentation, word vectors of all words in the two sentences and n-element stroke vectors into a trained DSSM model, and calculating the similarity of the two sentences.
5. Detecting a target text: and determining whether the sentence to be detected is a spoofed security text or not according to the similarity value and the labeled information of the labeled sentence.
When the text similarity calculation model is trained, in addition to word vectors, n-element stroke vectors obtained by cw2vec training are used as input of the DSSM model, so that on one hand, semantic information of Chinese words can be described on finer granularity, on the other hand, more words which do not appear in training data can be described through the n-element stroke vectors, the problem that new words appear in prediction is solved, and the target text can be detected more accurately.
The various technical features in the above embodiments can be arbitrarily combined, so long as there is no conflict or contradiction between the combinations of the features, but the combination is limited by the space and is not described one by one, and therefore, any combination of the various technical features in the above embodiments also falls within the scope disclosed in the present specification.
As shown in fig. 4, which is a training apparatus for a text similarity calculation model according to an embodiment of the present specification, the apparatus 40 may include:
an obtaining module 41, configured to obtain a first training text, a second training text, and a similarity between the first training text and the second training text;
a vector generation module 42, configured to generate a vector set of the words in the first training text and the second training text, where the vector set includes a word vector and an n-gram stroke vector of each word;
a training module 43, configured to obtain the text similarity calculation model according to the words in the first training text, the words in the second training text, the vector set, and the similarity training.
As shown in fig. 5, which is an apparatus for determining text similarity according to an embodiment of the present specification, the apparatus 50 may include:
an obtaining module 51, configured to obtain at least two texts;
a vector generation module 52, configured to generate a set of vectors for each word in the at least two texts, where the set of vectors includes a word vector and an n-gram stroke vector of the word;
and the calculating module 53 is configured to input the at least two texts and the vector set into a pre-trained text similarity calculation model, and calculate similarity between every two texts in the texts.
As shown in fig. 6, which is a device for detecting a target category of text according to an embodiment of the present disclosure, the device 60 may include:
the acquiring module 61 is configured to acquire the second text and the first text to be detected;
a vector generation module 62, configured to generate a set of vectors for each word in the first text and the second text, where the set of vectors includes a word vector and an n-gram stroke vector of the word;
a calculating module 63, configured to input each word in the first text, each word in the second text, and the vector set into a pre-trained text similarity calculation model to calculate a similarity between the first text and the second text;
and a decision module 64 for determining whether the first text is a text of a target category based on the similarity and the category of the second text.
In one embodiment, the set of vectors further includes an n-ary pinyin vector for the term.
In one embodiment, before the apparatus is configured to generate the set of vectors for the words in the first text and the second text, the apparatus is further configured to:
and performing word segmentation processing on the first text and the second text to obtain one or more words.
In one embodiment, the category of the second text is obtained through an unsupervised learning model.
In one embodiment, the text similarity calculation model is trained based on the words in the first training text, the words in the second training text, and a vector set consisting of word vectors and n-gram stroke vectors of the words in the first training text and the words in the second training text.
In one embodiment, the text similarity calculation model is a DSSM model.
The specific details of the implementation process of the functions and actions of each module in the device are referred to the implementation process of the corresponding step in the method, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the device in the specification can be applied to computer equipment, such as a server or an intelligent terminal. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for operation through the processor in which the file processing is located. From a hardware aspect, as shown in fig. 7, it is a hardware structure diagram of a computer device in which the apparatus of this specification is located, except for the processor 702, the memory 704, the network interface 706, and the nonvolatile memory 708 shown in fig. 7, a server or an electronic device in which the apparatus is located in an embodiment may also include other hardware according to an actual function of the computer device, which is not described again. The non-volatile memory 708 stores therein computer instructions, and the processor 702 executes the computer instructions to implement the method for detecting a target type of text, the method for calculating text similarity, and the method for training a text similarity calculation model in any of the above embodiments.
Accordingly, the embodiments of the present specification also provide a computer storage medium, in which a program is stored, and the program, when executed by a processor, implements the method in any of the above embodiments.
Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
Other embodiments of the present disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. The embodiments of the present specification are intended to cover any variations, uses, or adaptations of the embodiments of the specification following, in general, the principles of the embodiments of the specification and including such departures from the present disclosure as come within known or customary practice in the art to which the embodiments of the specification pertain. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the embodiments being indicated by the following claims.
It is to be understood that the embodiments of the present specification are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present specification is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (12)

1. A method of detecting text of a target category, the method comprising:
acquiring a second text and a first text to be detected;
generating a set of vectors for each word in the first text and the second text, the set of vectors including word vectors and n-gram stroke vectors for the words;
inputting each word in the first text, each word in the second text and the vector set into a pre-trained text similarity calculation model to calculate the similarity of the first text and the second text;
determining whether the first text is a text of a target category based on the similarity and the category of the second text.
2. The detection method of claim 1, the set of vectors further comprising n-ary pinyin vectors for the terms.
3. The detection method of claim 1, prior to generating the set of vectors for the terms in the first text and the second text, the method further comprising:
and performing word segmentation processing on the first text and the second text to obtain one or more words.
4. The detection method of claim 1, the category of the second text being derived by an unsupervised learning model.
5. The detection method according to claim 1, wherein the text similarity calculation model is trained based on the words in the first training text, the words in the second training text, and a vector set consisting of word vectors and n-gram stroke vectors of the words in the first training text and the words in the second training text.
6. The detection method of claim 1, the text similarity calculation model being a DSSM model.
7. A training method of a text similarity calculation model, the method comprising:
acquiring a first training text, a second training text and the similarity between the first training text and the second training text;
generating a set of vectors for each word in the first training text and the second training text, the set of vectors including word vectors for the words and n-gram stroke vectors;
and training according to the words in the first training text, the words in the second training text, the vector set and the similarity to obtain the text similarity calculation model.
8. A method of determining text similarity, the method comprising:
acquiring at least two texts;
generating a set of vectors for each word in the at least two texts, the set of vectors including word vectors for the words and n-gram stroke vectors;
and inputting each word in the at least two texts and the vector set into a pre-trained text similarity calculation model, and calculating the similarity between every two texts in the texts.
9. An apparatus for detecting a target class of text, the apparatus comprising:
the acquisition module is used for acquiring the second text and the first text to be detected;
a vector generation module configured to generate a set of vectors for each word in the first text and the second text, the set of vectors including word vectors and n-gram stroke vectors for the words;
the calculation module is used for inputting all the words in the first text, all the words in the second text and the vector set into a pre-trained text similarity calculation model so as to calculate the similarity of the first text and the second text;
and the judging module is used for determining whether the first text is the text of the target category or not according to the similarity and the category of the second text.
10. A training apparatus of a text similarity calculation model, the apparatus comprising:
the acquisition module is used for acquiring a first training text, a second training text and the similarity between the first training text and the second training text;
a vector generation module, configured to generate a set of vectors for each word in the first training text and the second training text, where the set of vectors includes word vectors for the words and n-gram stroke vectors;
and the training module is used for training according to the words in the first training text, the words in the second training text, the vector set and the similarity to obtain the text similarity calculation model.
11. An apparatus for determining text similarity, the apparatus comprising:
the acquisition module is used for acquiring at least two texts;
a vector generation module for generating a set of vectors for each word in the at least two texts, the set of vectors including word vectors for the words and n-gram stroke vectors;
and the calculation module is used for inputting each word in the at least two texts and the vector set into a pre-trained text similarity calculation model and calculating the similarity of every two texts in the texts.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 8 when executing the program.
CN201911030483.9A 2019-10-28 2019-10-28 Text detection method, similarity calculation method, model training method and device Pending CN111221960A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911030483.9A CN111221960A (en) 2019-10-28 2019-10-28 Text detection method, similarity calculation method, model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911030483.9A CN111221960A (en) 2019-10-28 2019-10-28 Text detection method, similarity calculation method, model training method and device

Publications (1)

Publication Number Publication Date
CN111221960A true CN111221960A (en) 2020-06-02

Family

ID=70830574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911030483.9A Pending CN111221960A (en) 2019-10-28 2019-10-28 Text detection method, similarity calculation method, model training method and device

Country Status (1)

Country Link
CN (1) CN111221960A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695333A (en) * 2020-06-24 2020-09-22 华侨大学 Trademark font similarity detection method, device and equipment
CN111708884A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Text classification method and device and electronic equipment
CN111832288A (en) * 2020-07-27 2020-10-27 网易有道信息技术(北京)有限公司 Text correction method and device, electronic equipment and storage medium
WO2022095370A1 (en) * 2020-11-06 2022-05-12 平安科技(深圳)有限公司 Text matching method and apparatus, terminal device, and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503184A (en) * 2016-10-24 2017-03-15 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN108345580A (en) * 2017-01-22 2018-07-31 阿里巴巴集团控股有限公司 A kind of term vector processing method and processing device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109299269A (en) * 2018-10-23 2019-02-01 阿里巴巴集团控股有限公司 A kind of file classification method and device
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN110321433A (en) * 2019-06-26 2019-10-11 阿里巴巴集团控股有限公司 Determine the method and device of text categories

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503184A (en) * 2016-10-24 2017-03-15 海信集团有限公司 Determine the method and device of the affiliated class of service of target text
CN108345580A (en) * 2017-01-22 2018-07-31 阿里巴巴集团控股有限公司 A kind of term vector processing method and processing device
CN108491382A (en) * 2018-03-14 2018-09-04 四川大学 A kind of semi-supervised biomedical text semantic disambiguation method
CN109299269A (en) * 2018-10-23 2019-02-01 阿里巴巴集团控股有限公司 A kind of file classification method and device
CN110059155A (en) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 The calculating of text similarity, intelligent customer service system implementation method and device
CN110046340A (en) * 2018-12-28 2019-07-23 阿里巴巴集团控股有限公司 The training method and device of textual classification model
CN109858039A (en) * 2019-03-01 2019-06-07 北京奇艺世纪科技有限公司 A kind of text information identification method and identification device
CN110321433A (en) * 2019-06-26 2019-10-11 阿里巴巴集团控股有限公司 Determine the method and device of text categories

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708884A (en) * 2020-06-02 2020-09-25 上海硬通网络科技有限公司 Text classification method and device and electronic equipment
CN111695333A (en) * 2020-06-24 2020-09-22 华侨大学 Trademark font similarity detection method, device and equipment
CN111695333B (en) * 2020-06-24 2022-09-13 华侨大学 Trademark font similarity detection method, device and equipment
CN111832288A (en) * 2020-07-27 2020-10-27 网易有道信息技术(北京)有限公司 Text correction method and device, electronic equipment and storage medium
CN111832288B (en) * 2020-07-27 2023-09-29 网易有道信息技术(北京)有限公司 Text correction method and device, electronic equipment and storage medium
WO2022095370A1 (en) * 2020-11-06 2022-05-12 平安科技(深圳)有限公司 Text matching method and apparatus, terminal device, and storage medium

Similar Documents

Publication Publication Date Title
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN111221960A (en) Text detection method, similarity calculation method, model training method and device
CN109460455B (en) Text detection method and device
CN109583468B (en) Training sample acquisition method, sample prediction method and corresponding device
JP2019511037A (en) Method and device for modeling machine learning model
US7783581B2 (en) Data learning system for identifying, learning apparatus, identifying apparatus and learning method
KR20080075501A (en) Information classification paradigm
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN110491368B (en) Dialect background-based voice recognition method, device, computer equipment and storage medium
CN112036187A (en) Context-based video barrage text auditing method and system
CN106997350B (en) Data processing method and device
CN114399382A (en) Method and device for detecting fraud risk of user, computer equipment and storage medium
KR102334018B1 (en) Apparatus and method for validating self-propagated unethical text
CN117409419A (en) Image detection method, device and storage medium
CN111612284B (en) Data processing method, device and equipment
CN110879832A (en) Target text detection method, model training method, device and equipment
CN113761137A (en) Method and device for extracting address information
CN115774784A (en) Text object identification method and device
CN113836297B (en) Training method and device for text emotion analysis model
CN114510720A (en) Android malicious software classification method based on feature fusion and NLP technology
CN109492396B (en) Malicious software gene rapid detection method and device based on semantic segmentation
CN113177603A (en) Training method of classification model, video classification method and related equipment
CN113343699A (en) Log security risk monitoring method and device, electronic equipment and medium
CN115186775B (en) Method and device for detecting matching degree of image description characters and electronic equipment
CN111078877A (en) Data processing method, training method of text classification model, and text classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200602

RJ01 Rejection of invention patent application after publication