CN110929530B - Multi-language junk text recognition method and device and computing equipment - Google Patents

Multi-language junk text recognition method and device and computing equipment Download PDF

Info

Publication number
CN110929530B
CN110929530B CN201811082749.XA CN201811082749A CN110929530B CN 110929530 B CN110929530 B CN 110929530B CN 201811082749 A CN201811082749 A CN 201811082749A CN 110929530 B CN110929530 B CN 110929530B
Authority
CN
China
Prior art keywords
text
language
recognized
probability
junk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811082749.XA
Other languages
Chinese (zh)
Other versions
CN110929530A (en
Inventor
康杨杨
高喆
周笑添
孙常龙
刘晓钟
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811082749.XA priority Critical patent/CN110929530B/en
Publication of CN110929530A publication Critical patent/CN110929530A/en
Application granted granted Critical
Publication of CN110929530B publication Critical patent/CN110929530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a method for identifying multilingual junk texts, which comprises the following steps: acquiring a text to be recognized, wherein the text to be recognized comprises at least two languages; converting the text to be recognized into an intermediate text written in a main language, wherein the main language is one of at least two languages; and judging whether the text to be recognized is the junk text or not by adopting a preset classification model according to the intermediate text. The invention also discloses a corresponding recognition device and computing equipment for the multilingual junk text.

Description

Multi-language junk text recognition method and device and computing equipment
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, and a computing device for identifying multilingual junk text.
Background
Some people can release junk texts containing sensitive words such as abuse, pornography, politics and the like on platforms such as short messages, instant messaging, games and the like due to benefit driving or in order to release bad emotions. Each platform typically examines the text content published thereon (e.g., by means of sensitive word matching, etc.) to identify and mask spam text. To avoid being identified and masked, these spam texts often employ a multi-lingual mixture to interfere with the platform's inspection of the text content.
For recognition of multilingual junk text, one possible method is to manually label a plurality of multilingual junk text as training samples, train a classification model, and then use the trained classification model to determine whether a text is junk text. However, the number of training samples of the multilingual junk text is small and the training samples are difficult to acquire, so that the classification model is not accurate enough for judging the junk text, and the generalization capability is poor.
Therefore, there is a need for a more efficient method of recognition of multilingual spam text.
Disclosure of Invention
To this end, the present invention provides a method, apparatus and computing device for recognition of multilingual spam in an effort to solve or at least alleviate the above-identified problems.
According to one aspect of the present invention, there is provided a method for recognizing multilingual junk text, comprising: acquiring a text to be recognized, wherein the text to be recognized comprises at least two languages; converting the text to be recognized into an intermediate text written in a main language, wherein the main language is one of at least two languages; and judging whether the text to be recognized is the junk text or not by adopting a preset classification model according to the intermediate text.
According to an aspect of the present invention, there is provided an apparatus for recognizing multilingual junk text, comprising: the acquisition module is suitable for acquiring a text to be identified, wherein the text to be identified comprises at least two languages; the conversion module is suitable for converting the text to be recognized into an intermediate text written in a main language, wherein the main language is one of the at least two languages; and the judging module is suitable for judging whether the text to be identified is the junk text or not by adopting a preset classification model according to the intermediate text.
According to one aspect of the invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the method of recognition of multilingual spam as described above.
According to one aspect of the present invention, there is provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform a method of recognition of multilingual spam as described above.
According to the technical scheme of the invention, firstly, a multi-language text to be recognized is converted into an intermediate text of a single language (namely a main language); and then judging whether the text to be recognized is a junk text or not by adopting a preset classification model according to the intermediate text. The classification model of the embodiment of the invention is a single language model, and is used for judging whether the text of a certain specific language is the junk text, and compared with the multilingual junk text, the single language junk text is easier to acquire and has more training samples, so that the single language classification model can judge whether the intermediate text is the junk text more accurately, namely can judge whether the multilingual text to be identified corresponding to the intermediate text is the junk text more accurately.
Further, in one embodiment of the present invention, the classification model may include at least two models, for example, a first classification model for determining whether text in a first language is junk text and a second classification model for determining whether text in a second language is junk text. Respectively translating the intermediate text into a first text in a first language and a second text in a second language; then, respectively inputting the first text and the second text into a first classification model and a second classification model to respectively output a first probability and a second probability that the first text and the second text are junk texts; and finally, comprehensively judging whether the text to be recognized is the junk text or not by combining the first probability and the second probability. By adopting a plurality of classification models, whether the text to be recognized is the junk text can be judged from the angles of a plurality of single languages, and the error of recognizing the junk text by only one classification model is reduced, so that the recognition result is more reliable and the accuracy is higher.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth the various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional objects, features, and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present disclosure.
FIG. 1 shows a schematic diagram of a spam text recognition system 100 in accordance with one embodiment of the present invention;
FIG. 2 shows a schematic diagram of a computing device 200 according to one embodiment of the invention;
FIG. 3 illustrates a flow diagram of a method 300 of recognition of multilingual spam text in accordance with one embodiment of the present invention;
FIG. 4 shows a schematic diagram of a recognition process of multilingual spam text according to one embodiment of the present invention; and
fig. 5 shows a schematic diagram of a multi-lingual spam text recognition device 500 according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a schematic diagram of a spam text recognition system 100 in accordance with one embodiment of the present invention. As shown in fig. 1, the spam text recognition system 100 includes a user terminal 110 and a computing device 200.
The user terminal 110, i.e. a terminal device used by a user, may be a personal computer such as a desktop computer, a notebook computer, or a mobile phone, a tablet computer, a multimedia device, an intelligent wearable device, but is not limited thereto. The computing device 200 is used to provide services to the user terminal 110, which may be implemented as a server, e.g., an application server, a Web server, etc.; but not limited to, desktop computers, notebook computers, processor chips, cell phones, tablet computers, and the like.
In an embodiment of the present invention, the computing device 200 may be used to provide a text posting service to a user, for example, the computing device 200 may be used as a server of a communication social application, for example, an application such as a sms, a micro-letter, a micro-blog, a bar, etc., on which the user may send a message or post content information, etc.; for another example, computing device 200 may act as a server for a gaming application on which a user may post session messages, may post in a community, forum, or the like. The text posting services that computing device 200 may provide are described above in terms of a social messaging application, a service application, however, those skilled in the art will appreciate that computing device 200 may be any device capable of providing text posting services to users and is not limited to a social messaging application, a game application, or a server.
The user publishes text content on a text publishing platform provided by computing device 200 via user terminal 110. In some cases, due to the benefit driving or the bad emotion, the text content issued by the user may contain sensitive words such as abuse, pornography, politics and the like, which disturb the platform order and interfere with the normal use of other users. Such low value text containing objectionable content is spam. To maintain a good platform environment, computing device 200 may examine text content published by users to identify spam text, mask it, delete it, and so on. To avoid being identified and masked, illegitimate users often interfere with the inspection of text content by computing device 200 in a multi-lingual mixture, and therefore, in embodiments of the present invention, a method for identifying multi-lingual spam is provided such that computing device 200 can more effectively identify multi-lingual spam. The recognition method of the multilingual junk text of the present invention will be described in detail below.
In one embodiment, the spam text recognition system 100 also includes a data storage device 120. The data storage 120 may be a relational database such as MySQL, ACCESS, etc., or a non-relational database such as NoSQL, etc.; the data storage device 120 may be a local database residing in the computing device 200, or may be a distributed database, such as HBase, disposed at a plurality of geographic locations, and in any case, the data storage device 120 is used to store data, and the specific deployment and configuration of the data storage device 120 is not limited by the present invention. The computing device 200 may connect with the data storage 120 and retrieve data stored in the data storage 120. For example, the computing device 200 may directly read the data in the data storage device 120 (when the data storage device 120 is a local database of the computing device 200), or may access the internet through a wired or wireless manner, and obtain the data in the data storage device 120 through a data interface.
In an embodiment of the present invention, the data storage 120 stores therein historical text content published by the user. It will be appreciated that the data storage device 120 may store all content that is distributed by all users, or may store content that is distributed by some users during some time periods (e.g., the last three months). The storage manner of the data storage device 120 for the historical text content published by the user can be set by those skilled in the art, which is not limited by the present invention. The user published historical text content may be part of a corpus that may include text obtained through other channels, such as text extracted from Wikipedia, etc., in addition to the user published historical text content. The texts in the corpus can be divided according to the number and types of the languages, for example, the corpus is divided into a single-language corpus and a multi-language corpus, wherein the single-language corpus further comprises a Chinese corpus, an English corpus, a Russian corpus, and the like.
In one embodiment, for any language, some corpora may be selected from a corpus of the language for labeling, and the labeled corpora are used as training samples to train and generate some machine learning models for natural language processing (Natural Language Processing, NLP). For example, some text may be selected from a corpus of chinese, which is labeled with a category label that is used to indicate whether a text is junk text. Then, based on the Chinese text marked with the classification labels, a Chinese classification model can be trained, and the classification model can be used for judging whether a Chinese text is junk text or not. For another example, some text may be selected from a corpus of chinese text to train a language model that generates chinese, which may be used to determine the smoothness of a segment of chinese text. When the language model is trained, the existing texts in the Chinese corpus can be used as positive samples, and the smoothness of the texts in the corpus is usually higher because the texts in the corpus are the texts which actually appear in the actual use process of the language. Negative examples may be derived by processing the text in the corpus, for example, by disrupting the order of words in the text, or by deleting some words in the text to reduce the smoothness of the text.
Implementation of the multi-lingual spam text recognition method of the present invention requires training of machine learning models (e.g., the aforementioned classification models, language models, etc.) based on the corpus stored in the data storage 120. The function and training method of the machine learning model related to the multi-language spam text recognition method of the present invention will be described in detail below.
The recognition method of multilingual spam text of the present invention can be performed in a computing device. FIG. 2 illustrates a block diagram of a computing device 200 according to one embodiment of the invention. As shown in FIG. 2, in a basic configuration 202, computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.
Depending on the desired configuration, the processor 204 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. Processor 204 may include one or more levels of cache, such as a first level cache 210 and a second level cache 212, a processor core 214, and registers 216. The example processor core 214 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.
Depending on the desired configuration, system memory 206 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 106 may include an operating system 220, one or more applications 222, and program data 224. The application 222 is in effect a plurality of program instructions for instructing the processor 204 to perform a corresponding operation. In some implementations, the application 222 can be arranged to cause the processor 204 to operate with the program data 224 on an operating system.
Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to basic configuration 202 via bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. The example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate communication via one or more I/O ports 258 and external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.). The example communication device 246 may include a network controller 260 that may be arranged to facilitate communication with one or more other computing devices 262 over a network communication link via one or more communication ports 264.
The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
In the computing device 200 according to the invention, the application 222 comprises a multi-lingual spam text recognition apparatus 500, the apparatus 500 comprising a plurality of program instructions that instruct the processor 104 to perform the multi-lingual spam text recognition method 300 to determine whether a multi-lingual text is spam.
FIG. 3 illustrates a flow diagram of a method 300 of recognition of multilingual spam text in accordance with one embodiment of the present invention. The method 300 is suitable for execution in a computing device, such as the computing device 200 described previously. As shown in fig. 3, the method 300 sequentially includes steps S310 to S330.
In step S310, a text to be recognized is acquired, the text to be recognized including at least two languages.
The text to be recognized includes at least two languages, i.e., the text to be recognized is multi-lingual text. The following are two examples of text to be recognized: "you come from the case", "give-out clients are just bitch, how you can't get у me, b.
In step S320, the text to be recognized is converted into an intermediate text written in a main language, which is one of the at least two languages.
The multilingual text to be recognized is converted into intermediate text in a single language (i.e., the main language), via step S320. Step S320 may be further performed according to the following steps S322 to S326:
in step S322, the text to be recognized is cut into a plurality of pieces, each piece corresponding to one language.
There are various implementations of the segmentation of the text to be recognized. In one embodiment, the text to be identified may be cut as follows: first, a Unicode (Unicode) of each character of a text to be recognized is acquired. The language to which each character corresponds is then determined based on the unicode. Since Unicode sets a uniform and unique binary code for each character in each language, accordingly, given Unicode code of a character, the language to which the character corresponds can be determined. For example, the Unicode code range of the chinese character is_4e00 to_9fa5, and if the Unicode code of a character is within the above range, it can be determined that the language corresponding to the character is chinese. After the language corresponding to each character is determined, the characters continuously corresponding to the same language in the text to be recognized are segmented into a segment, and the text to be recognized can be segmented into a plurality of segments. For example, the text to be recognized, "the case you come to a bottom" may be split into five pieces, "the case", "you come to a bottom", "a bottom"; the text to be identified is "honored clients," you are bitch, how is not to get у m, i.e. "can be cut into" honored clients, "you are" "bitch," how is not to get "" у m, ", i.e." b a "four segments.
In another embodiment, machine learning models (e.g., sequence annotation models, neural network models, etc.) may also be employed to segment the text to be identified. The machine learning model is generated by training a text marked with language types as a training sample.
Of course, other methods for segmenting the text to be identified may be used in addition to the two methods described above. The method for segmenting the text to be identified is not particularly limited.
In step S324, the non-main language segment is translated into the main language to obtain a translated segment corresponding to the non-main language segment.
The non-main language segment refers to a segment which is not written by a main language among a plurality of segments cut by the text to be recognized, and correspondingly, the segment written by the main language is the main language segment. The primary language is any one of the languages included in the text to be recognized.
The calibration of the main language has a plurality of modes, in one embodiment, the main language can be preset by a person skilled in the art, for example, when the text comprises Chinese characters, the Chinese is set as the main language (without considering the proportion of the Chinese characters in the text), and in the setting mode, the main language of the text to be recognized is Chinese; or, when the text includes english characters, the english is set as the main language (without considering how much english characters occupy the text), and in this setting manner, the main language of the text to be recognized, "the case you come down" is english.
In another embodiment, the primary language may be determined according to a predetermined rule. For example, the language with the largest number of characters in the text can be used as the main language, and in this setting mode, since the text to be recognized is "honored clients," you are bitch, how is not to be removed у m, i.e. a' e, the number of Chinese characters in the text to be recognized is more than English and Russian, and therefore, the main language of the text to be recognized is Chinese.
Of course, other methods besides the two methods described above may be used by those skilled in the art to determine the subject of the text to be recognized, and the invention is not limited in this regard.
The process of translating the non-subject language fragments into the subject language in step S324 may be implemented by using an existing machine translation service or SDK (Software Development Kit ), and the implementation method of the non-subject language translation is not limited in the present invention.
It should be noted that, in the translation process, the words in the two languages do not have to be in a one-to-one correspondence relationship, but may also have a one-to-many relationship, and accordingly, there may be one or more translation segments corresponding to one non-main language segment. For example, a non-subject language snippet "case" may correspond to a plurality of Chinese translation snippets such as "case", "container", and the like.
In step S326, the main language segment and the translation segment are combined to obtain an intermediate text written in the main language corresponding to the text to be recognized.
For example, the text to be identified "how not to у m b" may be split into two segments, "how not to get" and "у m b" in chinese as the main language, then "how not to get" as the main language segment, "у m a" as the non-main language segment. The non-main language fragment "у m means a b" is translated into the main language, and the translated fragment corresponding to the b "a b" is "dead". And combining the 'how not to go' of the main language fragments in the text to be recognized with the 'dead' of the translation fragments of the non-main language fragments to obtain the 'how not to go' of the intermediate text of the main language corresponding to the text to be recognized.
In one embodiment, when one non-subject language segment corresponds to multiple translated segments, step S326 may be further implemented as follows: and combining the main language fragments with one translation fragment of each non-main language fragment respectively to obtain at least one candidate text. The combination process is equivalent to the Cartesian product of the translation fragments corresponding to the non-main language fragments, and the number of the obtained candidate texts is the product of the number of the translation fragments corresponding to the non-main language fragments.
And then, determining the smoothness of each candidate text respectively, and taking the candidate text with the largest smoothness as the intermediate text corresponding to the text to be identified. The smoothness of the candidate texts can be determined by adopting a preset language model, namely, each candidate text is respectively input into the preset language model, and the smoothness of the candidate texts is determined according to the output of the language model. The language model is generated by training a corpus corresponding to the main language.
The language model may be, for example, an n-gram model, a deep learning model, or the like, but is not limited thereto. When the language model is an n-gram model, the process of training and generating the n-gram model is equivalent to generating a conditional probability table among words of a main language according to a corpus corresponding to the main language. And inputting the candidate text into a trained n-gram model, wherein the n-gram model calculates the occurrence probability of the candidate text according to a conditional probability table, and the higher the occurrence probability is, the higher the smoothness of the candidate text is.
The language model may also be a deep learning model. The training process of the deep learning model is as follows: selecting a plurality of texts from a corpus corresponding to a main language, and marking the smoothness; and taking the text marked with the smoothness as a training sample, and training to generate a deep learning model. And inputting the candidate text into a trained deep learning model, wherein the output of the deep learning model is the smoothness of the candidate text.
For example, the text to be recognized, "the case you come to a floor" may be split into five segments, "the case," "you come to a floor," "a floor," and "a floor" with chinese as the dominant language, then, "the case," "the floor" is the dominant language segment, and "case" and "floor" are the non-dominant language segments. And respectively translating each non-main language segment into a main language to obtain a translation segment corresponding to a case as a container and a translation segment corresponding to a follow as a follow. Combining the main language fragment "this", "you come" and "one under" with one translated fragment of the non-main language fragment "case" and "follow", respectively, since each non-main language fragment corresponds to two translated fragments, then, after combination, the following four (i.e. 2×2) candidate texts can be obtained:
1) This isCase(s)You comeHeel with heel bodyOne step down;
2) This isCase(s)You comeThenOne step down;
3) This isContainerYou comeHeel with heel bodyOne step down;
4) This isContainerYou comeThenAnd (3) a step below.
And then, respectively inputting the four candidate texts into a preset n-gram model, and respectively determining the smoothness of each candidate text by the n-gram model. And taking the candidate text with the maximum smoothness as an intermediate text corresponding to the text to be recognized.
In step S330, according to the intermediate text, a preset classification model is adopted to determine whether the text to be recognized is a junk text.
The classification model may be, for example, but not limited to, a support vector machine (Support Vector Machine, SVM) model, a logistic regression (Logistic Regression, LR) model, a convolutional neural network (Convolutional Neural Network, CNN) model, and the like. The classification model employed in step S330 is typically a classification model adapted to receive a text input, output a classification label (indicating junk text or not) corresponding to the text and/or a probability that the text belongs to junk text. Accordingly, the classification model is generated using text training that has been labeled with classification labels.
More specifically, spam text includes a variety of types of abuse text, pornography text, political text, and the like. Classification models may be used to identify spam text in general, or more specifically, some type of spam text, such as, for example, to identify abuse text, pornography text, political text, and the like. Accordingly, if the classification model is specifically used to identify abuse text, the classification model is generated from a training of text labeled with abuse tags (indicating abuse text or non-abuse text); if the classification model is specifically used to identify pornographic text, the classification model is generated from text training labeled with a pornographic tag (indicating either pornographic text or not); if the classification model is specifically used to identify administrative-related text, the classification model is generated by text training labeled with administrative-related tags (designated as administrative-related text or not); etc.
It should be noted that, the number and the language type of the classification model adopted in step S330 are not limited in the present invention, and a person skilled in the art can select an appropriate classification model according to the needs. Three embodiments of step S330 are given by way of example:
in the first embodiment, the preset classification model is used for determining a classification label corresponding to a text in a main language, that is, given a text in a main language, the classification model may output a determination result of whether the text is junk text. The classification model can be generated by training texts which are marked with classification labels in a corpus corresponding to the main language, wherein the classification labels are used for indicating whether the texts are junk texts or not.
Accordingly, step S330 may be implemented as follows: and inputting the intermediate text into a classification model so that the classification model outputs a judging result of whether the intermediate text is the junk text, namely outputting a judging result of whether the text to be identified corresponding to the intermediate text is the junk text.
In a second embodiment, the preset classification model is used for determining a classification label corresponding to a text in a non-subject language, that is, given a text in a non-subject language, the classification model may output a result of determining whether the text is junk text. The classification model can be generated by training a training text marked with classification labels in a corpus corresponding to a non-main language, wherein the classification labels are used for indicating whether the text is junk text or not.
Accordingly, step S330 may be implemented as follows: translating the intermediate text into a secondary intermediate text that is not in the primary language; and inputting the secondary intermediate text into the classification model so that the classification model outputs a judging result of whether the text to be recognized is the junk text.
In a third embodiment, the preset classification model includes a first classification model and a second classification model, where the first classification model and the second classification model are respectively used to determine probabilities that the text in the first language and the text in the second language are junk texts. It should be noted that the first language and the second language are two different languages, and neither may be the main language, but one may be the main language. In a preferred embodiment, in order to avoid content bias caused by language translation, one of the first language and the second language is set as a main language, for example, the first language is the main language. In addition, preferably, the first language and the second language are set to be languages with rich corpus resources and deep related natural language processing algorithm research so as to ensure the accuracy of junk text recognition, for example, the first language is Chinese and the second language is English.
The first classification model may be generated using text training in a first language in which classification labels have been labeled in the corpus, and the second classification model may be generated using text training in a second language in which classification labels have been labeled in the corpus, where the classification labels are used to indicate whether the text is spam.
Accordingly, step S330 may be implemented as follows: first, a first text in a first language and a second text in a second language corresponding to the intermediate text are determined. If the first language is the main language, the first text is the intermediate text; if the first language is different from the primary language, the first text is translated from the intermediate text. Similarly, if the second language is the primary language, then the second text is intermediate text; if the second text is different from the primary language, the second text is translated from the intermediate text. In an embodiment of the present invention, the first language and the second language are not both the primary languages, and therefore, the translation process is performed at least once in determining the first text and the second text.
After the first text and the second text are determined, the first text and the second text are respectively input into a first classification model and a second classification model, so that the first probability and the second probability that the first text and the second text are junk texts are respectively output. That is, the first classification model outputs a first probability that the first text is spam, and the second classification model outputs a second probability that the second text is spam.
And then judging whether the text to be recognized is junk text or not according to the first probability and the second probability.
In one embodiment, the text to be identified is determined to be spam when the first probability is greater than the first threshold and the second probability is greater than the second threshold. The values of the first threshold and the second threshold may be set by those skilled in the art, and the present invention is not limited thereto.
In another embodiment, when the weighted sum result of the first probability and the second probability is greater than the third threshold, the text to be identified is determined to be junk text, and the values of the first probability, the weight of the second probability and the third threshold may be set by those skilled in the art, which is not limited in this invention.
Of course, other methods may be adopted to obtain the determination result of whether the text to be recognized is the junk text according to the first probability and the second probability, and the specific method of obtaining the determination result of whether the text to be recognized is the junk text according to the first probability and the second probability is not limited.
FIG. 4 illustrates a multi-lingual spam text recognition process according to one embodiment of the invention. Specifically, FIG. 4 illustrates a process for identifying multilingual abuse text.
As shown in fig. 4, the text to be identified is "honored customer", you are just bitch, how we cannot go to у ma.
Then, the text to be identified is cut into "respected clients," how you are "bitch", "у m a" and "b a" are not needed, and the four segments are "respected clients" when Chinese is the main language, how you are "respected clients" and "c" are "main language segments," bitch "," у m "and" b a "are non-main language segments (English and Russian, respectively).
And then, translating the non-main language fragments into a main language to obtain translated fragments corresponding to the non-main language fragments. "bitch" corresponds to two translation segments, "splatter" "" jail "; "у Me" corresponds to "dead" of a translated fragment.
Then, combine the main language segment "respected client," how you are "do not go" with one translation segment of the non-main language segment "bitch", "у m a_b_b" respectively, to obtain the following two candidate texts:
1) Honored clients, you areSplatter for womenHow to goDead of life
2) Honored clients, you areJail harassmentHow toWithout going toDead of life
And then, respectively inputting the two candidate texts into a preset language model, respectively determining the smoothness of the two candidate texts by the language model, and taking the candidate text with the highest smoothness as the intermediate text of the text to be recognized. Through calculation, the smoothness of the candidate text 1) is greater than that of the candidate text 2), so that the candidate text 1) is taken as a client of the text to be identified, namely, a bit, a у M, a and b, and the middle text of the text to be identified.
The intermediate text is then entered into a predetermined model of abuse in chinese, resulting in a first probability that the intermediate text is an abuse text. Meanwhile, the intermediate text is translated into English text, the English text is input into a preset English abuse model, and the second probability that the English text is the abuse text is obtained. And finally, integrating the first probability and the second probability to obtain a result of whether the text to be identified is the abuse text. For example, the values of the first probability and the second probability are respectively 0.8 and 0.7, the weights of the first probability and the second probability are respectively 0.7 and 0.3, and the threshold is set to 0.75, and the weighted sum result of the first probability and the second probability is 0.8×0.7+0.7×0.3=0.77 and is larger than the threshold 0.75, so that a text to be identified, namely, a client who is in honor, is a bitch, how not to get a e and a e of у, is judged as a text which is in abuse.
Fig. 5 shows a schematic diagram of an apparatus 500 for recognition of multi-language spam text according to one embodiment of the invention, the apparatus 500 residing in a computing device (e.g., the computing device 200 described above) to cause the computing device to perform a method of recognition of multi-language spam text of the invention (e.g., the method 300 described above). As shown in fig. 5, the apparatus 500 includes an acquisition module 510, a conversion module 520, and a determination module 530.
The obtaining module 510 is adapted to obtain a text to be recognized, the text to be recognized comprising at least two languages. The acquiring module 510 is specifically configured to perform the method of step S310, and the processing logic and functions of the acquiring module 510 can be referred to the relevant description of step S310, which is not repeated herein.
The conversion module 520 is adapted to convert the text to be recognized into an intermediate text written in a main language, which is one of at least two languages included in the text to be recognized. The conversion module 520 is specifically configured to perform the method as described in step S320, and the processing logic and functions of the conversion module 520 can be referred to the relevant description of step S320, which is not repeated herein.
The judging module 530 is adapted to judge whether the text to be recognized is a junk text according to the intermediate text by using a preset classification model. The determining module 530 is specifically configured to perform the method as described in the foregoing step S330, and the processing logic and functions of the determining module 530 can be referred to the related description of the foregoing step S330, which is not repeated herein.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions of the methods and apparatus of the present invention, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the multi-lingual spam text recognition method of the present invention in accordance with instructions in said program code stored in the memory.
By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with examples of the invention. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for carrying out the functions performed by the elements for carrying out the objects of the invention.
As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims (12)

1. A method for identifying multilingual junk text comprises the following steps:
acquiring a text to be recognized, wherein the text to be recognized comprises at least two languages;
converting the text to be recognized into an intermediate text written in a main language, wherein the main language is one of the at least two languages;
judging whether the text to be identified is a junk text or not by adopting a preset classification model according to the intermediate text, wherein the classification model comprises a first classification model and a second classification model which are respectively used for determining the probabilities of the text in a first language and the text in a second language being the junk text;
the step of judging whether the text to be identified is junk text or not by adopting a preset classification model according to the intermediate text comprises the following steps:
determining a first text of a first language and a second text of a second language corresponding to the intermediate text;
respectively inputting the first text and the second text into the first classification model and the second classification model to respectively output a first probability and a second probability that the first text and the second text are junk texts;
and when the first probability is larger than the first threshold value and the second probability is larger than the second threshold value, or when the weighted summation result of the first probability and the second probability is larger than the third threshold value, judging the text to be recognized as junk text.
2. The method of claim 1, wherein the first language is the primary language.
3. The method of claim 1, wherein the first classification model is generated using text training in a first language to which classification tags have been labeled and the second classification model is generated using text training in a second language to which classification tags have been labeled, wherein the classification tags indicate whether text is spam.
4. A method according to any one of claims 1-3, wherein the step of converting the text to be recognized into intermediate text written in a main language comprises:
segmenting the text to be identified into a plurality of segments, each segment corresponding to a language;
translating the non-main language fragments into main language to obtain translation fragments corresponding to the non-main language fragments;
combining the main language segment and the translation segment to obtain an intermediate text written in the main language corresponding to the text to be recognized,
the main language fragments are fragments written in a main language in the plurality of fragments, and the non-main language fragments are fragments not written in the main language in the plurality of fragments.
5. The method of claim 4, wherein the step of segmenting the text to be identified into a plurality of segments comprises:
acquiring the unified code of each character of the text to be recognized;
determining the language corresponding to each character according to the unified code;
and cutting characters continuously corresponding to the same language in the text to be recognized into a segment.
6. The method of claim 4, wherein the non-subject language segments correspond to at least one translated segment,
the step of combining the main language segment and the translation segment to obtain the intermediate text written in the main language corresponding to the text to be recognized comprises the following steps:
combining the main language fragments with one translation fragment of each non-main language fragment respectively to obtain at least one candidate text;
determining the smoothness of each candidate text respectively;
and taking the candidate text with the maximum smoothness as an intermediate text corresponding to the text to be identified.
7. The method of claim 6, wherein the step of separately determining the smoothness of each candidate text comprises:
and respectively inputting each candidate text into a preset language model, and determining the smoothness of the candidate text according to the output of the language model.
8. The method of claim 7, wherein the language model is generated using corpus training corresponding to the subject language.
9. The method of claim 7, wherein the language model is an n-gram model or a deep learning model.
10. A multi-lingual spam text recognition device comprising:
the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is suitable for acquiring a text to be recognized, and the text to be recognized comprises at least two languages;
the conversion module is suitable for converting the text to be recognized into an intermediate text written in a main language, wherein the main language is one of the at least two languages; and
the judging module is suitable for determining a first text of a first language and a second text of a second language corresponding to the intermediate text, respectively inputting the first text and the second text into a first classification model and a second classification model to respectively output a first probability and a second probability that the first text and the second text are junk texts, and judging the text to be recognized as junk texts when the first probability is larger than a first threshold and the second probability is larger than a second threshold or when the weighted summation result of the first probability and the second probability is larger than a third threshold.
11. A computing device, comprising:
at least one processor; and
a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-9.
12. A readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-9.
CN201811082749.XA 2018-09-17 2018-09-17 Multi-language junk text recognition method and device and computing equipment Active CN110929530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811082749.XA CN110929530B (en) 2018-09-17 2018-09-17 Multi-language junk text recognition method and device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811082749.XA CN110929530B (en) 2018-09-17 2018-09-17 Multi-language junk text recognition method and device and computing equipment

Publications (2)

Publication Number Publication Date
CN110929530A CN110929530A (en) 2020-03-27
CN110929530B true CN110929530B (en) 2023-04-25

Family

ID=69855738

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811082749.XA Active CN110929530B (en) 2018-09-17 2018-09-17 Multi-language junk text recognition method and device and computing equipment

Country Status (1)

Country Link
CN (1) CN110929530B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005093566A1 (en) * 2004-03-22 2005-10-06 Albahith Co. Human interface translator for machines
CN102567529A (en) * 2011-12-30 2012-07-11 北京理工大学 Cross-language text classification method based on two-view active learning technology
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN107239440A (en) * 2017-04-21 2017-10-10 同盾科技有限公司 A kind of rubbish text recognition methods and device
CN107562728A (en) * 2017-09-12 2018-01-09 电子科技大学 Social media short text filter method based on structure and text message
CN108062303A (en) * 2017-12-06 2018-05-22 北京奇虎科技有限公司 The recognition methods of refuse messages and device
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005093566A1 (en) * 2004-03-22 2005-10-06 Albahith Co. Human interface translator for machines
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN102567529A (en) * 2011-12-30 2012-07-11 北京理工大学 Cross-language text classification method based on two-view active learning technology
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN107239440A (en) * 2017-04-21 2017-10-10 同盾科技有限公司 A kind of rubbish text recognition methods and device
CN107562728A (en) * 2017-09-12 2018-01-09 电子科技大学 Social media short text filter method based on structure and text message
CN108062303A (en) * 2017-12-06 2018-05-22 北京奇虎科技有限公司 The recognition methods of refuse messages and device
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
K. Mohana Priya 等.Emerging Twitter Using Text Classification Based on Live Streaming.《International Journal of Intellectual Advancements and Research in Engineering Computations》.2018,第16卷(第2期),全文. *
刘伍颖 ; 王挺 ; .适于垃圾文本流过滤的条件概率集成方法.《计算机科学与探索》.2010,(第05期),全文. *
陈宏君 ; 张磊 ; .结构化文本语言编译器的虚拟机指令设计与优化.《单片机与嵌入式系统应用》.2018,(第05期),全文. *
黄正伟 等.基于SVM分类模型的垃圾文本识别研究.《数学的实践与认识》.2016,第46卷(第07期),全文. *

Also Published As

Publication number Publication date
CN110929530A (en) 2020-03-27

Similar Documents

Publication Publication Date Title
US10657332B2 (en) Language-agnostic understanding
CN109416705B (en) Utilizing information available in a corpus for data parsing and prediction
US10558757B2 (en) Symbol management
WO2019153613A1 (en) Chat response method, electronic device and storage medium
CN105183761B (en) Sensitive word replacing method and device
JP5379138B2 (en) Creating an area dictionary
JP2019504413A (en) System and method for proposing emoji
WO2015176518A1 (en) Reply information recommending method and device
CN111680159A (en) Data processing method and device and electronic equipment
AU2017205328A1 (en) Named entity recognition on chat data
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
US11593557B2 (en) Domain-specific grammar correction system, server and method for academic text
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
WO2018023356A1 (en) Machine translation method and apparatus
US20180032907A1 (en) Detecting abusive language using character n-gram features
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN110781273A (en) Text data processing method and device, electronic equipment and storage medium
CN110069769A (en) Using label generating method, device and storage equipment
CN107111607A (en) The system and method detected for language
CN110738056A (en) Method and apparatus for generating information
CN110929026A (en) Abnormal text recognition method and device, computing equipment and medium
WO2024207762A1 (en) Data identification method and related device
US20180107655A1 (en) Systems and methods for handling formality in translations of text
US20140101596A1 (en) Language and communication system
Verma et al. Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant