CN111291551A - Text processing method and device, electronic equipment and computer readable storage medium - Google Patents
Text processing method and device, electronic equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN111291551A CN111291551A CN202010073135.6A CN202010073135A CN111291551A CN 111291551 A CN111291551 A CN 111291551A CN 202010073135 A CN202010073135 A CN 202010073135A CN 111291551 A CN111291551 A CN 111291551A
- Authority
- CN
- China
- Prior art keywords
- text
- preset
- invalid
- meets
- comment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a text processing method and device, electronic equipment and a computer readable storage medium, and relates to the field of processing. The method comprises the following steps: acquiring a text of a game; acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not; when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition; when the detection meets the statistic correlation condition, determining whether the text is semantically valid based on a preset semantic rule; and when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text. According to the method and the device, the efficiency of obtaining effective comment contents from the comment area by the user is higher, and the user experience is better.
Description
Technical Field
The present application relates to the field of processing technologies, and in particular, to a text processing method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
With the rapid development of internet technology, users have a variety of interactions via the internet. Such as: the user can make comments in the comment column below the commented subject, and other users can interact with the comments in the comment area.
Currently, when comments are analyzed, since there are a large number of comment contents for the same comment subject, there are contents with high repetition rate and without practical significance mixed in the comments, such as: "sofa"; even some comment areas appear many meaningless sentences entered in disorder, such as: the Fuxi pulling place does not take the flavor of the Fuxi Kazakh-Hada-Tu-Shi. Due to the existence of the comment content with high repetition rate and no practical significance, valuable text content in the comment area is submerged, so that the efficiency of obtaining effective comment content from the comment area by a user is low, and the user experience is poor.
Disclosure of Invention
The application provides a text processing method and device, electronic equipment and a computer readable storage medium, which can solve the problems that a user is low in efficiency of obtaining effective comment contents from a comment area and poor in user experience. The technical scheme is as follows:
in a first aspect, a text processing method is provided, and the method includes:
acquiring a text of a game;
acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not;
when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition;
when the detection meets the statistic correlation condition, determining whether the text is semantically valid based on a preset semantic rule;
and when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text.
Preferably, the interactive information includes at least one of a number of comments, a number of supports, a number of objections, and a number of shares of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets a preset condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring Chinese characters in the text, and counting the number of the Chinese characters;
and when the number of the Chinese characters exceeds the number threshold of the Chinese characters, determining that the text meets the statistical condition.
Preferably, the step of detecting the information based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring non-Chinese characters in the text, and counting the number of the non-Chinese characters;
and when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters, determining that the text meets the statistical condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring all characters in the text;
detecting whether continuously repeated characters exist in all characters;
counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring all Chinese characters in the text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text meets the statistical condition.
Preferably, determining whether the text is semantically valid based on a preset semantic rule includes:
calculating the confusion degree of the text;
when the confusion does not exceed a confusion threshold, calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
Preferably, the method further comprises the following steps:
storing the invalid text with invalid semantics to the invalid text set; and the number of the first and second groups,
and when the text semantics are determined to be valid, determining the text to be a valid text, and storing the valid text to the valid text set.
Preferably, the valid text set and the invalid text set are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
In a second aspect, an apparatus for text processing is provided, the apparatus comprising:
the acquisition module is used for acquiring a text of a game; acquiring interactive information of the text;
the first detection module is used for determining whether the text contains preset keywords or not when the interactive information meets preset conditions;
the second detection module is used for detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition or not when the text does not contain a preset keyword;
the third detection module is used for determining whether the text is semantically valid or not based on preset semantic rules when the verification is passed;
the judging module is used for determining the text as an invalid text when the text semantic is determined to be invalid;
and the filtering module is used for filtering the text.
Preferably, the obtaining module is specifically configured to:
acquiring at least one of comment quantity, support quantity, object quantity and sharing quantity of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets a preset condition.
Preferably, the second detection module comprises:
the first statistic submodule is used for acquiring the Chinese characters in the text and counting the number of the Chinese characters;
a first determining sub-module for determining that the text meets a statistically relevant condition when the number of Chinese characters exceeds a threshold number of Chinese characters.
Preferably, the second detection module comprises:
the second statistic submodule is used for acquiring the non-Chinese characters in the text and counting the number of the non-Chinese characters;
and the second determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters.
Preferably, the second detection module comprises:
the first obtaining submodule is used for obtaining all characters in the text;
the detection submodule is used for detecting whether continuous repeated characters exist in all the characters;
the third counting submodule is used for counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and the third determining sub-module is used for determining that the text meets the statistical correlation condition when the repetition frequency of any continuously repeated character does not exceed the repetition frequency threshold value.
Preferably, the second detection module comprises:
the second acquisition submodule is used for acquiring all Chinese characters in the text and the initial letters of all the Chinese characters;
the fourth statistic submodule is used for counting the continuous occurrence times of each first letter;
and the fourth determining submodule is used for determining that the text meets the statistical correlation condition when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value.
Preferably, the third detection module comprises:
the similarity calculation operator module is used for calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and the judging submodule is used for judging that the text semantics are valid when at least one first similarity in the first similarities exceeds a valid similarity threshold, or judging that the text semantics are invalid when at least one second similarity in the second similarities exceeds an invalid similarity threshold.
Preferably, the method further comprises the following steps:
the storage module is used for storing the invalid texts with invalid semantics into the invalid text set; and when the text semantics are determined to be valid, determining that the text is a valid text, and storing the valid text to the valid text set.
Preferably, the valid text set and the invalid text set are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
In a third aspect, an electronic device is provided, which includes:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is configured to call the operation instruction, and the executable instruction enables the processor to execute an operation corresponding to the text processing method shown in the first aspect of the application.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the text processing method shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is:
in the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, and when the text is semantically ineffective, the text is determined to be an ineffective text and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to another embodiment of the present application;
FIG. 3 is a schematic diagram of data interaction between a review system and an invalid review filtering service in the present application;
fig. 4 is a schematic structural diagram of a text processing apparatus according to another embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for text processing according to yet another embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The text processing method, the text processing device, the electronic equipment and the computer-readable storage medium aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
In one embodiment, a text processing method is provided, as shown in fig. 1, the method including:
step S101, obtaining a game text;
the text of the game can be comment text in a comment system in the field of games. Further, in addition to the game field, the embodiment of the present invention may be applied to all comment-related fields, such as websites, APPs, and the like having a comment system. As shown in fig. 3, in practical applications, when a website, an APP, or the like needs to display comments, a comment system may first obtain relevant background comment data, then automatically filter all comments through an invalid comment filtering service to obtain valid comments, and then display the valid comments by front-end comments, so that the comments after filtering are seen by a user.
Further, the comment system and the invalid comment filtering service may both be set in the terminal, or the comment system may be set in the terminal, and the invalid comment filtering service may be set in the server, which may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
Step S102, acquiring interactive information of a text; when the interactive information meets the preset conditions, determining whether the text comprises preset keywords or not;
the interactive information may include other information besides the comment quantity, the support quantity, the objection quantity, and the share quantity, and may be adjusted according to actual needs in actual applications, which is not limited in the embodiment of the present invention. Because the interaction is generated by manual operation of other users after reading the text, the understanding of the text by the users is more accurate compared with the understanding of the machine, for example, a comment has 100 user objections, and the comment is definitely problematic, so that the text processing can be performed by means of semantic understanding of other users, and the accuracy is improved.
Further, when the interaction information of the text meets a preset condition, for example, the number of supports exceeds a preset support threshold, or the number of objections does not exceed a preset objection threshold, it can be further determined whether the text includes a preset keyword. In practical application, a blacklist of keywords may be preset, where the blacklist may include a plurality of preset keywords, and when it is detected that a text includes at least one preset keyword in the blacklist, it may be determined that the text is an invalid comment. The invalid comments can be comments without Chinese characters or comments with discordant sentences and the like; a valid comment may be the reverse of an invalid comment, i.e., a valuable or meaningful comment. Invalid comments can be further efficiently screened out through a blacklist of preset keywords.
Step S103, when the text does not contain the preset keywords, detecting the text based on the preset character statistical rule to determine whether the text meets the statistical correlation conditions;
when the text is detected not to contain the preset keywords, the text can be determined not to contain the keywords in the preset black and white list, and then the text is continuously detected based on the preset character statistical rules to determine whether the text meets the statistical relevant conditions.
Step S104, when the detection accords with the statistic correlation condition, determining whether the text is semantically effective or not based on a preset semantic rule;
after the steps S101 to S103, most of invalid texts may be filtered from the format of the text, the dimensions of keywords, and the like, and then whether the semantics are valid is determined for the remaining texts based on the preset semantic rule, so that the text may be filtered from the dimensions of the real semantics of the text.
And step S105, when the text semantics are determined to be invalid, determining that the text is an invalid text, and filtering the text.
After the text semantics are determined to be valid, the text can be determined to be a valid text, and the text can be displayed; otherwise, the text is judged to be an invalid text, and filtering processing such as shielding and the like can be carried out on the text.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
In another embodiment, a text processing method is provided, as shown in fig. 2, the method including:
step S201, obtaining a game text;
the text of the game can be comment text in a comment system in the field of games. Further, in addition to the game field, the embodiment of the present invention can be applied to all the comment fields, such as a website with a comment system, APP, and the like. As shown in fig. 3, in practical applications, when a website, an APP, or the like needs to display comments, the comment system may obtain background comment data first, then automatically filter all comments through an invalid comment filtering service to obtain valid comments, and then display the valid comments by front-end comment, so that the comments after filtering are seen by a user.
For example, a certain web page of a certain website originally includes 100 comments, and the 100 comments are filtered before the 100 comments are displayed, so that 80 valid comments and 20 invalid comments are obtained, and therefore, the 80 valid comments are displayed, and the 20 invalid comments are shielded. The invalid comments can be comments without Chinese characters or comments with discordant sentences and the like; a valid comment may be the reverse of an invalid comment, i.e., a valuable or meaningful comment.
Further, the comment system and the invalid comment filtering service may both be set in the terminal, or the comment system may be set in the terminal, and the invalid comment filtering service may be set in the server, which may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention. The terminal may have the following features:
(1) on a hardware architecture, a device has a central processing unit, a memory, an input unit and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, various input modes such as a keyboard, a mouse, a touch screen, a microphone, a camera and the like can be provided, and input can be adjusted as required. Meanwhile, the equipment often has a plurality of output modes, such as a telephone receiver, a display screen and the like, and can be adjusted according to needs;
(2) on a software system, the device must have an operating system, such as Windows Mobile, Symbian, Palm, Android, iOS, and the like. Meanwhile, the operating systems are more and more open, and personalized application programs developed based on the open operating system platforms are infinite, such as a communication book, a schedule, a notebook, a calculator, various games and the like, so that the requirements of personalized users are met to a great extent;
(3) in terms of communication capacity, the device has flexible access mode and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby being convenient for users to use. The device can support GSM (Global system for Mobile Communication), WCDMA (Wideband Code Division Multiple Access), CDMA2000(Code Division Multiple Access), TDSCDMA (Time Division-Synchronous Code Division Multiple Access), Wi-Fi (Wireless-Fidelity), WiMAX (world interoperability for Microwave Access), etc., thereby adapting to various systems of networks, not only supporting voice services, but also supporting various Wireless data services;
(4) in the aspect of function use, the equipment focuses more on humanization, individuation and multi-functionalization. With the development of computer technology, devices enter a human-centered mode from a device-centered mode, and the embedded computing, control technology, artificial intelligence technology, biometric authentication technology and the like are integrated, so that the human-oriented purpose is fully embodied. Due to the development of software technology, the equipment can be adjusted and set according to individual requirements, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the function is more and more powerful.
Step S202, acquiring interactive information of a text; when the interactive information meets the preset conditions, determining whether the text comprises preset keywords or not;
in a preferred embodiment of the present invention, the interaction comprises at least one of a number of comments, a number of supports, a number of objections, a number of shares of the text;
the interactive information accords with the preset condition and comprises the following steps:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets the preset condition.
Specifically, for any text, at least one of the number of comments, the number of supports, the number of objections, and the number of shares of the text may be obtained first, and when the number of comments exceeds a predetermined value, and/or the number of supports exceeds a predetermined support threshold, and/or the number of objections does not exceed a predetermined objection threshold, and/or the number of shares exceeds a predetermined share threshold, it is determined that the interaction information meets a predetermined condition.
For example, a certain webpage includes five comments, when a first comment is subjected to text processing, the comment quantity, the support quantity, the objection quantity and the sharing quantity of the first comment are acquired, when the quantity of at least one of the comment quantity, the support quantity, the objection quantity and the sharing quantity exceeds a corresponding preset threshold value, for example, the comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity exceeds a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, it is determined that the interaction information meets a preset condition, at this time, it can be determined that the first comment is an effective comment, and the rest of the four comments can be analogized accordingly.
Further, the interactive information may include other information besides the number of comments, the number of supports, the number of objections, and the number of shares, and may be adjusted according to actual requirements in actual applications, which is not limited in the embodiment of the present invention. Because the interaction is generated by manual operation of other users after reading the text, the understanding of the text by the users is more accurate compared with the understanding of the machine, for example, a comment has 100 user objections, and the comment is definitely problematic, so that the text processing can be performed by means of semantic understanding of other users, and the accuracy is improved.
Further, when the interaction information of the text meets a preset condition, for example, the number of supports exceeds a preset support threshold, or the number of objections does not exceed a preset objection threshold, it can be further determined whether the text includes a preset keyword. In practical application, a blacklist of keywords may be preset, where the blacklist may include a plurality of preset keywords, and when it is detected that a text includes at least one preset keyword in the blacklist, it may be determined that the text is an invalid comment. Invalid comments can be further efficiently screened out through a blacklist of preset keywords.
For example, if the keyword blacklist includes "number of words hit" and "five words", and a certain comment is "this is five words", it is detected that the comment includes "five words" in the blacklist, then the comment can be determined as an invalid comment, and the comment is shielded; or, if a comment is "5 words together", and it is detected that the comment contains "words together" in the blacklist, the comment can be determined as an invalid comment, and the comment is masked.
Further, when detecting whether the text contains preset keywords, a mode of performing natural language processing on the text and then matching with each preset keyword in the blacklist may be adopted. Of course, other ways of detecting whether the text includes the preset keyword are all applicable to the embodiment of the present invention, and the method may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
In addition, during matching, in addition to when the text completely contains the preset keywords (as in the above example), a matching degree threshold may also be set, and when the matching degree of the text and the preset keywords exceeds the matching degree threshold, it may be determined that the text contains the keywords. For example, if the preset keyword in the blacklist is "five words," and the keyword included in the text is "five words," the matching degree between the preset keyword and the keyword is very high, and if the matching degree exceeds a threshold value of the matching degree, it can also be determined that the text includes the preset keyword.
Furthermore, the black list may include text segments or other contents besides the keywords, and may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
Step S203, when the text does not contain the preset keywords, detecting the text based on the preset character statistical rule to determine whether the text meets the statistical correlation conditions;
when the text is detected not to contain the preset keywords, the text can be determined to be valid, and then the text is continuously detected based on the preset character statistical rule so as to determine whether the text meets the statistical relevant conditions.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring Chinese characters in a text, and counting the number of the Chinese characters;
and when the number of the Chinese characters exceeds the number threshold of the Chinese characters, determining that the text meets the statistical correlation condition.
Specifically, the method includes the steps of firstly obtaining Chinese characters in a text, counting the number of the Chinese characters, then judging whether the number of the Chinese characters exceeds a preset threshold value of the number of the Chinese characters, and if so, determining that the text is valid; if not, the text may be determined to be invalid. Therefore, invalid comments can be further efficiently screened out based on the number of Chinese characters in the text.
For example, the preset threshold value of the number of chinese characters is 0, and a certain comment is "Yfdcbhj", or "…". | A Is there a . ", then the comment may be determined to be invalid.
Further, besides judging whether the number of the Chinese characters exceeds the number threshold of the Chinese characters, the number of all the characters in the text and the number of the Chinese characters in the text can be obtained, then whether the proportion of the Chinese characters to all the characters exceeds the proportion threshold is calculated, and if yes, the text can be determined to be valid; if not, the text may be determined to be invalid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring non-Chinese characters in a text, and counting the number of the non-Chinese characters;
and when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters, determining that the text meets the statistical correlation condition.
Specifically, the non-chinese characters in the text may be obtained first, the number of the non-chinese characters may be counted, and then it is determined whether the number of the non-chinese characters exceeds a preset threshold value of the number of the non-chinese characters, and if so, it may be determined that the text is invalid; if not, the text may be determined to be valid.
For example, the preset threshold value of the number of non-chinese characters is 10, and a certain comment is "\ N,486970735914696644, see fig. 1, \ N,2405296206,191939760947625", or "http:// hck. hckzf111. cn/register? if intr is 1PA88Jocx & type 0& special share cas good things ", then the comment is determined to be invalid.
Further, besides judging whether the number of the non-Chinese characters exceeds the number threshold of the non-Chinese characters, the number of all characters in the text and the number of the non-Chinese characters in the text can be obtained, then whether the proportion of the Chinese characters to all characters exceeds the proportion threshold is calculated, and if so, the text can be determined to be invalid; if not, the text may be determined to be valid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring characters in a text;
detecting whether continuously repeated characters exist in the characters;
counting the repetition times of each continuously repeated character when the continuously repeated character is detected to exist in the characters;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical correlation condition.
Specifically, all characters in the text may be obtained first, and then whether there are consecutive repeated characters or character strings in all the characters is detected, if yes, the number of times of repetition of each consecutive repeated character or character string is counted, and when the number of times of repetition of any consecutive repeated character or character string does not exceed a threshold number of times of repetition, it is determined that the text meets a statistically relevant condition. Therefore, invalid comments can be further efficiently screened out based on the repetition times of any continuously repeated character in the text.
For example, if the preset threshold value of the number of times of repetition of the character or the character string is 5, and a certain comment is "how you are black", or "how do you see how you see we" is read at the tail of a red machine, it may be determined that the comment is invalid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring all Chinese characters in a text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text accords with the statistical correlation condition.
Specifically, all chinese characters in the text may be obtained first, then the first letter of each chinese character is obtained, then the number of times each first letter appears continuously is counted, and if the number of times any one first letter appears continuously does not exceed the continuous occurrence threshold, it may be determined that the text meets the statistically relevant condition.
For example, if the threshold value of the number of consecutive occurrences of the initial "H" is 5, and a certain comment is "haha good still good" or "yaha red fire", it may be determined that the comment is invalid. Therefore, invalid comments can be further efficiently screened out based on the continuous occurrence times of any initial letter in the text.
Step S204, when the detection accords with the statistic correlation condition, determining whether the text is semantically effective or not based on a preset semantic rule;
after the steps S201 to S203, most of invalid texts may be filtered from the format of the text, the dimensions of keywords, and the like, and then whether the semantics are valid is determined for the remaining texts based on the preset semantic rule, so that the text may be filtered from the dimensions of the real semantics of the text.
In a preferred embodiment of the present invention, determining whether a text is semantically valid based on a preset semantic rule includes:
calculating the confusion degree of the text;
when the confusion does not exceed the threshold of the confusion, calculating the similarity between the text and each effective text in the preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each invalid text in the preset effective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
Wherein the confusion is used for representing the smoothness degree of the semantics of the text. The lower the confusion degree is, the more smooth the semantics of the text is; the higher the degree of confusion, the less smooth the semantics of the text are.
Specifically, a threshold value of the confusability may be preset, and then the preset language model is used to calculate the confusability of the text sample, which may specifically be the following formula:
wherein, P (ω)1,…,ωm) Is the numerical value of the confusability of the text, omega is a character or word, omega1To omegamA text sentence is formed, which comprises m words and/or words, for example, in the text sentence "good weather today", m is 3, ω1Is "today", omega2Is "weather", omega3Is "good doing". Then, the formula is modified by adopting a preset domain dictionary, so that when the omega is in the range of omegai,ωi-1,…,ωi-(n-1)When it is a phrase specific to the field, then P (ω)i|ωi-(n-1),…,ωi-1) To 1, for example, the formula is modified by using a game field dictionary, and when a phrase in a text sentence is detected as a phrase specific to the game field, P (ω) is madei|ωi-(n-1),…,ωi-1) The text is 1, and the confusion degree of the whole text is calculated, the larger the confusion degree is, the less smooth the semantics is, the higher the probability that the semantics of the text are invalid is, and the higher the probability that the text is an invalid text is; meanwhile, the formula is modified by the domain dictionary, so that the modified formula can detect comments in the domain more accurately, and invalid comments in the domain can be further screened out efficiently.
Further, the language model may be a BERT model (Bidirectional Encoder representation from transforms), which is a language model proposed by google corporation, and may be used for natural language processing tasks such as text classification, reading and understanding, of course, other language models may also be used, and may be set according to actual requirements in actual applications, which is not limited in this embodiment of the present invention.
For example, the two texts are respectively 'purple sweet potato pudding and' purple sweet potato pudding 'and' French fdfa 'to reset and then react to open the conforming card', the two texts are input into a language model, and the language model is used for segmenting the texts to obtain omega1~ωmThe word segmentation can be carried out, then the confusion degree of the purple sweet potato pudding and the purple sweet potato pudding can be calculated to be 1230 through the formula, the confusion degree of the legal method fdfa which is reset and reacts to be in line with the card after payment is 1345, and the preset confusion degree threshold value is 350, so that the two text semantemes can be determined to be invalid.
Further, an effective text set, an invalid text set, an effective similarity threshold and an invalid similarity threshold can be preset, wherein the effective text set comprises at least one text marked as effective, and the invalid text set comprises at least one text marked as invalid.
In practical application, the similarity between the text to be detected and each valid text in the valid text set can be respectively calculated to obtain a plurality of first similarities, the similarity between the text to be detected and each invalid text in the invalid text set can be respectively calculated to obtain a plurality of second similarities, and then whether at least one first similarity in the plurality of first similarities exceeds the valid similarity threshold value or not and whether at least one second similarity in the plurality of second similarities exceeds the invalid similarity threshold value or not are judged. And when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
For example, the effective similarity threshold and the ineffective similarity threshold are both 0.8, after similarity calculation is performed on a comment and each effective text and each ineffective text, if the similarity between the comment and a valid text is 0.9, the text semantic validity can be determined; if the similarity of the comment and an invalid text is 0.85, the text can be determined to be semantically invalid.
Wherein, the similarity calculation can be carried out by adopting a Faiss similarity search tool. Faiss is a high-performance high-dimensional vector similarity retrieval and clustering library, and the open source protocol is BSDlicense.
The method and the device can be applied to the embodiment of the invention, Faiss is adopted to carry out similarity calculation on any text and each effective text in the effective text set to obtain the first similarity between the text and each effective text in the effective text set, and the similarity calculation is carried out on the text and each invalid text in the invalid text set to obtain the second similarity between the text and each invalid text in the invalid text set.
Faiss will reduce memory usage and support large-scale datasets, such as 10 billion-scale high-dimensional vector similarity search on a single machine. Therefore, the text with the highest similarity among the text to be detected, the valid text set and the invalid text set can be quickly found by adopting Faiss.
It should be noted that, in the embodiment of the present invention, if the similarity between a certain text and an effective text exceeds an effective similarity threshold, it indicates that the semantics of the certain text and the effective text are very similar, and similarly, if the similarity between a certain text and an invalid text exceeds an invalid similarity threshold, it indicates that the semantics of the certain text and the invalid text are very similar. Therefore, in practical application, the probability of the situation that the semantics of a certain text is very similar to those of a certain valid text and is also very similar to those of a certain invalid text is almost zero, and the judgment result of the embodiment of the invention cannot be influenced.
Further, the generation manner of the valid text set and the invalid text set, and the training manner of the preset language model are specifically as follows.
In a preferred embodiment of the present invention, the valid text set and the invalid text set are generated as follows:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
Specifically, for all the text samples without labels, the interaction information of each text sample is obtained first, and the successfully obtained text samples are clustered to obtain a set A.
Then, respectively executing 'detecting whether the text contains preset keywords' on each text sample in the set A; and when the text does not contain the preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets the statistical correlation condition, and clustering text samples meeting the statistical correlation condition to obtain a set B.
A preset language model is then used to calculate the degree of confusion for each text sample in set B. The preset language model is provided with initial parameters and parameter values corresponding to the parameters.
For example, the two texts are respectively ' purple sweet potato pudding and ' purple sweet potato pudding ' and ' french fdfa ' to react to open the card after resetting and then paying, the confusion degree of the ' purple sweet potato pudding ' obtained through the language model calculation is 1230, the confusion degree of the ' purple sweet potato pudding ' and ' french fdfa ' to react to open the card after resetting and then paying is 1345, and the preset confusion threshold value is 350, so that the two texts can be determined to be invalid.
And in the same way, filtering out the texts with the confusion degree greater than the threshold value of the confusion degree in the set B so as to filter out a large number of invalid comments to obtain a set C, and then carrying out unsupervised secondary classification processing on the set C.
In practical applications, machine learning mainly solves two types of problems, supervised learning and unsupervised learning.
Supervised learning refers to a process of guiding a model to learn a task concerned by a user through an external response variable and achieving a purpose required by the user. That is, the ultimate goal of supervised learning is to allow the model to more accurately model the response variables required by the user. Example (c): the user wants to predict the house selling price in a certain area through a series of characteristic values, and wants to predict the box office of the movie. Here, "sales price" and "movie box office" are response variables in supervised learning.
Stated differently, a model is learned from a given labeled data set, and when new unlabeled data is input, a prediction result can be obtained from the trained model. Supervised learning is often used to deal with the "classification" problem.
Supervised learning may include three types of models:
1. a linear model;
2. a decision tree model;
3. a neural network model.
These three types of supervised learning models can be subdivided into two categories of problems:
1. a classification problem;
2. and (4) regression problem.
The core of the classification problem is how to use a model to discriminate the class of a data point. This category is typically discrete, such as two or more categories. The core of the regression problem is to use a model to output a predicted value. This value is typically a real number and is continuous.
Unsupervised learning refers to the fact that under normal circumstances, no response variable is evident. The core of unsupervised learning is that the potential structure and rule in the data are often expected to be discovered, and reference is provided for the user to make the next decision. Typically unsupervised learning is desirable to be able to group, i.e., "cluster," data using data features. Typically, unsupervised learning is able to mine structures within the data that may be more likely to catch the essential connections of the data than the user-provided data features.
The main purpose of unsupervised learning is to mine the connections inherent in the data. The underlying problem here is that different unsupervised learning methods have different assumptions about the structure inside the data. Therefore, unsupervised learning often differs greatly between different models. Of the numerous unsupervised learning models, the clustering model is undoubtedly an important representative, wherein the K-means algorithm (K-means) is the most common and very important algorithm model in the clustering algorithm model.
In the prior art, the classification model is usually trained based on supervised learning, that is, training samples with labels are prepared in advance, and then the classification model is trained by using the training samples with labels. For example, ten thousand training samples marked as valid and ten thousand training samples marked as invalid are prepared in advance, and then the classification model is trained by using the two ten thousand training samples. Therefore, each training sample needs to be labeled manually, which wastes both labor cost and time cost, especially when the number of training samples is large.
In the embodiment of the present invention, the problem of classification is solved based on unsupervised learning, that is, the set C is subjected to two classifications (the text includes two classifications, i.e., the valid text and the invalid text) by unsupervised learning, so as to determine whether each text in the set C belongs to the valid text or the invalid text.
Specifically, although most of the invalid texts may be filtered out based on the degree of confusion, some invalid texts with small degree of confusion may still exist in the set C, that is, although the sentence is smooth, there is no text in any amount, such as "is the weather good today? "is used. Therefore, the set C is further classified by an unsupervised classification method (such as K-means algorithm, K takes 2) to obtain a set C1 containing invalid text samples and a set C2 of valid text samples.
The sets A and C2 containing valid text samples and the set C1 containing invalid text samples can be obtained through the method, and simultaneously, the trained language model is obtained. Then, a plurality of texts which are marked as valid in advance but have high confusion are combined with the set A, C2 to obtain a final valid text set, and a plurality of texts which are marked as invalid in advance but have low confusion are combined with the set C1 to obtain a final invalid text set. In addition, because the original text samples are unlabeled in the application, the original text samples can be classified more accurately by adopting unsupervised binary classification, and further, invalid comments can be further and efficiently screened out.
Step S205, when the text semantics are determined to be invalid, determining that the text is an invalid text, and filtering the text.
After the text semantics are determined to be valid, the text can be determined to be a valid text, and the text can be displayed; otherwise, the text is judged to be an invalid text, and filtering processing such as shielding and the like can be carried out on the text.
Further, in practical applications, Bad Case text of Bad Case, that is, invalid text that cannot be recognized by the current manner may appear. For this Case, Bad Case may be artificially added to the invalid text set, so that Bad Case and text similar to Bad Case may be recognized next time.
Step S206, storing the invalid text with invalid semantics into an invalid text set;
and step S207, when the text semantics are determined to be valid, determining that the text is a valid text, and storing the valid text into a valid text set.
Specifically, after the invalid text is filtered, the invalid text can be stored in a preset invalid text set, and similarly, when any text semantic is determined to be valid, any text can be determined to be a valid text, and the valid text is stored in the preset valid text set, so that the valid text set and the invalid text set are expanded.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In practical application, the method and the device can be applied to APP in the field of games. In the APP, a user can browse articles of a game, such as game strategies, game competition news and the like, each article can be provided with a user comment area, the user can post comments of the user on the article in the user comment area, and meanwhile, for the comments posted by any user, other users can interact with the comments, including one-step comment, praise, deprecate or share.
For example, in a user comment area of a certain game strategy, the comment made by the user a is "very practical, and has just recently been played", the user B refers to the comment and makes an independent comment, "i also is thank you building owner", and the user C makes a praise on the comment made by the user a, and the user D makes a comment, "pass by pass".
Further, when a user browses the detailed content of any article, the APP may present the detailed content of the article and the corresponding comment of the article, but the presented comment is the comment after being filtered.
For example, when a user browses the detailed content of the game strategy, the detailed content of the game strategy and all comments are acquired, that is, "very practical," exactly click here recently "," i am also thank you for building owner "and" pass by way ", then the interactive information of the comments is acquired for the comments of the user a, since the comments have one piece of comment information (the comment of the user B) and one piece of approval (the approval of the user C), the interactive information meets the preset condition, then the comments of the user a are processed in natural language and matched with each preset keyword in the preset blacklist, and since there is no matching item, it can be determined that the comments of the user a do not contain the preset keyword.
And then, the text is further detected based on a preset character statistical rule to determine whether the text meets the statistical relevant condition, and the comment of the user A can be determined to meet the statistical relevant condition because the number of Chinese characters of the comment of the user A exceeds a Chinese character number threshold value, the number of non-Chinese characters is less than a non-Chinese character number threshold value, the repetition frequency of any continuously repeated character does not exceed a repetition frequency threshold value, and the continuous occurrence frequency of any initial character does not exceed a continuous occurrence frequency threshold value.
Further, the language model is adopted to calculate the confusion degree of the comment and compare the confusion degree with a preset confusion degree threshold, and because the confusion degree of the comment does not exceed the preset confusion degree threshold, the similarity calculation is carried out on the comment and all effective texts in an effective text set, and the similarity calculation is carried out on the comment and all the ineffective texts in an ineffective text set, so that the similarity of the comment and a certain effective text exceeds the preset effective similarity threshold, the comment semantic validity can be judged, and the comment of the user A is judged to be an effective text.
The above steps are also adopted for the comment of the user B and the comment of the user D. Because the objection quantity of the user B does not exceed the preset objection threshold, the interactive information of the comment also meets the preset condition, and the comment of the user B is also valid text; the number of times of 'passing through' in the comments of the user D exceeds a repetition threshold value, so that the statistics related conditions are not met, and the comments of the user D are invalid texts; other steps are the same as the processing steps for the comment of the user a, and are not described herein.
Therefore, the user can see the comment of the user a, the comment of the user B, and the like of the user C in addition to the detailed contents of the game strategy, but cannot see the comment of the user D, that is, the comment of the user D is masked.
Fig. 4 is a schematic structural diagram of a text processing apparatus according to another embodiment of the present application, and as shown in fig. 4, the apparatus according to this embodiment may include:
an obtaining module 401, configured to obtain a text of a game; acquiring interactive information of the text;
a first detecting module 402, configured to determine whether the text includes a preset keyword when the interactive information meets a preset condition;
a second detection module 403, configured to, when the text does not include a preset keyword, detect the text based on a preset character statistical rule to determine whether the text meets a statistical correlation condition;
a third detecting module 404, configured to determine whether the text is semantically valid based on a preset semantic rule when the verification passes;
a decision module 405, configured to determine that the text is an invalid text when it is determined that the text semantics are invalid;
and a filtering module 406 for filtering the text.
In a preferred embodiment of the present invention, the obtaining module is specifically configured to:
acquiring at least one of comment quantity, support quantity, object quantity and sharing quantity of the text;
the interactive information accords with the preset condition and comprises the following steps:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets the preset condition.
In a preferred embodiment of the present invention, the second detection module includes:
the first statistic submodule is used for acquiring Chinese characters in the text and counting the number of the Chinese characters;
and the first determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the Chinese characters exceeds the number threshold of the Chinese characters.
In a preferred embodiment of the present invention, the second detection module includes:
the second statistic submodule is used for acquiring non-Chinese characters in the text and counting the number of the non-Chinese characters;
and the second determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the non-Chinese characters is smaller than the number threshold of the non-Chinese characters.
In a preferred embodiment of the present invention, the second detection module includes:
the first obtaining submodule is used for obtaining all characters in the text;
the detection submodule is used for detecting whether continuous repeated characters exist in all the characters;
the third counting submodule is used for counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and the third determining sub-module is used for determining that the text meets the statistical correlation condition when the repetition frequency of any continuously repeated character does not exceed the repetition frequency threshold value.
In a preferred embodiment of the present invention, the second detection module includes:
the second acquisition submodule is used for acquiring all Chinese characters in the text and the initial letters of all the Chinese characters;
the fourth statistic submodule is used for counting the continuous occurrence times of each first letter;
and the fourth determining submodule is used for determining that the text accords with the statistical correlation condition when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value.
In a preferred embodiment of the present invention, the third detecting module includes:
the similarity calculation operator module is used for calculating the similarity between the text and each effective text in the preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each invalid text in the preset invalid text set to obtain at least one second similarity;
and the judging submodule is used for judging that the text semantics are valid when at least one first similarity in the first similarities exceeds a valid similarity threshold, or judging that the text semantics are invalid when at least one second similarity in the second similarities exceeds an invalid similarity threshold.
In a preferred embodiment of the present invention, the method further comprises:
the storage module is used for storing the invalid texts with invalid semantics into an invalid text set; and when the text semantics are determined to be valid, determining the text to be a valid text, and storing the valid text to a valid text set.
In a preferred embodiment of the present invention, the valid text set and the invalid text set are generated as follows:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
The text processing apparatus of this embodiment can execute the text processing methods shown in the first embodiment and the second embodiment of this application, and the implementation principles thereof are similar, and are not described herein again.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In another embodiment of the present application, there is provided an electronic device including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: in the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 5000 shown in fig. 5 includes: a processor 5001 and a memory 5003. The processor 5001 and the memory 5003 are coupled, such as via a bus 5002. Optionally, the electronic device 5000 may also include a transceiver 5004. It should be noted that the transceiver 5004 is not limited to one in practical application, and the structure of the electronic device 5000 is not limited to the embodiment of the present application.
The processor 5001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 5001 may also be a combination of processors implementing computing functionality, e.g., a combination comprising one or more microprocessors, a combination of DSPs and microprocessors, or the like.
The memory 5003 may be, but is not limited to, ROM or other types of static storage devices that can store static and instructions, RAM or other types of dynamic storage devices that can store and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 5003 is used for storing application program codes for executing the present solution, and the execution is controlled by the processor 5001. The processor 5001 is configured to execute application program code stored in the memory 5003 to implement the teachings of any of the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.
Yet another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when run on a computer, enables the computer to perform the corresponding content in the aforementioned method embodiments. Compared with the prior art, in the embodiment of the invention, the text of the game and the interactive information of the text are firstly obtained, when the interactive information meets the preset condition, whether the text contains the preset keywords is determined, when the text does not contain the preset keywords, the text is detected based on the preset character statistical rule to determine whether the text meets the statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on the preset semantic rule, and when the text is semantically ineffective, the text is determined to be the ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (12)
1. A method of text processing, comprising:
acquiring a target text of a game;
acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not;
when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition;
when the detection meets the statistic correlation condition, determining whether the text is semantically valid based on a preset semantic rule;
and when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text.
2. The text processing method according to claim 1, wherein the interactive information includes at least one of a comment information amount, a support amount, an objection amount, and a share amount of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets a preset condition.
3. The method according to claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring Chinese characters in the text, and counting the number of the Chinese characters;
and when the number of the Chinese characters exceeds the number threshold of the Chinese characters, determining that the text meets the statistical correlation condition.
4. The method of claim 1, wherein the step of detecting the information based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring non-Chinese characters in the text, and counting the number of the non-Chinese characters;
and when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters, determining that the text meets the statistical correlation condition.
5. The method according to claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring all characters in the text;
detecting whether continuously repeated characters exist in all characters;
counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical correlation condition.
6. The method according to claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring all Chinese characters in the text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text meets the statistical correlation condition.
7. The method of claim 1, wherein determining whether the text is semantically valid based on preset semantic rules comprises:
calculating the confusion degree of the text;
when the confusion does not exceed a confusion threshold, calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
8. The text processing method according to claim 1 or 7, further comprising:
storing the invalid text with invalid semantics to the invalid text set; and the number of the first and second groups,
and when the text semantics are determined to be valid, determining the text to be a valid text, and storing the valid text to the valid text set.
9. The text processing method according to claim 1 or 7, wherein the valid text set and the invalid text set are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
10. A text processing apparatus, comprising:
the acquisition module is used for acquiring a text of a game; acquiring interactive information of the text;
the first detection module is used for determining whether the text contains preset keywords or not when the interactive information meets preset conditions;
the second detection module is used for detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition or not when the text does not contain a preset keyword;
the third detection module is used for determining whether the text is semantically valid or not based on a preset semantic rule when the detection accords with the statistic correlation condition;
the judging module is used for determining the text as an invalid text when the text semantic is determined to be invalid;
and the filtering module is used for filtering the text.
11. An electronic device, comprising:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is used for executing the text processing method of any one of the claims 1-9 by calling the operation instruction.
12. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the text processing method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073135.6A CN111291551B (en) | 2020-01-22 | 2020-01-22 | Text processing method and device, electronic equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073135.6A CN111291551B (en) | 2020-01-22 | 2020-01-22 | Text processing method and device, electronic equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111291551A true CN111291551A (en) | 2020-06-16 |
CN111291551B CN111291551B (en) | 2023-04-18 |
Family
ID=71026668
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010073135.6A Active CN111291551B (en) | 2020-01-22 | 2020-01-22 | Text processing method and device, electronic equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111291551B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112529629A (en) * | 2020-12-16 | 2021-03-19 | 北京居理科技有限公司 | Malicious user comment brushing behavior identification method and system |
CN113420234A (en) * | 2021-07-02 | 2021-09-21 | 青海师范大学 | Microblog data acquisition method and system |
CN114547435A (en) * | 2020-11-24 | 2022-05-27 | 腾讯科技(深圳)有限公司 | Content quality identification method, device, equipment and readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120226707A1 (en) * | 2011-03-01 | 2012-09-06 | Xerox Corporation | Linguistically enhanced email detector |
US20160306800A1 (en) * | 2015-04-16 | 2016-10-20 | Fluenty Korea Inc. | Reply recommendation apparatus and system and method for text construction |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
CN107436875A (en) * | 2016-05-25 | 2017-12-05 | 华为技术有限公司 | File classification method and device |
CN108446316A (en) * | 2018-02-07 | 2018-08-24 | 北京三快在线科技有限公司 | Recommendation method, apparatus, electronic equipment and the storage medium of associational word |
CN109388743A (en) * | 2017-08-11 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The determination method and apparatus of language model |
CN109783657A (en) * | 2019-01-07 | 2019-05-21 | 北京大学深圳研究生院 | Multistep based on limited text space is from attention cross-media retrieval method and system |
CN110717328A (en) * | 2019-07-04 | 2020-01-21 | 北京达佳互联信息技术有限公司 | Text recognition method and device, electronic equipment and storage medium |
-
2020
- 2020-01-22 CN CN202010073135.6A patent/CN111291551B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120226707A1 (en) * | 2011-03-01 | 2012-09-06 | Xerox Corporation | Linguistically enhanced email detector |
US20160306800A1 (en) * | 2015-04-16 | 2016-10-20 | Fluenty Korea Inc. | Reply recommendation apparatus and system and method for text construction |
CN107436875A (en) * | 2016-05-25 | 2017-12-05 | 华为技术有限公司 | File classification method and device |
CN107193801A (en) * | 2017-05-21 | 2017-09-22 | 北京工业大学 | A kind of short text characteristic optimization and sentiment analysis method based on depth belief network |
CN109388743A (en) * | 2017-08-11 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The determination method and apparatus of language model |
CN108446316A (en) * | 2018-02-07 | 2018-08-24 | 北京三快在线科技有限公司 | Recommendation method, apparatus, electronic equipment and the storage medium of associational word |
CN109783657A (en) * | 2019-01-07 | 2019-05-21 | 北京大学深圳研究生院 | Multistep based on limited text space is from attention cross-media retrieval method and system |
CN110717328A (en) * | 2019-07-04 | 2020-01-21 | 北京达佳互联信息技术有限公司 | Text recognition method and device, electronic equipment and storage medium |
Non-Patent Citations (4)
Title |
---|
DANIEL MAIER 等: "Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology" * |
GOPINATH SHYAM 等: "Investigating the relationship between the content of online word of mouth, advertising, and brand performance" * |
黄晟: "基于用户体验的APP设计研究" * |
齐慧杰 等: "探析客户端跟帖评论的管理策略" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114547435A (en) * | 2020-11-24 | 2022-05-27 | 腾讯科技(深圳)有限公司 | Content quality identification method, device, equipment and readable storage medium |
CN114547435B (en) * | 2020-11-24 | 2024-10-18 | 腾讯科技(深圳)有限公司 | Content quality identification method, device, equipment and readable storage medium |
CN112529629A (en) * | 2020-12-16 | 2021-03-19 | 北京居理科技有限公司 | Malicious user comment brushing behavior identification method and system |
CN113420234A (en) * | 2021-07-02 | 2021-09-21 | 青海师范大学 | Microblog data acquisition method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111291551B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108121700B (en) | Keyword extraction method and device and electronic equipment | |
CN109344406B (en) | Part-of-speech tagging method and device and electronic equipment | |
CN110334209B (en) | Text classification method, device, medium and electronic equipment | |
CN111291551B (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN111324771B (en) | Video tag determination method and device, electronic equipment and storage medium | |
CN109472022B (en) | New word recognition method based on machine learning and terminal equipment | |
CN107102993B (en) | User appeal analysis method and device | |
CN112231569A (en) | News recommendation method and device, computer equipment and storage medium | |
US20200364216A1 (en) | Method, apparatus and storage medium for updating model parameter | |
US10915756B2 (en) | Method and apparatus for determining (raw) video materials for news | |
CN112395421B (en) | Course label generation method and device, computer equipment and medium | |
CN111739520B (en) | Speech recognition model training method, speech recognition method and device | |
CN111767714B (en) | Text smoothness determination method, device, equipment and medium | |
CN113486178B (en) | Text recognition model training method, text recognition method, device and medium | |
CN107341143A (en) | A kind of sentence continuity determination methods and device and electronic equipment | |
CN112667782A (en) | Text classification method, device, equipment and storage medium | |
CN111324810A (en) | Information filtering method and device and electronic equipment | |
CN114547315A (en) | Case classification prediction method and device, computer equipment and storage medium | |
CN113626704A (en) | Method, device and equipment for recommending information based on word2vec model | |
CN112084752A (en) | Statement marking method, device, equipment and storage medium based on natural language | |
CN109284389A (en) | A kind of information processing method of text data, device | |
CN114722832A (en) | Abstract extraction method, device, equipment and storage medium | |
Wu et al. | Attention-based convolutional neural networks for chinese relation extraction | |
CN116561298A (en) | Title generation method, device, equipment and storage medium based on artificial intelligence | |
Arbaatun et al. | Hate speech detection on Twitter through Natural Language Processing using LSTM model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40024213 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant |