CN111291551B - Text processing method and device, electronic equipment and computer readable storage medium - Google Patents

Text processing method and device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN111291551B
CN111291551B CN202010073135.6A CN202010073135A CN111291551B CN 111291551 B CN111291551 B CN 111291551B CN 202010073135 A CN202010073135 A CN 202010073135A CN 111291551 B CN111291551 B CN 111291551B
Authority
CN
China
Prior art keywords
text
preset
invalid
similarity
meets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010073135.6A
Other languages
Chinese (zh)
Other versions
CN111291551A (en
Inventor
俞一鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010073135.6A priority Critical patent/CN111291551B/en
Publication of CN111291551A publication Critical patent/CN111291551A/en
Application granted granted Critical
Publication of CN111291551B publication Critical patent/CN111291551B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a text processing method and device, electronic equipment and a computer readable storage medium, and relates to the field of processing. The method comprises the following steps: acquiring a text of a game; acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not; when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition; when the detection meets the statistic correlation condition, determining whether the text is semantically valid based on a preset semantic rule; and when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text. According to the method and the device, the efficiency of obtaining effective comment contents from the comment area by the user is higher, and the user experience is better.

Description

Text processing method and device, electronic equipment and computer readable storage medium
Technical Field
The present application relates to the field of processing technologies, and in particular, to a text processing method, an apparatus, an electronic device, and a computer-readable storage medium.
Background
With the rapid development of internet technology, users have a variety of interactions via the internet. Such as: the user can make comments in the comment column below the commented subject, and other users can interact with the comments in the comment area.
Currently, when comments are analyzed, because there are a lot of comment contents for the same comment subject, there are contents mixed in the comments that have a high repetition rate and no practical significance, such as: "sofa"; even some comment areas appear many meaningless sentences entered in disorder, such as: ' Fuxi plucks and draws the place and does not have additional Fei A collection ground Hada flavor ". Due to the existence of the comment content which has high repetition rate and no practical significance, valuable text content in the comment area is submerged, so that the efficiency of obtaining effective comment content from the comment area by a user is low, and the user experience is poor.
Disclosure of Invention
The application provides a text processing method and device, an electronic device and a computer readable storage medium, and can solve the problems that the efficiency of obtaining effective comment contents from a comment area by a user is very low and the user experience is poor. The technical scheme is as follows:
in a first aspect, a text processing method is provided, and the method includes:
acquiring a text of a game;
acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not;
when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition;
when the statistical correlation condition is detected to be met, determining whether the text is semantically valid or not based on a preset semantic rule;
and when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text.
Preferably, the interactive information includes at least one of a number of comments, a number of supports, a number of objections, and a number of shares of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interactive information meets a preset condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring Chinese characters in the text, and counting the number of the Chinese characters;
and when the number of the Chinese characters exceeds the number threshold of the Chinese characters, determining that the text meets the statistical condition.
Preferably, the step of detecting the information based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring non-Chinese characters in the text, and counting the number of the non-Chinese characters;
and when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters, determining that the text meets the statistical condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring all characters in the text;
detecting whether continuously repeated characters exist in all characters;
counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical condition.
Preferably, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical condition includes:
acquiring all Chinese characters in the text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text meets the statistical condition.
Preferably, determining whether the text is semantically valid based on a preset semantic rule includes:
calculating the confusion degree of the text;
when the confusion does not exceed a confusion threshold, calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
Preferably, the method further comprises the following steps:
storing the invalid text with invalid semantics to the invalid text set; and the number of the first and second groups,
and when the text semantics are determined to be valid, determining that the text is a valid text, and storing the valid text to the valid text set.
Preferably, the valid text set and the invalid text set are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample successfully corresponding to the acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
In a second aspect, an apparatus for text processing is provided, the apparatus comprising:
the acquisition module is used for acquiring a text of a game; acquiring interactive information of the text;
the first detection module is used for determining whether the text contains preset keywords or not when the interactive information meets preset conditions;
the second detection module is used for detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition or not when the text does not contain a preset keyword;
the third detection module is used for determining whether the text is semantically valid or not based on a preset semantic rule when the verification is passed;
the judging module is used for determining the text as an invalid text when the text semantics are determined to be invalid;
and the filtering module is used for filtering the text.
Preferably, the obtaining module is specifically configured to:
acquiring at least one of comment quantity, support quantity, object quantity and sharing quantity of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets a preset condition.
Preferably, the second detection module comprises:
the first statistic submodule is used for acquiring the Chinese characters in the text and counting the number of the Chinese characters;
a first determining sub-module for determining that the text meets a statistically relevant condition when the number of Chinese characters exceeds a threshold number of Chinese characters.
Preferably, the second detection module comprises:
the second statistic submodule is used for acquiring the non-Chinese characters in the text and counting the number of the non-Chinese characters;
and the second determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters.
Preferably, the second detection module comprises:
the first obtaining sub-module is used for obtaining all characters in the text;
the detection submodule is used for detecting whether continuous repeated characters exist in all the characters;
the third counting submodule is used for counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and the third determining sub-module is used for determining that the text meets the statistical correlation condition when the repetition frequency of any continuously repeated character does not exceed the repetition frequency threshold value.
Preferably, the second detection module comprises:
the second obtaining sub-module is used for obtaining all Chinese characters in the text and the first letter of each Chinese character;
the fourth statistic submodule is used for counting the continuous occurrence times of each first letter;
and the fourth determining submodule is used for determining that the text meets the statistical correlation condition when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value.
Preferably, the third detection module comprises:
the similarity calculation operator module is used for calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and the judging submodule is used for judging that the text semantics are valid when at least one first similarity in the first similarities exceeds a valid similarity threshold, or judging that the text semantics are invalid when at least one second similarity in the second similarities exceeds an invalid similarity threshold.
Preferably, the method further comprises the following steps:
the storage module is used for storing the invalid texts with invalid semantics into the invalid text set; and when the text semantics are determined to be valid, determining that the text is a valid text, and storing the valid text to the valid text set.
Preferably, the valid text set and the invalid text set are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
In a third aspect, an electronic device is provided, which includes:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is configured to call the operation instruction, and the executable instruction enables the processor to execute an operation corresponding to the text processing method shown in the first aspect of the application.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the text processing method shown in the first aspect of the present application.
The beneficial effect that technical scheme that this application provided brought is:
in the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, and when the text is semantically ineffective, the text is determined to be an ineffective text and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a text processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a text processing method according to another embodiment of the present application;
FIG. 3 is a schematic diagram of data interaction between a review system and an invalid review filtering service in the present application;
fig. 4 is a schematic structural diagram of a text processing apparatus according to another embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for text processing according to yet another embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The text processing method, the text processing device, the electronic equipment and the computer-readable storage medium aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
In one embodiment, a text processing method is provided, as shown in fig. 1, the method including:
step S101, obtaining a game text;
the text of the game can be comment text in a comment system in the field of games. Further, in addition to the game field, the embodiment of the present invention may be applied to all comment-related fields, such as websites, APPs, and the like having a comment system. As shown in fig. 3, in practical applications, when a website, an APP, or the like needs to display comments, a comment system may first obtain relevant background comment data, then automatically filter all comments through an invalid comment filtering service to obtain valid comments, and then display the valid comments by front-end comments, so that the comments after filtering are seen by a user.
Further, the comment system and the invalid comment filtering service may both be set in the terminal, or the comment system may be set in the terminal, and the invalid comment filtering service may be set in the server, which may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
Step S102, acquiring interactive information of a text; when the interactive information meets the preset conditions, determining whether the text comprises preset keywords or not;
the interactive information may include other information besides the comment quantity, the support quantity, the object quantity, and the share quantity, and may be adjusted according to actual requirements in actual applications, which is not limited in the embodiment of the present invention. Because the interaction is generated by manual operation of other users after reading the text, the understanding of the text by the users is more accurate compared with the understanding of the machine, for example, if 100 users object to a comment, the comment is definitely problematic, so that the text processing can be performed by means of semantic understanding of other users, and the accuracy is improved.
Further, when the interaction information of the text meets a preset condition, for example, the number of supports exceeds a preset support threshold, or the number of objections does not exceed a preset objection threshold, it can be further determined whether the text includes a preset keyword. In practical application, a blacklist of keywords may be preset, where the blacklist may include a plurality of preset keywords, and when it is detected that a text includes at least one preset keyword in the blacklist, it may be determined that the text is an invalid comment. The invalid comments can be comments without Chinese characters or comments with discordant sentences and the like; a valid comment may be the reverse of an invalid comment, i.e., a valuable or meaningful comment. Invalid comments can be further efficiently screened out through a blacklist of preset keywords.
Step S103, when the text does not contain the preset keywords, detecting the text based on the preset character statistical rule to determine whether the text meets the statistical correlation conditions;
when the text is detected not to contain the preset keywords, the text can be determined not to contain the keywords in the preset black and white list, and then the text is continuously detected based on the preset character statistical rules to determine whether the text meets the statistical relevant conditions.
Step S104, when the detection accords with the statistic correlation condition, determining whether the text is semantically effective or not based on a preset semantic rule;
after the steps S101 to S103, most of invalid texts may be filtered from the format of the text, the dimensions of keywords, and the like, and then whether the semantics are valid is determined for the remaining texts based on the preset semantic rule, so that the text may be filtered from the dimensions of the real semantics of the text.
And step S105, when the text semantics are determined to be invalid, determining that the text is an invalid text, and filtering the text.
After the text semantics are determined to be valid, the text can be determined to be a valid text, and the text can be displayed; otherwise, the text is judged to be an invalid text, and filtering processing such as shielding can be carried out on the text.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on multiple dimensions such as interactive information, preset keywords, statistical relevant conditions and effective semantics, the accuracy of multi-dimensional detection is higher compared with a single detection mode, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
In another embodiment, a text processing method is provided, as shown in fig. 2, the method including:
step S201, obtaining a game text;
the text of the game can be comment text in a comment system in the field of the game. Further, in addition to the game field, the embodiment of the present invention can be applied to all the comment fields, such as a website with a comment system, APP, and the like. As shown in fig. 3, in practical applications, when a website, an APP, or the like needs to display comments, the comment system may first obtain background comment data, then automatically filter all comments through an invalid comment filtering service to obtain valid comments, and then display the valid comments on the front end, so that the comments that are filtered are seen by a user.
For example, a certain web page of a certain website originally includes 100 comments, and the 100 comments are filtered before the 100 comments are displayed, so that 80 valid comments and 20 invalid comments are obtained, and therefore, the 80 valid comments are displayed, and the 20 invalid comments are shielded. The invalid comments can be comments without Chinese characters or comments with discordant sentences and the like; a valid comment may be the reverse of an invalid comment, i.e., a valuable or meaningful comment.
Further, the comment system and the invalid comment filtering service may both be set in the terminal, or the comment system may be set in the terminal, and the invalid comment filtering service may be set in the server, which may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention. The terminal may have the following features:
(1) On a hardware architecture, a device has a central processing unit, a memory, an input unit and an output unit, that is, the device is often a microcomputer device having a communication function. In addition, various input modes such as a keyboard, a mouse, a touch screen, a microphone, a camera and the like can be provided, and input can be adjusted as required. Meanwhile, the equipment often has a plurality of output modes, such as a telephone receiver, a display screen and the like, and can be adjusted according to needs;
(2) In a software system, the device must have an operating system, such as Windows Mobile, symbian, palm, android, iOS, and the like. Meanwhile, the operating systems are more and more open, and personalized application programs developed based on the open operating system platforms are infinite, such as a communication book, a schedule, a notebook, a calculator, various games and the like, so that the requirements of personalized users are met to a great extent;
(3) In terms of communication capacity, the equipment has flexible access modes and high-bandwidth communication performance, and can automatically adjust the selected communication mode according to the selected service and the environment, thereby facilitating the use of users. The device can support GSM (Global System for Mobile Communication), WCDMA (Wideband Code Division Multiple Access), CDMA2000 (Code Division Multiple Access), TDSCDMA (Time Division-Synchronous Code Division Multiple Access), wi-Fi (Wireless-Fidelity), wiMAX (world Interoperability for Microwave Access) and the like, thereby being suitable for various types of networks, and not only supporting voice services, but also supporting various Wireless data services;
(4) In the aspect of function use, the equipment focuses more on humanization, individuation and multi-functionalization. With the development of computer technology, devices enter a human-centered mode from a device-centered mode, and the embedded computing, control technology, artificial intelligence technology, biometric authentication technology and the like are integrated, so that the human-oriented purpose is fully embodied. Due to the development of software technology, the equipment can be adjusted and set according to personal requirements, and is more personalized. Meanwhile, the device integrates a plurality of software and hardware, and the function is more and more powerful.
Step S202, acquiring interactive information of a text; when the interactive information meets the preset conditions, determining whether the text comprises preset keywords or not;
in a preferred embodiment of the present invention, the interaction comprises at least one of a number of comments, a number of supports, a number of objections, a number of shares of the text;
the interactive information accords with the preset condition and comprises:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets the preset condition.
Specifically, for any text, at least one of the comment quantity, the support quantity, the objection quantity, and the share quantity of the text may be obtained first, and when the comment quantity exceeds a predetermined value, and/or the support quantity exceeds a predetermined support threshold, and/or the objection quantity does not exceed a predetermined objection threshold, and/or the share quantity exceeds a predetermined share threshold, it is determined that the interaction information meets a predetermined condition.
For example, a certain webpage includes five comments, when the first comment is subjected to text processing, the comment quantity, the support quantity, the objection quantity, and the share quantity of the first comment are acquired, and when the quantity of at least one of the comment quantity, the support quantity, the objection quantity, and the share quantity exceeds a corresponding preset threshold, for example, the comment quantity exceeds a preset comment threshold, and/or the support quantity exceeds a preset support threshold, and/or the objection quantity exceeds a preset objection threshold, and/or the share quantity exceeds a preset share threshold, it is determined that the interactive information meets a preset condition, at this time, it can be determined that the first comment is an effective comment, and so on for the other four comments.
Further, the interactive information may include other information besides the number of comments, the number of supports, the number of objections, and the number of shares, and may be adjusted according to actual requirements in actual applications, which is not limited in the embodiment of the present invention. Because the interaction is generated by manual operation of other users after reading the text, the understanding of the text by the users is more accurate compared with the understanding of the machine, for example, a comment has 100 user objections, and the comment is definitely problematic, so that the text processing can be performed by means of semantic understanding of other users, and the accuracy is improved.
Further, when the interaction information of the text meets a preset condition, for example, the number of supports exceeds a preset support threshold, or the number of objections does not exceed a preset objection threshold, it can be further determined whether the text includes a preset keyword. In practical application, a blacklist of keywords may be preset, where the blacklist may include a plurality of preset keywords, and when it is detected that a text includes at least one preset keyword in the blacklist, it may be determined that the text is an invalid comment. Invalid comments can be further efficiently screened out through a blacklist of preset keywords.
For example, if the keyword blacklist includes "number of words hit" and "five words", and a certain comment is "this is five words", it is detected that the comment includes "five words" in the blacklist, then the comment can be determined as an invalid comment, and the comment is shielded; or, if a comment is "5 words together", and it is detected that the comment contains "words together" in the blacklist, the comment can be determined as an invalid comment, and the comment is masked.
Further, when detecting whether the text contains preset keywords, a mode of performing natural language processing on the text and then matching with each preset keyword in the blacklist may be adopted. Of course, other ways of detecting whether the text includes the preset keyword are all applicable to the embodiment of the present invention, and the method may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
In addition, during matching, in addition to when the text completely contains the preset keywords (as in the above example), a matching degree threshold may also be set, and when the matching degree of the text and the preset keywords exceeds the matching degree threshold, it may be determined that the text contains the keywords. For example, if the preset keyword in the blacklist is "five words," and the keyword included in the text is "five words," the matching degree between the preset keyword and the keyword is very high, and if the matching degree exceeds a threshold value of the matching degree, it can also be determined that the text includes the preset keyword.
Furthermore, the black list may include text segments or other contents besides the keywords, and may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
Step S203, when the text does not contain the preset keywords, detecting the text based on the preset character statistical rule to determine whether the text meets the statistical correlation condition;
when the text is detected not to contain the preset keywords, the text can be determined to be valid, and then the text is continuously detected based on the preset character statistical rule so as to determine whether the text meets the statistical relevant conditions.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring Chinese characters in a text, and counting the number of the Chinese characters;
when the number of the Chinese characters exceeds the number threshold of the Chinese characters, the text is determined to meet the statistical correlation condition.
Specifically, the method includes the steps of firstly obtaining Chinese characters in a text, counting the number of the Chinese characters, then judging whether the number of the Chinese characters exceeds a preset threshold value of the number of the Chinese characters, and if so, determining that the text is valid; if not, the text may be determined to be invalid. Therefore, invalid comments can be further efficiently screened out based on the number of Chinese characters in the text.
For example, the preset threshold value of the number of chinese characters is 0, and a certain comment is "Yfdcbhj", or "…. ! Is it a question of . ", then the comment is determined to be invalid.
Further, besides judging whether the number of the Chinese characters exceeds the number threshold of the Chinese characters, the number of all the characters in the text and the number of the Chinese characters in the text can be obtained, then whether the proportion of the Chinese characters to all the characters exceeds the proportion threshold is calculated, and if yes, the text can be determined to be valid; if not, the text may be determined to be invalid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring non-Chinese characters in a text, and counting the number of the non-Chinese characters;
and when the number of the non-Chinese characters is smaller than the threshold value of the number of the non-Chinese characters, determining that the text meets the statistical correlation condition.
Specifically, the non-chinese characters in the text may be obtained first, the number of the non-chinese characters may be counted, and then it is determined whether the number of the non-chinese characters exceeds a preset threshold value of the number of the non-chinese characters, and if so, it may be determined that the text is invalid; if not, the text may be determined to be valid.
For example, the threshold number of non-Chinese characters is 10, and a comment is "\\ N,486970735914696644, see FIG. 1, \\ N,2405296206,191939760947625", or "http:// hck. Hckzf111.Cn/register? intr =1pa88jocx &type =0 &specific = case is good-share ", then the comment may be determined to be invalid.
Further, besides judging whether the number of the non-Chinese characters exceeds the number threshold of the non-Chinese characters, the number of all characters in the text and the number of the non-Chinese characters in the text can be obtained, then whether the proportion of the Chinese characters to all characters exceeds the proportion threshold is calculated, and if so, the text can be determined to be invalid; if not, the text may be determined to be valid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring characters in a text;
detecting whether continuously repeated characters exist in the characters;
counting the repetition times of each continuously repeated character when the continuously repeated character is detected to exist in the characters;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical correlation condition.
Specifically, all characters in the text may be obtained first, and then whether there are consecutive repeated characters or character strings in all the characters is detected, if yes, the number of times of repetition of each consecutive repeated character or character string is counted, and when the number of times of repetition of any consecutive repeated character or character string does not exceed a threshold number of times of repetition, it is determined that the text meets a statistically relevant condition. Therefore, invalid comments can be further efficiently screened out based on the repetition frequency of any continuously repeated character in the text.
For example, if the preset threshold value of the number of times of repetition of the character or the character string is 5, and a certain comment is "how you are black", or "how do you see how you see we" is read at the tail of a red machine, it may be determined that the comment is invalid.
In a preferred embodiment of the present invention, the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition includes:
acquiring all Chinese characters in a text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text accords with the statistical correlation condition.
Specifically, all chinese characters in the text may be obtained first, then the first letter of each chinese character is obtained, and then the number of times that each first letter appears continuously is counted, and if the number of times that any one first letter appears continuously does not exceed the continuous occurrence threshold, it may be determined that the text meets the statistical correlation condition.
For example, if the threshold value of the number of consecutive occurrences of the initial "H" is 5, and a certain comment is "haha good still good" or "yaha red fire", it may be determined that the comment is invalid. Therefore, invalid comments can be further efficiently screened out based on the continuous occurrence times of any initial letter in the text.
Step S204, when the detection accords with the statistic correlation condition, determining whether the text is semantically effective or not based on a preset semantic rule;
after the steps S201 to S203, most of invalid texts may be filtered from the format of the text, the dimensions of keywords, and the like, and then whether the semantics are valid is determined for the remaining texts based on the preset semantic rule, so that the text may be filtered from the dimensions of the real semantics of the text.
In a preferred embodiment of the present invention, determining whether a text is semantically valid based on a preset semantic rule includes:
calculating the confusion degree of the text;
when the confusion does not exceed the threshold of the confusion, calculating the similarity between the text and each effective text in the preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each invalid text in the preset effective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
Wherein the confusion is used for representing the smoothness degree of the semantics of the text. The lower the confusion degree is, the more smooth the semantics of the text is; the higher the degree of confusion, the less smooth the semantics of the text are.
Specifically, a threshold value of the confusability may be preset, and then the preset language model is used to calculate the confusability of the text sample, which may specifically be the following formula:
Figure SMS_1
wherein, P (ω) 1 ,…,ω m ) Is the numerical value of the confusability of the text, omega is a character or word, omega 1 To omega m A text sentence is formed, which comprises m words and/or words, for example, in the text sentence "good weather today", m is 3, ω 1 Is "today", omega 2 Is "weather", omega 3 Is "good doing". Then, the formula is modified by adopting a preset domain dictionary, so that when the omega is in the range of omega i ,ω i-1 ,…,ω i-(n-1) When it is a phrase specific to the field, then P (ω) ii-(n-1) ,…,ω i-1 ) To 1, for example, the formula is modified by using a game field dictionary, and when a phrase in a text sentence is detected as a phrase specific to the game field, P (ω) is made ii-(n-1) ,…,ω i-1 ) The text is 1, and the confusion degree of the whole text is calculated, the larger the confusion degree is, the less smooth the semantics is, the higher the probability that the semantics of the text are invalid is, and the higher the probability that the text is an invalid text is; at the same time, adoptThe formula is modified by the domain dictionary, so that the modified formula can detect the comments in the domain more accurately, and invalid comments in the domain can be further screened out efficiently.
Further, the language model may be a BERT model (Bidirectional Encoder replication from transforms), which is a language model proposed by google corporation, and may be used for natural language processing tasks such as text classification, reading and understanding, of course, other language models may also be used, and may be set according to actual requirements in actual applications, which is not limited in the embodiment of the present invention.
For example, the two texts are respectively 'purple sweet potato cloth Ding Zishu pudding' and 'French fdfa coming from reset and then reacting to open a conforming card', the two texts are input into a language model, and the language model is enabled to perform word segmentation on the texts to obtain omega 1 ~ω m The word segmentation can be carried out, then the confusion degree of the purple sweet potato pudding and the purple sweet potato pudding can be calculated to be 1230 through the formula, the confusion degree of the legal method fdfa which is reset and reacts to be in line with the card after payment is 1345, and the preset confusion degree threshold value is 350, so that the two text semantemes can be determined to be invalid.
Further, an effective text set, an invalid text set, an effective similarity threshold and an invalid similarity threshold can be preset, wherein the effective text set comprises at least one text marked as effective, and the invalid text set comprises at least one text marked as invalid.
In practical application, the similarity between the text to be detected and each valid text in the valid text set can be respectively calculated to obtain a plurality of first similarities, the similarity between the text to be detected and each invalid text in the invalid text set can be respectively calculated to obtain a plurality of second similarities, and then whether at least one first similarity in the plurality of first similarities exceeds a valid similarity threshold or not and whether at least one second similarity in the plurality of second similarities exceeds an invalid similarity threshold or not are judged. And when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
For example, the effective similarity threshold and the ineffective similarity threshold are both 0.8, after similarity calculation is performed on a comment and each effective text and each ineffective text, if the similarity between the comment and a valid text is 0.9, the text semantic validity can be determined; if the similarity of the comment and an invalid text is 0.85, the text can be determined to be semantically invalid.
Wherein, the similarity calculation can be carried out by adopting a Faiss similarity search tool. Faiss is a high-performance high-dimensional vector similarity retrieval and clustering library, and the open source protocol is BSDlicense.
The method and the device can be applied to the embodiment of the invention, faiss is adopted to carry out similarity calculation on any text and each effective text in the effective text set to obtain the first similarity between the text and each effective text in the effective text set, and the similarity calculation is carried out on the text and each invalid text in the invalid text set to obtain the second similarity between the text and each invalid text in the invalid text set.
Faiss will reduce memory usage and support large-scale datasets, such as 10 billion-scale high-dimensional vector similarity search on a single machine. Therefore, the text with the highest similarity among the text to be detected, the valid text set and the invalid text set can be quickly found by adopting Faiss.
It should be noted that, in the embodiment of the present invention, if the similarity between a certain text and an effective text exceeds an effective similarity threshold, it indicates that the semantics of the certain text and the effective text are very similar, and similarly, if the similarity between a certain text and an invalid text exceeds an invalid similarity threshold, it indicates that the semantics of the certain text and the invalid text are very similar. Therefore, in practical application, the probability of the situation that the semantics of a certain text is very similar to those of a certain valid text and is very similar to those of a certain invalid text is almost zero, and the judgment result of the embodiment of the invention cannot be influenced.
Further, the generation manner of the valid text set and the invalid text set, and the training manner of the preset language model are specifically as follows.
In a preferred embodiment of the present invention, the valid text set and the invalid text set are generated as follows:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
Specifically, for all the text samples without labels, the interaction information of each text sample is obtained first, and the successfully obtained text samples are clustered to obtain a set A.
Then, respectively executing 'detecting whether the text contains preset keywords' on each text sample in the set A; and when the text does not contain the preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets the statistical correlation condition, and clustering text samples meeting the statistical correlation condition to obtain a set B.
A preset language model is then used to calculate the confusability of each text sample in the set B. The preset language model is provided with initial parameters and parameter values corresponding to the parameters.
For example, the two texts are respectively 'purple sweet potato cloth Ding Zishu pudding' and 'french fdfa' to react to open the conforming card after resetting payment ', the confusion degree of' purple sweet potato pudding 'obtained through the language model calculation is 1230, the confusion degree of' purple sweet potato pudding 'to react to conform card after resetting payment by french fdfa' is 1345, and the preset confusion threshold value is 350, so that the two texts can be determined to be invalid.
And in the same way, filtering out the texts with the confusion degree greater than the threshold value of the confusion degree in the set B so as to filter out a large number of invalid comments to obtain a set C, and then carrying out unsupervised secondary classification processing on the set C.
In practical applications, machine learning mainly solves two types of problems, supervised learning and unsupervised learning.
Supervised learning refers to a process of guiding a model to learn a task concerned by a user through an external response variable and achieving a purpose required by the user. That is, the ultimate goal of supervised learning is to allow the model to more accurately model the response variables required by the user. Example (c): the user wants to predict the house selling price in a certain area through a series of characteristic values, and wants to predict the box office of the movie. The "selling price" and "movie box office" are response variables in supervised learning.
Put another way, a model is learned from a given labeled data set, and when new unlabeled data is input, a prediction result can be obtained from the trained model. Supervised learning is often used to deal with the "classification" problem.
Supervised learning may include three types of models:
1. a linear model;
2. a decision tree model;
3. a neural network model.
These three types of supervised learning models can be subdivided into two categories of problems:
1. a classification problem;
2. and (4) regression problem.
The core of the classification problem is how to identify a class of a data point using a model. This category is generally discrete, such as two or more categories. The core of the regression problem is to use a model to output a predicted value. This value is typically a real number and is continuous.
Unsupervised learning refers to the fact that under normal circumstances, no response variable is evident. The core of unsupervised learning is that the potential structure and rule in the data are often expected to be discovered, and reference is provided for the user to make the next decision. Typically unsupervised learning is desirable to be able to group, i.e., "cluster," data using data features. Typically, unsupervised learning is able to mine structures within the data that may be more likely to capture the essential relationships of the data than the user-provided data features.
The main purpose of unsupervised learning is to mine the data-inherent connections. The underlying problem here is that different unsupervised learning methods have different assumptions about the structure inside the data. Therefore, unsupervised learning often differs greatly between different models. Of the numerous unsupervised learning models, the clustering model is undoubtedly an important representative, wherein the K-means algorithm (K-means) is the most common and very important algorithm model in the clustering algorithm model.
In the prior art, the classification model is usually trained based on supervised learning, that is, training samples with labels need to be prepared in advance, and then the classification model is trained by using the training samples with labels. For example, ten thousand training samples marked as valid and ten thousand training samples marked as invalid are prepared in advance, and then the classification model is trained by using the ten thousand training samples. Therefore, each training sample needs to be labeled manually, which wastes both labor cost and time cost, especially when the number of training samples is large.
In the embodiment of the present invention, the problem of classification is solved based on unsupervised learning, that is, the set C is subjected to two classifications (the text includes two classifications, i.e., the valid text and the invalid text) by unsupervised learning, so as to determine whether each text in the set C belongs to the valid text or the invalid text.
Specifically, although most of the invalid texts may be filtered out based on the degree of confusion, some invalid texts with small degree of confusion may still exist in the set C, that is, although the sentence is smooth, there is no text in any amount, such as "is the weather good today? "of the text. Therefore, the set C is further classified by an unsupervised binary classification method (such as K-means algorithm, K takes 2), so as to obtain a set C1 containing invalid text samples and a set C2 containing valid text samples.
The sets A and C2 containing the valid text samples and the set C1 containing the invalid text samples can be obtained through the method, and simultaneously, the trained language model is obtained. Then, a plurality of texts marked as valid in advance but with high confusion are combined with the set A, C to obtain a final valid text set, and a plurality of texts marked as invalid in advance but with low confusion are combined with the set C1 to obtain a final invalid text set. In addition, because the original text samples are unlabeled in the application, the original text samples can be more accurately classified by adopting unsupervised binary classification, and further, invalid comments can be further and efficiently screened out.
And step S205, when the text semantics are determined to be invalid, determining that the text is an invalid text, and filtering the text.
After the text semantics are determined to be valid, the text can be determined to be a valid text, and the text can be displayed; otherwise, the text is judged to be an invalid text, and filtering processing such as shielding can be carried out on the text.
Further, in practical applications, bad Case text of Bad Case, that is, invalid text that cannot be recognized by the current manner may appear. For this Case, bad Case may be artificially added to the invalid text set, so that Bad Case and text similar to Bad Case may be recognized next time.
Step S206, storing the invalid text with invalid semantics into an invalid text set;
and step S207, when the text semantics are determined to be valid, determining that the text is a valid text, and storing the valid text into a valid text set.
Specifically, after the invalid text is filtered, the invalid text can be stored in a preset invalid text set, and similarly, when any text semantic is determined to be valid, any text can be determined to be a valid text, and the valid text is stored in the preset valid text set, so that the valid text set and the invalid text set are expanded.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In practical application, the method and the device can be applied to APP in the field of games. In the APP, a user can browse articles of a game, such as game strategies, game competition news and the like, each article can be provided with a user comment area, the user can post comments of the user on the article in the user comment area, and meanwhile, for the comments posted by any user, other users can interact with the comments, including one-step comment, praise, deprecate or share.
For example, in a user comment area of a certain game strategy, the comment posted by the user a is "very practical, just recently, the card is written here", the user B refers to the comment and posts an independent comment "i also, thanks to the building owner", and the user C praises the comment posted by the user a and the user D posts a comment "pass by way".
Further, when a user browses the detailed content of any article, the APP may present the detailed content of the article and the corresponding comment of the article, but the presented comment is the comment after being filtered.
For example, when a user browses the detailed content of the game strategy, the detailed content of the game strategy and all comments are acquired, that is, "very practical," exactly click here recently "," i am also thank you for building owner "and" pass by way ", then the interactive information of the comments is acquired for the comments of the user a, since the comments have one piece of comment information (the comment of the user B) and one piece of approval (the approval of the user C), the interactive information meets the preset condition, then the comments of the user a are processed in natural language and matched with each preset keyword in the preset blacklist, and since there is no matching item, it can be determined that the comments of the user a do not contain the preset keyword.
And then, the text is further detected based on a preset character statistical rule to determine whether the text meets the statistical relevant condition, and the comment of the user A can be determined to meet the statistical relevant condition because the number of Chinese characters of the comment of the user A exceeds a Chinese character number threshold value, the number of non-Chinese characters is less than a non-Chinese character number threshold value, the repetition frequency of any continuously repeated character does not exceed a repetition frequency threshold value, and the continuous occurrence frequency of any initial character does not exceed a continuous occurrence frequency threshold value.
Further, the language model is adopted to calculate the confusion degree of the comment and compare the confusion degree with a preset confusion degree threshold, and because the confusion degree of the comment does not exceed the preset confusion degree threshold, the similarity calculation is carried out on the comment and all effective texts in an effective text set, and the similarity calculation is carried out on the comment and all the ineffective texts in an ineffective text set, so that the similarity of the comment and a certain effective text exceeds the preset effective similarity threshold, the comment semantic validity can be judged, and the comment of the user A is judged to be an effective text.
The above steps are also adopted for the comment of the user B and the comment of the user D. Because the objection quantity of the user B does not exceed the preset objection threshold, the interactive information of the comment also meets the preset condition, and the comment of the user B is also valid text; the number of times of 'passing through' in the comments of the user D exceeds a repetition threshold value, so that the statistics related conditions are not met, and the comments of the user D are invalid texts; other steps are the same as the processing steps for the comment of the user a, and are not described herein.
Therefore, the user can see the comment of the user a, the comment of the user B, and the like of the user C in addition to the detailed contents of the game strategy, but cannot see the comment of the user D, that is, the comment of the user D is masked.
Fig. 4 is a schematic structural diagram of a text processing apparatus according to another embodiment of the present application, and as shown in fig. 4, the apparatus according to this embodiment may include:
an obtaining module 401, configured to obtain a text of a game; acquiring interactive information of the text;
a first detecting module 402, configured to determine whether the text includes a preset keyword when the interactive information meets a preset condition;
a second detection module 403, configured to, when the text does not include a preset keyword, detect the text based on a preset character statistical rule to determine whether the text meets a statistical correlation condition;
a third detecting module 404, configured to determine whether the text is semantically valid based on a preset semantic rule when the verification passes;
a decision module 405, configured to determine that the text is an invalid text when it is determined that the text semantics are invalid;
and a filtering module 406 for filtering the text.
In a preferred embodiment of the present invention, the obtaining module is specifically configured to:
acquiring at least one of comment quantity, support quantity, object quantity and sharing quantity of the text;
the interactive information accords with the preset condition and comprises the following steps:
and when the acquired comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interaction information meets the preset condition.
In a preferred embodiment of the present invention, the second detection module includes:
the first statistic submodule is used for acquiring Chinese characters in the text and counting the number of the Chinese characters;
the first determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the Chinese characters exceeds the threshold value of the number of the Chinese characters.
In a preferred embodiment of the present invention, the second detection module includes:
the second statistic submodule is used for acquiring non-Chinese characters in the text and counting the number of the non-Chinese characters;
and the second determining sub-module is used for determining that the text meets the statistical correlation condition when the number of the non-Chinese characters is smaller than the number threshold of the non-Chinese characters.
In a preferred embodiment of the present invention, the second detection module includes:
the first obtaining submodule is used for obtaining all characters in the text;
the detection submodule is used for detecting whether continuous repeated characters exist in all the characters;
the third counting submodule is used for counting the repetition times of each continuously repeated character when the continuously repeated characters exist in all the characters;
and the third determining sub-module is used for determining that the text meets the statistical correlation condition when the repetition frequency of any continuously repeated character does not exceed the repetition frequency threshold.
In a preferred embodiment of the present invention, the second detection module includes:
the second acquisition sub-module is used for acquiring all Chinese characters in the text and the first letter of each Chinese character;
the fourth statistic submodule is used for counting the continuous occurrence times of each first letter;
and the fourth determining submodule is used for determining that the text accords with the statistical correlation condition when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value.
In a preferred embodiment of the present invention, the third detecting module includes:
the similarity operator module is used for calculating the similarity between the text and each effective text in the preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each invalid text in the preset invalid text set to obtain at least one second similarity;
and the judging submodule is used for judging that the text semantics are valid when at least one first similarity in the first similarities exceeds a valid similarity threshold, or judging that the text semantics are invalid when at least one second similarity in the second similarities exceeds an invalid similarity threshold.
In a preferred embodiment of the present invention, the method further comprises:
the storage module is used for storing the invalid texts with invalid semantics into an invalid text set; and when the text semantics are determined to be valid, determining the text to be a valid text, and storing the valid text to a valid text set.
In a preferred embodiment of the present invention, the valid text set and the invalid text set are generated as follows:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
The text processing apparatus of this embodiment can execute the text processing methods shown in the first embodiment and the second embodiment of this application, and the implementation principles thereof are similar, and are not described herein again.
In the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In another embodiment of the present application, there is provided an electronic device including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: in the embodiment of the invention, a game text and interactive information of the text are firstly acquired, when the interactive information meets a preset condition, whether the text contains a preset keyword is determined, when the text does not contain the preset keyword, the text is detected based on a preset character statistical rule to determine whether the text meets a statistical correlation condition, when the text meets the statistical correlation condition, whether the text is semantically effective is determined based on a preset semantic rule, when the text is semantically ineffective, the text is determined to be an ineffective text, and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the invalid text set which are adopted when the semantic validity of the text is determined are obtained through the training of the text sample without labeling, the text sample does not need to be labeled manually, the labor cost and the time cost are greatly reduced, and the user experience is further improved.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 5000 shown in fig. 5 includes: a processor 5001 and a memory 5003. The processor 5001 and the memory 5003 are coupled, such as via a bus 5002. Optionally, the electronic device 5000 can also include a transceiver 5004. It should be noted that the transceiver 5004 is not limited to one in practical application, and the structure of the electronic device 5000 is not limited to the embodiment of the present application.
The processor 5001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 5001 may also be a combination of processors implementing computing functionality, e.g., a combination comprising one or more microprocessors, a combination of DSPs and microprocessors, or the like.
Bus 5002 may include a path that communicates between the above components. The bus 5002 may be a PCI bus or EISA bus, etc. The bus 5002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
Memory 5003 may be, but is not limited to, ROM or other type of static storage device that can store static and instructions, RAM or other type of dynamic storage device that can store and instructions, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 5003 is used for storing application program codes for executing the present solution, and the execution is controlled by the processor 5001. The processor 5001 is configured to execute application program code stored in the memory 5003 to implement the teachings of any of the foregoing method embodiments.
Wherein, the electronic device includes but is not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like.
Yet another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when run on a computer, enables the computer to perform the corresponding content in the aforementioned method embodiments. Compared with the prior art, in the embodiment of the invention, the text of the game and the interactive information of the text are firstly obtained, when the interactive information meets the preset condition, whether the text contains the preset keywords is determined, when the text does not contain the preset keywords, the text is detected based on the preset character statistical rule to determine whether the text meets the statistical related condition, when the text meets the statistical related condition, whether the text is semantically effective is determined based on the preset semantic rule, and when the text is semantically ineffective, the text is determined to be an ineffective text and the text is filtered. Therefore, whether the text is effective or not is detected based on a plurality of dimensions such as interactive information, preset keywords, statistical relevant conditions and semantic validity, compared with a single detection mode, the accuracy of multi-dimensional detection is higher, the efficiency of obtaining effective comment contents from a comment area by a user is higher, and the user experience is better.
Furthermore, the effective text set and the ineffective text set which are adopted when determining whether the text is semantically effective are obtained by the training of the text sample without marking, and the text sample does not need to be marked manually, so that the labor cost and the time cost are greatly reduced, and the user experience is further improved.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and embellishments can be made without departing from the principle of the present invention, and these should also be construed as the scope of the present invention.

Claims (11)

1. A method of text processing, comprising:
acquiring a target text of a game;
acquiring interactive information of the text; when the interactive information meets a preset condition, determining whether the text comprises a preset keyword or not;
when the text does not contain preset keywords, detecting the text based on a preset character statistical rule to determine whether the text meets a statistical correlation condition;
when the detection meets the statistic correlation condition, determining whether the text is semantically valid based on a preset semantic rule;
when the text semantics are determined to be invalid, determining the text to be an invalid text, and filtering the text;
determining whether the text is semantically valid or not based on a preset semantic rule, wherein the determining comprises the following steps of:
calculating the confusability of the text based on the following formula;
Figure QLYQS_1
wherein, P (ω) 1 ,…,ω m ) Is the numeric value of the text confusion, omega is a character or word, omega 1 To omega m Composing the text;
when the confusion does not exceed a confusion threshold, calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
2. The text processing method according to claim 1, wherein the interactive information includes at least one of a comment information amount, a support amount, an objection amount, and a share amount of the text;
the interaction information accords with preset conditions and comprises the following steps:
and when the comment quantity exceeds a preset comment threshold value, and/or the support quantity exceeds a preset support threshold value, and/or the objection quantity does not exceed a preset objection threshold value, and/or the sharing quantity exceeds a preset sharing threshold value, judging that the interactive information meets a preset condition.
3. The method according to claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring Chinese characters in the text, and counting the number of the Chinese characters;
and when the number of the Chinese characters exceeds the number threshold of the Chinese characters, determining that the text meets the statistical correlation condition.
4. The method of claim 1, wherein the step of detecting the information based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring non-Chinese characters in the text, and counting the number of the non-Chinese characters;
when the number of non-Chinese characters is less than a threshold number of non-Chinese characters, determining that the text meets a statistically relevant condition.
5. The method according to claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistically relevant condition comprises:
acquiring all characters in the text;
detecting whether continuously repeated characters exist in all characters;
when the continuous repeated characters exist in all the characters, counting the repeated times of each continuous repeated character;
and when the repetition times of any continuously repeated character do not exceed the repetition time threshold value, determining that the text meets the statistical correlation condition.
6. The method of claim 1, wherein the step of detecting the text based on the preset character statistical rule to determine whether the text meets the statistical correlation condition comprises:
acquiring all Chinese characters in the text and the initial letter of each Chinese character;
counting the continuous occurrence times of each initial;
and when the continuous occurrence frequency of any initial does not exceed the continuous occurrence frequency threshold value, determining that the text meets the statistical correlation condition.
7. The text processing method according to claim 1, further comprising:
storing the invalid text with invalid semantics to the invalid text set; and (c) a second step of,
and when the text semantics are determined to be valid, determining the text to be a valid text, and storing the valid text to the valid text set.
8. The text processing method of claim 1, wherein the set of valid text and the set of invalid text are generated by:
acquiring interaction information of each text sample in a preset text sample set;
taking at least one first text sample corresponding to the successfully acquired interaction information as a first sample set;
obtaining at least one second text sample which does not contain preset keywords from each first text sample;
verifying each second text sample based on a preset character statistical rule to obtain at least one third text sample passing verification;
calculating to obtain the confusion degree of each third text sample;
taking at least one third text sample with the confusability smaller than the confusability threshold value as a second sample set;
classifying the second sample set to obtain a third sample set containing effective text samples and a fourth sample set containing ineffective text samples;
and obtaining a final effective text set based on the first sample set and the third sample set, and obtaining a final invalid text set based on the fourth sample set.
9. A text processing apparatus, comprising:
the acquisition module is used for acquiring a text of a game; acquiring interactive information of the text;
the first detection module is used for determining whether the text contains preset keywords or not when the interactive information meets preset conditions;
the second detection module is used for detecting the text based on a preset character statistical rule to determine whether the text meets a statistical relevant condition or not when the text does not contain a preset keyword;
the third detection module is used for determining whether the text is semantically valid or not based on a preset semantic rule when the detection accords with the statistic correlation condition;
the judging module is used for determining the text as an invalid text when the text semantic is determined to be invalid;
the filtering module is used for filtering the text;
the third detection module, when determining whether the text is semantically valid based on a preset semantic rule, is specifically configured to:
calculating the confusability of the text based on the following formula;
Figure QLYQS_2
wherein, P (ω) 1 ,…,ω m ) Is the numerical value of the confusability of the text, omega is a character or word, omega 1 To omega m Composing the text;
when the confusion does not exceed a confusion threshold, calculating the similarity between the text and each effective text in a preset effective text set to obtain at least one first similarity, and calculating the similarity between the text and each ineffective text in a preset ineffective text set to obtain at least one second similarity;
and when at least one first similarity in the first similarities exceeds an effective similarity threshold, judging that the text semantics are effective, or when at least one second similarity in the second similarities exceeds an ineffective similarity threshold, judging that the text semantics are ineffective.
10. An electronic device, comprising:
a processor, a memory, and a bus;
the bus is used for connecting the processor and the memory;
the memory is used for storing operation instructions;
the processor is used for executing the text processing method of any one of the claims 1-8 by calling the operation instruction.
11. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the method of text processing according to any one of claims 1 to 8.
CN202010073135.6A 2020-01-22 2020-01-22 Text processing method and device, electronic equipment and computer readable storage medium Active CN111291551B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010073135.6A CN111291551B (en) 2020-01-22 2020-01-22 Text processing method and device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010073135.6A CN111291551B (en) 2020-01-22 2020-01-22 Text processing method and device, electronic equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111291551A CN111291551A (en) 2020-06-16
CN111291551B true CN111291551B (en) 2023-04-18

Family

ID=71026668

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010073135.6A Active CN111291551B (en) 2020-01-22 2020-01-22 Text processing method and device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111291551B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114547435A (en) * 2020-11-24 2022-05-27 腾讯科技(深圳)有限公司 Content quality identification method, device, equipment and readable storage medium
CN112529629A (en) * 2020-12-16 2021-03-19 北京居理科技有限公司 Malicious user comment brushing behavior identification method and system
CN113420234B (en) * 2021-07-02 2022-08-02 青海师范大学 Microblog data acquisition method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN108446316A (en) * 2018-02-07 2018-08-24 北京三快在线科技有限公司 Recommendation method, apparatus, electronic equipment and the storage medium of associational word
CN109388743A (en) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 The determination method and apparatus of language model
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN110717328A (en) * 2019-07-04 2020-01-21 北京达佳互联信息技术有限公司 Text recognition method and device, electronic equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8429141B2 (en) * 2011-03-01 2013-04-23 Xerox Corporation Linguistically enhanced email detector
WO2016167424A1 (en) * 2015-04-16 2016-10-20 주식회사 플런티코리아 Answer recommendation device, and automatic sentence completion system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN107193801A (en) * 2017-05-21 2017-09-22 北京工业大学 A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN109388743A (en) * 2017-08-11 2019-02-26 阿里巴巴集团控股有限公司 The determination method and apparatus of language model
CN108446316A (en) * 2018-02-07 2018-08-24 北京三快在线科技有限公司 Recommendation method, apparatus, electronic equipment and the storage medium of associational word
CN109783657A (en) * 2019-01-07 2019-05-21 北京大学深圳研究生院 Multistep based on limited text space is from attention cross-media retrieval method and system
CN110717328A (en) * 2019-07-04 2020-01-21 北京达佳互联信息技术有限公司 Text recognition method and device, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Daniel Maier 等.Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology.《Communication Methods and Measures》.2018,第12卷(第2期),93-118. *
Gopinath Shyam 等.Investigating the relationship between the content of online word of mouth, advertising, and brand performance.《Marketing Science》.2014,第33卷(第2期),241-258. *
黄晟.基于用户体验的APP设计研究.《中国优秀硕士学位论文全文数据库信息科技辑》.2013,(第01期),I136-394. *
齐慧杰 等.探析客户端跟帖评论的管理策略.《网络传播》.2019,(第9期),88-89. *

Also Published As

Publication number Publication date
CN111291551A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
CN110781276B (en) Text extraction method, device, equipment and storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN111291551B (en) Text processing method and device, electronic equipment and computer readable storage medium
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN109344406B (en) Part-of-speech tagging method and device and electronic equipment
CN111046656B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN107102993B (en) User appeal analysis method and device
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN112231569A (en) News recommendation method and device, computer equipment and storage medium
CN112667782A (en) Text classification method, device, equipment and storage medium
CN111767714B (en) Text smoothness determination method, device, equipment and medium
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN114780746A (en) Knowledge graph-based document retrieval method and related equipment thereof
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN113626704A (en) Method, device and equipment for recommending information based on word2vec model
CN113569118A (en) Self-media pushing method and device, computer equipment and storage medium
CN112861510A (en) Summary processing method, apparatus, device and storage medium
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN115408523A (en) Medium-length and long-text classification method and system based on abstract extraction and keyword extraction
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN115169345A (en) Training method, device and equipment for text emotion analysis model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40024213

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant