CN113127746A - Information pushing method based on user chat content analysis and related equipment thereof - Google Patents

Information pushing method based on user chat content analysis and related equipment thereof Download PDF

Info

Publication number
CN113127746A
CN113127746A CN202110522391.3A CN202110522391A CN113127746A CN 113127746 A CN113127746 A CN 113127746A CN 202110522391 A CN202110522391 A CN 202110522391A CN 113127746 A CN113127746 A CN 113127746A
Authority
CN
China
Prior art keywords
data
text
chat
chat data
sensitive word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110522391.3A
Other languages
Chinese (zh)
Other versions
CN113127746B (en
Inventor
陈家荣
蓝志毅
丰阳露
解效玄
易页
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xindong Network Co ltd
Original Assignee
Xindong Network Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xindong Network Co ltd filed Critical Xindong Network Co ltd
Priority to CN202110522391.3A priority Critical patent/CN113127746B/en
Publication of CN113127746A publication Critical patent/CN113127746A/en
Application granted granted Critical
Publication of CN113127746B publication Critical patent/CN113127746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses information pushing based on user chat content analysis and related equipment thereof, relates to a semantic analysis technology, and aims to quickly obtain a chat data clustering result after performing content analysis on chat data uploaded by a user in a preset scene, acquire qualified clustering clusters and corresponding text topics thereof based on the chat data clustering result to form a target chat data clustering cluster set, and finally acquire corresponding popularization text data respectively according to the text topics of the target chat data clustering sets for information pushing.

Description

Information pushing method based on user chat content analysis and related equipment thereof
Technical Field
The invention relates to the technical field of semantic analysis, in particular to an information pushing method based on user chat content analysis and related equipment thereof.
Background
Network games have become an increasingly popular entertainment activity, such as network games running on cell phones and network games running on computers. In the online game, new game activities are frequently continuously deduced, and the new game activities are generally obtained by a planner based on experience summary of the planner after a certain user market research, that is, the process generally needs to design user questionnaire survey, questionnaire collection, questionnaire statistical analysis and other processes, which results in that the process of acquiring activity information expected by a user is extremely time-consuming, labor cost is high, and data processing efficiency is low.
Disclosure of Invention
The embodiment of the invention provides an information pushing method based on user chat content analysis and related equipment thereof, aiming at solving the problems that in the prior art, the process of user questionnaire survey, questionnaire collection, questionnaire statistical analysis and the like are generally needed to obtain expected activity information for a specific user group, so that the data obtaining process is time-consuming and low in efficiency, and the labor cost is high.
In a first aspect, an embodiment of the present invention provides an information pushing method based on user chat content analysis, including:
if the chat data uploaded by the user side in the preset scene is detected, acquiring the data type of the chat data;
judging whether the data type is a voice type or a text type;
if the data type is a text type, performing sensitive word detection and sensitive word conversion processing on the chatting data to obtain chatting data subjected to first desensitization processing as current chatting data;
if the data type is a voice type, performing voice text extraction, sensitive word detection and sensitive word conversion processing on the chatting data to obtain second desensitized chatting data serving as current chatting data, and performing text-to-voice conversion on the second desensitized chatting data according to corresponding user voice characteristics to obtain processed chatting data;
binding the current chatting data with the user ID of the corresponding user side and then storing the current chatting data in a local first storage area;
acquiring current system time, and judging whether the time interval between the current system time and the last chatting data analysis time is equal to a preset chatting data analysis time period or not;
if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period, acquiring a currently stored chat data set in the first storage area, and performing text clustering on the chat data set to obtain a corresponding chat data clustering result; the chat data clustering result comprises a plurality of chat data clustering clusters;
obtaining text topics corresponding to all chat data cluster clusters in the chat data cluster results;
if the text similarity between the text subject corresponding to the chat data cluster and the target subject in the preset target subject list exceeds a preset similarity threshold value, acquiring the chat data cluster corresponding to the corresponding text subject as a target chat data cluster to form a target chat data cluster set; and
and acquiring promotion text data respectively corresponding to the text topics of the target chat data cluster to form a text data set to be pushed.
In a second aspect, an embodiment of the present invention provides an information pushing apparatus based on user chat content analysis, including:
the chat data type acquisition unit is used for acquiring the data type of the chat data if the chat data uploaded by the user terminal entering a preset scene is detected;
the type judging unit is used for judging whether the data type is a voice type or a text type;
the first desensitization processing unit is used for carrying out sensitive word detection and sensitive word conversion processing on the chatting data if the data type is a text type to obtain the chatting data subjected to the first desensitization processing as the current chatting data;
the second desensitization processing unit is used for extracting a voice text, detecting a sensitive word and converting the sensitive word from the chat data if the data type is a voice type to obtain second desensitized chat data serving as current chat data, and converting the text into voice from the second desensitized chat data according to the voice characteristics of the corresponding user to obtain processed chat data;
the data storage unit is used for binding the current chatting data with the user ID of the corresponding user side and then storing the current chatting data in a local first storage area;
the time judging unit is used for acquiring the current system time and judging whether the time interval between the current system time and the last chatting data analysis time is equal to a preset chatting data analysis time period or not;
the text clustering unit is used for acquiring a currently stored chat data set in the first storage area and performing text clustering on the chat data set to obtain a corresponding chat data clustering result if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period; the chat data clustering result comprises a plurality of chat data clustering clusters;
the text theme extraction unit is used for acquiring text themes corresponding to the chat data cluster clusters in the chat data cluster results;
a target cluster acquiring unit, configured to acquire a chat data cluster corresponding to a corresponding text topic as a target chat data cluster if a text similarity between a text topic corresponding to the chat data cluster and a target topic in a preset target topic list exceeds a preset similarity threshold, so as to form a target chat data cluster set; and
and the text to be pushed acquiring unit is used for acquiring promotion text data respectively corresponding to the text topics of the target chat data cluster to form a text data set to be pushed.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the information push method based on the analysis of the chat content of the user according to the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the information push method based on the analysis of the chat content of the user according to the first aspect.
The embodiment of the invention provides an information pushing method based on user chat content analysis and related equipment thereof, which are characterized in that a chat data clustering result is quickly obtained after content analysis is carried out on chat data uploaded by a user entering a preset scene, a qualified clustering cluster and a corresponding text theme are obtained based on the chat data clustering result to form a target chat data clustering cluster set, and finally corresponding popularization text data are respectively obtained according to the text theme of each target chat data clustering set for information pushing.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic application scenario diagram of an information push method based on user chat content analysis according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of an information push method based on user chat content analysis according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of an information pushing apparatus based on user chat content analysis according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In order to more clearly understand the technical solution of the present application, the following detailed description is made on the execution subject involved. The following describes the technical solution with the server as the execution subject.
The client is an intelligent terminal such as a smart phone, a tablet computer and a notebook computer. The user can operate the user side to install a designated application program or client (such as a XXX game client, a YYY chat room client, and the like), the user can enter a preset scene after starting the application program or client, and in the preset scene, after the user inputs chat data (such as text-type chat data or voice-type chat data) in a chat window, the chat data can be sent to the server for storage and user chat content analysis.
The server can receive chat data uploaded by a plurality of user terminals respectively, desensitize the chat data, bind IDs and then store locally, cluster analysis of chat texts, extract text topics of hot cluster clusters and obtain promotion text data, and then texts capable of being promoted can be analyzed based on the chat contents of the users and pushed to the user terminals to be viewed.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of an information push method based on user chat content analysis according to an embodiment of the present invention; fig. 2 is a schematic flowchart of an information push method based on user chat content analysis according to an embodiment of the present invention, where the method is executed by application software installed in a server.
As shown in fig. 2, the method includes steps S101 to S110.
S101, if the chatting data uploaded by the user side entering the preset scene is detected, the data type of the chatting data is obtained.
In this embodiment, when the user operates the user side to start and log in a designated application program or a client installed in the user side, a preset scenario may be entered, and in the preset scenario, after the user inputs chat data (such as text-type chat data or voice-type chat data) in a chat window, the chat data may be sent to the server for storage and analysis of the chat content of the user. After the server receives the chat content sent by the user side, the data type of the chat content can be quickly analyzed so as to carry out the data processing of the next step.
S102, judging whether the data type is a voice type or a text type.
In this embodiment, the data types of the chat data sent by the user side are generally two types: i.e. speech type or text type. That is, the user can send voice information or text information to the server, and the server can quickly determine the data type according to the data file suffix of the chat data, wherein the data file suffix of the general text type is txt, and the data file suffix of the voice type is mp 3. By this judgment, the subsequent data desensitization processing strategy can be quickly determined.
S103, if the data type is a text type, sensitive word detection and sensitive word conversion processing are carried out on the chatting data, and the chatting data after the first desensitization processing is obtained and used as current chatting data.
In this embodiment, if the data type is a text type, the chat data indicating that the user is an edited text type is sent to the server, and the server performs desensitization processing and then forwards the data to other user sides for co-viewing. When the server carries out desensitization processing on the chat data of the text type, a sensitive word lexicon preset in the server can be called, and sensitive word detection is carried out based on the sensitive word lexicon. By the text desensitization processing mode, sensitive texts can be quickly converted and then sent to the user side for display.
The sensitive word bank preset in the server comprises: a Politics sub-thesaurus, a contraband sub-thesaurus, an abuse sub-thesaurus, a violation web sub-thesaurus, a nuisance ads sub-thesaurus, an incident sub-thesaurus, a pornography sub-thesaurus, a gambling sub-thesaurus, and the like. By presetting the sensitive words in the server, the sensitive word detection can be rapidly carried out on the chat text based on the sensitive word library.
In one embodiment, step S103 includes:
segmenting the chatting data to obtain a first segmentation result; wherein the first word segmentation result comprises a plurality of word segments;
sensitive word detection is carried out on each word segmentation in the first word segmentation result so as to judge whether a sensitive word exists in the first word segmentation result;
if the first word segmentation result contains sensitive words, acquiring corresponding sensitive words to form a first target sensitive word set;
calling a pre-trained sensitive word classification model, inputting each sensitive word in the first target sensitive word set into the sensitive word classification model, and acquiring the sensitive word grade corresponding to each sensitive word in the first target sensitive word set;
if the sensitive word grade corresponding to each sensitive word in the first target sensitive word set without the sensitive word is higher than a preset sensitive word grade threshold value, replacing each target sensitive word in the first target sensitive word set by the pinyin initial letter to obtain a first sensitive word conversion result corresponding to each target sensitive word;
and replacing each word in the chatting data, which is the same as the first target sensitive word set, by a corresponding first sensitive word conversion result to obtain chatting data after first desensitization processing as current chatting data.
In this embodiment, a first segmentation result can be obtained by performing segmentation processing on the chat data of the text type based on a probability statistics segmentation model; wherein, the first word segmentation result comprises a plurality of word segmentations. And then, judging whether each participle in the first participle result belongs to a sensitive word based on comparison with a sensitive word thesaurus, and if the sensitive word exists in the first participle result, acquiring the corresponding sensitive word to form a first target sensitive word set (for example, 1 sensitive word belonging to a harassing advertisement sub-thesaurus exists in the first participle result). And if the first segmentation result does not contain the sensitive word, the chat data is used as the current chat data.
In order to avoid adverse effects caused by direct text display of the sensitive word, each target sensitive word in a first target sensitive word set may be obtained, and a sensitive word grade corresponding to each sensitive word in the first target sensitive word set is obtained through a pre-trained sensitive word classification model (in specific implementation, a convolutional neural network may be adopted as the sensitive word classification model, and a word vector corresponding to each sensitive word in the first target sensitive word set is input to the sensitive word classification model, so that a sensitive word grade corresponding to each sensitive word in the first target sensitive word set may be obtained, for example, if the sensitive word grade corresponding to a sensitive word in the first target sensitive word set is 1, the sensitive word is a sensitive word with a first-level sensitive grade).
In order to perform different processing strategies on the sensitive words with different sensitive word grades, it may be determined first whether a sensitive word grade corresponding to a sensitive word in the first target sensitive word set is higher than a preset sensitive word grade threshold (for example, the sensitive word grade is set to 3). If the sensitive word rank corresponding to no sensitive word in the first target sensitive word set is higher than the sensitive word rank threshold, it indicates that the sensitive word rank corresponding to each sensitive word in the first target sensitive word set is not particularly high, at this time, the text characters included in each target sensitive word in the first target sensitive word set can be replaced by the corresponding pinyin initial letter, so as to obtain a first sensitive word conversion result corresponding to each target sensitive word.
If the sensitive word grade corresponding to the sensitive word in the first target sensitive word set is higher than the sensitive word grade threshold, the sensitive word grade corresponding to the sensitive word in the first target sensitive word set is particularly high, and at the moment, the corresponding sensitive word needs to be shielded and replaced by desensitization characters with the same number as the characters of the sensitive word, so as to realize desensitization. For example, a sensitive word consisting of 2 characters "AB" is in the first target sensitive word set, and "AB" can be replaced with "! | A "or". x ", etc. to perform desensitization treatment. By the method, the sensitive words with higher sensitivity level are also shielded as in the prior art, so that the sensitive words can be effectively filtered.
And S104, if the data type is the voice type, performing voice text extraction, sensitive word detection and sensitive word conversion on the chatting data to obtain second desensitized chatting data serving as current chatting data, and performing text-to-voice conversion on the second desensitized chatting data according to the corresponding user voice characteristics to obtain processed chatting data.
In this embodiment, if the data type is a voice type, the chat data indicating that the user is an edited voice type is sent to the server, and at this time, the server needs to perform voice text extraction (for example, perform voice recognition and text extraction based on an N-gram model stored in the server, where the N-gram model is a multivariate model) to obtain extracted text data, and the server performs desensitization processing on the extracted text data and then forwards the extracted text data to other user sides for viewing together. When the server carries out desensitization processing after converting the voice type chatting data into text data, a sensitive word lexicon preset in the server can be called, and sensitive word detection is carried out based on the sensitive word lexicon. By the voice desensitization processing mode, sensitive texts can be quickly converted and then sent to the user side for display.
In one embodiment, step S104 includes:
performing text recognition on the chatting data through a voice recognition model to obtain a text recognition result;
performing word segmentation on the text recognition result to obtain a second word segmentation result; wherein the second word segmentation result comprises a plurality of word segmentations;
sensitive word detection is carried out on each word segmentation in the second word segmentation result so as to judge whether a sensitive word exists in the second word segmentation result;
if the second word segmentation result contains the sensitive words, acquiring the corresponding sensitive words to form a second target sensitive word set;
replacing each target sensitive word in the second target sensitive word set by the first letter of pinyin to obtain a second sensitive word conversion result corresponding to each target sensitive word;
replacing each word in the text recognition result which is the same as the word in the second target sensitive word set by a corresponding second sensitive word conversion result to obtain second desensitized chat data serving as current chat data;
acquiring user identity information corresponding to the user side and user sound characteristics corresponding to the user identity information;
and performing voice synthesis on the chat data after the second desensitization processing through the user voice characteristics to obtain the processed chat data.
In the embodiment, the chat data is subjected to text recognition through a voice recognition model based on an N-gram model to obtain a text recognition result, and then word segmentation is performed on the text recognition result through a word segmentation model based on probability statistics to obtain a second word segmentation result; and the second word segmentation result comprises a plurality of word segmentations. Then, each participle in the second participle result can be compared with a sensitive word lexicon to determine whether the participle belongs to a sensitive word, if the sensitive word exists in the second participle result, the corresponding sensitive word is obtained to form a first target sensitive word set (for example, 2 sensitive words in the second participle result which abuse the sub-lexicon exist), in order to avoid adverse effects brought by direct text display of the sensitive word, each target sensitive word in the second target sensitive word set can be obtained, characters included in each target sensitive word are replaced by corresponding pinyin initial letters, a second sensitive word conversion result corresponding to each target sensitive word is obtained, and the sensitive word with adverse effects in the text is avoided through desensitization processing. And if the second word segmentation result does not contain the sensitive word, taking the text recognition result as the current chat data.
Since the voice information of the user is restored later, the voice feature extraction of the user (for example, extracting the tone feature, the voiceprint feature, and the like of the user) may be performed based on the first voice data of the user corresponding to the user collected by the server before, and when the server receives other chat data sent by the user again later, the voice synthesis may be performed based on the previously extracted voice feature of the user and the second desensitized chat data subjected to desensitization, so as to obtain the processed chat data. The processed chatting data is sent to the user side for listening, so that negative effects caused by sensitive words can be effectively avoided.
And S105, binding the current chatting data with the user ID of the corresponding user side and storing the current chatting data in a local first storage area.
In this embodiment, after the desensitization processing on the chat data is completed in the server, at this time, a user ID (which may be understood as a user login account) that is corresponding to the user side and is a unique ID may be obtained first, and then the current chat data is bound with the user ID of the corresponding user side and stored in the local first storage area, and the current chat data stored in the first storage area is periodically extracted by the server and then analyzed for chat content.
S106, obtaining the current system time, and judging whether the time interval between the current system time and the last chatting data analysis time is equal to a preset chatting data analysis time period or not.
In this embodiment, the determination of whether the time interval between the current system time and the last chat data analysis time is equal to the preset chat data analysis time period is to determine whether the next chat content analysis time corresponding to the last chat content analysis time is up again. If the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period, the current chat data analysis can be started; and if the time interval between the current system time and the last chat data analysis time is less than the chat data analysis time period, continuing to return to execute the step of acquiring the data type of the chat data if the chat data uploaded by the user side entering the preset scene is detected.
S107, if the time interval between the current system time and the last chatting data analysis time is equal to the chatting data analysis time period, acquiring a chatting data set which is stored in the first storage area at present, and carrying out text clustering on the chatting data set to obtain a corresponding chatting data clustering result; and the chat data clustering result comprises a plurality of chat data clustering clusters.
In this embodiment, if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period (for example, the chat data analysis time period may be set to be one day, one week, or one month, and the chat data analysis time period may be set according to actual requirements in a user-defined manner), it indicates that the current chat content analysis can be performed. After the server finishes analyzing the chat content of the stored chat data set in the first storage area last time, the server can transfer the stored chat data set in the first storage area to the second storage area and empty the first storage area. Therefore, each chatting data analysis period only analyzes the chatting data which is stored in the first storage area and collected recently, the data processing amount is effectively reduced, and the data processing efficiency is improved.
If the time interval between the current system time and the previous chat data analysis time is not equal to the chat data analysis time period, step S107 is executed after waiting until the time interval between the current system time and the previous chat data analysis time is equal to the chat data analysis time period.
In one embodiment, step S107 includes:
obtaining a semantic vector corresponding to each chatting data in the chatting data set;
and acquiring Euclidean distances among semantic vectors corresponding to the chat data set to perform K-means clustering to obtain a chat data clustering result.
In this embodiment, after each chat data in the chat data set is converted into a semantic vector, the chat data in the chat data set can be clustered based on a vector clustering method (such as a K-means clustering algorithm), so that a chat data clustering result with the same number as that of preset clustering clusters can be obtained quickly. K-means clustering of vectors is prior art and is not described herein in detail.
And S108, obtaining text topics corresponding to the chat data cluster in the chat data cluster result.
In this embodiment, in order to quickly obtain a core topic set of chat data collected in the current chat data analysis time period, text topics respectively corresponding to each chat data cluster in the chat data cluster result may be analyzed, and then activity texts of relevant topics to be promoted are selected from the extracted text topics, which are expected to be set by the server, so that accurate pushing of relevant activities of interest of the user chat content can be achieved.
In one embodiment, step S108 includes:
obtaining chat data included in the ith group of chat data clustering results; wherein the initial value of i is 1;
inputting all chatting data in the ith group of chatting data clustering results into a pre-trained LDA model for theme extraction to obtain theme extraction results respectively corresponding to all chatting data; wherein the LDA model is a document-subject generation model;
obtaining a topic extraction result with the maximum word frequency in all topic extraction results corresponding to the ith group of chat data clustering results, and taking the topic extraction result as a text topic corresponding to the ith group of chat data clustering results;
increasing the value of i by 1 to update the value of i, and judging whether i exceeds N; wherein N represents the total number of the chat data clustering clusters included in the chat data clustering result;
if i does not exceed N, storing the text theme corresponding to the ith group of chat data clustering results, and returning to execute the step of obtaining the chat data included in the ith group of chat data clustering results;
and if i exceeds N, acquiring the 1 st group chat data clustering result to the text theme respectively corresponding to the i-1 st group chat data clustering result so as to obtain the text theme respectively corresponding to each chat data clustering cluster in the chat data clustering result.
In this embodiment, the text topic extraction processing is performed on the N chat data cluster clusters included in the chat data cluster result, so that the text topics corresponding to the chat data cluster clusters in the chat data cluster result can be obtained quickly. When the text topics respectively corresponding to each chat data cluster are extracted, the topic extraction of each chat data is carried out by adopting a document-topic generation model, so that the extraction result is quicker and more accurate.
For example, after the chat data included in the cluster result of the group 1 chat data is obtained, N1 chat records included in the cluster result of the group 1 chat data may be counted, and at this time, the N1 chat records are all input to the LDA model, so as to obtain topic extraction results corresponding to each chat data of the cluster result of the group 1 chat data. And then, obtaining the topic extraction result with the maximum word frequency in all topic extraction results corresponding to the 1 st group of chat data clustering results to serve as the text topic corresponding to the 1 st group of chat data clustering results, and counting and obtaining the text topic corresponding to the 1 st group of chat data clustering results through the method, wherein the text topic obtaining modes of other groups can refer to the determining mode of the text topic of the 1 st group of chat data clustering results. And when the acquisition of the text theme corresponding to each chat data cluster in the chat data clustering result is completed, the main information focused by the chat data clustering result can be obtained, so that the server can push the information with higher correlation degree in a more targeted manner.
S109, if the text similarity between the text topic corresponding to the chat data cluster and the target topic in the preset target topic list exceeds a preset similarity threshold, acquiring the chat data cluster corresponding to the corresponding text topic as a target chat data cluster, and forming a target chat data cluster set.
In this embodiment, since the text topic corresponding to each chat data cluster is known, and a target topic list composed of a plurality of target topics is pre-stored in the server (each target topic in the target topic list is correspondingly provided with a promotion text data, for example, the promotion text data is a user preferential activity promotion text corresponding to the target topic), at this time, the text topic corresponding to each chat data cluster is respectively compared with each target topic in the target topic list to calculate the text similarity therebetween (for example, calculating the text similarity between two topics may be performed by first respectively obtaining word vectors corresponding to two topics, and then calculating the euclidean distance between the two word vectors as the similarity between two texts), and finally, if the text similarity between the text topic corresponding to the chat data cluster and the target topic in the target topic list exceeds a preset similarity threshold, and acquiring the chat data cluster corresponding to the corresponding text topic as a target chat data cluster to form a target chat data cluster set. By the method, the hot topics in the user chat data can be screened (the hot topics are also hot activities to be promoted by the operator corresponding to the server), so that the concerned promotion texts can be pushed to the user based on the hot topics.
If the text similarity between the text topic corresponding to no chat data cluster and the target topic in the preset target topic list exceeds a preset similarity threshold value, the fact that the core topic concerned in the chat content of the user does not have a similar topic in the target topic list pre-stored in the server is indicated, at the moment, the server sends the text topic corresponding to each chat data cluster in the chat data cluster result to a target receiving terminal (the target receiving terminal can be understood as an intelligent terminal used by a planning staff of an operator), and prompts a user corresponding to the target receiving terminal to edit the text to be promoted according to the text topic corresponding to each chat data cluster.
In an embodiment, as an alternative implementation of step S109, step S109 may be replaced by:
and acquiring promotion text data to be audited corresponding to each text topic, and if the promotion text data to be audited passes the verification of the sensitive words, taking each promotion text data to be audited as promotion text data corresponding to the text topic of each target chat data cluster.
In this embodiment, the first implementation of step S109 refers to step S109, that is, by determining whether the topical of the popular text concerned in the user chat content has the same topic as the text to be promoted by the server, if the topical of the popular text concerned in the user chat content has the same topic as the text to be promoted by the server, the promotion text data corresponding to the topic of the popular text is obtained. In the specific implementation step S109, another alternative implementation manner may also be adopted, that is, to-be-checked promotion text data may be correspondingly set in the server for each text topic concerned in the user chat content, so that the promotion text data may be correspondingly pushed for all text topics concerned by the user, and timely information feedback on the content concerned by the user is realized.
S110, acquiring promotion text data respectively corresponding to the text topics of the target chat data cluster, and forming a text data set to be pushed.
In this embodiment, after the acquisition of the promotion text data corresponding to the text topic of each target chat data cluster is completed, a text data set to be pushed is composed of the promotion text data, and then the text data set to be pushed is sent to the user side in the server according to a preset pushing policy (for example, the text data set to be pushed is pushed at ten am every wednesday) so as to implement information distribution and pushing.
In an embodiment, after step S110, the method further includes:
sending the text data set to be pushed to a user side;
and receiving text evaluation information sent by the user side according to the text data set to be pushed.
In this embodiment, after the server sends the text data set to be pushed to the user side, the user may view the text data set to be pushed on the user side, and perform text evaluation on each text data to be pushed in each text data set to be pushed (for example, edit a section of evaluation text for each text data to be pushed and send the text data to the server), so that when the server receives text evaluation information sent by the user side according to the text data set to be pushed, information feedback of the user on the text data to be pushed may be obtained in time.
In an embodiment, after step S110, the method further includes:
and obtaining the comment video data corresponding to each text data to be pushed in the text data set to be pushed, and sending the video links corresponding to the comment video data corresponding to each text data to be pushed to the user side.
In this embodiment, in order to help the user to know the core content of each text data to be pushed more quickly, the server may record the commentary video data of the relevant activity for each text data to be pushed, and each commentary video data generates a video link correspondingly. Therefore, after the video links corresponding to the comment video data respectively corresponding to the text data to be pushed are sent to the user side, the user can click to quickly view the comment video according to the self requirement, and the efficiency of correctly obtaining information by the user is improved.
The method realizes the automatic semantic analysis based on the user chatting data to quickly acquire the corresponding promotion text data set without manual intervention, and improves the data processing efficiency.
The embodiment of the invention also provides an information pushing device based on the analysis of the user chat content, which is used for executing any embodiment of the information pushing method based on the analysis of the user chat content. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of an information pushing apparatus based on user chat content analysis according to an embodiment of the present invention. The information pushing device 100 based on the analysis of the chat content of the user can be configured in a server.
As shown in fig. 3, the information push apparatus 100 based on the analysis of the chat content of the user includes: the chat data type acquiring unit 101, the type judging unit 102, the first desensitization processing unit 103, the second desensitization processing unit 104, the data storage unit 105, the time judging unit 106, the text clustering unit 107, the text topic extracting unit 108, the target clustering cluster acquiring unit 109 and the text to be pushed acquiring unit 110.
The chat data type obtaining unit 101 is configured to obtain a data type of chat data if the chat data uploaded by a user terminal entering a preset scene is detected.
In this embodiment, when the user operates the user side to start and log in a designated application program or a client installed in the user side, a preset scenario may be entered, and in the preset scenario, after the user inputs chat data (such as text-type chat data or voice-type chat data) in a chat window, the chat data may be sent to the server for storage and analysis of the chat content of the user. After the server receives the chat content sent by the user side, the data type of the chat content can be quickly analyzed so as to carry out the data processing of the next step.
A type determining unit 102, configured to determine whether the data type is a voice type or a text type.
In this embodiment, the data types of the chat data sent by the user side are generally two types: i.e. speech type or text type. That is, the user can send voice information or text information to the server, and the server can quickly determine the data type according to the data file suffix of the chat data, wherein the data file suffix of the general text type is txt, and the data file suffix of the voice type is mp 3. By this judgment, the subsequent data desensitization processing strategy can be quickly determined.
And the first desensitization processing unit 103 is configured to, if the data type is a text type, perform sensitive word detection and sensitive word conversion processing on the chat data to obtain chat data after the first desensitization processing as current chat data.
In this embodiment, if the data type is a text type, the chat data indicating that the user is an edited text type is sent to the server, and the server performs desensitization processing and then forwards the data to other user sides for co-viewing. When the server carries out desensitization processing on the chat data of the text type, a sensitive word lexicon preset in the server can be called, and sensitive word detection is carried out based on the sensitive word lexicon. By the text desensitization processing mode, sensitive texts can be quickly converted and then sent to the user side for display.
The sensitive word bank preset in the server comprises: a Politics sub-thesaurus, a contraband sub-thesaurus, an abuse sub-thesaurus, a violation web sub-thesaurus, a nuisance ads sub-thesaurus, an incident sub-thesaurus, a pornography sub-thesaurus, a gambling sub-thesaurus, and the like. By presetting the sensitive words in the server, the sensitive word detection can be rapidly carried out on the chat text based on the sensitive word library.
In an embodiment, the first desensitization processing unit 103 includes:
the first word segmentation unit is used for segmenting the chat data to obtain a first word segmentation result; wherein the first word segmentation result comprises a plurality of word segments;
the first sensitive word detection unit is used for detecting a sensitive word of each word in the first word segmentation result so as to judge whether the sensitive word exists in the first word segmentation result;
the first sensitive word set acquisition unit is used for acquiring corresponding sensitive words to form a first target sensitive word set if the sensitive words exist in the first segmentation result;
the sensitive word grade obtaining unit is used for calling a pre-trained sensitive word classification model, inputting each sensitive word in the first target sensitive word set into the sensitive word classification model, and obtaining the sensitive word grade corresponding to each sensitive word in the first target sensitive word set;
the first sensitive word conversion unit is used for replacing each target sensitive word in the first target sensitive word set by the pinyin initial letter if the sensitive word level corresponding to each sensitive word in the first target sensitive word set is higher than a preset sensitive word level threshold value, so as to obtain a first sensitive word conversion result corresponding to each target sensitive word;
and the first desensitization result acquisition unit is used for replacing each word in the chat data, which is the same as the first target sensitive word set, by a corresponding first sensitive word conversion result to obtain chat data after first desensitization processing to be used as current chat data.
In this embodiment, a first segmentation result can be obtained by performing segmentation processing on the chat data of the text type based on a probability statistics segmentation model; wherein, the first word segmentation result comprises a plurality of word segmentations. And then, judging whether each participle in the first participle result belongs to a sensitive word based on comparison with a sensitive word thesaurus, and if the sensitive word exists in the first participle result, acquiring the corresponding sensitive word to form a first target sensitive word set (for example, 1 sensitive word belonging to a harassing advertisement sub-thesaurus exists in the first participle result).
In order to avoid adverse effects caused by direct text display of the sensitive word, each target sensitive word in a first target sensitive word set may be obtained, and a sensitive word grade corresponding to each sensitive word in the first target sensitive word set is obtained through a pre-trained sensitive word classification model (in specific implementation, a convolutional neural network may be adopted as the sensitive word classification model, and a word vector corresponding to each sensitive word in the first target sensitive word set is input to the sensitive word classification model, so that a sensitive word grade corresponding to each sensitive word in the first target sensitive word set may be obtained, for example, if the sensitive word grade corresponding to a sensitive word in the first target sensitive word set is 1, the sensitive word is a sensitive word with a first-level sensitive grade).
In order to perform different processing strategies on the sensitive words with different sensitive word grades, it may be determined first whether a sensitive word grade corresponding to a sensitive word in the first target sensitive word set is higher than a preset sensitive word grade threshold (for example, the sensitive word grade is set to 3). If the sensitive word rank corresponding to no sensitive word in the first target sensitive word set is higher than the sensitive word rank threshold, it indicates that the sensitive word rank corresponding to each sensitive word in the first target sensitive word set is not particularly high, at this time, the text characters included in each target sensitive word in the first target sensitive word set can be replaced by the corresponding pinyin initial letter, so as to obtain a first sensitive word conversion result corresponding to each target sensitive word.
If the sensitive word grade corresponding to the sensitive word in the first target sensitive word set is higher than the sensitive word grade threshold, the sensitive word grade corresponding to the sensitive word in the first target sensitive word set is particularly high, and at the moment, the corresponding sensitive word needs to be shielded and replaced by desensitization characters with the same number as the characters of the sensitive word, so as to realize desensitization. For example, a sensitive word consisting of 2 characters "AB" is in the first target sensitive word set, and "AB" can be replaced with "! | A "or". x ", etc. to perform desensitization treatment. By the method, the sensitive words with higher sensitivity level are also shielded as in the prior art, so that the sensitive words can be effectively filtered.
And the second desensitization processing unit 104 is configured to, if the data type is a voice type, perform voice text extraction, sensitive word detection, and sensitive word conversion on the chat data to obtain second desensitized chat data serving as current chat data, and perform text-to-speech conversion on the second desensitized chat data according to corresponding user voice characteristics to obtain processed chat data.
In this embodiment, if the data type is a voice type, the chat data indicating that the user is an edited voice type is sent to the server, and at this time, the server needs to perform voice text extraction (for example, perform voice recognition and text extraction based on an N-gram model stored in the server, where the N-gram model is a multivariate model) to obtain extracted text data, and the server performs desensitization processing on the extracted text data and then forwards the extracted text data to other user sides for viewing together. When the server carries out desensitization processing after converting the voice type chatting data into text data, a sensitive word lexicon preset in the server can be called, and sensitive word detection is carried out based on the sensitive word lexicon. By the voice desensitization processing mode, sensitive texts can be quickly converted and then sent to the user side for display.
In one embodiment, the second desensitization processing unit 104 includes:
the text recognition unit is used for performing text recognition on the chatting data through a voice recognition model to obtain a text recognition result;
the second word segmentation unit is used for performing word segmentation on the text recognition result to obtain a second word segmentation result; wherein the second word segmentation result comprises a plurality of word segmentations;
the second sensitive word detection unit is used for carrying out sensitive word detection on each word segmentation in the second word segmentation result so as to judge whether a sensitive word exists in the second word segmentation result;
the second sensitive word set obtaining unit is used for obtaining corresponding sensitive words to form a second target sensitive word set if the second word segmentation result has sensitive words;
the second sensitive word conversion unit is used for replacing each target sensitive word in the second target sensitive word set by the first letter of pinyin to obtain a second sensitive word conversion result corresponding to each target sensitive word;
a second desensitization result obtaining unit, configured to replace, by a corresponding second sensitive word conversion result, each word in the text recognition result that is the same as the word in the second target sensitive word set, to obtain second desensitization-processed chat data as current chat data;
the user voice feature acquisition unit is used for acquiring user identity information corresponding to the user side and user voice features corresponding to the user identity information;
and the voice synthesis unit is used for carrying out voice synthesis on the chat data after the second desensitization processing through the voice characteristics of the user to obtain the processed chat data.
In the embodiment, the chat data is subjected to text recognition through a voice recognition model based on an N-gram model to obtain a text recognition result, and then word segmentation is performed on the text recognition result through a word segmentation model based on probability statistics to obtain a second word segmentation result; and the second word segmentation result comprises a plurality of word segmentations. Then, each participle in the second participle result can be compared with a sensitive word lexicon to determine whether the participle belongs to a sensitive word, if the sensitive word exists in the second participle result, the corresponding sensitive word is obtained to form a first target sensitive word set (for example, 2 sensitive words in the second participle result which abuse the sub-lexicon exist), in order to avoid adverse effects brought by direct text display of the sensitive word, each target sensitive word in the second target sensitive word set can be obtained, characters included in each target sensitive word are replaced by corresponding pinyin initial letters, a second sensitive word conversion result corresponding to each target sensitive word is obtained, and the sensitive word with adverse effects in the text is avoided through desensitization processing.
Since the voice information of the user is restored later, the voice feature extraction of the user (for example, extracting the tone feature, the voiceprint feature, and the like of the user) may be performed based on the first voice data of the user corresponding to the user collected by the server before, and when the server receives other chat data sent by the user again later, the voice synthesis may be performed based on the previously extracted voice feature of the user and the second desensitized chat data subjected to desensitization, so as to obtain the processed chat data. The processed chatting data is sent to the user side for listening, so that negative effects caused by sensitive words can be effectively avoided.
A data storage unit 105, configured to store the current chat data in a local first storage area after binding the current chat data with the user ID of the corresponding user side.
In this embodiment, after the desensitization processing on the chat data is completed in the server, at this time, a user ID (which may be understood as a user login account) that is corresponding to the user side and is a unique ID may be obtained first, and then the current chat data is bound with the user ID of the corresponding user side and stored in the local first storage area, and the current chat data stored in the first storage area is periodically extracted by the server and then analyzed for chat content.
The time determining unit 106 is configured to obtain a current system time, and determine whether a time interval between the current system time and a previous chat data analysis time is equal to a preset chat data analysis time period.
In this embodiment, the determination of whether the time interval between the current system time and the last chat data analysis time is equal to the preset chat data analysis time period is to determine whether the next chat content analysis time corresponding to the last chat content analysis time is up again. If the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period, the current chat data analysis can be started; and if the time interval between the current system time and the last chat data analysis time is less than the chat data analysis time period, continuing to return to execute the step of acquiring the data type of the chat data if the chat data uploaded by the user side entering the preset scene is detected.
A text clustering unit 107, configured to, if a time interval between current system time and last chat data analysis time is equal to the chat data analysis time period, obtain a currently stored chat data set in the first storage area, perform text clustering on the chat data set, and obtain a corresponding chat data clustering result; and the chat data clustering result comprises a plurality of chat data clustering clusters.
In this embodiment, if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period (for example, the chat data analysis time period may be set to be one day, one week, or one month, and the chat data analysis time period may be set according to actual requirements in a user-defined manner), it indicates that the current chat content analysis can be performed. After the server finishes analyzing the chat content of the stored chat data set in the first storage area last time, the server can transfer the stored chat data set in the first storage area to the second storage area and empty the first storage area. Therefore, each chatting data analysis period only analyzes the chatting data which is stored in the first storage area and collected recently, the data processing amount is effectively reduced, and the data processing efficiency is improved.
In one embodiment, the text clustering unit 107 includes:
a semantic vector acquiring unit, configured to acquire a semantic vector corresponding to each chat data in the chat data set;
and the semantic vector clustering unit is used for acquiring Euclidean distances among the semantic vectors corresponding to the chat data set so as to perform K-means clustering and obtain a chat data clustering result.
In this embodiment, after each chat data in the chat data set is converted into a semantic vector, the chat data in the chat data set can be clustered based on a vector clustering method (such as a K-means clustering algorithm), so that a chat data clustering result with the same number as that of preset clustering clusters can be obtained quickly. K-means clustering of vectors is prior art and is not described herein in detail.
And the text theme extraction unit 108 is configured to obtain text themes corresponding to the chat data cluster in the chat data cluster result.
In this embodiment, in order to quickly obtain a core topic set of chat data collected in the current chat data analysis time period, text topics respectively corresponding to each chat data cluster in the chat data cluster result may be analyzed, and then activity texts of relevant topics to be promoted are selected from the extracted text topics, which are expected to be set by the server, so that accurate pushing of relevant activities of interest of the user chat content can be achieved.
In one embodiment, the text topic extraction unit 108 includes:
the group chat data acquisition unit is used for acquiring chat data included in the ith group chat data clustering result; wherein the initial value of i is 1;
the LDA theme extraction unit is used for inputting all the chatting data in the ith group of chatting data clustering results into a pre-trained LDA model for theme extraction to obtain theme extraction results corresponding to all the chatting data respectively; wherein the LDA model is a document-subject generation model;
the word frequency statistical unit is used for acquiring a topic extraction result with the maximum word frequency in all topic extraction results corresponding to the ith group of chat data clustering results to serve as a text topic corresponding to the ith group of chat data clustering results;
the packet number updating unit is used for increasing the value of i by 1 and judging whether i exceeds N; wherein N represents the total number of the chat data clustering clusters included in the chat data clustering result;
the first processing unit is used for storing the text theme corresponding to the ith group of chat data clustering results if i does not exceed N, and returning to execute the step of acquiring the chat data included in the ith group of chat data clustering results;
and the second processing unit is used for acquiring the text topics respectively corresponding to the 1 st group of chat data clustering results to the i-1 st group of chat data clustering results if i exceeds N so as to obtain the text topics respectively corresponding to all chat data clustering clusters in the chat data clustering results.
In this embodiment, the text topic extraction processing is performed on the N chat data cluster clusters included in the chat data cluster result, so that the text topics corresponding to the chat data cluster clusters in the chat data cluster result can be obtained quickly. When the text topics respectively corresponding to each chat data cluster are extracted, the topic extraction of each chat data is carried out by adopting a document-topic generation model, so that the extraction result is quicker and more accurate.
For example, after the chat data included in the cluster result of the group 1 chat data is obtained, N1 chat records included in the cluster result of the group 1 chat data may be counted, and at this time, the N1 chat records are all input to the LDA model, so as to obtain topic extraction results corresponding to each chat data of the cluster result of the group 1 chat data. And then, obtaining the topic extraction result with the maximum word frequency in all topic extraction results corresponding to the 1 st group of chat data clustering results to serve as the text topic corresponding to the 1 st group of chat data clustering results, and counting and obtaining the text topic corresponding to the 1 st group of chat data clustering results through the method, wherein the text topic obtaining modes of other groups can refer to the determining mode of the text topic of the 1 st group of chat data clustering results. And when the acquisition of the text theme corresponding to each chat data cluster in the chat data clustering result is completed, the main information focused by the chat data clustering result can be obtained, so that the server can push the information with higher correlation degree in a more targeted manner.
And a target cluster acquiring unit 109, configured to acquire a chat data cluster corresponding to a corresponding text topic as a target chat data cluster if a text similarity between a text topic corresponding to the chat data cluster and a target topic in a preset target topic list exceeds a preset similarity threshold, so as to form a target chat data cluster set.
In this embodiment, since the text topic corresponding to each chat data cluster is known, and a target topic list composed of a plurality of target topics is pre-stored in the server (each target topic in the target topic list is correspondingly provided with a promotion text data, for example, the promotion text data is a user preferential activity promotion text corresponding to the target topic), at this time, the text topic corresponding to each chat data cluster is respectively compared with each target topic in the target topic list to calculate the text similarity therebetween (for example, calculating the text similarity between two topics may be performed by first respectively obtaining word vectors corresponding to two topics, and then calculating the euclidean distance between the two word vectors as the similarity between two texts), and finally, if the text similarity between the text topic corresponding to the chat data cluster and the target topic in the target topic list exceeds a preset similarity threshold, and acquiring the chat data cluster corresponding to the corresponding text topic as a target chat data cluster to form a target chat data cluster set. By the method, the hot topics in the user chat data can be screened (the hot topics are also hot activities to be promoted by the operator corresponding to the server), so that the concerned promotion texts can be pushed to the user based on the hot topics.
In an embodiment, as an alternative to the target cluster obtaining unit 109, the target cluster obtaining unit 109 may be replaced with:
and the to-be-audited popularization text data acquisition unit is used for acquiring to-be-audited popularization text data respectively corresponding to each text theme, and if each to-be-audited popularization text data passes the verification of the sensitive word, each to-be-audited popularization text data is taken as the popularization text data respectively corresponding to the text theme of each target chat data cluster.
In this embodiment, the first implementation manner of the target cluster obtaining unit 109 refers to step S109, that is, by determining whether the topical of the popular text concerned in the user chat content has the same topic as the text to be promoted by the server, if the topical of the popular text concerned in the user chat content has the same topic as the text to be promoted by the server, the promotion text data corresponding to the topic of the popular text is obtained. The specific implementation target cluster obtaining unit 109 may also adopt another alternative execution mode, that is, each text topic concerned in the user chat content may be correspondingly set with the promotion text data to be audited in the server, so that the promotion text data may be correspondingly pushed to all the text topics concerned by the user, and the timely information feedback of the content concerned by the user is realized.
And a to-be-pushed text obtaining unit 110, configured to obtain promotion text data corresponding to the text topics of each target chat data cluster, and form a to-be-pushed text data set.
In this embodiment, after the acquisition of the promotion text data corresponding to the text topic of each target chat data cluster is completed, a text data set to be pushed is composed of the promotion text data, and then the text data set to be pushed is sent to the user side in the server according to a preset pushing policy (for example, the text data set to be pushed is pushed at ten am every wednesday) so as to implement information distribution and pushing.
In an embodiment, the information pushing apparatus 100 based on user chat content analysis further includes:
the information pushing unit is used for sending the text data set to be pushed to the user side;
and the evaluation information acquisition unit is used for receiving the text evaluation information sent by the user side according to the text data set to be pushed.
In this embodiment, after the server sends the text data set to be pushed to the user side, the user may view the text data set to be pushed on the user side, and perform text evaluation on each text data to be pushed in each text data set to be pushed (for example, edit a section of evaluation text for each text data to be pushed and send the text data to the server), so that when the server receives text evaluation information sent by the user side according to the text data set to be pushed, information feedback of the user on the text data to be pushed may be obtained in time.
In an embodiment, the information pushing apparatus 100 based on user chat content analysis further includes:
and the comment video link sending unit is used for acquiring comment video data corresponding to each text data to be pushed in the text data set to be pushed respectively and sending video links corresponding to the comment video data corresponding to each text data to be pushed respectively to the user side.
In this embodiment, in order to help the user to know the core content of each text data to be pushed more quickly, the server may record the commentary video data of the relevant activity for each text data to be pushed, and each commentary video data generates a video link correspondingly. Therefore, after the video links corresponding to the comment video data respectively corresponding to the text data to be pushed are sent to the user side, the user can click to quickly view the comment video according to the self requirement, and the efficiency of correctly obtaining information by the user is improved.
The device realizes the automatic semantic analysis based on the user chatting data to quickly acquire the corresponding promotion text data set without manual intervention, and improves the data processing efficiency.
The information pushing device based on the user chat content analysis can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 4, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a storage medium 503 and an internal memory 504.
The storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, may cause the processor 502 to perform an information push method based on analysis of chat content of a user.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be enabled to execute an information push method based on the analysis of the chat content of the user.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run the computer program 5032 stored in the memory, so as to implement the information push method based on the analysis of the chat content of the user disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer readable storage medium may be a non-volatile computer readable storage medium or a volatile computer readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the information push method based on the analysis of the chat content of the user disclosed by the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. An information push method based on user chat content analysis is characterized by comprising the following steps:
if the chat data uploaded by the user side in the preset scene is detected, acquiring the data type of the chat data;
judging whether the data type is a voice type or a text type;
if the data type is a text type, performing sensitive word detection and sensitive word conversion processing on the chatting data to obtain chatting data subjected to first desensitization processing as current chatting data;
if the data type is a voice type, performing voice text extraction, sensitive word detection and sensitive word conversion processing on the chatting data to obtain second desensitized chatting data serving as current chatting data, and performing text-to-voice conversion on the second desensitized chatting data according to corresponding user voice characteristics to obtain processed chatting data;
binding the current chatting data with the user ID of the corresponding user side and then storing the current chatting data in a local first storage area;
acquiring current system time, and judging whether the time interval between the current system time and the last chatting data analysis time is equal to a preset chatting data analysis time period or not;
if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period, acquiring a currently stored chat data set in the first storage area, and performing text clustering on the chat data set to obtain a corresponding chat data clustering result; the chat data clustering result comprises a plurality of chat data clustering clusters;
obtaining text topics corresponding to all chat data cluster clusters in the chat data cluster results;
if the text similarity between the text subject corresponding to the chat data cluster and the target subject in the preset target subject list exceeds a preset similarity threshold value, acquiring the chat data cluster corresponding to the corresponding text subject as a target chat data cluster to form a target chat data cluster set; and
and acquiring promotion text data respectively corresponding to the text topics of the target chat data cluster to form a text data set to be pushed.
2. The information push method based on the analysis of the chat content of the user according to claim 1, wherein the performing sensitive word detection and sensitive word conversion processing on the chat data to obtain the chat data after the first desensitization processing as the current chat data comprises:
segmenting the chatting data to obtain a first segmentation result; wherein the first word segmentation result comprises a plurality of word segments;
sensitive word detection is carried out on each word segmentation in the first word segmentation result so as to judge whether a sensitive word exists in the first word segmentation result;
if the first word segmentation result contains sensitive words, acquiring corresponding sensitive words to form a first target sensitive word set;
calling a pre-trained sensitive word classification model, inputting each sensitive word in the first target sensitive word set into the sensitive word classification model, and acquiring the sensitive word grade corresponding to each sensitive word in the first target sensitive word set;
if the sensitive word grade corresponding to no sensitive word in the first target sensitive word set is higher than a preset sensitive word grade threshold value, replacing each target sensitive word in the first target sensitive word set by a pinyin initial letter to obtain a first sensitive word conversion result corresponding to each target sensitive word;
and replacing each word in the chatting data, which is the same as the first target sensitive word set, by a corresponding first sensitive word conversion result to obtain chatting data after first desensitization processing as current chatting data.
3. The information push method based on the analysis of the chat content of the user as claimed in claim 1, wherein the step of performing speech text extraction, sensitive word detection and sensitive word conversion processing on the chat data to obtain second desensitized chat data as current chat data, and performing text-to-speech processing on the second desensitized chat data according to the corresponding user sound characteristics to obtain processed chat data comprises:
performing text recognition on the chatting data through a voice recognition model to obtain a text recognition result;
performing word segmentation on the text recognition result to obtain a second word segmentation result; wherein the second word segmentation result comprises a plurality of word segmentations;
sensitive word detection is carried out on each word segmentation in the second word segmentation result so as to judge whether a sensitive word exists in the second word segmentation result;
if the second word segmentation result contains the sensitive words, acquiring the corresponding sensitive words to form a second target sensitive word set;
replacing each target sensitive word in the second target sensitive word set by the first letter of pinyin to obtain a second sensitive word conversion result corresponding to each target sensitive word;
replacing each word in the text recognition result which is the same as the word in the second target sensitive word set by a corresponding second sensitive word conversion result to obtain second desensitized chat data serving as current chat data;
acquiring user identity information corresponding to the user side and user sound characteristics corresponding to the user identity information;
and performing voice synthesis on the chat data after the second desensitization processing through the user voice characteristics to obtain the processed chat data.
4. The information push method based on the analysis of the chat content of the user according to claim 1, wherein the obtaining of the currently stored chat data set in the first storage area and the text clustering of the chat data set to obtain the corresponding chat data clustering result comprises:
obtaining a semantic vector corresponding to each chatting data in the chatting data set;
and acquiring Euclidean distances among semantic vectors corresponding to the chat data set to perform K-means clustering to obtain a chat data clustering result.
5. The information push method based on the analysis of the chat content of the user according to claim 1, wherein the obtaining of the text topic corresponding to each cluster of the chat data in the cluster result of the chat data comprises:
obtaining chat data included in the ith group of chat data clustering results; wherein the initial value of i is 1;
inputting all chatting data in the ith group of chatting data clustering results into a pre-trained LDA model for theme extraction to obtain theme extraction results respectively corresponding to all chatting data; wherein the LDA model is a document-subject generation model;
obtaining a topic extraction result with the maximum word frequency in all topic extraction results corresponding to the ith group of chat data clustering results, and taking the topic extraction result as a text topic corresponding to the ith group of chat data clustering results;
increasing the value of i by 1 to update the value of i, and judging whether i exceeds N; wherein N represents the total number of the chat data clustering clusters included in the chat data clustering result;
if i does not exceed N, storing the text theme corresponding to the ith group of chat data clustering results, and returning to execute the step of obtaining the chat data included in the ith group of chat data clustering results;
and if i exceeds N, acquiring the 1 st group chat data clustering result to the text theme respectively corresponding to the i-1 st group chat data clustering result so as to obtain the text theme respectively corresponding to each chat data clustering cluster in the chat data clustering result.
6. The information push method based on the analysis of the chat content of the user according to claim 1, wherein if the text similarity between the text topic corresponding to the chat data cluster and the target topic in the preset target topic list exceeds a preset similarity threshold, the chat data cluster corresponding to the corresponding text topic is obtained as the target chat data cluster, and the following steps are substituted for the formation of the target chat data cluster set:
and acquiring promotion text data to be audited corresponding to each text topic, and if the promotion text data to be audited passes the verification of the sensitive words, taking each promotion text data to be audited as promotion text data corresponding to the text topic of each target chat data cluster.
7. The information push method based on the analysis of the chat content of the user according to claim 1, wherein after obtaining the promotion text data corresponding to the text topic of each target chat data cluster to form the text data set to be pushed, the method further comprises:
sending the text data set to be pushed to a user side;
and receiving text evaluation information sent by the user side according to the text data set to be pushed.
8. An information pushing device based on user chat content analysis, comprising:
the chat data type acquisition unit is used for acquiring the data type of the chat data if the chat data uploaded by the user terminal entering a preset scene is detected;
the type judging unit is used for judging whether the data type is a voice type or a text type;
the first desensitization processing unit is used for carrying out sensitive word detection and sensitive word conversion processing on the chatting data if the data type is a text type to obtain the chatting data subjected to the first desensitization processing as the current chatting data;
the second desensitization processing unit is used for extracting a voice text, detecting a sensitive word and converting the sensitive word from the chat data if the data type is a voice type to obtain second desensitized chat data serving as current chat data, and converting the text into voice from the second desensitized chat data according to the voice characteristics of the corresponding user to obtain processed chat data;
the data storage unit is used for binding the current chatting data with the user ID of the corresponding user side and then storing the current chatting data in a local first storage area;
the time judging unit is used for acquiring the current system time and judging whether the time interval between the current system time and the last chatting data analysis time is equal to a preset chatting data analysis time period or not;
the text clustering unit is used for acquiring a currently stored chat data set in the first storage area and performing text clustering on the chat data set to obtain a corresponding chat data clustering result if the time interval between the current system time and the last chat data analysis time is equal to the chat data analysis time period; the chat data clustering result comprises a plurality of chat data clustering clusters;
the text theme extraction unit is used for acquiring text themes corresponding to the chat data cluster clusters in the chat data cluster results;
a target cluster acquiring unit, configured to acquire a chat data cluster corresponding to a corresponding text topic as a target chat data cluster if a text similarity between a text topic corresponding to the chat data cluster and a target topic in a preset target topic list exceeds a preset similarity threshold, so as to form a target chat data cluster set; and
and the text to be pushed acquiring unit is used for acquiring promotion text data respectively corresponding to the text topics of the target chat data cluster to form a text data set to be pushed.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the information push method based on the analysis of the chat content of the user according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to execute the information push method based on user chat content analysis according to any of claims 1 to 7.
CN202110522391.3A 2021-05-13 2021-05-13 Information pushing method based on user chat content analysis and related equipment thereof Active CN113127746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110522391.3A CN113127746B (en) 2021-05-13 2021-05-13 Information pushing method based on user chat content analysis and related equipment thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110522391.3A CN113127746B (en) 2021-05-13 2021-05-13 Information pushing method based on user chat content analysis and related equipment thereof

Publications (2)

Publication Number Publication Date
CN113127746A true CN113127746A (en) 2021-07-16
CN113127746B CN113127746B (en) 2022-10-04

Family

ID=76781747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110522391.3A Active CN113127746B (en) 2021-05-13 2021-05-13 Information pushing method based on user chat content analysis and related equipment thereof

Country Status (1)

Country Link
CN (1) CN113127746B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630306A (en) * 2021-07-28 2021-11-09 北京达佳互联信息技术有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN113821603A (en) * 2021-09-29 2021-12-21 平安普惠企业管理有限公司 Recording information processing method, apparatus, device and storage medium
US20230123548A1 (en) * 2021-10-15 2023-04-20 EMC IP Holding Company LLC Method and system to manage technical support sessions using ranked historical technical support sessions
US20230118727A1 (en) * 2021-10-15 2023-04-20 EMC IP Holding Company LLC Method and system to manage technical support sessions using historical technical support sessions

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100205541A1 (en) * 2009-02-11 2010-08-12 Jeffrey A. Rapaport social network driven indexing system for instantly clustering people with concurrent focus on same topic into on-topic chat rooms and/or for generating on-topic search results tailored to user preferences regarding topic
CN105956180A (en) * 2016-05-30 2016-09-21 北京京东尚科信息技术有限公司 Sensitive word filtering method
CN106210318A (en) * 2016-07-12 2016-12-07 广东欧珀移动通信有限公司 The method of voice broadcast information, device and mobile terminal
CN106570020A (en) * 2015-10-09 2017-04-19 百度在线网络技术(北京)有限公司 Method and apparatus used for providing recommended information
CN110534113A (en) * 2019-08-26 2019-12-03 深圳追一科技有限公司 Audio data desensitization method, device, equipment and storage medium
CN110750619A (en) * 2019-08-15 2020-02-04 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100205541A1 (en) * 2009-02-11 2010-08-12 Jeffrey A. Rapaport social network driven indexing system for instantly clustering people with concurrent focus on same topic into on-topic chat rooms and/or for generating on-topic search results tailored to user preferences regarding topic
CN106570020A (en) * 2015-10-09 2017-04-19 百度在线网络技术(北京)有限公司 Method and apparatus used for providing recommended information
CN105956180A (en) * 2016-05-30 2016-09-21 北京京东尚科信息技术有限公司 Sensitive word filtering method
CN106210318A (en) * 2016-07-12 2016-12-07 广东欧珀移动通信有限公司 The method of voice broadcast information, device and mobile terminal
CN110750619A (en) * 2019-08-15 2020-02-04 中国平安财产保险股份有限公司 Chat record keyword extraction method and device, computer equipment and storage medium
CN110534113A (en) * 2019-08-26 2019-12-03 深圳追一科技有限公司 Audio data desensitization method, device, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113630306A (en) * 2021-07-28 2021-11-09 北京达佳互联信息技术有限公司 Information processing method, information processing device, electronic equipment and storage medium
CN113821603A (en) * 2021-09-29 2021-12-21 平安普惠企业管理有限公司 Recording information processing method, apparatus, device and storage medium
US20230123548A1 (en) * 2021-10-15 2023-04-20 EMC IP Holding Company LLC Method and system to manage technical support sessions using ranked historical technical support sessions
US20230118727A1 (en) * 2021-10-15 2023-04-20 EMC IP Holding Company LLC Method and system to manage technical support sessions using historical technical support sessions
US11915205B2 (en) * 2021-10-15 2024-02-27 EMC IP Holding Company LLC Method and system to manage technical support sessions using ranked historical technical support sessions
US11941641B2 (en) * 2021-10-15 2024-03-26 EMC IP Holding Company LLC Method and system to manage technical support sessions using historical technical support sessions

Also Published As

Publication number Publication date
CN113127746B (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN113127746B (en) Information pushing method based on user chat content analysis and related equipment thereof
US20230222366A1 (en) Systems and methods for semantic analysis based on knowledge graph
CN110020422B (en) Feature word determining method and device and server
CN106919661B (en) Emotion type identification method and related device
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN108027814B (en) Stop word recognition method and device
CN113360622B (en) User dialogue information processing method and device and computer equipment
CN110909165A (en) Data processing method, device, medium and electronic equipment
CN108536595B (en) Intelligent matching method and device for test cases, computer equipment and storage medium
CN112365894A (en) AI-based composite voice interaction method and device and computer equipment
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN113383362B (en) User identification method and related product
CN110069769B (en) Application label generation method and device and storage device
CN111767393A (en) Text core content extraction method and device
CN115577316A (en) User personality prediction method based on multi-mode data fusion and application
CN106910135A (en) User recommends method and device
CN113190682B (en) Method and device for acquiring event influence degree based on tree model and computer equipment
KR101894060B1 (en) Advertisement providing server using chatbot
CN107016561B (en) Information processing method and device
KR101965361B1 (en) Server and method for providing online service
CN108711073B (en) User analysis method, device and terminal
CN114242047A (en) Voice processing method and device, electronic equipment and storage medium
CN113593546A (en) Terminal device awakening method and device, storage medium and electronic device
CN110990705B (en) News processing method, device, equipment and medium
CN113971581A (en) Robot control method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant