CN110458296B - Method and device for marking target event, storage medium and electronic device - Google Patents

Method and device for marking target event, storage medium and electronic device Download PDF

Info

Publication number
CN110458296B
CN110458296B CN201910713377.4A CN201910713377A CN110458296B CN 110458296 B CN110458296 B CN 110458296B CN 201910713377 A CN201910713377 A CN 201910713377A CN 110458296 B CN110458296 B CN 110458296B
Authority
CN
China
Prior art keywords
target
phrase
training
classification model
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910713377.4A
Other languages
Chinese (zh)
Other versions
CN110458296A (en
Inventor
邹耿鹏
段建波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910713377.4A priority Critical patent/CN110458296B/en
Publication of CN110458296A publication Critical patent/CN110458296A/en
Application granted granted Critical
Publication of CN110458296B publication Critical patent/CN110458296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for marking a target event, a storage medium and an electronic device. Wherein the method comprises the following steps: acquiring content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases; determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency of the phrase exceeds a preset frequency threshold value in a preset time period; determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event; and under the condition that the weight corresponding to the target category exceeds a preset weight threshold, marking the target phrase contained in the target to-be-processed information as a target event. At least solving the problem of low efficiency of detecting the target event in the related art.

Description

Method and device for marking target event, storage medium and electronic device
Technical Field
The invention relates to the technical field of game data processing, in particular to a method and a device for marking a target event, a storage medium and an electronic device.
Background
In the related art, detection of network hot events is mainly implemented by training a Word vector model by using a Word Embedding (Word Embedding) related algorithm. Specifically, a word vector model is used to obtain a word-level vector expression, then a word vector is spliced or a sentence trunk is obtained to extract trunk words, a training model and other modes are used to obtain a sentence vector expression, and then a clustering method is used to cluster the sentence vectors to obtain an event cluster. However, the manner provided by the related art at present cannot realize intelligent recognition of the category of the event cluster, that is, cannot accurately determine whether the event to be detected is a true hot event or a regular event with a high stepwise frequency, and often needs to be manually distinguished to determine whether the event is a hot event.
That is, such a detection method provided by the related art requires a lot of labor cost, so that the complexity of event detection increases, resulting in a problem of low detection efficiency.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for marking a target event, a storage medium and an electronic device, which are used for at least solving the problem of low efficiency of detecting the target event in the related technology.
According to an aspect of an embodiment of the present invention, there is provided a method for marking a target event, including: acquiring content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases; determining a target phrase from the plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency of the phrase exceeds a preset frequency threshold value in a preset time period; determining a target category corresponding to target to-be-processed information containing the target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event; and marking the target phrase contained in the target to-be-processed information as the target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold.
According to another aspect of the embodiment of the present invention, there is also provided a marking apparatus for a target event, including: the acquisition module is used for acquiring content sentences carried in the information to be processed, wherein the content sentences are segmented into one or more phrases; the first determining module is used for determining a target phrase from the plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency of which exceeds a preset frequency threshold value in a preset time period; the second determining module is used for determining a target category corresponding to target to-be-processed information containing the target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event; the marking module is used for marking the target phrase contained in the target to-be-processed information as the target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold.
Optionally, the second determining module includes: the input unit is used for inputting the target to-be-processed information into the classification model, wherein the target to-be-processed information contains one or more target phrases, and the classification model is obtained by training an initial classification model by using the phrases contained in the to-be-processed information as training samples; and the output unit is used for outputting the target category corresponding to the target phrase.
Optionally, the apparatus further comprises: the training module is used for training the initial classification model by using the first target to-be-processed information with the determined category as a training sample, wherein the first target to-be-processed information comprises a phrase marked as a target event and a phrase not marked as the target event.
Optionally, the training module includes: the classification unit is used for classifying the first target to-be-processed information with the determined category into a training data set, a verification data set and a test data set, wherein the training data set and the verification data set are used for training the classification model, and the test data set is used for testing the trained classification model; the first segmentation unit is used for segmenting the content sentences contained in the training data set and the verification data set into initial training phrases, and taking the initial training phrases with the occurrence frequency exceeding a preset threshold value as initial training samples, wherein the vector dimension of the initial training samples is the number of the initial training samples; the calculating unit is used for calculating semantic vector characterization of the initial training sample through a vector characterization algorithm; the first training unit is used for inputting the vector dimension of the initial training sample and the semantic vector representation of the initial training sample into the initial classification model for training to obtain the classification model; and the test unit is used for testing the training result of the classification model through the test data set and adjusting the model parameters of the classification model.
Optionally, the training module further comprises: the second segmentation unit is used for segmenting the target content sentence in the target information to be processed into a plurality of target training word groups, wherein the target training word groups only comprise Chinese characters and do not comprise stop words, and the stop words at least comprise exclamation words and/or pronouns and/or intonation words; the determining unit is used for determining the target training phrase with the occurrence frequency exceeding a preset threshold value as a word bag; the first merging unit is used for merging the word bags with the current training samples of the classification model to form target training samples; and the second training unit is used for training the classification model by using the target training sample and adjusting model parameters of the classification model.
Optionally, the training module further comprises: the first acquisition unit is used for acquiring the determined second target to-be-processed information in the time period from the last model training ending time to the current time, wherein the second target to-be-processed information comprises a phrase with the occurrence frequency exceeding a preset frequency threshold value in a preset time period; and the second merging unit is used for merging the phrase contained in the second target to-be-processed information into the current training sample of the classification model.
Optionally, the first determining module includes: the first determining unit is used for determining a phrase which appears in the same content sentence and has the occurrence frequency exceeding a preset threshold value in the content sentences of a plurality of pieces of information to be processed as a first phrase, wherein the first phrase only contains Chinese characters; a first discarding unit, configured to discard the first phrase having a present duty ratio smaller than a first preset duty ratio threshold and/or a present word frequency smaller than a first preset word frequency threshold and/or a present word frequency increasing rate smaller than a first preset increasing rate threshold, to obtain a second phrase, where the present word frequency increasing rate is an increasing rate obtained with respect to a word frequency of a previous day; the clustering unit is used for clustering the second phrase to obtain a first phrase cluster; the second discarding unit is used for discarding the first phrase cluster with the today's duty ratio smaller than a second preset duty ratio threshold value and/or the today's word frequency smaller than a second preset word frequency threshold value and/or the today's word frequency increasing rate smaller than a second preset increasing rate threshold value to obtain a second phrase cluster; and the second determining unit is used for determining that the phrase in the second phrase cluster is the target phrase.
Optionally, the first discarding unit comprises: an obtaining subunit, configured to obtain a today's duty cycle of the current first phrase using the following formula: p1=exp { (log P/m)/log n) } wherein P represents the current duty cycle of the day preceding the first phrase, m and n being constants respectively; a determining subunit configured to determine the first phrase with the smallest today's duty cycle by comparing the today's duty cycle of each of the first phrases; and the discarding subunit is used for discarding the first phrase with the minimum today's duty ratio.
Optionally, the first determining module further includes:
the second obtaining unit is used for obtaining the fluctuation coefficient of the first phrase in the current time period through the following formula:
wherein x' represents a fluctuation coefficient, x represents word frequency of the first phrase in the current time period, mu represents a word frequency mean value of the first phrase in the same time period of the previous day, and sigma represents a standard deviation of the word frequency of the first phrase in the same time period of the previous day;
and the third discarding unit is used for discarding the first phrase when the fluctuation coefficient is smaller than a preset fluctuation value.
According to a further aspect of embodiments of the present invention, there is also provided a storage medium having stored therein a computer program, wherein the computer program is arranged to perform the above-described method of determining a target event at run-time.
According to still another aspect of the embodiments of the present invention, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for determining a target event according to the above-mentioned method.
In the embodiment of the invention, a classification model is adopted to automatically identify the category of the information to be processed, and the content statement carried in the information to be processed is obtained, wherein the content statement is segmented into a plurality of phrases; determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase with occurrence times exceeding a preset time threshold value in a preset time period; determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event; and under the condition that the weight corresponding to the target category exceeds a preset weight threshold, marking the target phrase contained in the target to-be-processed information as a target event. Through determining the target phrase, phrases with the occurrence times exceeding a preset time threshold value in a preset time period can be screened out, then through using a classification model to determine the target category corresponding to the target to-be-processed information containing the target phrase in the to-be-processed information, the automatic classification of the to-be-processed information is realized, different weights are set for different target categories in the classification model, only the target phrase corresponding to the category reaching the preset weight threshold value is marked as a target event, the phrase conforming to the target event rule is further screened out, the problem of low efficiency caused by manually classifying and screening whether the target event is avoided, the purpose of automatically identifying the category to which the target to-be-processed information belongs is achieved, and the technical effect of automatically detecting whether the target phrase in the target to-be-processed information is the target event is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a hardware environment for an alternative method of tagging target events according to an embodiment of the application;
FIG. 2 is a flow chart of an alternative method of marking a target event according to an embodiment of the application;
FIG. 3 is an alternative schematic diagram of a target event alert interface according to an embodiment of the present application;
FIG. 4 is a further alternative schematic diagram of a target event alert interface according to an embodiment of the application;
FIG. 5 is yet another alternative schematic diagram of a target event alert interface in accordance with an embodiment of the application;
FIG. 6 is an alternative flow chart of a worksheet type recognition method in accordance with an embodiment of the present application;
FIG. 7 is an alternative flow chart of an SVM classification model training method according to an embodiment of the application;
FIG. 8 is an alternative block diagram of a target event marking device according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a fluctuation coefficient display interface according to an embodiment of the present application;
FIG. 10 is a schematic diagram of an alternative early warning month according to an embodiment of the present invention;
fig. 11 is a schematic structural view of an alternative electronic device according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to solve the technical problems, the embodiment of the application provides a marking method of a target event. Fig. 1 is a schematic diagram of a hardware environment of an alternative method for marking a target event according to an embodiment of the present application, as shown in fig. 1, the hardware environment may include, but is not limited to, a first user device 102, a network 110, a server 112, and a second user device 202, where the first user device 102 may include, but is not limited to, a memory 104, a processor 106, and a display 108, the server 112 may include, but is not limited to, a database 114, a processing engine 116, and the second user device 202 may include, but is not limited to, a memory 204, a processor 206, and a display 208. The user equipment herein may be, but is not limited to, a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a terminal device such as a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. The marking method of the target event mainly comprises the following steps:
step S102, the first user equipment 102 sends the information to be processed to the network end 110;
step S104, the network side 110 forwards the information to be processed to the server 112;
step S106, the server 112 marks the information to be processed meeting the requirements as a target event and pushes the target event to the second user equipment 202;
Step S108, the server 112 returns the processing result for the information to be processed to the network terminal 110;
in step S110, the network side 110 feeds back the processing result to the user equipment 102.
It should be noted that, the server 112 may not feed back the processing result to the first user device 102. When the information to be processed sent by the first user device 102 is not to request a certain data result, or is merely miscreant, the server 112 may ignore the information to be processed, and not feed back the processing result.
Optionally, in step S106, the server 112 marks the information to be processed meeting the requirements as the target event may be implemented by the following steps: acquiring content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases; determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency exceeds a preset frequency threshold value in a preset time period; determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event; and under the condition that the weight corresponding to the target category exceeds a preset weight threshold, marking the target phrase contained in the target to-be-processed information as a target event.
FIG. 2 is a flow chart of an alternative method of marking a target event according to an embodiment of the application. As shown in fig. 2, the method includes:
step S202, obtaining content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases;
step S204, determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency of which exceeds a preset frequency threshold value in a preset time period;
step S206, determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event;
step S208, marking the target phrase contained in the target to-be-processed information as a target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold.
Optionally, the above data processing method is not limited to being applied to the scene of obtaining the hot work order, but can be applied to any other application scene that needs to identify and annotate the text message, such as shopping class, game class, instant messaging class, financial application class scene, etc.
Alternatively, the information to be processed may be, but not limited to, short text information, and the content sentence included in the information to be processed may include, but not limited to, punctuation marks, emoticons, mood words, nouns, verbs, adjectives, and the like. After the content sentence is segmented, other fonts or symbols except Chinese characters can be removed, and the content sentence carried in each piece of information to be processed can be segmented into one phrase or a plurality of phrases.
Alternatively, the target categories may include, but are not limited to, consultation (e.g., wrong game ranking, inability to update installation packages), payment business (e.g., lost credit card, inability to open wallet, inability to open red package), fraud complaints, miscreants (e.g., malicious swipes, swipes keywords), intermittent hotwords (throttle notices, holiday reminders), and the like. Different target categories have different weights corresponding to the classification model, and the weights of the target categories are used for indicating the possibility that the target categories become target events.
For example, two pieces of target to-be-processed information A and B are currently determined, and the target category corresponding to A is a consultation of how to update a game version, that is, how to update the same game version is consulted by a plurality of messages, at this time, the weight of the consultation category is determined to be higher than a preset weight threshold, the target phrase corresponding to A is determined to be a target event, and the target event is pushed to a professional for processing the consultation information, so that emergency processing is required for the emergency. The target category corresponding to the B is 'summer to solar terms', a great number of users have a look and feel against the solar terms on the solar term day, the solar terms do not belong to sudden hot events, the weight of the category of the information is lower than a preset weight threshold, and the target phrase corresponding to the B is not marked as a target event and is not pushed to professionals for processing.
In an optional embodiment, the determining, using the classification model, the target category corresponding to the target to-be-processed information including the target phrase in the to-be-processed information may be implemented by:
s1, inputting target to-be-processed information into a classification model, wherein the target to-be-processed information contains a plurality of target phrases, and the classification model is obtained by training an initial classification model by using phrases contained in the to-be-processed information as training samples;
s2, outputting a target category corresponding to the target phrase.
Alternatively, the classification model referred to in the embodiments of the present invention may be, but is not limited to, a support vector machine (support vector machine, abbreviated SVM) classification model. Inputting the phrase in the initially collected information to be processed into an initial classification model, and training the initial classification model to obtain a classification model capable of automatically detecting the category of the information to be processed.
In an alternative embodiment, before determining the target category corresponding to the target to-be-processed information including the target phrase in the to-be-processed information by using the classification model, the initial classification model may be trained by:
and training the initial classification model by using the first target to-be-processed information with the determined category as a training sample, wherein the first target to-be-processed information comprises a phrase marked as a target event and a phrase not marked as the target event.
In an alternative embodiment, training the initial classification model using the first target pending information of the determined class as a training sample may be achieved by:
s1, dividing the information to be processed of the first target with the determined category into a training data set, a verification data set and a test data set, wherein the training data set is used for training an initial classification model, and the test data set is used for testing the trained classification model;
s2, segmenting content sentences contained in the training data set and the verification data set into initial training phrases, and taking the initial training phrases with the occurrence frequency exceeding a preset threshold value as initial training samples, wherein the vector dimension of the initial training samples is the number of the initial training samples;
s3, calculating semantic vector characterization of the initial training sample through a vector characterization algorithm;
s4, inputting the vector dimension of the initial training sample and the semantic vector representation of the initial training sample into an initial classification model for training to obtain a classification model;
s5, testing training results of the classification model through the test data set, and adjusting model parameters of the classification model.
Alternatively, the first target to-be-processed information of which the category has been determined may be manually classified short text information, which may include, but is not limited to, short text information of a part of sudden hot events, short text information of a part of persistent hot events, short text information of a part of non-hot events, and part of miscreant information. The training data set, the validation data set, and the test data set may each include the aforementioned short text information, which is not limited in this embodiment of the present invention.
The content sentence can be segmented by a word segmentation algorithm to obtain an initial training phrase, and other word segmentation algorithms or tools can be used. The segmented phrase can be filtered by using methods such as regular expressions and the like to filter non-Chinese character parts (punctuation marks, special symbols, numbers, english and the like) in the phrase, and then the segmented result is subjected to word stopping treatment, wherein the stopping treatment can comprise, but is not limited to, interjectors, mood words and pronouns. For example, the content statement contained in the information to be processed is "o-! New version game installation package-! Why the-! The following is carried out The following is carried out Why? This content sentence can finally be split into the following phrases: "New version, game installation package, update", or "New version, game, installation package, unable, update". The segmentation rule of the phrase can be set according to the actual application scene of the model, and the embodiment of the invention does not limit the segmentation rule.
Optionally, the initial training phrase with the occurrence frequency exceeding the preset threshold is used as an initial training sample, the preset threshold can be 0 or any integer greater than 0, and when the preset threshold is 0, all initial training phrases obtained after segmentation are used as initial training samples.
Alternatively, the semantic vector characterization that facilitates the sample may be calculated by using a term frequency-inverse text frequency index (TF-inverse document frequency, abbreviated as TF-IDF) as a vector characterization algorithm, or may be calculated using other vector characterization algorithms, which is not limited by the embodiment of the present invention.
Optionally, the vector dimension of the initial training sample and the semantic vector representation of the initial training sample are input into the initial classification model for training, for example, when the number of initial training samples is 100, the vector dimension of the initial training sample 100 and the semantic vector representation of the initial training sample are input into the SVM classification model for training.
Alternatively, the training results of the classification model may be tested using a test dataset through a K-fold cross-validation training (K-fold cross Validation) algorithm. For example, 100 pieces of first target to-be-processed information are divided into k groups of data sets, wherein 2 groups are test data sets, k-2 groups are training data sets and/or verification data sets, and after the classification model is trained by using the k-2 groups of data sets, the training result can be tested by using the 2 groups of test data sets. And the model parameters of the classification model are adjusted and optimized through the test result, and the closer the test result is to the real result, the more stable the model parameters are, and the stronger the reliability is.
In an optional embodiment, after training the initial classification model using the first target pending information of the determined class as a training sample, the method further includes:
s1, dividing target content sentences in target to-be-processed information into a plurality of target training word groups, wherein the target training word groups only comprise Chinese characters and do not comprise stop words, and the stop words at least comprise exclamation words and/or pronouns and/or intonation words;
s2, determining a target training phrase with the occurrence frequency exceeding a preset threshold as a word bag;
s3, combining the word bags with the current training samples of the classification model to form a target training sample;
s4, training a classification model by using the target training sample, and adjusting model parameters of the classification model.
Optionally, before the target information to be processed is input into the classification model for class detection, the target information to be processed may be input into the classification model for training. The high-frequency phrase in the target information to be processed is used as a training word bag, so that the phrase contained in the target information to be processed can be updated into the training model of the classification model in real time, and the condition that the target information to be processed cannot be identified when being input into the classification model for class detection is avoided.
Alternatively, the same method as the method for splitting the target training phrase may be used, for example, the content sentence contained in the target processing information is o-! New version of the reading software installation package-! Why the-! The following is carried out The following is carried out Why? This content sentence can finally be split into the following phrases: the new version, the reading software installation package and the update or the new version, the reading software, the installation package, the incapability and the update can be set according to the actual application scene of the model, and the embodiment of the invention is not limited to the new version, the reading software installation package and the update.
Optionally, the target training word group with the occurrence frequency exceeding the preset threshold is used as a training word bag, then the training word bag is combined with the current training sample, so that a new training sample is formed, and training operation is also performed on the sample containing the target training word group. And after training, the model parameters of the classification model are adjusted and optimized through the test data set, and the more the test result is close to the real result, the more stable the model parameters are, and the stronger the reliability is.
In an alternative embodiment, after training the initial classification model by using the first target to-be-processed information of which the class has been determined as a training sample, the training sample of the classification model may be updated periodically, and the method mainly includes the following steps:
S1, acquiring determined second target to-be-processed information in a time period from the last model training ending time to the current time, wherein the second target to-be-processed information comprises a phrase with the occurrence frequency exceeding a preset frequency threshold value in a preset time period;
s2, the phrase contained in the information to be processed of the second target is combined into the current training sample of the classification model.
Alternatively, taking the training of the classification model of the customer service work order as an example, the updating of the work order classification model may include automatic updating and manual updating.
The automatic updating may include the steps of:
s1, searching a plurality of indexes such as whether the work order quantity of a supplementary recording work order library exceeds a preset threshold value, whether the current interval of the last model training time exceeds the preset time threshold value and the like, and meeting the requirement that one model is automatically started to be updated;
s2, retrieving samples from the supplementary recording work order library and combining the original training samples;
s3, word segmentation is carried out on the processed sentences by adopting a word segmentation algorithm (such as a jieba algorithm); filtering non-Chinese character parts (punctuation marks, special symbols, numbers, english, and the like) in the words after word segmentation by using regular expressions and other methods; the result after word segmentation is processed by removing stop words (exclamation words, mood words, pronouns and the like), and word bags are converted into an index list;
And 4, retraining the worksheet classification model.
And a manual updating step:
s1, searching a model stability index (population stability index, PSI for short), and if the model stability index exceeds a threshold value, giving an alarm;
s2, manually re-carrying out feature engineering on the corpus, re-examining the corpus, training a new model, and manually carrying out parameter tuning on the model until the model is stable.
In an alternative embodiment, step S204 determines the target phrase from the plurality of phrases, which may be implemented by the following steps:
s1, determining a phrase with the occurrence frequency exceeding a preset threshold value in a content sentence as a first phrase, wherein the first phrase only contains Chinese characters;
s2, discarding a first phrase with the today 'S duty ratio smaller than a first preset duty ratio threshold value and/or the today' S word frequency smaller than a first preset word frequency threshold value and/or the today 'S word frequency increasing rate smaller than a first preset increasing rate threshold value to obtain a second phrase, wherein the today' S word frequency increasing rate is an increasing rate obtained relative to the word frequency of the previous day;
s3, clustering the second phrase to obtain a first phrase cluster;
s4, discarding the first phrase clusters with the today ' S duty ratio smaller than a second preset duty ratio threshold value and/or the today ' S word frequency smaller than a second preset word frequency threshold value and/or the today ' S word frequency increase rate smaller than a second preset increase rate threshold value to obtain second phrase clusters;
S5, determining the phrase in the second phrase cluster as a target phrase.
Optionally, frequent mining is performed on the segmented phrases, a FP-GROWTHs frequent mining algorithm can be used to obtain frequent phrases, namely a first phrase, FP-GROWTHs frequent item set mining is one of the association rule mining algorithms, and frequent item sets are obtained by limiting confidence, support and promotion between associated items. And then carrying out twice filtering and once clustering treatment on the first phrase.
The first filtering includes: and carrying out first-layer dynamic filtering on the frequent phrase by using a logarithmic curve model and parameter filtering set by a rule engine, and filtering a part of the frequent phrase which does not meet the condition to finally obtain a second phrase. The second phrase may be obtained by discarding a first phrase having a today's duty cycle smaller than a first predetermined duty cycle threshold and/or a today's word frequency smaller than a first predetermined word frequency threshold and/or a today's word frequency increase rate smaller than a first predetermined increase rate threshold, optionally the today's word frequency increase rate being an increase rate obtained with respect to the word frequency of the previous day.
Alternatively, when determining frequent phrases, it may be obtained by detecting the word frequency of the phrase on the current day. If the word frequency of the phrase meets the preset word frequency threshold value, the word frequency of the phrase is also possibly a common phrase which does not belong to burst hot phrases, and the word frequency increasing rate, namely the increasing rate of the current phrase present day word frequency relative to the previous day, can be considered at the moment. When the word frequency increasing rate also meets the preset rule, the word frequency increasing rate can be further determined to be a frequent phrase. If only the word frequency increasing rate of the phrase is seen, the word frequency of the previous day may be a particularly small base number, even 0, so that the word frequency of the present day may have a very high word frequency increasing rate only by a few times, which may lead to inaccurate data, and meanwhile, whether the present word frequency and/or the present duty ratio meet the preset condition is considered, and the final frequent phrase, that is, the second phrase, is determined after comprehensive consideration.
Alternatively, the present first phrase's today's duty cycle may be obtained using the following formula:
p1=exp { (log P/m)/log n) } wherein P represents the current duty cycle of the day preceding the first phrase, m and n being constants respectively;
the first phrase with the smallest today's duty cycle is determined and discarded by comparing the today's duty cycle of each first phrase.
Optionally, the word frequency of the phrase can be judged according to an n-sigma rule, and the fluctuation coefficient of the first phrase in the current time period can be obtained according to the following formula:
wherein x' represents a fluctuation coefficient, x represents word frequency of the first phrase in the current time period, mu represents a word frequency mean value of the first phrase in the same time period of the previous day, and sigma represents a standard deviation of the word frequency of the first phrase in the same time period of the previous day;
and discarding the first phrase when the fluctuation coefficient x' is smaller than a preset fluctuation value.
After the first filtering, the frequent phrase meeting the requirements, namely the second phrase, is obtained.
Optionally, the filtered frequent phrase may be clustered by using a DBSCAN clustering algorithm, all the worksheets containing the frequent phrase are taken out, the bag of words vector is used as the vector expression of the worksheet content, and the content vector is clustered by using the DBSCAN algorithm to obtain the frequent phrase cluster, that is, the first phrase cluster.
Optionally, a rule engine may be used to perform rule filtering on the clustered frequent phrase clusters to obtain frequent phrase clusters that finally meet the conditions, that is, perform second filtering on the first phrase cluster to obtain a second phrase cluster. The phrase in the finally obtained second phrase cluster is the target phrase, and the information to be processed containing the target phrase is the target information to be processed.
The scheme provided by the embodiment of the invention can be used for any short text-based emergency mining scene, the short text message is monitored, after the hot spot/emergency is found, the prompt of the emergency is carried out in a mail, applet and instant messaging group mode, the type of the hot spot work order is prompted, the title is displayed in the form of the co-occurrence group cluster, and the prompt is carried out in the form of combining characters, charts and data.
FIG. 3 is an alternative schematic diagram of a target event alert interface according to an embodiment of the present invention. As shown in FIG. 3, the single quantity change condition of the clustered worksheets is displayed through the PC end alarm page, and the same-ratio condition of the worksheets is displayed through the same-ratio data, so that a worker can judge the emergency degree of the clustered worksheets according to the alarm and perform corresponding processing in time. The alarm detail page shows a detailed page of a hot spot result mined by the algorithm, and the time sequence change condition of a hot spot phrase. An optional push interface shown in fig. 3 can intuitively see the time period, the product name and the update, version downloading and update of hot word screening, can also see the trend of word frequency of 2019-6-2 on the same day along with time, can see that the highest word frequency is about 12 points and is 41 as shown by a solid line in the figure, and can also see that the word frequency of 2019-6-1 on the previous day is changed, can also see that the highest word frequency is about 12 points and is 6 as shown by a dotted line in the figure. The average word frequency of 2019-6-2 days is automatically counted on the interface. Through the linear changes, whether the current hotword is a hot spot/emergency or not can be intuitively confirmed. Meanwhile, work order specific information such as a user XXX1 can be displayed below the interface, and work order texts with the content of ' updated version ', how to get ' are issued at 8:58 of 2019-6-2 for a product with the product code of a1, so that a worker can conveniently master specific work order requests in time.
Fig. 4 is a schematic diagram of still another alternative target event alert interface according to an embodiment of the present invention, as shown in fig. 4, by using a real-time pushing function of an applet on a PC side or a mobile terminal, on an alternative pushing interface, push contents may include a hot word "update, version download, update", a job ticket category is "consultation category", while a product name, a pushing time, statistics of a time period (e.g. 8:00-12:00) including the job ticket of the hot word, a job ticket feedback number (24) for the hot word in a current time period, a rate of increase of the same ratio (2300) compared with the same time period yesterday, a ratio (3.08%) of the current hot word in all the job tickets, and a job ticket content, etc., where the job ticket content includes "update version, how to get up to the latest version", etc. as shown in fig. 4.
FIG. 5 is yet another alternative schematic diagram of a target event alert interface according to an embodiment of the invention. As shown in fig. 5, by the real-time pushing function of the instant messaging group of the PC end or the mobile terminal, in an optional pushing interface, for example, in the instant messaging group with group name of "[ product name ] user feedback early warning", the pushing of the user feedback early warning is received, and the pushing content includes, but is not limited to, hot word "update, version download, update", product name, pushing time, statistics of a time period (for example, 8:00-12:00) containing a hot word work order, a work order feedback number (24) for the hot word in the current time period, a rate of increase (2300%) of the same time period as yesterday, a current duty ratio (3.08%) of the hot word in all work orders, and work order content, where the work order content may include "why the update is still me after the update is downloaded, and the update is also me when logging in" as shown in fig. 5.
The display interfaces for alarm pushing provided in fig. 3 to 5 of the embodiment of the application can realize the active real-time tracking and pushing of hotspots/emergencies in time, the feedback number, the same-ratio growth rate and the hotword duty ratio can intuitively present the word frequency of the current hotword in the current time period, the growth rate relative to the previous day and the duty ratio in the present phrase, thus being convenient for confirming whether the hotspots/emergencies are the hotspots/emergencies, and being capable of identifying the emotion of a sender at the moment from the content of a work order, being convenient for workers to know the hotspots/emergencies at any time and any place, and further being capable of actively coping.
Alternatively, the FP-GROWTHs model is used to mine the hot words according to the worksheet content, cluster the hot words, and the clustered worksheets are subjected to worksheet category discrimination through the machine learning model, which can be accomplished through a flowchart shown in fig. 6. FIG. 6 is an alternative flow chart of a method of worksheet type identification, as shown in FIG. 6, comprising the steps of:
s601, acquiring an asynchronous work order, and extracting the content of the work order;
s602, extracting a problem description in a work order in a certain time period of the day, and preprocessing the problem description; preprocessing comprises the steps of word segmentation by using a word segmentation tool for the problem description after rough processing (comprising the step of filtering special characters such as expressions by using regular expressions), and filtering non-Chinese character parts (punctuation marks, numbers and the like) for the words by using methods such as the regular expressions;
S603, frequent phrase mining is carried out by using an FP-GROWTH frequent mining algorithm, the FP-GROWTH frequent item set mining is one of the association rule mining algorithms, and the frequent item sets are obtained by limiting the confidence, the support and the lifting degree among the association items;
s604, performing first-layer dynamic filtering on frequent phrases by using a logarithmic curve model and parameter filtering set by a rule engine, and filtering a part of frequent phrases which do not meet the conditions;
s605, clustering the filtered frequent phrase by using a DBSCAN clustering algorithm, taking out all work orders containing the frequent phrase, using a word bag vector as a vector expression of the work order content, and clustering the content vector by using the DBSCAN algorithm to obtain a frequent phrase cluster;
s606, performing rule filtering on the clustered frequent phrase clusters by using a rule engine to obtain frequent phrase clusters which finally meet the conditions;
s607, classifying and predicting the worksheets in each frequent phrase cluster by using an SVM worksheet type classification model;
s608, marking a category mark on the frequent phrase cluster with the threshold reaching the condition.
Alternatively, the SVM classification model training process can be implemented by the following steps. FIG. 7 is an alternative flow chart of a training method for SVM classification models according to an embodiment of the invention, as shown in FIG. 7, comprising the steps of:
S701, marking the work order, and determining the class of the work order;
s702, performing word segmentation on the work orders subjected to filtering and clustering processing by adopting jieba, and filtering non-Chinese character parts (punctuation marks, special symbols, numbers, english and the like) in words after word segmentation by using methods such as regular expressions and the like;
s703, performing processing of removing stop words (exclamation words, mood words, pronouns and the like) on the word segmentation result, screening candidate words through word frequency conditions, constructing word bags and converting the word bags into an index list. S704, dividing the data into a training data set, a verification data set and a test data set through the historical marked worksheets;
s705, training an SVM work order classification model through K-fold cross validation, and optimizing model parameters;
s7051, word segmentation is carried out on all work orders, stop words and single word words are removed, and partial words are selected as the dimension of vectors through word frequency limitation to represent texts;
s7052, calculating TF-IDF vectors of each work order through TF-IDF as vector expression of the work order;
s7053, training an SVM work order classification model through K-fold cross validation and adjusting parameters.
S706, performing model test through the test data set, and evaluating the prediction capability of the model through model evaluation parameters such as F1, KS values and the like, wherein the larger the numerical values of the F1, KS are, the higher the accuracy of the model is. Continuously optimizing model parameters, and strengthening the prediction capacity and generalization capacity of the model;
S707, outputting the trained classification model.
Automatic discovery and statistics recording of work order hot events are realized; accurate auxiliary discrimination of the hot spot is realized by using the n-sigma parameters; the integration of hot spot discovery and hot spot event type discrimination is realized, the current hot spot work order event can be discovered without human intervention, the event type is automatically reported, and the customer service work efficiency is greatly improved; and detecting the model predicting work order category distribution condition by using the stability parameters of the PSI and the like, and monitoring whether the model predicting performance is stable.
By the scheme provided by the embodiment of the invention, the following technical effects are realized:
the embodiment of the invention uses the word bag model, the word bag is calculated in real time by using the current corpus, and the situation that the new vocabulary cannot be used does not exist.
According to the embodiment of the invention, the co-occurrence phrase is mined through the FP-GROWTH model, the low-frequency phrase is filtered, the co-occurrence phrase is used as a display form, and similar worksheets are clustered and mined by using worksheets represented by the word bag vectors to form event clusters. The rule engine is used for evaluating the mined event clusters, and the event clusters can be compared in two dimensions of the transverse direction and the longitudinal direction by using the co-occurrence phrase, so that not only can the magnitude of the event clusters be inspected, but also dimension indexes such as the growth rate and the like are considered, hot spots are mined, and emergency events are mined.
According to the embodiment of the invention, intelligent discrimination of the work order category is realized through the machine learning model, and the problem that the hot spot display in the prior art needs to be selected manually is solved.
The embodiment of the invention realizes the early warning closed loop of the hot work order on the technical side, and the excavation of frequent word groups and the classification marking of the work order category can be automatically completed through a program.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
According to another aspect of the embodiment of the present application, there is also provided a target event marking apparatus for implementing the method for marking a target event. FIG. 8 is an alternative block diagram of a target event marking apparatus according to an embodiment of the present application, as shown in FIG. 8, including:
an obtaining module 802, configured to obtain a content sentence carried in information to be processed, where the content sentence is segmented into one or more phrases;
a first determining module 804, configured to determine a target phrase from a plurality of phrases, where the target phrase is a phrase that appears in the same piece of information to be processed and has a number of occurrences exceeding a preset number of occurrences threshold in a predetermined time period;
a second determining module 806, configured to determine, using the classification model, a target category corresponding to target to-be-processed information including a target phrase in the to-be-processed information, where different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used to indicate a likelihood that the target phrase becomes a target event;
and the marking module 808 is configured to mark a target phrase included in the target to-be-processed information as a target event when the weight corresponding to the target category exceeds a preset weight threshold.
Optionally, the second determining module includes: the input unit is used for inputting target information to be processed into a classification model, wherein the target information to be processed contains one or more target phrases, and the classification model is obtained by training an initial classification model by using phrases contained in the information to be processed as training samples; and the output unit is used for outputting the target category corresponding to the target phrase.
Optionally, the apparatus further comprises: the training module is used for training the initial classification model by using the first target to-be-processed information with the determined category as a training sample, wherein the first target to-be-processed information comprises a phrase marked as a target event and a phrase not marked as the target event.
Optionally, the training module includes: the classification unit is used for classifying the first target to-be-processed information with the determined category into a training data set, a verification data set and a test data set, wherein the verification data set is used for training the classification model, and the test data set is used for testing the trained classification model; the first segmentation unit is used for segmenting content sentences contained in the training data set and the verification data set into initial training phrases, and taking the initial training phrases with the occurrence frequency exceeding a preset threshold value as initial training samples, wherein the vector dimension of the initial training samples is the number of the initial training samples; the calculating unit is used for calculating semantic vector characterization of the initial training sample through a vector characterization algorithm; the first training unit is used for inputting the vector dimension of the initial training sample and the semantic vector representation of the initial training sample into the initial classification model for training to obtain the classification model; and the test unit is used for testing the training result of the classification model through the test data set and adjusting the model parameters of the classification model.
Optionally, the training module further comprises: the second segmentation unit is used for segmenting target content sentences in the target information to be processed into a plurality of target training word groups, wherein the target training word groups only comprise Chinese characters and do not comprise stop words, and the stop words at least comprise exclamation words and/or pronouns and/or intonation words; the determining unit is used for determining the target training phrase with the occurrence frequency exceeding a preset threshold value as a word bag; the first merging unit is used for merging the word bags with the current training samples of the classification model to form target training samples; and the second training unit is used for training the classification model by using the target training sample and adjusting model parameters of the classification model.
Optionally, the training module further comprises: the first acquisition unit is used for acquiring the determined second target to-be-processed information in the time period from the last model training end time to the current time, wherein the second target to-be-processed information comprises a phrase with the occurrence frequency exceeding a preset frequency threshold value in a preset time period; the second merging unit is used for merging the phrase contained in the second target to-be-processed information into the current training sample of the classification model.
Optionally, the first determining module includes: the first determining unit is used for determining a phrase which appears in the same content sentence and has the occurrence frequency exceeding a preset threshold value in the content sentences of the plurality of information to be processed as a first phrase, wherein the first phrase only contains Chinese characters; the first discarding unit is configured to discard a first phrase having a present duty ratio smaller than a first preset duty ratio threshold and/or a present word frequency smaller than a first preset word frequency threshold and/or a present word frequency increasing rate smaller than a first preset increasing rate threshold, so as to obtain a second phrase, where the present word frequency increasing rate is an increasing rate obtained with respect to a word frequency of a previous day; the clustering unit is used for clustering the second phrase to obtain a first phrase cluster; the second discarding unit is used for discarding the first phrase cluster with the today's duty ratio smaller than a second preset duty ratio threshold value and/or the today's word frequency smaller than a second preset word frequency threshold value and/or the today's word frequency increasing rate smaller than a second preset increasing rate threshold value to obtain a second phrase cluster; and the second determining unit is used for determining the phrase in the second phrase cluster as the target phrase.
Optionally, the first discarding unit comprises: an obtaining subunit, configured to obtain a today's duty cycle of the current first phrase using the following formula: p1=exp { (log P/m)/log n) } wherein P represents the current duty cycle of the day preceding the first phrase, m and n being constants respectively; a determining subunit configured to determine a first phrase having a minimum today's duty cycle by comparing the today's duty cycle of each first phrase; and the discarding subunit is used for discarding the first phrase with the minimum today's duty ratio.
Optionally, the first determining module further includes:
a second acquisition unit for acquiring the fluctuation of the first phrase in the current time period by the following formulaCoefficients:
wherein x' represents a fluctuation coefficient, x represents word frequency of the first phrase in the current time period, mu represents a word frequency mean value of the first phrase in the same time period of the previous day, and sigma represents a standard deviation of the word frequency of the first phrase in the same time period of the previous day;
and the third discarding unit is used for discarding the first phrase when the fluctuation coefficient is smaller than a preset fluctuation value.
Fig. 9 is a schematic diagram of a fluctuation coefficient display interface according to an embodiment of the present invention, as shown in fig. 9, in which the horizontal axis represents a time period, the vertical axis represents a word frequency of a target phrase, the dotted line represents a yesterday (2019-6-1) word frequency, the solid line represents a today (2019-6-2) word frequency, and as seen in fig. 9, the highest word frequency of the day is 69, the fluctuation coefficient in the time period of 10:00-14:00 can be calculated as 51 according to the calculation formula of the fluctuation coefficient, and if the preset fluctuation value is 5, the today fluctuation coefficient of the target phrase exceeds the preset fluctuation value and can be reserved.
Fig. 10 is an optional schematic diagram of a month early warning situation according to an embodiment of the present invention, and as shown in fig. 10, the marking method of a target event provided by the embodiment of the present invention is applied to a customer service work order (short text) hot spot burst warning of products such as electronic payment, PC-side game, mobile terminal game, video playing, etc., and the overall early warning accuracy is calculated to reach 81%. Compared with the original manual order receiving, the early warning is realized from no to some crossing. The effect is particularly remarkable in some products with insufficient service hands, and the discovery and the statistical pressure of first-line staff are greatly relieved.
According to still another aspect of the embodiment of the present invention, there is further provided an electronic device for implementing the method for marking a target file, where the electronic device may be applied, but not limited to, to the server 112 shown in fig. 1. As shown in fig. 11, the electronic device comprises a memory 902 and a processor 904, the memory 902 having stored therein a computer program, the processor 904 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic apparatus may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
step S1, obtaining content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases;
step S2, determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase with the occurrence frequency exceeding a preset frequency threshold value in a preset time period;
step S3, determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event;
and S4, marking the target phrase contained in the target to-be-processed information as a target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold.
Alternatively, it will be understood by those skilled in the art that the structure shown in fig. 11 is only schematic, and the electronic device may also be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, and a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 11 is not limited to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 11, or have a different configuration than shown in FIG. 11.
The memory 902 may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for processing a data request in the embodiment of the present invention, and the processor 904 executes the software programs and modules stored in the memory 902, thereby executing various functional applications and data processing, that is, implementing the method for processing a data request. The memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 902 may further include memory remotely located relative to the processor 904, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. Wherein the memory 902 may specifically, but not be limited to, program steps of a tagging method for storing a target event. As an example, as shown in fig. 9, the memory 902 may include, but is not limited to, an acquisition module 802, a first determination module 804, a second determination module 806, and a marking module 808 in the marking device that includes the target event. In addition, other module units in the marking device of the target event may be included, but are not limited to, and are not described in detail in this example.
Optionally, the transmission device 906 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission means 906 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 906 is a Radio Frequency (RF) module for communicating wirelessly with the internet.
In addition, the electronic device further includes: a display 908 for displaying alert pushes of a target event; and a connection bus 910 for connecting the respective module parts in the above-described electronic device.
An embodiment of the invention also provides a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
step S1, obtaining content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases;
Step S2, determining a target phrase from a plurality of phrases, wherein the target phrase is a phrase with the occurrence frequency exceeding a preset frequency threshold value in a preset time period;
step S3, determining a target category corresponding to target to-be-processed information containing a target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event;
and S4, marking the target phrase contained in the target to-be-processed information as a target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold.
Optionally, the storage medium is further configured to store a computer program for executing the steps included in the method in the above embodiment, which is not described in detail in this embodiment.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (9)

1. A method for marking a target event, comprising:
acquiring content sentences carried in information to be processed, wherein the content sentences are segmented into a plurality of phrases;
Determining a target phrase from the plurality of phrases, wherein the target phrase is a phrase which appears in the same piece of information to be processed and the occurrence frequency of the phrase exceeds a preset frequency threshold value in a preset time period;
determining a target category corresponding to target to-be-processed information containing the target phrase in the to-be-processed information by using a classification model, wherein different categories including the target category correspond to different weights in the classification model, and the weights of the target category are used for indicating the possibility that the target phrase becomes a target event;
marking the target phrase contained in the target to-be-processed information as the target event under the condition that the weight corresponding to the target category exceeds a preset weight threshold;
the determining the target phrase among the plurality of phrases includes:
determining a phrase which appears in the same content sentence and the occurrence times of which exceed a preset threshold value in a plurality of content sentences of the information to be processed as a first phrase, wherein the first phrase only contains Chinese characters; discarding the first phrase with the present duty ratio smaller than a first preset duty ratio threshold value and/or the present word frequency smaller than a first preset word frequency threshold value and/or the present word frequency increasing rate smaller than a first preset increasing rate threshold value to obtain a second phrase, wherein the present word frequency increasing rate is an increasing rate obtained relative to the word frequency of the previous day; clustering the second phrase to obtain a first phrase cluster; discarding the first phrase cluster with the today's duty ratio smaller than a second preset duty ratio threshold value and/or the today's word frequency smaller than a second preset word frequency threshold value and/or the today's word frequency increase rate smaller than a second preset increase rate threshold value to obtain a second phrase cluster; and determining the phrase in the second phrase cluster as the target phrase.
2. The method of claim 1, wherein the determining, using a classification model, a target category corresponding to target to-be-processed information including the target phrase in the to-be-processed information includes:
inputting the target information to be processed into the classification model, wherein the classification model is obtained by training an initial classification model by using the phrase contained in the information to be processed as a training sample;
and outputting the target category corresponding to the target information to be processed.
3. The method according to claim 2, wherein before determining a target category corresponding to target to-be-processed information including the target phrase in the to-be-processed information using a classification model, the method further comprises:
and training the initial classification model by using the first target to-be-processed information with the determined category as a training sample, wherein the first target to-be-processed information comprises a phrase marked as a target event and a phrase not marked as the target event.
4. A method according to claim 3, wherein training the initial classification model using the first target pending information for which a class has been determined as a training sample comprises:
Dividing the first target to-be-processed information with the determined category into a training data set, a verification data set and a test data set, wherein the training data set and the verification data set are used for training the initial classification model, and the test data set is used for testing the trained classification model;
segmenting the content sentences contained in the training data set and the verification data set into initial training phrases, and taking the initial training phrases with the occurrence frequency exceeding a preset threshold value as initial training samples, wherein the vector dimension of the initial training samples is the number of the initial training samples;
calculating semantic vector characterization of the initial training sample through a vector characterization algorithm;
inputting the vector dimension of the initial training sample and the semantic vector representation of the initial training sample into the initial classification model for training to obtain the classification model;
and testing the training result of the classification model through the test data set, and adjusting model parameters of the classification model.
5. A method according to claim 3, wherein after training the initial classification model using the first target pending information for which a class has been determined as a training sample, the method further comprises:
Dividing target content sentences in the target to-be-processed information into a plurality of target training word groups, wherein the target training word groups only comprise Chinese characters and do not comprise stop words, and the stop words at least comprise exclamation words and/or pronouns and/or mood words;
determining the target training phrase with the occurrence frequency exceeding a preset threshold as a word bag;
combining the word bags with the current training samples of the classification model to form a target training sample;
training the classification model by using the target training sample, and adjusting model parameters of the classification model.
6. The method according to any one of claims 3 to 5, wherein after training the initial classification model using the first target information to be processed for which the class has been determined as a training sample, the method further comprises:
acquiring second target to-be-processed information determined in a time period from the last model training ending time to the current time, wherein the second target to-be-processed information comprises a phrase with occurrence times exceeding a preset time threshold value in a preset time period;
and combining the phrase contained in the information to be processed of the second target into the current training sample of the classification model.
7. The method of claim 1, wherein discarding the first phrase having a today's duty cycle less than a first preset duty cycle threshold comprises:
the present today's duty cycle for the first phrase is obtained using the following formula:
p1=exp { (log P/m)/log n) } wherein P represents the current duty cycle of the day preceding the first phrase, m and n being constants respectively;
and determining the first phrase with the minimum today's duty ratio by comparing the today's duty ratio of each first phrase, and discarding the first phrase.
8. The method of claim 1, wherein after discarding the first phrase having a today's duty cycle less than a first predetermined duty cycle threshold and/or a today's word frequency less than a first predetermined word frequency threshold and/or a today's word frequency increase rate less than a first predetermined increase rate threshold, the method further comprises:
obtaining the fluctuation coefficient of the first phrase in the current time period through the following formula:
wherein x' represents a fluctuation coefficient, x represents word frequency of the first phrase in the current time period, mu represents a word frequency mean value of the first phrase in the same time period of the previous day, and sigma represents a standard deviation of the word frequency of the first phrase in the same time period of the previous day;
And discarding the first phrase when the fluctuation coefficient is smaller than a preset fluctuation value.
9. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 8 by means of the computer program.
CN201910713377.4A 2019-08-02 2019-08-02 Method and device for marking target event, storage medium and electronic device Active CN110458296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910713377.4A CN110458296B (en) 2019-08-02 2019-08-02 Method and device for marking target event, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910713377.4A CN110458296B (en) 2019-08-02 2019-08-02 Method and device for marking target event, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN110458296A CN110458296A (en) 2019-11-15
CN110458296B true CN110458296B (en) 2023-08-29

Family

ID=68484679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910713377.4A Active CN110458296B (en) 2019-08-02 2019-08-02 Method and device for marking target event, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN110458296B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178679A (en) * 2019-12-06 2020-05-19 中能瑞通(北京)科技有限公司 Phase identification method based on clustering algorithm and network search
CN111060325A (en) * 2019-12-13 2020-04-24 斑马网络技术有限公司 Test scene construction method and device, electronic equipment and storage medium
CN111782803B (en) * 2020-06-05 2024-06-18 京东科技控股股份有限公司 Work order processing method and device, electronic equipment and storage medium
CN113419210A (en) * 2021-06-09 2021-09-21 Oppo广东移动通信有限公司 Data processing method and device, electronic equipment and storage medium
CN113645439B (en) * 2021-06-22 2022-07-29 宿迁硅基智能科技有限公司 Event detection method and system, storage medium and electronic device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929977A (en) * 2012-10-16 2013-02-13 浙江大学 Event tracing method aiming at news website
CN106649274A (en) * 2016-12-27 2017-05-10 东华互联宜家数据服务有限公司 Text content tag labeling method and device
CN106682123A (en) * 2016-12-09 2017-05-17 北京锐安科技有限公司 Hot event acquiring method and device
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
CN108563655A (en) * 2017-12-28 2018-09-21 北京百度网讯科技有限公司 Text based event recognition method and device
CN108595519A (en) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 Focus incident sorting technique, device and storage medium
CN108763272A (en) * 2018-04-08 2018-11-06 平安科技(深圳)有限公司 A kind of event information analysis method, computer readable storage medium and terminal device
CN109271639A (en) * 2018-10-11 2019-01-25 南京中孚信息技术有限公司 Hot ticket finds method and device
CN109726289A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Event detecting method and device
CN109918505A (en) * 2019-02-26 2019-06-21 西安电子科技大学 A kind of network security incident visualization method based on text-processing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10331768B2 (en) * 2015-09-21 2019-06-25 Tata Consultancy Services Limited Tagging text snippets
CN107491534B (en) * 2017-08-22 2020-11-20 北京百度网讯科技有限公司 Information processing method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929977A (en) * 2012-10-16 2013-02-13 浙江大学 Event tracing method aiming at news website
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
CN106682123A (en) * 2016-12-09 2017-05-17 北京锐安科技有限公司 Hot event acquiring method and device
CN106649274A (en) * 2016-12-27 2017-05-10 东华互联宜家数据服务有限公司 Text content tag labeling method and device
CN108563655A (en) * 2017-12-28 2018-09-21 北京百度网讯科技有限公司 Text based event recognition method and device
CN108595519A (en) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 Focus incident sorting technique, device and storage medium
CN108763272A (en) * 2018-04-08 2018-11-06 平安科技(深圳)有限公司 A kind of event information analysis method, computer readable storage medium and terminal device
CN109271639A (en) * 2018-10-11 2019-01-25 南京中孚信息技术有限公司 Hot ticket finds method and device
CN109726289A (en) * 2018-12-29 2019-05-07 北京百度网讯科技有限公司 Event detecting method and device
CN109918505A (en) * 2019-02-26 2019-06-21 西安电子科技大学 A kind of network security incident visualization method based on text-processing

Also Published As

Publication number Publication date
CN110458296A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110458296B (en) Method and device for marking target event, storage medium and electronic device
CN109783632B (en) Customer service information pushing method and device, computer equipment and storage medium
CN108717408B (en) Sensitive word real-time monitoring method, electronic equipment, storage medium and system
WO2018040068A1 (en) Knowledge graph-based semantic analysis system and method
CN107145516B (en) Text clustering method and system
CN110401545B (en) Chat group creation method, chat group creation device, computer equipment and storage medium
CN115098650B (en) Comment information analysis method based on historical data model and related device
WO2017091985A1 (en) Method and device for recognizing stop word
CN106095939B (en) The acquisition methods and device of account authority
CN110210038B (en) Core entity determining method, system, server and computer readable medium thereof
CN115048464A (en) User operation behavior data detection method and device and electronic equipment
CN111209372B (en) Keyword determination method and device, electronic equipment and storage medium
CN112989058B (en) Information classification method, test question classification method, device, server and storage medium
JP2009157450A (en) Mail sorting system, mail retrieving system, and mail destination sorting system
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN111767404A (en) Event mining method and device
CN113095073B (en) Corpus tag generation method and device, computer equipment and storage medium
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN113407859B (en) Resource recommendation method and device, electronic equipment and storage medium
CN110750976A (en) Language model construction method, system, computer device and readable storage medium
CN116489047B (en) Intelligent communication management system and method based on edge calculation
CN112632990B (en) Label acquisition method, device, equipment and readable storage medium
Priya et al. Multi-modal categorization of news through varied machine learning techniques and models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant