CN115994175A - Information mining method and device for network language and electronic equipment - Google Patents

Information mining method and device for network language and electronic equipment Download PDF

Info

Publication number
CN115994175A
CN115994175A CN202211634781.0A CN202211634781A CN115994175A CN 115994175 A CN115994175 A CN 115994175A CN 202211634781 A CN202211634781 A CN 202211634781A CN 115994175 A CN115994175 A CN 115994175A
Authority
CN
China
Prior art keywords
network
language
category
candidate object
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211634781.0A
Other languages
Chinese (zh)
Inventor
陈珺仪
谢奕
陈佳颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211634781.0A priority Critical patent/CN115994175A/en
Publication of CN115994175A publication Critical patent/CN115994175A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides an information mining method, device and electronic equipment for network language, which relate to the technical field of artificial intelligence, in particular to the technical field of natural language processing, deep learning and semantic analysis, and the specific implementation scheme is as follows: acquiring a network language discussion of the candidate object, and mining content semantic features and content time sequence features of the network language discussion; acquiring behavior characteristics of the candidate object based on the internet language discussion; predicting the probability of the candidate object executing the target event according to the content semantic features, the content time sequence features and the behavior features; and screening out the target objects with the abnormality from the candidate objects according to the probability of the candidate objects executing the target event. According to the method and the device, through mining information of the network language, namely, the content semantic features, the content time sequence features and the behavior features of the network language discussion are obtained, whether an abnormal target object exists or not is determined according to the probability of executing the target event by the candidate object, and accuracy and reliability of determining the abnormal target object are improved.

Description

Information mining method and device for network language and electronic equipment
Technical Field
The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing, deep learning and semantic analysis, and particularly relates to an information mining method and device for network language and an electronic equipment storage medium.
Background
In the related art, modeling and scoring can be performed based on a Learning To Rank (LTR) mode, and analysis can be performed on network language information, however, the LTR mode is often required to rely on a large amount of data for learning, has insufficient cold start data and poor interpretation, and can score according to a user's behavioral characteristics, design a manual rule to perform scoring and analyze the network language information, however, the rule depends on manual setting based on the manual rule mode, the score setting of different sections depends on service experience, manual tuning is required, hidden relations among characteristics are difficult to mine, and the like, so that how to improve the accuracy and reliability of network language information mining becomes a problem to be solved.
Disclosure of Invention
The disclosure provides an information mining method, device, electronic equipment, storage medium and program product for network language.
According to a first aspect, there is provided a method for mining information for web-oriented speech, including: acquiring a network language discussion of a candidate object, and mining content semantic features and content time sequence features of the network language discussion; acquiring behavior characteristics of the candidate object based on the internet language discussion; predicting the probability of the candidate object executing a target event according to the content semantic features, the content time sequence features and the behavior features; and screening out the target objects with the abnormality from the candidate objects according to the probability of executing the target event by the candidate objects.
According to a second aspect, there is provided an information mining apparatus for network-oriented speech, comprising: the mining module is used for acquiring the internet language discussion of the candidate object and mining the content semantic features and the content time sequence features of the internet language discussion; the acquisition module is used for acquiring the behavior characteristics of the candidate objects based on the internet language discussion; the prediction module is used for predicting the probability of the candidate object to execute a target event according to the content semantic features, the content time sequence features and the behavior features; and the screening module is used for screening out the target objects with the abnormality from the candidate objects according to the probability of executing the target event by the candidate objects.
According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the network-oriented language information mining method of the first aspect of the present disclosure.
According to a fourth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the network-oriented language information mining method according to the first aspect of the present disclosure.
According to a fifth aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the network-oriented language information mining method according to the first aspect of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
fig. 1 is a flow diagram of a network-oriented language information mining method according to a first embodiment of the present disclosure;
fig. 2 is a flow diagram of a network-oriented language information mining method according to a second embodiment of the present disclosure;
fig. 3 is a flow diagram of a network-oriented language information mining method according to a third embodiment of the present disclosure;
fig. 4 is a flow diagram of a network-oriented language information mining method according to a fourth embodiment of the present disclosure;
fig. 5 is a flow diagram of a network-oriented language information mining method according to a fifth embodiment of the present disclosure;
fig. 6 is a flow diagram of a network-oriented language information mining method according to a sixth embodiment of the present disclosure;
fig. 7 is a flowchart of a network-oriented language information mining method according to a seventh embodiment of the present disclosure;
FIG. 8 is a flow diagram of a network-oriented language information mining method according to the present disclosure;
FIG. 9 is a block diagram of a network-oriented language information mining apparatus used to implement embodiments of the present disclosure;
fig. 10 is a block diagram of an information mining method electronic device for implementing network oriented language of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Artificial intelligence (Artificial Intelligence, AI for short), which is a new technical science to study, develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence.
Natural language processing (Natural Language Processing, NLP for short) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics.
Deep Learning (DL) is a new research direction in the field of Machine Learning (ML), which is introduced into Machine Learning to make it closer to the original target-artificial intelligence, and Deep Learning is the inherent law and presentation hierarchy of Learning sample data, and information obtained in these Learning processes greatly helps interpretation of data such as text, image and sound. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data.
The semantic analysis is a method for analyzing semantic information based on natural language, which not only analyzes grammar level such as lexical analysis and syntactic analysis, but also relates to meaning contained in words, phrases, sentences and paragraphs, and aims to express the structure of the language by using the semantic structure of the sentences.
An information mining method for network language according to an embodiment of the present disclosure is described below with reference to the accompanying drawings.
Fig. 1 is a flow diagram of a network-oriented language information mining method according to a first embodiment of the present disclosure.
As shown in fig. 1, the information mining method for network language according to the embodiment of the disclosure may specifically include the following steps:
s101, acquiring a network language discussion of the candidate object, and mining content semantic features and content time sequence features of the network language discussion.
Specifically, the execution body of the information mining method for network language according to the embodiments of the present disclosure may be the information mining device for network language provided by the embodiments of the present disclosure, where the information mining device for network language may be a hardware device with data information processing capability and/or software necessary for driving the hardware device to work. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.
It should be noted that, the specific manner of acquiring the network arguments of the candidate objects is not limited in this disclosure, and may be selected according to practical situations.
Alternatively, the web language discussion may be obtained by collecting web language comments that the user generates during use of the different applications.
Alternatively, the time at which the network utterance collection is performed may be set.
For example, web utterances generated during use of different applications by a user in the last week may be collected to obtain a web arguments, and web utterances generated during use of different applications by a user in the last month may be collected to obtain a web arguments.
In the embodiment of the disclosure, after the internet talk collection is acquired, the internet talk in the internet talk collection can be preprocessed to obtain the preprocessed internet talk, and the content semantic features and the content time sequence features of the preprocessed internet talk are mined.
For example, because words with high frequency and no practical meaning exist in the web language text, the words are useless for mining content semantic features and content time sequence features of the web language discussion, and unnecessary words and spoken words in the web language can be removed, for example: words such as "bar", "o", "hiccup" and the like are removed, and noise of the network language text data set can be reduced.
It should be noted that if the analysis is only performed from the number of words in the text hit sensitive vocabulary in the semanteme, and the meaning and the context of the whole sentence in the text are not examined, an erroneous analysis result may be caused, and the web language uttered by the user often has a strong timeliness, where the more recent uttered language reflects the current psychological state and the behavioral tendency of the user, and after the web language discussion is acquired, the content semantic feature and the content time sequence feature of the web language discussion may be mined.
S102, based on the internet language discussion, obtaining the behavior characteristics of the candidate object.
In the embodiment of the disclosure, after the internet language discussion set is acquired, the behavior characteristics of the candidate object can be acquired through calculation based on the internet language discussion set.
Alternatively, feature construction may be performed on the search behavior information of the candidate object based on the internet language discussion, so as to obtain the behavior feature of the candidate object.
Alternatively, the behavior characteristics of the candidate object may be a search frequency, an activity day, a last activity time interval, a duration, a number of hits on a particular keyword behavior label, and so on.
S103, predicting the probability of the candidate object executing the target event according to the content semantic features, the content time sequence features and the behavior features.
It should be noted that, in the present disclosure, a specific manner of predicting the probability of the candidate object executing the target event according to the content semantic feature, the content timing feature and the behavior feature is not limited, and may be selected according to the actual situation.
Alternatively, scoring may be performed based on a gradient-lifting decision tree (Gradient Boosting Decision Tree, GBDT) model, which outputs scores and generates a probability that the predicted candidate will perform the target event based on the scores.
It should be noted that, the setting of the target event is not limited in this disclosure, and may be set according to actual situations.
S104, screening out the target objects with the abnormality from the candidate objects according to the probability of executing the target event by the candidate objects.
Alternatively, a probability threshold value of the candidate object to execute the target event may be set, and the target object having the abnormality may be screened from the candidate objects according to the probability of the candidate object to execute the target event and the probability threshold value.
For example, when the probability of the candidate object executing the target event is greater than or equal to the probability threshold, then the candidate object may be determined to be the target object; when the probability of the candidate object executing the target event is smaller than the probability threshold, the candidate object may be determined to be an object in which no abnormality exists.
Alternatively, a probability threshold for the candidate object to execute the target event may be set, the probability of the candidate object to execute the target event may be ranked, and the target object with the abnormality may be screened from the candidate objects according to the probability of the candidate object to execute the target event, the probability threshold, and the ranking result.
Further, after the target objects with the abnormality are screened out, the target objects can be pushed in sequence.
Alternatively, the probability of executing the target event on the candidate object may be ranked from large to small, and the target objects ranked earlier may be sequentially pushed according to the ranking result.
In summary, according to the information mining method for network language in the embodiment of the disclosure, by acquiring a network language discussion of a candidate object, mining content semantic features and content time sequence features of the network language discussion, acquiring behavior features of the candidate object based on the network language discussion, predicting probability of executing a target event by the candidate object according to the content semantic features, the content time sequence features and the behavior features, and screening out abnormal target objects from the candidate object according to the probability of executing the target event by the candidate object. According to the method and the device, the information of the network language is mined, namely, the content semantic features, the content time sequence features and the behavior features of the network language discussion are obtained, whether an abnormal target object exists or not is determined according to the probability of executing the target event by the candidate object, and the accuracy and the reliability of determining the abnormal target object are improved.
Fig. 2 is a flow diagram of a network-oriented language information mining method according to a second embodiment of the present disclosure.
As shown in fig. 2, on the basis of the embodiment shown in fig. 1, the information mining method for network language according to the embodiment of the disclosure may specifically include the following steps:
s201, acquiring a network language discussion of the candidate object, and mining content semantic features and content time sequence features of the network language discussion.
S202, based on the internet language discussion, obtaining the behavior characteristics of the candidate object.
Specifically, steps S201 to S202 in this embodiment are the same as steps S101 to S102 in the above embodiment, and will not be described here again.
Step S103 "predicting the probability of the candidate object executing the target event based on the content semantic feature, the content timing feature, and the behavior feature" in the above embodiment may specifically include the following steps S203 to S205.
S203, determining a first quantized feature and a second quantized feature of the candidate object tendency execution target event according to the mined content semantic features.
As a possible implementation manner, as shown in fig. 3, on the basis of the foregoing embodiment, the specific process of determining, in the step S202, that the candidate object tends to execute the first quantized feature and the second quantized feature of the target event according to the mined semantic features of the content includes the following steps:
S301, determining the category confidence level of each trend category of the candidate object trend execution target event according to the mined content semantic features, wherein the trend categories are divided according to the degree of the candidate object trend execution target event.
For example, when the target event is a candidate object to purchase the commodity a, the category confidence under each tendency category is constructed for the business scenario, and the explanation is made taking the target event as "candidate object to purchase the commodity a" as an example, the tendency of the target event to be executed is "high", that is, the intention of purchasing the commodity a can be obviously reflected, the words related to "purchase" and the like often exist, the idea of strong purchase is reflected, or the commodity a is likely to have been purchased, that is, the tendency of whether the commodity a is prone to be purchased cannot be completely determined, and the tendency of the target event to be executed is "low", that is, the intention of purchasing the commodity a can be completely excluded.
Further, the degree of the tendency execution target event may be marked, the "high" is marked as 2, the "medium" is marked as 1, the "low" is marked as 0, the training set and the verification set may be divided, and 70% of the data obtained by marking may be used as the training set and 30% may be used as the verification set.
Alternatively, the class confidence for each trend class of candidate trending target events may be determined according to the following formula, namely, the class confidence for high (2) -medium (1) -low (0):
vector[confidence] 0,, =softmax(W T x+b)
s302, determining a first quantization characteristic and a second quantization characteristic according to the category confidence of the network language under different trend categories.
As a possible implementation manner, as shown in fig. 4, on the basis of the foregoing embodiment, the specific process of determining the first quantization feature according to the category confidence of the network language under different trend categories in the step S302 includes the following steps:
s401, obtaining the category confidence of each network speaker in the network speaker set under the target trend category.
It should be noted that, the category confidence of each network speaker in the network speaker set under the target tendency category, that is, the confidence of the category under the three tendency categories of high (2) -medium (1) -low (0), may be obtained.
S402, counting a first number of category confidence degrees which are larger than a set confidence degree threshold from the category confidence degrees under the target trend category.
It should be noted that, in the case of the "high" category or the "low" category confidence, these two cases may greatly affect the judgment of the abnormal target object, and the "medium" category is more, so that the abnormal target object may be difficult to distinguish by the ambiguities, and therefore, the first number of the category confidence that the "high" category confidence and the "low" category confidence are greater than the set confidence threshold may be counted from the category confidence under the target trend category.
S403, determining a first total number of network speakers in the network speaker set.
In the embodiment of the disclosure, after the internet talk collection is acquired, a first total number of internet talk in the internet talk collection may be determined.
And S404, determining a high confidence ratio as a first quantization characteristic according to the first quantity and the first total quantity of the network language words.
Optionally, after the first number and the first total number of network utterances are obtained, a high confidence duty cycle may be determined according to the following formula:
Figure BDA0004007090870000081
as a possible implementation manner, as shown in fig. 5, based on the foregoing embodiment, the specific process of determining the second quantization feature according to the category confidence of the network language under the different trend categories in the step S302 includes the following steps:
s501, determining the difference value between the category confidence degrees of the pairwise trend categories of the network language.
It should be noted that, after the category confidence degrees under different trend categories are obtained, the difference between the category confidence degrees of the pairwise trend categories of the network language can be determined through calculation.
S502, obtaining a first total number of network speakers in the network speaker set.
And S503, determining confidence average difference as a second quantization characteristic according to the difference value of the target network language and the first total quantity.
Optionally, after obtaining the difference value and the first total number of target network utterances, a confidence mean difference may be determined according to the following formula:
Figure BDA0004007090870000082
s204, determining a third quantization characteristic of the candidate object trend execution target event according to the mined content time sequence characteristic.
As a possible implementation manner, as shown in fig. 6, based on the foregoing embodiment, the specific process of determining, in the step S204, that the candidate object tends to perform the third quantization characteristic of the target event according to the mined content time sequence characteristic includes the following steps:
s601, sorting and tail interception are carried out on the network language discussion according to the time sequence, and a network language sequence is obtained.
Optionally, the web talk collection may be ordered in time sequence, and one sixth of the tail may be truncated to obtain the web talk sequence.
S602, obtaining the category confidence of each network speaker in the network speaker sequence under each category.
For example, [2,2,0,0,0,0] for the network sequence, then the class confidence under each class [ high, low ].
And S603, determining the duty ratio of the network language of the low-tendency category in the network language sequence as a third quantization characteristic of the candidate object according to the category confidence degree of each category of the network language.
As a possible implementation manner, as shown in fig. 7, based on the foregoing embodiment, the specific process of determining the duty ratio of the low-tendency speech in the network speech sequence according to the category confidence in the foregoing step S603 includes the following steps:
s701, selecting the category with the highest category confidence in all categories as the recognition tendency type of the network language.
Optionally, the network utterances may be ranked in time sequence to obtain a sequence within a period of time, and a category with the highest category confidence level in all the categories is selected as the recognition tendency type of the network utterances. I.e. the category with the highest confidence level can be taken as the sequence value:
val i =argmax(vector[])
s702, counting a second number of network speakers with low tendency types identified in the network speaker sequence.
In embodiments of the present disclosure, after the network utterance sequence is obtained, a second number of network utterances in the network utterance sequence for which the recognition tendency type is a low tendency type may be counted.
S703, obtaining a second total number of network utterances in the network utterance sequence.
In the embodiment of the disclosure, after the network speaker sequence is acquired, counting the second total number of network speakers in the network speaker sequence may be performed to acquire the second total number of network speakers in the network speaker sequence.
S704, determining a duty ratio of the low-tendency utterances in the network utterance sequence according to the second number and the second total number of the low-tendency-type network utterances.
Optionally, after the second number and the second total number are obtained, the duty cycle of the low-propensity speech in the network speech sequence may be determined according to the following formula:
Figure BDA0004007090870000091
for example, for a network sequence [2,2,0,0,0,0], the network sequence is already ordered according to time sequence, the recent time point parameter is set to 1/6, the sequence of the tail 1/6 is taken as the most recent network sequence, the low tendency speech occupation ratio is calculated to be 1, and the output value is 1.
S205, predicting the probability of the execution target event of the candidate object according to the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature, and obtaining the probability of the execution target event of the candidate object.
Optionally, after the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature are obtained, a feature vector may be constructed, and the constructed feature vector is used as an input of a gradient lifting decision tree model, so as to predict the probability of executing the target event of the candidate object according to the gradient lifting decision tree model, and obtain the probability of executing the target event of the candidate object.
S206, screening out the target objects with the abnormality from the candidate objects according to the probability of the candidate objects executing the target event.
Specifically, step S206 in this embodiment is the same as step S104 in the above embodiment, and will not be described here again.
In summary, according to the information mining method for network language in the embodiment of the disclosure, by acquiring a network language discussion of a candidate object, mining content semantic features and content time sequence features of the network language discussion, acquiring behavior features of the candidate object based on the network language discussion, determining a first quantization feature and a second quantization feature of a candidate object tendency execution target event according to the mined content semantic features, determining a third quantization feature of the candidate object tendency execution target event according to the mined content time sequence features, predicting the probability of the candidate object execution target event according to the behavior features, the first quantization feature, the second quantization feature and the third quantization feature, obtaining the probability of the candidate object execution target event, and screening out the abnormal target object from the candidate object according to the probability of the candidate object execution target event. Therefore, the method and the device for determining the abnormal target object improve accuracy and reliability of determining the abnormal target object by mining the information of the network language, namely acquiring the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature of the network language discussion and determining whether the abnormal target object exists according to the probability of executing the target event by the candidate object.
The following explains the information mining method for network language.
For example, as shown in fig. 8, first, a text corresponding to a web language is input, and the web language is preprocessed, for example: removing word segmentation, removing stop words and the like, mining content semantic features, namely, a first quantized feature is high in confidence coefficient duty ratio, a second quantized feature is confidence coefficient average difference, determining a third quantized feature of a candidate object tendency execution target event according to mined content time sequence features, acquiring behavior features of the candidate object, constructing feature vectors according to the behavior features, the first quantized feature, the second quantized feature and the third quantized feature, predicting probability of the candidate object execution target event, obtaining probability of the candidate object execution target event, and further screening out abnormal target objects from the candidate object according to probability of the candidate object execution target event.
In summary, according to the information mining method for network language in the embodiment of the disclosure, by acquiring a network language discussion of a candidate object, mining content semantic features and content time sequence features of the network language discussion, acquiring behavior features of the candidate object based on the network language discussion, determining a first quantization feature and a second quantization feature of a candidate object tendency execution target event according to the mined content semantic features, determining a third quantization feature of the candidate object tendency execution target event according to the mined content time sequence features, predicting the probability of the candidate object execution target event according to the behavior features, the first quantization feature, the second quantization feature and the third quantization feature, obtaining the probability of the candidate object execution target event, and screening out the abnormal target object from the candidate object according to the probability of the candidate object execution target event. Therefore, the method and the device for determining the abnormal target object have the advantages that through mining the information of the network language, namely, the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature of the network language discussion are obtained, and according to the probability of executing the target event by the candidate object, whether the abnormal target object exists is determined, the accuracy and the reliability of determining the abnormal target object are improved, and the forward feedback rate of a user and the hit rate of judging the target object are improved.
It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.
Fig. 9 is a schematic structural diagram of an information mining apparatus for network-oriented language according to an embodiment of the present disclosure.
As shown in fig. 9, the information mining apparatus 900 for network-oriented language includes: a mining module 910, an acquisition module 920, a prediction module 930, and a screening module 940. Wherein:
the mining module 910 is configured to obtain a web language discussion set of the candidate object, and mine content semantic features and content timing features of the web language discussion set;
an obtaining module 920, configured to obtain a behavioral characteristic of the candidate object based on the internet language discussion;
a prediction module 930, configured to predict a probability of the candidate object executing a target event according to the content semantic feature, the content timing feature, and the behavior feature;
and a screening module 940, configured to screen, according to the probability that the candidate object executes the target event, a target object with an exception from the candidate objects.
Wherein, prediction module 930 is further configured to:
determining a first quantized feature and a second quantized feature of the candidate object tendency execution target event according to the mined content semantic features;
Determining a third quantized feature of the candidate object tendency execution target event according to the mined content time sequence feature;
and predicting the probability of the execution target event of the candidate object according to the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature to obtain the probability of the execution target event of the candidate object.
Wherein, prediction module 930 is further configured to:
determining the category confidence level of each trend category of the candidate object trend execution target event according to the mined content semantic features, wherein the trend categories are divided according to the degree of the candidate object trend execution target event;
and determining the first quantization characteristic and the second quantization characteristic according to the category confidence of the network language under different trend categories.
Wherein, prediction module 930 is further configured to:
acquiring the category confidence level of each network speaker in the network speaker set under the target trend category;
counting a first number of category confidence degrees greater than the set confidence degree threshold from the category confidence degrees under the target trend category;
determining a first total number of network speakers in the set of network speakers;
A high confidence duty cycle is determined as the first quantization feature from the first number and a first total number of the network utterances.
Wherein, prediction module 930 is further configured to:
determining a difference between the class confidence levels of the pairwise trend classes of the network utterances;
acquiring a first total number of network speakers in the network speaker set;
and determining confidence average difference as the second quantization characteristic according to the difference value of the target network language and the first total quantity.
Wherein, prediction module 930 is further configured to:
sequencing and tail interception are carried out on the network language discussion set according to a time sequence to obtain a network language theory sequence;
acquiring category confidence degrees of all network speakers under each category in the network speaker sequence;
and determining the duty ratio of the network language of the low-tendency category in the network language sequence as the third quantization characteristic of the candidate object according to the category confidence degree of each category of the network language.
Wherein, prediction module 930 is further configured to:
selecting the category with the highest category confidence degree from all categories as the identification tendency type of the network language;
counting a second number of network utterances in the sequence of network utterances for which the identified trend type is a low trend type;
Obtaining a second total number of network utterances in the sequence of network utterances;
and determining the duty ratio of the low-tendency speakers in the network speaker sequence according to the second number of the low-tendency network speakers and the second total number.
Wherein, the apparatus 900 is further configured to:
preprocessing the network language in the network language set to obtain a preprocessed network language, and performing feature mining on the preprocessed network language.
It should be noted that, the explanation of the embodiment of the information mining method facing the network language is also applicable to the information mining device facing the network language in the embodiment of the disclosure, and the specific process is not repeated here.
In summary, the information mining device for network language according to the embodiments of the present disclosure obtains the internet language discussion of the candidate object, and mines the content semantic feature and the content time sequence feature of the internet language discussion, obtains the behavior feature of the candidate object based on the internet language discussion, predicts the probability of executing the target event by the candidate object according to the content semantic feature, the content time sequence feature and the behavior feature, and screens out the abnormal target object from the candidate object according to the probability of executing the target event by the candidate object. According to the method and the device, the information of the network language is mined, namely, the content semantic features, the content time sequence features and the behavior features of the network language discussion are obtained, whether an abnormal target object exists or not is determined according to the probability of executing the target event by the candidate object, and the accuracy and the reliability of determining the abnormal target object are improved.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1007 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the device 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as the network-oriented language information mining method. For example, in some embodiments, a web-talk oriented information mining method. May be implemented as a computer software program tangibly embodied on a machine-readable medium, e.g., storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the network-talk oriented information mining method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the network-oriented language information mining method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements a network-oriented language information mining method as described above.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (19)

1. The information mining method for the network language comprises the following steps:
acquiring a network language discussion of a candidate object, and mining content semantic features and content time sequence features of the network language discussion;
Acquiring behavior characteristics of the candidate object based on the internet language discussion;
predicting the probability of the candidate object executing a target event according to the content semantic features, the content time sequence features and the behavior features;
and screening out the target objects with the abnormality from the candidate objects according to the probability of executing the target event by the candidate objects.
2. The method of claim 1, wherein the predicting the probability of the candidate object performing a target event based on the content semantic features, the content temporal features, and the behavioral features comprises:
determining a first quantized feature and a second quantized feature of the candidate object tendency execution target event according to the mined content semantic features;
determining a third quantized feature of the candidate object tendency execution target event according to the mined content time sequence feature;
and predicting the probability of the execution target event of the candidate object according to the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature to obtain the probability of the execution target event of the candidate object.
3. The method of claim 2, wherein the determining the candidate object propensity to perform the first and second quantized features of the target event based on the mined content semantic features comprises:
Determining the category confidence level of each trend category of the candidate object trend execution target event according to the mined content semantic features, wherein the trend categories are divided according to the degree of the candidate object trend execution target event;
and determining the first quantization characteristic and the second quantization characteristic according to the category confidence of the network language under different trend categories.
4. A method according to claim 3, wherein determining the first quantified characteristic from class confidence of the network utterance under different trending classes comprises:
acquiring the category confidence level of each network speaker in the network speaker set under the target trend category;
counting a first number of category confidence degrees greater than the set confidence degree threshold from the category confidence degrees under the target trend category;
determining a first total number of network speakers in the set of network speakers;
a high confidence duty cycle is determined as the first quantization feature from the first number and a first total number of the network utterances.
5. A method according to claim 3, wherein determining the second quantified characteristic from class confidence of the network utterance under different trending classes comprises:
Determining a difference between the class confidence levels of the pairwise trend classes of the network utterances;
acquiring a first total number of network speakers in the network speaker set;
and determining confidence average difference as the second quantization characteristic according to the difference value of the target network language and the first total quantity.
6. The method of any of claims 1-5, wherein the determining a third quantified characteristic of the candidate object propensity to execute a target event from the mined content timing characteristic comprises:
sequencing and tail interception are carried out on the network language discussion set according to a time sequence to obtain a network language theory sequence;
acquiring category confidence degrees of all network speakers under each category in the network speaker sequence;
and determining the duty ratio of the network language of the low-tendency category in the network language sequence as the third quantization characteristic of the candidate object according to the category confidence degree of each category of the network language.
7. The method of claim 6, wherein the determining the duty cycle of the low-propensity speaker in the sequence of network speakers according to the category confidence comprises:
selecting the category with the highest category confidence degree from all categories as the identification tendency type of the network language;
Counting a second number of network utterances in the sequence of network utterances for which the identified trend type is a low trend type;
obtaining a second total number of network utterances in the sequence of network utterances;
and determining the duty ratio of the low-tendency speakers in the network speaker sequence according to the second number of the low-tendency network speakers and the second total number.
8. The method according to claim 1 or 2, wherein the method further comprises:
preprocessing the network language in the network language set to obtain a preprocessed network language, and performing feature mining on the preprocessed network language.
9. An information mining apparatus for network-oriented speech, comprising:
the mining module is used for acquiring the internet language discussion of the candidate object and mining the content semantic features and the content time sequence features of the internet language discussion;
the acquisition module is used for acquiring the behavior characteristics of the candidate objects based on the internet language discussion;
the prediction module is used for predicting the probability of the candidate object to execute a target event according to the content semantic features, the content time sequence features and the behavior features;
and the screening module is used for screening out the target objects with the abnormality from the candidate objects according to the probability of executing the target event by the candidate objects.
10. The apparatus of claim 9, wherein the prediction module is further to:
determining a first quantized feature and a second quantized feature of the candidate object tendency execution target event according to the mined content semantic features;
determining a third quantized feature of the candidate object tendency execution target event according to the mined content time sequence feature;
and predicting the probability of the execution target event of the candidate object according to the behavior feature, the first quantization feature, the second quantization feature and the third quantization feature to obtain the probability of the execution target event of the candidate object.
11. The apparatus of claim 10, wherein the prediction module is further to:
determining the category confidence level of each trend category of the candidate object trend execution target event according to the mined content semantic features, wherein the trend categories are divided according to the degree of the candidate object trend execution target event;
and determining the first quantization characteristic and the second quantization characteristic according to the category confidence of the network language under different trend categories.
12. The apparatus of claim 11, wherein the prediction module is further configured to:
Acquiring the category confidence level of each network speaker in the network speaker set under the target trend category;
counting a first number of category confidence degrees greater than the set confidence degree threshold from the category confidence degrees under the target trend category;
determining a first total number of network speakers in the set of network speakers;
a high confidence duty cycle is determined as the first quantization feature from the first number and a first total number of the network utterances.
13. The apparatus of claim 11, wherein the prediction module is further configured to:
determining a difference between the class confidence levels of the pairwise trend classes of the network utterances;
acquiring a first total number of network speakers in the network speaker set;
and determining confidence average difference as the second quantization characteristic according to the difference value of the target network language and the first total quantity.
14. The apparatus of any of claims 9-13, wherein the prediction module is further to:
sequencing and tail interception are carried out on the network language discussion set according to a time sequence to obtain a network language theory sequence;
acquiring category confidence degrees of all network speakers under each category in the network speaker sequence;
And determining the duty ratio of the network language of the low-tendency category in the network language sequence as the third quantization characteristic of the candidate object according to the category confidence degree of each category of the network language.
15. The apparatus of claim 14, wherein the prediction module is further configured to:
selecting the category with the highest category confidence degree from all categories as the identification tendency type of the network language;
counting a second number of network utterances in the sequence of network utterances for which the identified trend type is a low trend type;
obtaining a second total number of network utterances in the sequence of network utterances;
and determining the duty ratio of the low-tendency speakers in the network speaker sequence according to the second number of the low-tendency network speakers and the second total number.
16. The apparatus of claim 9 or 10, wherein the apparatus is further configured to:
preprocessing the network language in the network language set to obtain a preprocessed network language, and performing feature mining on the preprocessed network language.
17. An electronic device comprising a processor and a memory;
wherein the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for implementing the method according to claims 1-8.
18. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to claims 1-8.
CN202211634781.0A 2022-12-19 2022-12-19 Information mining method and device for network language and electronic equipment Pending CN115994175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211634781.0A CN115994175A (en) 2022-12-19 2022-12-19 Information mining method and device for network language and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211634781.0A CN115994175A (en) 2022-12-19 2022-12-19 Information mining method and device for network language and electronic equipment

Publications (1)

Publication Number Publication Date
CN115994175A true CN115994175A (en) 2023-04-21

Family

ID=85989786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211634781.0A Pending CN115994175A (en) 2022-12-19 2022-12-19 Information mining method and device for network language and electronic equipment

Country Status (1)

Country Link
CN (1) CN115994175A (en)

Similar Documents

Publication Publication Date Title
EP3819785A1 (en) Feature word determining method, apparatus, and server
CN113553412B (en) Question-answering processing method, question-answering processing device, electronic equipment and storage medium
KR20210132578A (en) Method, apparatus, device and storage medium for constructing knowledge graph
CN114416943B (en) Training method and device for dialogue model, electronic equipment and storage medium
CN113420822B (en) Model training method and device and text prediction method and device
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN112632987B (en) Word slot recognition method and device and electronic equipment
CN117216275A (en) Text processing method, device, equipment and storage medium
CN113408280A (en) Negative example construction method, device, equipment and storage medium
EP4134838A1 (en) Word mining method and apparatus, electronic device and readable storage medium
CN114758649B (en) Voice recognition method, device, equipment and medium
CN113641724B (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN113792230B (en) Service linking method, device, electronic equipment and storage medium
CN115827867A (en) Text type detection method and device
EP4092564A1 (en) Method and apparatus for constructing event library, electronic device and computer readable medium
CN115454261A (en) Input method candidate word generation method and device, electronic equipment and readable storage medium
CN114201953A (en) Keyword extraction and model training method, device, equipment and storage medium
CN115994175A (en) Information mining method and device for network language and electronic equipment
CN113792546A (en) Corpus construction method, apparatus, device and storage medium
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN114119972A (en) Model acquisition and object processing method and device, electronic equipment and storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination