CN116431884A - Method, system, computing device and storage medium for auditing link short messages - Google Patents
Method, system, computing device and storage medium for auditing link short messages Download PDFInfo
- Publication number
- CN116431884A CN116431884A CN202310416446.1A CN202310416446A CN116431884A CN 116431884 A CN116431884 A CN 116431884A CN 202310416446 A CN202310416446 A CN 202310416446A CN 116431884 A CN116431884 A CN 116431884A
- Authority
- CN
- China
- Prior art keywords
- link
- short message
- vector
- text
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 239000013598 vector Substances 0.000 claims abstract description 108
- 238000012545 processing Methods 0.000 claims abstract description 38
- 230000006399 behavior Effects 0.000 claims description 56
- 238000012550 audit Methods 0.000 claims description 38
- 230000015654 memory Effects 0.000 claims description 25
- 230000007246 mechanism Effects 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- 230000010365 information processing Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 description 19
- 238000004891 communication Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 10
- 238000012512 characterization method Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000007723 transport mechanism Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/02—Reliability analysis or reliability optimisation; Failure analysis, e.g. worst case scenario performance, failure mode and effects analysis [FMEA]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Machine Translation (AREA)
Abstract
The application discloses a method, a system, a computing device and a storage medium for auditing a link short message. The auditing method of the link short message comprises the following steps: generating a first vector representing user behavior characteristics based on user behavior data when a user sends a link short message; processing the link short message to generate text information and link information, wherein the link information comprises image data and text data; respectively processing text data in the text information and the link information, and correspondingly generating a second vector and a third vector which represent text characteristics of the short message; inputting the image data, the first vector, the second vector and the third vector into an auditing model, and outputting the probability of the predictive link short message passing auditing after processing. According to the scheme, the auditing accuracy can be improved.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, a system, a computing device, and a storage medium for auditing a link short message.
Background
The automatic short message auditing means that when a user sends a short message through a short message platform, a machine automatically judges whether the content of the short message can pass the auditing. Only the short message passing the audit can be sent through the short message platform, so that the safety and reliability of the short message sent through the platform are ensured.
The link short message (also called service information or push information) is a short message containing hyperlinks in a short message text, and is a short message with a special format. The link short message can send the link of a certain site or a certain service to the mobile phone supporting the function of the link short message through the short message, so that the user can directly access the service by only reading the short message (service information) and opening the link in the short message. Therefore, the link short message realizes the combination of the short message and the link service, saves the time for searching the service, and is convenient for the user to directly find and use the favorite service. Because the link short message contains more data, the automatic auditing of the prior short message mainly uses single text information to audit the text of the short message.
Therefore, a new scheme is needed to audit the content of the link short message so as to improve the accuracy of the link short message audit.
Disclosure of Invention
The present application provides a method, system, computing device, and storage medium for auditing linked text messages in an attempt to solve or at least alleviate at least one of the problems presented above.
According to one aspect of the present application, there is provided an auditing method of a link short message, including: generating a first vector representing user behavior characteristics based on user behavior data when a user sends a link short message; processing the link short message to generate text information and link information, wherein the link information comprises image data and text data; respectively processing text data in the text information and the link information, and correspondingly generating a second vector and a third vector which represent text characteristics of the short message; inputting the image data, the first vector, the second vector and the third vector into an auditing model, and outputting the probability of the predictive link short message passing auditing after processing.
Optionally, in a method according to the present application, the audit model includes a convolutional neural network component, an attention mechanism component, and a fusion component coupled, and inputting the image data and the first, second, and third vectors into the audit model, including: inputting the image data into a convolutional neural network component to generate a fourth vector representing the image characteristics of the short message; inputting the fourth vector and the first, second and third vectors into an attention mechanism component to obtain attention values of the fourth vector and the first, second and third vectors respectively; and inputting the attention value into a fusion component to generate the probability of the predictive link short message passing the audit.
Optionally, in the method according to the present application, generating, based on user behavior data when the user sends the link sms, a first vector representing a feature of the user behavior includes: acquiring user behavior data and recording characteristic values corresponding to the behavior data, wherein the user behavior data comprises: editing the duration of the short messages, the time for sending the short messages, the number of the short messages sent and the login time of the user; the eigenvalues are encoded to generate a first vector.
Optionally, in the method according to the present application, the processing of the link short message to generate text information and link information includes: replacing non-text characters in the link short message to obtain text information; and acquiring the webpage pointed by the link short message as link information.
Optionally, in the method according to the present application, the processing of the link short message to generate text information and link information further includes: acquiring an image corresponding to a webpage as image data of link information; text information in the image is identified, and text data of the link information is generated.
Optionally, the method according to the present application further comprises: and generating a gray level image of the image corresponding to the webpage as image data.
Optionally, in the method according to the present application, before replacing the non-text characters in the link short message to obtain the text information, the method further includes: and deleting the stop word in the link short message.
Optionally, the method according to the present application further comprises the step of training to generate an audit model: acquiring historical link short message data, wherein the historical link short message data comprises: linking the short message, the user behavior data and the auditing result; respectively processing the link short message and the user behavior data to generate a corresponding first vector, a second vector, a third vector and image data, and taking the corresponding first vector, the second vector, the third vector and the image data as training samples; inputting the training sample into an initial auditing model to obtain a prediction result, and adjusting model parameters of the auditing model according to the auditing result and the prediction result until the preset requirement is met, and obtaining the generated auditing model after training is finished.
Optionally, in the method according to the present application, text data in the text information and the link information are processed respectively, and a second vector and a third vector representing text characteristics of the short message are generated correspondingly, including: and respectively processing the text data in the text information and the link information by utilizing the pre-training language characterization model so as to correspondingly generate a second vector and a third vector.
According to still another aspect of the present application, there is provided an audit system for linking short messages, including: the information acquisition unit is suitable for acquiring user behavior data and the link short message when the user sends the link short message; the information processing unit is suitable for processing the link short message to generate text information and link information, wherein the link information comprises image data and text data; an information representation unit adapted to generate a first vector representing a characteristic of the user behavior based on the user behavior data; the method is also suitable for respectively processing text data in the text information and the link information and correspondingly generating a second vector and a third vector which represent text characteristics of the short message; the auditing prediction unit is suitable for inputting the image data, the first vector, the second vector and the third vector into an auditing model, and outputting the probability of the auditing of the prediction link short message after processing.
According to yet another aspect of the present application, there is provided a computing device comprising: one or more processor memories; one or more programs, wherein the one or more programs are stored in memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods described above.
According to yet another aspect of the present application, there is provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.
According to the auditing method, text features, image features and user behavior features in the link short message are extracted based on deep learning, and the link short message is automatically audited through multi-mode data. Compared with the prior automatic auditing of the short message, which is mainly performed by using single text information, the auditing method and the auditing device can well improve the auditing accuracy.
In addition, by analyzing potential positions of illegal contents in the illegal short messages, when text features in the linked short messages are extracted, data sources are divided into two parts: text information in the link short message and text data contained in the webpage pointed by the link enable text features to be richer, and prediction accuracy of the auditing model is improved.
In addition, the user behavior data is defined by analyzing the behavior data of the offending short message sender. Thus, the user behavior characteristics according to the method and the device can improve the prediction accuracy of the auditing model.
The foregoing description is only an overview of the technical solutions of the present application, and may be implemented according to the content of the specification in order to make the technical means of the present application more clearly understood, and in order to make the above-mentioned and other objects, features and advantages of the present application more clearly understood, the following detailed description of the present application will be given.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which set forth various ways in which the principles herein may be practiced, and all aspects and equivalents thereof are intended to fall within the scope of the claimed subject matter. The above, as well as additional purposes, features, and advantages of the present application will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Like reference numerals generally refer to like parts or elements throughout the present application.
FIG. 1 illustrates a schematic diagram of a computing device 100 according to some embodiments of the present application;
Fig. 2 illustrates a flow diagram of an audit method 200 for linked text messages according to some embodiments of the present application;
FIG. 3 illustrates a schematic diagram of an audit model 300, according to some embodiments of the present application;
FIG. 4 illustrates a schematic diagram of an audit system 400 linking text messages according to some embodiments of the present application;
fig. 5 illustrates a schematic diagram of an audit flow of a link text message according to some embodiments of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
According to the embodiment of the application, the short message platform can combine an office system, an external marketing system, a customer service system and the like in an enterprise with a mobile network, and information transmission is realized in a short message mode. In some embodiments, the user issues external marketing information, such as member marketing, business promotions, customer service, holiday blessings, and the like, via a short message platform. The short message platform generally distributes short messages in batches, and once the short messages contain illegal information, the influence range is wider. Therefore, after the short message is edited and before the short message is sent, the auditing and filtering of the short message content by the short message platform are particularly important.
In addition, the link short message is used as a common short message form of a short message platform because the link short message can be directly linked with service information.
A person sending a short message on the short message platform reserves a lot of behavior data on the platform, and different users can leave rich user behavior information on the short message sending platform, such as: editing short message content, clicking each page to switch and the like. Meanwhile, for the link short message, the link short message is opened to contain a plurality of webpage information. In the past, single text information is mainly used for auditing the text of the short message in the automatic auditing of the short message. The applicant has found that multimodal data (e.g. user behavior data, web page information, text information, etc.) contains information whether more users send offending content, and therefore, the accuracy of the link text message auditing can be improved by using these information.
According to the embodiment of the application, the auditing scheme of the multi-mode data aiming at the link short message is provided, so that the accuracy of automatic auditing is improved.
The method 200 for auditing the link text messages according to the present application may be implemented in a computing device. Fig. 1 illustrates a block diagram of a computing device 100, according to some embodiments of the present application. It should be noted that, the computing device 100 shown in fig. 1 is only an example, and in practice, the computing device used to implement the auditing method 200 of the link short message of the present application may be any type of device, and the hardware configuration of the computing device may be the same as the computing device 100 shown in fig. 1 or may be different from the computing device 100 shown in fig. 1. In practice, the computing device used to implement the method 200 for auditing the link short message of the present application may add or delete hardware components of the computing device 100 shown in fig. 1, and the present application does not limit the specific hardware configuration situation of the computing device.
As shown in FIG. 1, in a basic configuration 102, a computing device 100 typically includes a system memory 106 and one or more processors 104. The memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processor, including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of caches, such as a first level cache 110 and a second level cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a Digital Signal Processing (DSP) core, or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations, the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. Physical memory in a computing device is often referred to as volatile memory, RAM, and data in disk needs to be loaded into physical memory in order to be read by processor 104. The system memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some implementations, the application 122 may be arranged to execute instructions on an operating system by the one or more processors 104 using the program data 124. The operating system 120 may be, for example, linux, windows or the like, which includes program instructions for handling basic system services and performing hardware-dependent tasks. The application 122 includes program instructions for implementing various functions desired by the user, and the application 122 may be, for example, a browser, instant messaging software, a software development tool (e.g., integrated development environment IDE, compiler, etc.), or the like, but is not limited thereto.
When the computing device 100 starts up running, the processor 104 reads the program instructions of the operating system 120 from the memory 106 and executes them. Applications 122 run on top of operating system 120, utilizing interfaces provided by operating system 120 and underlying hardware to implement various user-desired functions. When a user launches the application 122, the application 122 is loaded into the memory 106, and the processor 104 reads and executes the program instructions of the application 122 from the memory 106.
Computing device 100 also includes storage device 132, storage device 132 includes removable storage 136 (e.g., CD, DVD, U disk, removable hard disk, etc.) and non-removable storage 138 (e.g., hard disk drive HDD, etc.), both removable storage 136 and non-removable storage 138 being connected to storage interface bus 134.
Computing device 100 may also include a storage interface bus 134. Storage interface bus 134 enables communication from storage devices 132 (e.g., removable storage 136 and non-removable storage 138) to base configuration 102 via bus/interface controller 130. At least a portion of operating system 120, applications 122, and program data 124 may be stored on removable storage 136 and/or non-removable storage 138, and loaded into system memory 106 via storage interface bus 134 and executed by one or more processors 104 when computing device 100 is powered up or when application 122 is to be executed.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to basic configuration 102 via bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices such as a display or speakers via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communication with one or more other computing devices 162 via one or more communication ports 164 over a network communication link.
The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as a personal computer including desktop and notebook computer configurations. Of course, computing device 100 may also be implemented as part of a small-sized portable (or mobile) electronic device such as a cellular telephone, a digital camera, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application-specific device, or a hybrid device that may include any of the above functions. And may even be implemented as servers, such as file servers, database servers, application servers, WEB servers, and the like. The embodiments of the present application are not limited in this regard.
In an embodiment according to the present application, the computing device 100 is configured to perform an auditing method 200 of linked text messages according to the present application. Wherein the application 122 disposed on the operating system contains a plurality of program instructions for performing one or more of the methods described above, which may instruct the processor 104 to perform the method 200 described above of the present application.
Fig. 2 illustrates a flow diagram of an audit method 200 for linked text messages according to some embodiments of the present application. As shown in fig. 2, method 200 begins at 210.
At 210, a first vector representing user behavior is generated based on user behavior data when the user sends the link text message.
In the process that a user logs in to the short message platform and sends a link short message (hereinafter also referred to as short message) through the short message platform, the short message platform can acquire corresponding user behavior data. The applicant finds out in the study that the sender of the illegal message has some personal behavior rules. For example, in order to avoid the risk of being shielded by an operator, the sender of the offending sms will have some personal behavior rules in the time of sending the sms and in the time of logging into the sms platform. For another example, the editing time of the illegal sms may be shorter than the general editing time of the normal sms. As another example, the number of outgoing marketing messages that a user would normally send each time should be maintained in a steady state without significant changes in the number. Etc. Based on these studies, in some embodiments, the user behavior data includes: editing the duration of the short messages, the time for sending the short messages, the number of the short messages sent and the login time of the user.
The following is an exemplary illustration of user behavior data.
The time length for editing the short message refers to the time from the opening of the short message editing frame to the ending of the short message editing, and the time is expressed in seconds.
The time for sending the short message refers to the current time for clicking the short message to send by the user or the automatic sending time set by the user. For ease of calculation, this time is recorded only to hours, i.e. only the number of hours in which it is located.
The number of the sent short messages refers to the number of short message stripes sent by the user once.
The user login time refers to the time for logging in the short message platform when the user sends the short message. For ease of calculation, this time is also recorded only for the number of hours in which it is located.
According to some embodiments of the present application, user behavior data is first obtained and feature values corresponding to the behavior data are recorded. For the above description, the following is assumed. For example, if the duration of the user editing the short message is 100 seconds, the recorded duration of the editing short message is characterized as 100. For example, the time for the user to send the short message is 18:20, the recorded sending short message time is characterized as 18. If the number of the short messages sent by the user at a time is 1000, the number of the recorded short messages sent by the user is 1000. If the time for the user to log in the short message platform is 21:31, the recorded user login time is characterized as 21. These 4 eigenvalues are then expressed in vector form: [100,18,1000,21].
And then, encoding the recorded characteristic values to generate a first vector. More specifically, the 4-dimensional vector represented by the eigenvalue is encoded. In some embodiments, the 4-dimensional vector is encoded based on an automatic encoder, and a vector of a fixed dimension is output as the first vector. The automatic encoder neural network consists of two main parts: an encoder and a decoder. The encoder maps the input data into a low-dimensional representation space, and the decoder maps the low-dimensional representation back to the original data. The interaction between the encoder and decoder enables the auto-encoder neural network to learn a minimal representation of the input data. In this embodiment, the automatic encoder can change discrete data into a representation of any dimension of continuous data (selecting a fixed dimension) in a self-supervising manner, facilitating alignment with other feature vector dimensions at the time of subsequent computation.
It should be noted that, other encoding methods may be used to encode the characteristic value of the user behavior data. According to some embodiments of the application, an automatic encoder is selected, and the fact that the loss of information is small in the process of dimension increasing and decreasing is considered. But the present application is not limited thereto.
In 220, the linked text message is processed to generate text information and link information.
As described above, the link text message generally comprises two parts: the plain text section and URL web site links. According to some embodiments, the link short message is preprocessed, namely, stop words in the link short message are deleted. The stop words comprise, for example, a mood word, an adverb, a preposition word, a connector word and the like, and may further comprise some vocabulary words, and a stop word list may be predefined according to a service scene of the short message platform, which is not excessively limited in the application.
And then, further processing is carried out on the short message content from which the stop word is deleted.
According to the embodiment of the application, some non-text characters (such as numerical values) in the link text message do not work in the text message, so that the text message is obtained by replacing the non-text characters. Optionally, replacing non-text characters in the link short message by using the regular expression, and taking the replaced short message content as text information. In some embodiments, the non-text characters are typically a link address and a telephone number, replacing the link address with "[ link ]" and replacing the telephone number with "[ number ]".
The following is an example.
The original link short message is: the host plus WeChat friends share welfare, click on the link http:// wp. Sharkshoping. Com/nesting/zh or make phone calls 094-34485.
After the stop word processing and the non-text replacement are carried out on the link short message, the text information after the replacement is obtained is: adding host WeChat friends to share welfare, clicking on the link or dialing the phone number.
And then, acquiring the webpage pointed by the link short message as link information. In some embodiments, a web browser is used to open a web link (i.e., a link address that was replaced with non-text in the last step), and the information included in the corresponding web page is link information.
According to an embodiment of the present application, this linking information is further processed.
First, an image corresponding to the web page is acquired as image data of link information. Optionally, the webpage is subjected to screenshot, and an image obtained through the screenshot is used as image data.
In still other embodiments, the screenshot of the web page may be further processed to generate a grayscale image of the screenshot of the web page as image data. The purpose of the graying processing of the image is to simplify the matrix and improve the subsequent operation speed.
Then, text information in the image corresponding to the web page is identified, and text data of the link information is generated. According to some embodiments of the present application, text recognition is performed on the image, for example OCR (Optical Character Recognition), to obtain its corresponding text data. Of course, a convolutional neural network or other model may be used to identify text information in the image, which is not limited in this application.
In summary, the link information according to the present application includes image data and text data within the link. The text data is text information obtained by identifying text contents after the link is opened. For offensive messages, the likely offensive text is contained in the linked open web page. The text data is used for supplementing text information of the short message. The image data refers to picture information after the link is opened. For offending messages, possible offending content is contained in the linked open web page and presented in the form of an image.
At 230, the text data in the text information and the link information are processed respectively, and a second vector and a third vector representing text characteristics of the short message are correspondingly generated.
According to the embodiment of the application, text data in the text information and the link information can be processed through a language characterization model to generate corresponding feature vectors. In some embodiments, the text data in the text information and the link information are processed, respectively, using a pre-trained language characterization model to correspondingly generate a second vector and the third vector. The pre-training language characterization model is a text characterization model obtained after a large number of texts are learned in an unsupervised mode, and can be used for representing the texts with richer semantic information. The pre-trained language characterization model may be, for example, a BERT (Bidirectional Encoder Representation from Transformers) model, with MLM (Masked Language Model) pre-training bi-directional transducers to generate deep bi-directional language characterizations. In this embodiment, when processing text information, an extra output layer is added to the pre-trained BERT model, and fine-tuning (fine-tune) is performed; in processing text data in the link information, an additional output layer is added to the pre-trained BERT model, and fine tuning is performed.
In 240, the image data, the first vector, the second vector, and the third vector are input into an audit model, and the probability of the passing audit of the predictive link short message is output after processing.
In some embodiments, the audit model is based on a convolutional neural network and an attention mechanism, including a convolutional neural network component, an attention mechanism component, and a fusion component coupled. The convolutional neural network component can perform feature extraction on the image data to obtain the image features of the link short message. The Attention values of the image features and other features (e.g., text features, user behavior features) are then calculated by an Attention (Attention) mechanism component. And then splicing the value of the attribute, and inputting the value of the attribute into a fusion component, wherein the fusion component can be a linear layer so as to obtain a final auditing result. The principle of operation of the attention mechanism is to map an input sequence to a set of convolution kernels, each corresponding to a different element in the sequence, and to extract the features of each element by calculating the response of each convolution kernel. Next, each element is assigned a weight using the Attention mechanism, which represents the importance of the element. Finally, these weights are weighted averaged to generate an output sequence.
FIG. 3 illustrates a schematic diagram of an audit model 300 according to some embodiments of the present application. As shown in fig. 3, the image data is input to a convolutional neural network component to generate a fourth vector representing the characteristics of the sms image. And then, inputting the fourth vector and the first, second and third vectors into an attention mechanism component together to obtain attention values of the fourth vector and the first, second and third vectors respectively. 3 attention values are input into a fusion component to generate the probability of the predictive link short message passing the audit.
Judging whether the link short message finally passes the audit according to a preset threshold value, for example, if the output probability value is not smaller than the preset threshold value, passing the audit; otherwise, if the output probability value is smaller than the preset threshold value, the auditing is not passed.
It should be noted that the foregoing description of the audit model 300 is only an example, and those skilled in the art may also replace some modules in the audit model 300 of the present application with other existing functions or network structures based on the description of the present application, so as to achieve the technical effects described in the present application, which are all within the scope of protection of the present application.
It should be noted that, according to the present application, the process of generating an audit model is also included. One training procedure is exemplarily shown below.
Firstly, acquiring historical link short message data, wherein the historical link short message data comprises: linking the short message, the user behavior data and the auditing result, wherein the auditing result comprises: pass and fail. For a description of user behavior data, reference is made to the description previously associated with 210.
And then, respectively processing the link short message and the user behavior data to generate a corresponding first vector, a second vector, a third vector and image data as training samples. For how to generate the first vector, the second vector, the third vector and the image data, reference is made to the foregoing related description, and no further description is given here.
And inputting the training sample into an initial auditing model to obtain a prediction result, and adjusting model parameters of the auditing model according to the auditing result and the prediction result until a preset requirement is met (for example, the value convergence of a loss function or the iteration number reaches a preset value, without limitation), and finishing training to obtain the generated auditing model.
In summary, according to the auditing method of the application, text features, image features and user behavior features in the link short message are extracted based on deep learning, and the link short message is automatically audited through multi-mode data. Compared with the prior automatic auditing of the short message, which is mainly performed by using single text information, the auditing method and the auditing device can well improve the auditing accuracy.
In addition, by analyzing potential positions of illegal contents in the illegal short messages, when text features in the linked short messages are extracted, data sources are divided into two parts: text in the link short message and text data contained in the webpage pointed by the link enable text features to be richer, and prediction accuracy of the auditing model is improved.
In addition, the user behavior data is defined by analyzing the behavior data of the offending short message sender. Thus, the user behavior characteristics according to the method and the device can improve the prediction accuracy of the auditing model.
Fig. 4 illustrates a schematic diagram of an audit system 400 for linked text messages according to some embodiments of the present application. The auditing system 400 can be arranged in a short message platform, when a user issues a link short message through the short message platform, the auditing system 400 firstly audits the link short message to be sent, and when the auditing ensures that the link short message does not contain illegal contents, the short message platform sends the link short message through a mobile network. If the link short message contains illegal contents, the short message platform pauses the sending of the link short message and prompts the user that the link short message contains illegal contents. If the user modifies the link short message after receiving the prompt, the short message platform calls the auditing system 400 to audit the modified link short message again, and the modified link short message is allowed to be sent until the auditing is passed.
It should be noted that the auditing system 400 is complementary to the foregoing descriptions of the auditing method 200, and the relevant parts are not described in detail.
As shown in fig. 4, the audit system 400 for the link short message includes: an information acquisition unit 410, an information processing unit 420, an information presentation unit 430, and an audit prediction unit 440.
The information acquiring unit 410 acquires user behavior data and a link short message when the user transmits the link short message.
The information processing unit 420 processes the link short message to generate text information and link information. Wherein the link information includes image data and text data.
The information representing unit 430 is used for extracting features of the user behavior data, text information and link information to generate feature vectors. Specifically, a first vector representing a characteristic of a user behavior is generated based on the user behavior data; and respectively processing the text data in the text information and the link information, and correspondingly generating a second vector and a third vector which represent text characteristics of the short message. Further, the second vector is a characteristic representation of text information in the linked text message, and the third vector is a characteristic representation of text data in the link.
The auditing prediction unit 440 inputs the image data and the first vector, the second vector and the third vector into an auditing model, and outputs the probability of the prediction link short message passing auditing after processing. Thus, the auditing system can determine whether the link message passes the auditing based on the probability value.
Fig. 5 illustrates a schematic diagram of an audit flow of a link text message according to some embodiments of the present application. In connection with fig. 5, a processing procedure of a link short message according to the present application is shown below.
1. The user logs in the short message platform, and the platform records the user login time and uploads the user login time to the database.
2. The user sets the number of the link messages, the platform records the number of the link messages and uploads the number of the link messages to the database.
3. The user edits the content of the short message to be sent, and the platform records the time length used by the user to edit the short message and uploads the time length to the database.
4. The user clicks or sets the time for sending the link short message at fixed time, and the platform records the time for sending the short message and uploads the time to the database.
5. And encoding the obtained user behavior information through an automatic encoder to obtain a user behavior representation (namely, a first vector).
6. Short message text information (such as 'short message content' shown in fig. 5) is extracted from the link short message to be sent by the user, and is input into a pre-training language model for processing, so as to obtain vector representation (such as 'short message content representation' shown in fig. 5, namely, second vector) of the short message text of the user.
7. And extracting link information from the link short message to be sent by the user, acquiring a webpage pointed by the link and capturing a screenshot, and obtaining a link screenshot. On the one hand, identifying text information in the link screenshot, obtaining linked text data (such as 'link content' shown in fig. 5), and obtaining vector representation of the link text (such as 'link content representation' shown in fig. 5, namely a third vector) through a pre-training language model; on the other hand, the link screenshot is subjected to gradation conversion as linked image data (i.e., "linked image representation" shown in fig. 5).
8. User behavior representation, short message content representation, link content representation and link image representation are input into an auditing model (the auditing model can be expressed as a convolutional neural network and an attention mechanism) as multi-mode information representation for processing, and finally a predicted probability value is output, so that an auditing result is obtained.
It should be noted that the process may also represent a process of training the audit model. Only the auditing result of the history link short message is added in the step 8: pass or fail can be used to train the audit model. And will not be described in detail herein.
The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present application, or certain aspects or portions of the methods and apparatus of the present application, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U-drives, floppy diskettes, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the application.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute an audit scheme of the link text message of the present application according to instructions in the program code stored in the memory.
The application discloses together:
the method of any one of A1-7, further comprising the step of training to generate an audit model: acquiring historical link short message data, wherein the historical link short message data comprises: linking the short message, the user behavior data and the auditing result; processing the link short message and the user behavior data respectively to generate a corresponding first vector, a second vector, a third vector and image data as training samples; inputting the training sample into an initial auditing model to obtain a prediction result, and adjusting model parameters of the auditing model according to the auditing result and the prediction result until a preset requirement is met, and obtaining the generated auditing model after training is finished.
A9, the method of any one of A1-8, wherein the processing the text data in the text information and the link information respectively, and correspondingly generating a second vector and a third vector representing text characteristics of the short message, includes: and respectively processing the text information and the text data in the link information by utilizing a pre-training language characterization model so as to correspondingly generate the second vector and the third vector.
By way of example, and not limitation, readable media comprise readable storage media and communication media. The readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the examples herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present application is not directed to any particular programming language. It should be appreciated that the contents of the present application described herein can be implemented using a variety of programming languages, and that the above description of specific languages is provided for disclosure of preferred embodiments of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the present application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various application's aspects. However, the method of this application should not be interpreted as reflecting the intent: i.e., the claimed application requires more features than are expressly recited in each claim. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for performing functions performed by elements for purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner. Furthermore, the number word "plurality" means "two" and/or "more than two".
While the application has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the foregoing description, will appreciate that other embodiments are contemplated within the scope of the application as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the subject matter of the application. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present application is illustrative, but not limiting, of the scope of the application, which is defined by the appended claims.
Claims (10)
1. An auditing method of a link short message comprises the following steps:
generating a first vector representing user behavior characteristics based on user behavior data when a user sends a link short message;
processing the link short message to generate text information and link information, wherein the link information comprises image data and text data;
respectively processing the text information and the text data in the link information, and correspondingly generating a second vector and a third vector which represent text characteristics of the short message;
Inputting the image data, the first vector, the second vector and the third vector into an auditing model, and outputting and predicting the probability of the link short message passing auditing after processing.
2. The method of claim 1, wherein the audit model includes a convolutional neural network component, an attention mechanism component, and a fusion component coupled, and
inputting the image data and the first, second, and third vectors into an audit model, comprising:
inputting the image data into the convolutional neural network component to generate a fourth vector representing the image characteristics of the short message;
inputting the fourth vector, the first vector, the second vector and the third vector into an attention mechanism component to obtain attention values of the fourth vector, the first vector, the second vector and the third vector respectively;
and inputting the attention value into a fusion component to generate the probability of predicting the link short message to pass the audit.
3. The method of claim 1 or 2, wherein generating a first vector representing a characteristic of user behavior based on user behavior data when the user sends the link text message comprises:
Acquiring user behavior data and recording characteristic values corresponding to the behavior data, wherein the user behavior data comprises: editing the duration of the short messages, the time for sending the short messages, the number of the short messages sent and the login time of the user;
the eigenvalues are encoded to generate a first vector.
4. The method of any of claims 1-3, wherein the processing the link text message to generate text information and link information comprises:
replacing non-text characters in the link short message to obtain the text information;
and acquiring the webpage pointed by the link short message as the link information.
5. The method of any of claims 1-4, wherein the processing the link text message to generate text information and link information further comprises:
acquiring an image corresponding to the webpage as image data of link information;
and identifying text information in the image and generating text data of the link information.
6. The method of claim 5, wherein the acquiring the image corresponding to the web page as the image data of the link information further comprises:
and generating a gray level image of the image corresponding to the webpage as the image data.
7. The method of claim 4, wherein before the replacing the non-text characters in the linked text message to obtain text information, further comprising:
and deleting the stop word in the link short message.
8. An audit system of a link short message, comprising:
the information acquisition unit is suitable for acquiring user behavior data and the link short message when the user sends the link short message;
an information processing unit adapted to process the link short message to generate text information and link information, wherein the link information contains image data and text data;
an information representation unit adapted to generate a first vector representing a characteristic of a user behavior based on the user behavior data; the method is also suitable for respectively processing the text information and the text data in the link information, and correspondingly generating a second vector and a third vector which represent text characteristics of the short message;
and the auditing prediction unit is suitable for inputting the image data, the first vector, the second vector and the third vector into an auditing model, and outputting and predicting the probability of the link short message passing auditing after processing.
9. A computing device, comprising:
one or more processors;
a memory;
One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-7.
10. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310416446.1A CN116431884A (en) | 2023-04-18 | 2023-04-18 | Method, system, computing device and storage medium for auditing link short messages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310416446.1A CN116431884A (en) | 2023-04-18 | 2023-04-18 | Method, system, computing device and storage medium for auditing link short messages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116431884A true CN116431884A (en) | 2023-07-14 |
Family
ID=87086957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310416446.1A Pending CN116431884A (en) | 2023-04-18 | 2023-04-18 | Method, system, computing device and storage medium for auditing link short messages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116431884A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116996840A (en) * | 2023-09-26 | 2023-11-03 | 北京百悟科技有限公司 | Short message auditing method, device, equipment and storage medium |
-
2023
- 2023-04-18 CN CN202310416446.1A patent/CN116431884A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116996840A (en) * | 2023-09-26 | 2023-11-03 | 北京百悟科技有限公司 | Short message auditing method, device, equipment and storage medium |
CN116996840B (en) * | 2023-09-26 | 2023-12-29 | 北京百悟科技有限公司 | Short message auditing method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11216510B2 (en) | Processing an incomplete message with a neural network to generate suggested messages | |
US11425064B2 (en) | Customized message suggestion with user embedding vectors | |
US10827024B1 (en) | Realtime bandwidth-based communication for assistant systems | |
US10489792B2 (en) | Maintaining quality of customer support messages | |
CN107636648B (en) | Constructing responses based on emotion identification | |
CN110444198B (en) | Retrieval method, retrieval device, computer equipment and storage medium | |
US10970471B2 (en) | Phased collaborative editing | |
US20180286429A1 (en) | Intelligent truthfulness indicator association | |
US10885902B1 (en) | Non-semantic audio stenography | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN107612814A (en) | Method and apparatus for generating candidate's return information | |
CN112530408A (en) | Method, apparatus, electronic device, and medium for recognizing speech | |
US8315874B2 (en) | Voice user interface authoring tool | |
US11416539B2 (en) | Media selection based on content topic and sentiment | |
CN112784573B (en) | Text emotion content analysis method, device, equipment and storage medium | |
US20210141815A1 (en) | Methods and systems for ensuring quality of unstructured user input content | |
CN109614464B (en) | Method and device for identifying business problems | |
CN114450747B (en) | Method, system, and computer-readable medium for updating documents based on audio files | |
CN108846098B (en) | Information flow abstract generating and displaying method | |
US11373057B2 (en) | Artificial intelligence driven image retrieval | |
US20210390256A1 (en) | Methods and systems for multiple entity type entity recognition | |
CN116431884A (en) | Method, system, computing device and storage medium for auditing link short messages | |
CN115935182A (en) | Model training method, topic segmentation method in multi-turn conversation, medium, and device | |
CN115134660A (en) | Video editing method and device, computer equipment and storage medium | |
CN116913278B (en) | Voice processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |