CN110580308B - Information auditing method and device, electronic equipment and storage medium - Google Patents

Information auditing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110580308B
CN110580308B CN201810496212.1A CN201810496212A CN110580308B CN 110580308 B CN110580308 B CN 110580308B CN 201810496212 A CN201810496212 A CN 201810496212A CN 110580308 B CN110580308 B CN 110580308B
Authority
CN
China
Prior art keywords
data
sequence
preset
information
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810496212.1A
Other languages
Chinese (zh)
Other versions
CN110580308A (en
Inventor
傅东博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Zhenshi Information Technology Co Ltd
Original Assignee
Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Zhenshi Information Technology Co Ltd filed Critical Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority to CN201810496212.1A priority Critical patent/CN110580308B/en
Publication of CN110580308A publication Critical patent/CN110580308A/en
Application granted granted Critical
Publication of CN110580308B publication Critical patent/CN110580308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosure provides an information auditing method and device, and belongs to the technical field of data processing. The method comprises the following steps: extracting target data of a plurality of preset dimensions from information to be audited, wherein the target data comprises numerical data and character data; converting the numeric data into a first sequence and the character data into a second sequence; generating a target vector according to the first sequence and the second sequence; and processing the target vector through a machine learning model to obtain a result of whether the information to be checked passes the checking. The method and the device can be used for auditing the information containing the natural language, the accuracy of auditing the complex information is high, and the auditing process is automated.

Description

Information auditing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to an information auditing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
In recent years, artificial intelligence technology is increasingly applied to the field of data processing, so that data processing is developed from manual to automatic and intelligent. The information auditing is an important branch in the field of data processing, is an indispensable working link in most industries, and the realization of automation and intellectualization is favorable for improving the working efficiency and reducing the labor cost.
Most of the existing information auditing methods are to make some rule logics according to the experience of auditors, or to establish a rule database according to different conditions of rule combinations, so that a program can judge information according to the rule logics or the rule database, thereby realizing auditing. However, this method has the following disadvantages: the information needs to be converted into a form matched with the rule, and for the information in the forms of natural language and the like, the matching rule is very complicated or cannot be processed completely; the judgment according to the rule is usually a dichotomous result of 'yes' or 'no', the judgment on the information with fuzzy meaning is difficult, and the accuracy rate of judging the complex information is low; the optimization of the system depends on artificially updating rule logic or a rule database, the optimization process is complex, rule conflict is easily caused, and the stability of the system is influenced.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to an information auditing method and apparatus, an electronic device, and a computer-readable storage medium, which overcome the problems that natural language information cannot be processed and the accuracy of auditing complex information is low in an information auditing method due to the limitations and defects of the prior art at least to a certain extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, there is provided an information auditing method, including: extracting target data of a plurality of preset dimensions from information to be audited, wherein the target data comprises numerical data and character data; converting the numeric data into a first sequence and the character data into a second sequence; generating a target vector according to the first sequence and the second sequence; and processing the target vector through a machine learning model to obtain a result of whether the information to be checked passes the checking.
In an exemplary embodiment of the present disclosure, the converting the character-type data into the second sequence includes: acquiring a preset vocabulary database, wherein the preset vocabulary database comprises preset vocabularies and unique labels corresponding to the preset vocabularies; performing word segmentation processing on the character type data to obtain a plurality of words; and converting the character type data into the second sequence according to the unique labels corresponding to the vocabularies in the preset vocabulary database.
In an exemplary embodiment of the present disclosure, the acquiring the preset vocabulary database includes: acquiring a plurality of sample data, wherein the sample data comprises character type sample data; performing word segmentation processing on the character type sample data to obtain a preset word collection; generating a unique label for each vocabulary in the preset vocabulary set; and generating the preset vocabulary database according to each vocabulary in the preset vocabulary set and the unique label of each vocabulary.
In an exemplary embodiment of the disclosure, the generating a unique label for each vocabulary in the preset vocabulary set includes: counting the occurrence times of all the words in the preset word set, and sequencing all the words according to the occurrence times; and determining the sequence number of the sequence as the unique mark number of each vocabulary.
In an exemplary embodiment of the present disclosure, further comprising: and filling a preset null word numerical value or deleting an unnecessary numerical value for the second sequence according to the length of the reference sequence.
In an exemplary embodiment of the present disclosure, further comprising: determining the reference character type sample data with the largest vocabulary quantity in the plurality of sample data; converting the standard character type sample data into a numerical sequence according to the preset vocabulary database; determining the length of the numerical value sequence as the length of the reference sequence.
In an exemplary embodiment of the present disclosure, further comprising: and when the character type data comprises the vocabulary except the preset vocabulary, converting the vocabulary except the preset vocabulary into a preset new word numerical value.
In an exemplary embodiment of the present disclosure, the sample data further includes numerical sample data and a classification tag corresponding to the sample data; the method further comprises the following steps: converting the numerical sample data into a first sample sequence; converting the character type sample data into a second sample sequence; and training and obtaining the machine learning model through the first sample sequence, the second sample sequence and the corresponding classification labels.
In an exemplary embodiment of the present disclosure, the information to be audited includes prior feature data; the method further comprises the following steps: after the target data is extracted, judging whether the prior characteristic data meets prior conditions or not; and if the prior characteristic data is judged not to accord with the prior condition, outputting the result that the information to be audited cannot be audited.
In an exemplary embodiment of the present disclosure, the second sequence comprises a semantic feature vector; the converting the character-type data into a second sequence comprises: dividing the character type data into words, and obtaining a plurality of word vectors through a word2vec model; generating the semantic feature vector from the plurality of word vectors.
In an exemplary embodiment of the disclosure, the machine learning model comprises a long-short term memory network model or a support vector machine model.
According to an aspect of the present disclosure, there is provided an information auditing apparatus including: the target data extraction module is used for extracting a plurality of preset dimensionalities of target data from information to be audited, wherein the target data comprises numerical data and character data; a sequence conversion module for converting the numeric data into a first sequence and the character data into a second sequence; a target vector generation module, configured to generate a target vector according to the first sequence and the second sequence; and the machine learning processing module is used for processing the target vector through a machine learning model to obtain a result of whether the information to be audited passes the audit.
According to an aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any one of the above via execution of the executable instructions.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
Exemplary embodiments of the present disclosure have the following advantageous effects:
in the method and the device provided by the exemplary embodiment of the disclosure, after target data is extracted from information to be audited, the target data is classified into numerical data and character data, the numerical data and the character data are respectively converted into numerical sequences and combined into a target vector to realize information integration, and then the target vector is processed through a machine learning model to determine whether the audit is passed or not. On one hand, the method of the embodiment can process information in forms of natural language and the like, and has wider application range; all information in the information to be audited is integrated into the target vector and processed through the machine learning model, so that mutual combination and influence among information of different dimensions can be realized, and the auditing accuracy rate of complex information is high. On the other hand, the conversion difficulty of the numerical data and the character data in the information to be audited is different, and the classification processing is beneficial to scheduling more resources by the system so as to process the character data which is difficult to convert, and the efficiency is improved. On the other hand, the whole information auditing process is automated, so that the labor cost is saved; in some embodiments, the optimization of the system can be realized by training the machine learning model through more training data and feedback events, the process is simple, and the stability of the system is ensured.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
Fig. 1 is a diagram schematically showing a system architecture to which the information auditing method in the present exemplary embodiment is applied;
fig. 2 schematically shows a flowchart of an information auditing method in the present exemplary embodiment;
fig. 3 schematically shows a flowchart of another information auditing method in the present exemplary embodiment;
FIG. 4 is a sub-flow diagram that schematically illustrates a method of auditing information in the present exemplary embodiment;
fig. 5 is a block diagram schematically showing the configuration of an information auditing apparatus in the present exemplary embodiment;
fig. 6 schematically illustrates an electronic device for implementing the above method in the present exemplary embodiment;
fig. 7 schematically illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
An exemplary embodiment of the present disclosure first provides an information auditing method. Fig. 1 shows an exemplary system architecture diagram to which the information auditing method in the present embodiment may be applied. As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. Network 104 is used to provide communication connections between terminal devices 101, 102, 103 and server 105 and may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The user may use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to send or receive information, for example, send information to be audited to the server 105, and receive information of the result of audit from the server 105.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, personal computers, etc., and may be installed with various client applications, such as a web browser application, an instant messaging tool, a shopping application, etc.
The server 105 may be a server providing various data supports, for example, a background management server providing support for information to be checked sent by the user through the terminal devices 101, 102, and 103, and the background management server may perform processing such as forwarding, analyzing, checking, and the like on the received information data to be checked, and feed back a processing result to the terminal device.
It should be noted that the information auditing method provided in this exemplary embodiment may be applied to the server 105, where the server 105 receives the to-be-audited information sent by the terminal device and executes the information auditing method of this embodiment, and may also be applied to the terminal devices 101, 102, and 103, where the application programs installed on the terminal devices 101, 102, and 103 independently execute the information auditing method of this embodiment after receiving the to-be-audited information input by the user or sent by other terminal devices.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative, and any number of terminal devices, networks and servers may be provided according to actual needs.
As shown in fig. 2, in an exemplary embodiment, the information auditing method may include the following steps:
step S210, extracting target data of a plurality of preset dimensions from the information to be audited, wherein the target data comprises numerical data and character data.
The information to be audited is usually a set of information input according to a fixed style set by a program. For example, when the information to be checked is account application information, the contained information content may be characters, passwords, and the like input by a user on an account application page; when the information to be checked is after-sale application information, the included information content may be options, texts and the like input by the user on an after-sale application page. Through the setting of the fixed style, the program can extract the target data according to the preset dimensionality. For example, after the user inputs the information to be checked on the after-sales application page, the program may convert the information into the target data classification list in table 1, where the field is a preset dimension used for data indexing or data meaning identification. The target data may include numerical data such as "int" type in table 1, some directly inputted numerical values such as "service ticket number", and some mapping information contents to numerical values such as "whether it is after sale to home" which may be converted into "0"/"1" numerical values according to the user's option contents, and character type data. Character type data such as "string" type in table 1 is usually non-option text content, and the original character can be directly extracted as data.
Figure BDA0001668995450000061
TABLE 1
It should be noted that the example shown in table 1 is only an example, the numerical data and the character data are not limited to the cases shown in table 1, and in practical applications, there may be a classification and identification manner different from "int" and "string", for example, the numerical data may also be identified by "num", the character data may also be identified by "txt", and the like, which is not particularly limited in this embodiment.
The method of the embodiment can be directly used for processing numerical data and character data in the information to be audited. In some scenarios, the information to be audited may also include other types of information, such as voice, picture, etc., and then the information may be first converted into numerical data or character data by means of voice-text conversion, image recognition, etc., and then the method of this embodiment is applied to audit the information.
Step S220, converting the numeric data into a first sequence, and converting the character data into a second sequence.
The first sequence and the second sequence are both numerical value sequences, and the numerical values therein are arranged according to a specific sequence, and the numerical values may be separated by a separator (for example, comma, space), or may be distinguished by numerical value bits without being separated, and may be stored in the form of a discrete numerical value set, or may be stored in the form of a vector, which is not particularly limited in this embodiment.
When processing numerical data, conversion can be directly realized after setting the sequence of numerical arrangement, for example, part of the numerical data in table 1 can be arranged in the sequence shown in table 2 to form a first sequence, the first sequence can be (1,0,2,1,4,1), and the program can identify the meaning of the numerical value according to the position of the numerical value in the first sequence. Furthermore, the original value sequence may be subjected to a specific process, for example, when the original value sequence is a sequence of "0"/"1", one bit of "1" may be added to the leftmost side, and then the entire value sequence is converted into decimal values, and the length of the first sequence is compressed without loss of information, so as to facilitate subsequent processes.
afsCategory isBlackUser customerGrade isHasPackage pickwareType isPlus
1 0 2 1 4 1
TABLE 2
In processing the character-type data, it is necessary to convert the text in the character-type data into a numerical value, and there are various text-to-numerical value conversion methods, which will be described in detail in the following embodiments. The second sequence is obtained by converting the text into numerical values and arranging the numerical values in the order of the original text.
Step S230, generating a target vector according to the first sequence and the second sequence.
The first sequence and the second sequence are obtained through conversion, and the numerical data and the character data are actually coded in a unified mode, so that the two sequences can be combined into a target vector, and the integration of the two data is achieved.
Assuming that the first sequence is a numerical sequence of m dimensions and the second sequence is a numerical sequence of n dimensions, the first sequence may be concatenated with the second sequence, i.e. the target vector may be a vector of m + n dimensions, wherein the first sequence may be arranged before the second sequence or after the second sequence; the first sequence may also be added to the second sequence in a weighted manner, for example after the first sequence is multiplied by a weighting factor (if the maximum number of values of the second sequence is 5, the weighting factor may be 1 x 106) Adding with the second sequence, i.e. adding the values of the corresponding dimension, keeping the redundant values to obtain the target vector of n dimension (usually n>m, if n<m, then the target vector is an m-dimensional vector). It should be noted that, in order to facilitate the processing of the program, the normal numerical value sequence needs to be added with the sign of the vector (for example, "vector.", ")]"etc.) to convert to vector form.
And step S240, processing the target vector through a machine learning model to obtain a result of whether the information to be audited passes the audit.
Through the steps S210-S230, different forms of information in the information to be audited are integrated into the target vector, and the target vector can be processed through the machine learning model. The machine learning model takes the vector as input and takes the classification result as output, for example, the output of '1' is passed after the examination and the output of '0' is not passed after the examination, and on the premise of full training, the target vector is input into the machine learning model, so that the classification result of whether the examination is passed or not can be obtained.
In the exemplary embodiment, after the target data is extracted from the information to be audited, the target data is classified into numerical data and character data, the numerical data and the character data are respectively converted into numerical sequences and combined into a target vector, so as to realize information integration, and then the target vector is processed through a machine learning model to determine whether the audit is passed or not. On one hand, the method of the embodiment can process information in forms of natural language and the like, and has wider application range; all information in the information to be audited is integrated into the target vector and processed through the machine learning model, so that mutual combination and influence among information of different dimensions can be realized, and the auditing accuracy rate of complex information is high. On the other hand, the conversion difficulty of the numerical data and the character data in the information to be audited is different, and the numerical data and the character data are classified, so that more resources can be dispatched by the system to process the character data which is difficult to convert, and the efficiency is improved. On the other hand, the whole information auditing process is automated, so that the labor cost is saved; in some embodiments, the optimization of the system can be realized by training the machine learning model through more training data and feedback events, the process is simple, and the stability of the system is ensured.
In an exemplary embodiment, referring to fig. 3, the step of converting the character-type data into the second sequence may be implemented by steps S321 to S323:
step S321, acquiring a preset vocabulary database, where the preset vocabulary database includes preset vocabularies and unique labels corresponding to the preset vocabularies.
The preset vocabulary database can be imported from the outside or constructed and maintained locally, wherein the preset vocabulary refers to all vocabularies contained in the preset vocabulary database, each vocabulary has a corresponding unique identifier in the preset vocabulary database, and the unique identifier can be in a numerical value form, namely the preset vocabulary database contains the mapping relationship between the preset vocabulary and the numerical value. It should be noted that the preset vocabulary database may include vocabularies of multiple languages, for example, in this embodiment, the preset vocabulary database includes vocabularies of chinese and english, and vocabularies of two languages may share a set of unique identifier, so as to facilitate the preset vocabulary database index.
Step S322, performing word segmentation processing on the character type data to obtain a plurality of vocabularies.
The word segmentation processing method varies according to the Language, for example, english can directly segment words according to the word segmentation, while chinese requires special word segmentation processing, for example, processing by a thulac (thu Lexical Analyzer for chinese Lexical analysis) tool or a hanlp (Language processing) tokenizer. In this embodiment, word segmentation may also be performed according to a preset vocabulary database, and the chinese characters in the character type data are queried and matched in the preset vocabulary database one by one, and if matching is successful, the vocabulary is recognized, and the matching process may be implemented by forward matching, reverse matching, bidirectional matching, and the like.
Step S323, converting the character type data into the second sequence according to the unique labels corresponding to the vocabularies in the preset vocabulary database.
After the character type data are divided into a plurality of words, the unique identifier mapped by the character type data is inquired in a preset database, the words are replaced by the unique identifier, the character type data can be converted into a numerical sequence, and the numerical sequence is the second sequence. In the second sequence, the unique identifiers of the words can be arranged according to the arrangement sequence of the words in the original text, so that the information in the original text is fully reserved.
In an exemplary embodiment, referring to FIG. 3, the preset vocabulary database may be obtained through the following steps S311-314S:
step S3211, a plurality of sample data is obtained, where the sample data includes character-type sample data.
The sample data refers to sample information extracted from the history information that has been already reviewed, and various data contained therein. The sample data generally has the same or similar form as the target data, and therefore, the sample data may include numerical sample data and character sample data. When the preset vocabulary database is constructed, only character type sample data in the preset vocabulary database can be used, for example, when the preset vocabulary database for after-sales information auditing is constructed, referring to table 1, character type data (namely 'description of a client') of a large number of sample data can be called, and each sample data comprises one or more sentence texts, so that a set of sentence texts is obtained. Generally, the larger the number of the sample data is, the larger the coverage of the preset vocabulary in the preset vocabulary database is.
Step S3212, performing word segmentation processing on the character type sample data to obtain a preset vocabulary set.
And performing word segmentation on each sentence in the sentence text set to obtain a large number of words, and forming a preset word set by all the appeared words.
Step S3213, generating a unique label for each vocabulary in the preset vocabulary set.
By presetting logic, unique labels of the words can be generated. The preset logic may randomly generate a unique label, or generate a unique label in sequence after arranging according to alphabetic data of pinyin (or English) of Chinese characters, or generate a unique label in sequence after arranging according to the length of words and strokes of Chinese characters, and the like. In an exemplary embodiment, a unique label may also be generated for each vocabulary in the preset vocabulary set by: counting the occurrence times of all the words in the preset word set, and sequencing all the words according to the occurrence times; and determining the sequence number of the sequence as the unique label of each vocabulary. When the occurrence times of two or more vocabularies are the same, the vocabularies can be further subdivided and sorted according to the alphabetical order, the stroke number, the length of the vocabularies and the like.
Step S3214, generating the preset vocabulary database according to each vocabulary in the preset vocabulary set and the unique label of each vocabulary.
The unique labels generated by the various methods are numerical values, and a mapping relation is established between the preset vocabulary and the numerical values, so that the preset vocabulary database is generated. The advantage of generating the preset vocabulary database by the character type sample data is that a special vocabulary database can be constructed according to different application scenes, for example, the method of the embodiment is used for auditing after-sales information, so that more vocabularies including related categories such as online shopping, commodity quality, logistics and the like can be contained in the preset vocabulary database, and less related vocabularies with low relevance such as politics, military and the like can be involved, so that the size of the preset vocabulary database can be reduced, and the query and reading speed is improved.
After the preset vocabulary database is constructed, the conversion of the character-type data into the second sequence can be realized through steps S321 to S323. However, different character-type data usually contain different numbers of words and have different lengths after being converted into the second sequence. It should be noted that, in the present exemplary embodiment, the sequence length refers to the number of values in the sequence, for example, if a sequence includes 5 values, the length is 5, and is independent of the number of bits of each value, for example, the length of both the sequence (1,1,1,1,1) and the sequence (11,11111,111111,1111,111) is 5. In an exemplary embodiment, in order to unify the length of the second sequence, the second sequence may be filled with a preset null word value or a redundant value may be deleted according to the reference sequence length. The length of the reference sequence, i.e. the length of all second sequences, can be set to a fixed value, so that all second sequences produced subsequently can be processed according to the length. If the length of the second sequence is insufficient, a preset null word numerical value can be filled, wherein the preset null word numerical value refers to a numerical identifier corresponding to 'null' in a preset vocabulary database and is obviously distinguished from a unique identifier of an actual vocabulary, for example, the preset null word numerical value can be '0'; if the length of the second sequence is exceeded, the remainder value can be deleted from the front or from the back.
Further, the reference sequence length may be determined according to the following steps: step S410, determining the reference character type sample data with the largest vocabulary quantity in the plurality of sample data; step S420, converting the standard character type sample data into a numerical sequence according to the preset vocabulary database; step S430, determining the length of the numerical value sequence as the length of the reference sequence. For example, if the sentence text containing the largest number of words in the character type sample data contains 30 words, the sequence length after conversion into the numerical sequence is also 30, and the reference sequence length is 30, that is, the maximum length in the character type sample data is used as the reference sequence length. Therefore, the longer length of the reference sequence can be ensured, the probability of exceeding the length of the reference sequence is lower after the character type data is converted into the second sequence, and the condition of deleting numerical values is less to occur so as to keep the complete information in the character type data as much as possible.
The preset vocabulary database constructed in steps S3211 to S3214 is based on the statistics of the vocabulary in the character type sample data, and the number of the vocabulary contained in the character type sample data is limited, so that the character type data may contain the vocabulary other than the preset vocabulary in actual use, and in this case, the vocabulary other than the preset vocabulary may be converted into the preset new word value. The preset new word value refers to an identifier value corresponding to the unrecognizable word "unknown" in the preset word database, and should be clearly distinguished from a unique identifier of the recognizable word, for example, when the total number of words in the preset word database is 5 digits, the "unknown" may be set to "99999". Of course, the vocabulary other than the preset vocabulary may be a new word, or may be a wrong word caused by misspelling, or may be a special word such as a name, and the word may be uniformly converted into the preset new word value. In addition, the program can also store the new words, wrong words, special words and the like, count the accumulated occurrence times, and if the occurrence times of a certain vocabulary reaches a certain standard (for example, a threshold value can be set), the program can judge that the vocabulary is a common vocabulary and add the vocabulary to the preset vocabulary database. Therefore, the preset vocabulary database can be updated and maintained, and the process is simple and easy to realize.
In addition to the processing of the character-type data through the preset vocabulary-unique identification mapping relationship in the preset vocabulary database, the character-type data can be converted into the second sequence through other methods. In an exemplary embodiment, the second sequence may be a semantic feature vector, and the semantic feature vector may be obtained by: dividing the character type data into words, and obtaining a plurality of word vectors through a word2vec model; semantic feature vectors are generated from the plurality of word vectors. word2vec is a natural language processing tool that can vectorize all vocabulary to quantitatively measure the vocabulary-to-vocabulary relationship. After the word vectors of all the words are obtained, the semantic feature vectors can be generated by splicing, adding, averaging and other modes of all the word vectors.
It should be added that when converting the character-type data into the second sequence in the above embodiments, all the words are involved in the conversion and occupy a value in the second sequence. In order to reduce the length of the second sequence and reduce the dimension of the subsequent target vector, certain screening can be performed on irrelevant information in the character-type data. In an exemplary embodiment, words with high semantic relevance can be marked as key words in the preset word database, for example, when the method is applied to after-sales information auditing, words with high after-sales relevance can be marked as key words (determined according to experience, category word stock, common words and the like), and when character type data is processed, sentences which do not contain key words are deleted after sentence breaking according to punctuation marks, separating marks and the like, so that texts which do not contain key information, such as "hello", "thank you" and the like, can be deleted, and sufficient key information can be kept while the character length is shortened.
After the first sequence and the second sequence are obtained, a target vector may be generated in step S330, and then the target vector is input into the machine learning model for analysis processing. In an exemplary embodiment, the machine learning model may be obtained by training sample data, specifically, in addition to the character-type sample data, the sample data may further include numerical sample data and a classification tag corresponding to the sample data, where the classification tag refers to a result of whether the original information corresponding to the sample data passes the audit, that is, a correct output result in the machine learning model, for example, "1" represents that the audit passes, "0" represents that the audit does not pass, and the sample data may be classified and tagged with "1" or "0". Based on this, referring to fig. 3, a machine learning model can be obtained through steps S341 to S343: step S341, converting the numerical sample data into a first sample sequence; step S342, converting the character type sample data into a second sample sequence; and S343, training and obtaining the machine learning model through the first sample sequence, the second sample sequence and the corresponding classification labels. The conversion process of the first sample sequence and the second sample sequence is the same as the processing process of the numerical data and the character data, on the basis, the two sample sequences can be synthesized into a sample vector to be used as the input of the machine learning model, the output of the machine learning model is compared with the classification label corresponding to the sample data, the two sample sequences do not adjust the parameters of the machine learning model at the same time, and the model with higher accuracy is obtained through multiple iterations. The sample data can be generally divided into training data and test data, for example, by dividing in a ratio of 8:2, a machine learning model is trained by the training data, and the accuracy of the model is tested by the test data, so that overfitting in the training process can be prevented, and the generalization capability of the model is improved.
In an exemplary embodiment, the Machine learning model may include a Long Short Term Memory Network (LSTM) model or a Support Vector Machine (SVM) model. Both LSTM and SVM are machine learning models with multidimensional vectors as input, suitable for processing the target vector in this embodiment. The LSTM is suitable for processing the association condition between vocabularies with longer intervals, so that each vocabulary does not exist independently any more, the vocabularies influence each other, and the effect of reverse semantic recognition is particularly good; the SVM is suitable for carrying out accurate classification according to vocabulary semantics and emphasizes the classification of different word combination conditions.
In an exemplary embodiment, the information to be audited may include prior characteristic data; referring to fig. 3, the method may further include: after extracting the target data, judging whether the prior characteristic data meets the prior condition; if the prior characteristic data is judged not to be in accordance with the prior condition, outputting the result that the information to be audited cannot be audited; if yes, the steps S320 to S340 are continued to complete the information auditing. The prior condition means that some strict hard conditions can be set up in the auditing process, and when the conditions do not meet, other conditions are not considered and the auditing is not passed directly; data associated with a priori conditions, i.e., a priori characteristic data. For example, in table 1, whether "the area is exceeded" is equal to 1, and "the reservation time" - "the check-in time" ≦ the client requirement "or the like may be used as the prior condition, and if the prior feature data in the information to be audited does not meet the prior condition, such as whether" the area is exceeded "is equal to 0 or" the reservation time "-" the check-in time ">" the client requirement ", the subsequent auditing flow may not be performed, and the result that the auditing does not pass is directly output, thereby further simplifying the entire flow.
The exemplary embodiment of the present disclosure also provides an information auditing apparatus, which can be applied to the server 105 or the terminal devices 101, 102, 103 in the information auditing system shown in fig. 1. As shown in fig. 5, the information auditing apparatus 500 may include: a target data extraction module 510, configured to extract target data of multiple preset dimensions from information to be audited, where the target data includes numeric data and character data; a sequence conversion module 520 for converting the numeric data into a first sequence and the character data into a second sequence; a target vector generating module 530, configured to generate a target vector according to the first sequence and the second sequence; and the machine learning processing module 540 is configured to process the target vector through a machine learning model to obtain a result of whether the information to be checked passes the check.
In an exemplary embodiment, the information auditing apparatus may further include: the system comprises a preset vocabulary database acquisition module, a preset vocabulary database acquisition module and a preset vocabulary database processing module, wherein the preset vocabulary database acquisition module is used for acquiring a preset vocabulary database which comprises preset vocabularies and unique labels corresponding to the preset vocabularies; the sequence conversion module may include: and the second sequence conversion unit is used for carrying out word segmentation on the character type data to obtain a plurality of words and converting the character type data into a second sequence according to the unique label corresponding to the character type data in the preset word database.
In an exemplary embodiment, the preset vocabulary database obtaining module may include: the sample data processing unit is used for acquiring a plurality of sample data, the sample data comprises character type sample data, and the character type sample data is subjected to word segmentation processing to obtain a preset word set; and the unique label generating unit is used for generating a unique label for each vocabulary in the preset vocabulary set and generating a preset vocabulary database according to each vocabulary in the preset vocabulary set and the unique label thereof.
In an exemplary embodiment, the unique label generating unit may be configured to count the occurrence times of each vocabulary in the preset vocabulary set, sort the vocabularies according to the occurrence times, and determine the sorted serial numbers as the unique labels of the vocabularies.
In an exemplary embodiment, the second sequence conversion unit may be further configured to fill the second sequence with a preset null word value or delete the redundant value according to the length of the reference sequence.
In an exemplary embodiment, the sequence conversion module may include: and the reference length determining unit is used for determining the reference character type sample data with the largest vocabulary number in the sample data, converting the sample data into a numerical sequence according to a preset vocabulary database, and determining the length of the numerical sequence as the length of the reference sequence.
In an exemplary embodiment, the second sequence conversion unit may be further configured to convert words other than the predetermined word into predetermined new word values when the character data includes words other than the predetermined word.
In an exemplary embodiment, the sample data may further include numerical sample data and a classification tag corresponding to the sample data; the machine learning processing module can also be used for converting the numerical value type sample data into a first sample sequence, converting the character type sample data into a second sample sequence, and training and obtaining the machine learning model through the first sample sequence, the second sample sequence and the corresponding classification labels.
In an exemplary embodiment, the information to be audited may include prior characteristic data; the information auditing device can also comprise: and the prior condition judging module is used for judging whether the prior characteristic data accords with the prior condition after the target data is extracted, and outputting the result that the information to be audited cannot be audited when the prior characteristic data does not accord with the prior condition.
In an exemplary embodiment, the second sequence may include a semantic feature vector; the sequence conversion module can also be used for segmenting character type data into words, obtaining a plurality of word vectors through a word2vec model, and generating the semantic feature vector according to the word vectors.
In an exemplary embodiment, the machine learning model may include a long-short term memory network model or a support vector machine model.
The details of each module/unit in the above apparatus have been described in detail in the embodiments of the method section, and thus are not described again.
Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 600 according to such an exemplary embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting the various system components (including the memory unit 620 and the processing unit 66), and a display unit 640.
Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 610 may perform the steps as shown in fig. 2: step S210, extracting target data of a plurality of preset dimensions from information to be audited, wherein the target data comprises numerical data and character data; step S220, converting the numerical data into a first sequence, and converting the character data into a second sequence; step S230, generating a target vector according to the first sequence and the second sequence; and step S240, processing the target vector through a machine learning model to obtain a result of whether the information to be audited passes the audit.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.
The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 can be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 800 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.
Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure as described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 7, a program product 700 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit according to an exemplary embodiment of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (13)

1. An information auditing method is characterized by comprising the following steps:
extracting target data of a plurality of preset dimensions from information to be audited, wherein the target data comprises numerical data and character data;
converting the numeric data into a first sequence and the character data into a second sequence;
generating a target vector according to the first sequence and the second sequence;
processing the target vector through a machine learning model to obtain whether the information to be audited passes the audit;
wherein said converting the character-type data into a second sequence comprises:
acquiring a preset vocabulary database, wherein the preset vocabulary database comprises preset vocabularies and unique labels corresponding to the preset vocabularies;
performing word segmentation processing on the character type data to obtain a plurality of words;
arranging the unique labels corresponding to the vocabularies in the preset vocabulary database according to the sequence of the original text in the character type data, and converting the character type data into the second sequence.
2. The method of claim 1, wherein the obtaining a database of predetermined words comprises:
acquiring a plurality of sample data, wherein the sample data comprises character type sample data;
performing word segmentation processing on the character type sample data to obtain a preset word collection;
generating a unique label for each vocabulary in the preset vocabulary set;
and generating the preset vocabulary database according to each vocabulary in the preset vocabulary set and the unique label of each vocabulary.
3. The method of claim 2, wherein generating a unique label for each vocabulary in the predetermined set of vocabularies comprises:
counting the occurrence times of all the words in the preset word set, and sequencing all the words according to the occurrence times;
and determining the sequence number of the sequence as the unique mark number of each vocabulary.
4. The method of claim 2, further comprising:
and filling a preset null word numerical value or deleting an unnecessary numerical value for the second sequence according to the length of the reference sequence.
5. The method of claim 4, further comprising:
determining the reference character type sample data with the largest vocabulary quantity in the plurality of sample data;
converting the standard character type sample data into a numerical sequence according to the preset vocabulary database;
determining the length of the numerical value sequence as the length of the reference sequence.
6. The method of claim 1, further comprising:
and when the character type data comprises the vocabulary except the preset vocabulary, converting the vocabulary except the preset vocabulary into a preset new word numerical value.
7. The method of claim 2, wherein the sample data further comprises numerical sample data and a class label corresponding to the sample data; the method further comprises the following steps:
converting the numerical sample data into a first sample sequence;
converting the character type sample data into a second sample sequence;
and training and obtaining the machine learning model through the first sample sequence, the second sample sequence and the corresponding classification labels.
8. The method according to claim 1, wherein the information to be audited includes a priori characteristic data; the method further comprises the following steps:
after the target data is extracted, judging whether the prior characteristic data meets prior conditions or not;
and if the prior characteristic data is judged not to accord with the prior condition, outputting the result that the information to be audited cannot be audited.
9. The method of claim 1, wherein the second sequence comprises a semantic feature vector; the converting the character-type data into a second sequence comprises:
dividing the character type data into words, and obtaining a plurality of word vectors through a word2vec model;
generating the semantic feature vector from the plurality of word vectors.
10. The method of claim 1, wherein the machine learning model comprises a long-short term memory network model or a support vector machine model.
11. An information auditing apparatus, characterized by comprising:
the target data extraction module is used for extracting a plurality of preset dimensionalities of target data from the information to be audited, and the target data comprises numerical data and character data;
a sequence conversion module for converting the numeric data into a first sequence and the character data into a second sequence;
a target vector generation module, configured to generate a target vector according to the first sequence and the second sequence;
the machine learning processing module is used for processing the target vector through a machine learning model to obtain a result of whether the information to be audited passes audit;
wherein said converting the character-type data into a second sequence comprises:
acquiring a preset vocabulary database, wherein the preset vocabulary database comprises preset vocabularies and unique labels corresponding to the preset vocabularies;
performing word segmentation processing on the character type data to obtain a plurality of words;
arranging the unique labels corresponding to the vocabularies in the preset vocabulary database according to the sequence of the original text in the character type data, and converting the character type data into the second sequence.
12. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-10 via execution of the executable instructions.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-10.
CN201810496212.1A 2018-05-22 2018-05-22 Information auditing method and device, electronic equipment and storage medium Active CN110580308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810496212.1A CN110580308B (en) 2018-05-22 2018-05-22 Information auditing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810496212.1A CN110580308B (en) 2018-05-22 2018-05-22 Information auditing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110580308A CN110580308A (en) 2019-12-17
CN110580308B true CN110580308B (en) 2022-06-07

Family

ID=68808837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810496212.1A Active CN110580308B (en) 2018-05-22 2018-05-22 Information auditing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110580308B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222313B (en) * 2019-12-18 2023-08-18 东软集团股份有限公司 Security measure auditing method, device and equipment
CN111507097B (en) * 2020-04-16 2023-08-04 腾讯科技(深圳)有限公司 Title text processing method and device, electronic equipment and storage medium
CN111581344A (en) * 2020-04-26 2020-08-25 腾讯科技(深圳)有限公司 Interface information auditing method and device, computer equipment and storage medium
CN111626883A (en) * 2020-05-29 2020-09-04 上海商汤智能科技有限公司 Authority verification method and device, electronic equipment and storage medium
CN112950170B (en) * 2020-06-19 2022-08-26 蚂蚁胜信(上海)信息技术有限公司 Auditing method and device
CN112418798A (en) * 2020-11-23 2021-02-26 平安普惠企业管理有限公司 Information auditing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447366A (en) * 2015-08-07 2017-02-22 百度在线网络技术(北京)有限公司 Checking method of multimedia advertisement, and training method and apparatus of advertisement checking model
CN107944447A (en) * 2017-12-15 2018-04-20 北京小米移动软件有限公司 Image classification method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9720901B2 (en) * 2015-11-19 2017-08-01 King Abdulaziz City For Science And Technology Automated text-evaluation of user generated text
CN107633433B (en) * 2017-09-29 2021-02-05 北京奇虎科技有限公司 Advertisement auditing method and device
CN107832458B (en) * 2017-11-27 2021-08-10 中山大学 Character-level text classification method based on nested deep network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447366A (en) * 2015-08-07 2017-02-22 百度在线网络技术(北京)有限公司 Checking method of multimedia advertisement, and training method and apparatus of advertisement checking model
CN107944447A (en) * 2017-12-15 2018-04-20 北京小米移动软件有限公司 Image classification method and device

Also Published As

Publication number Publication date
CN110580308A (en) 2019-12-17

Similar Documents

Publication Publication Date Title
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN109598517B (en) Commodity clearance processing, object processing and category prediction method and device thereof
CN112270379A (en) Training method of classification model, sample classification method, device and equipment
CN111858843B (en) Text classification method and device
CN113064964A (en) Text classification method, model training method, device, equipment and storage medium
CN110555205B (en) Negative semantic recognition method and device, electronic equipment and storage medium
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN112487149A (en) Text auditing method, model, equipment and storage medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN111753086A (en) Junk mail identification method and device
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN114528845A (en) Abnormal log analysis method and device and electronic equipment
CN112860919A (en) Data labeling method, device and equipment based on generative model and storage medium
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN114416979A (en) Text query method, text query equipment and storage medium
CN115600109A (en) Sample set optimization method and device, equipment, medium and product thereof
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN114722198A (en) Method, system and related device for determining product classification code
CN112989050B (en) Form classification method, device, equipment and storage medium
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115718889A (en) Industry classification method and device for company profile

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant