CN114282097A - Information identification method and device - Google Patents

Information identification method and device Download PDF

Info

Publication number
CN114282097A
CN114282097A CN202111383614.9A CN202111383614A CN114282097A CN 114282097 A CN114282097 A CN 114282097A CN 202111383614 A CN202111383614 A CN 202111383614A CN 114282097 A CN114282097 A CN 114282097A
Authority
CN
China
Prior art keywords
information
word
identified
violation
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111383614.9A
Other languages
Chinese (zh)
Inventor
李小江
阮禄
吴洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Communication Industry Service Co ltd Zhongran Information Branch
Original Assignee
Chongqing Communication Industry Service Co ltd Zhongran Information Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Communication Industry Service Co ltd Zhongran Information Branch filed Critical Chongqing Communication Industry Service Co ltd Zhongran Information Branch
Priority to CN202111383614.9A priority Critical patent/CN114282097A/en
Publication of CN114282097A publication Critical patent/CN114282097A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The application discloses an information identification method and an information identification device. The method comprises the following steps: acquiring first information to be identified; determining a matching result between the first information to be identified and the violation word data set; the matching result comprises target violation words existing in the first information to be identified and the violation word data set; if the matching result meets a preset condition, second information to be recognized including the target violation word is acquired from the first information to be recognized; and judging whether the second information to be identified violates the rules or not by using the depth model. By the method, the violation information can be effectively identified, and a healthier network environment can be created.

Description

Information identification method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to an information identification method and an information identification apparatus.
Background
With the popularization of electronic products, electronic products such as mobile phones and computers become an indispensable part of people's lives. Meanwhile, with the rapid development of the internet industry, various webpages can provide more and more information for users. However, as network information is more and more easily obtained, along with some illegal persons or other useful persons, some violation information is propagated on the network, which easily causes irreparable misleading and influencing to many netizens and enterprises.
Therefore, how to effectively identify the violation information is a technical problem to be solved urgently.
Disclosure of Invention
The application discloses an information identification method and an information identification device, which can effectively identify violation information and are beneficial to creating a healthier network environment.
In a first aspect, an embodiment of the present application provides an information identification method, where the method includes:
acquiring first information to be identified;
determining a matching result between the first information to be identified and the violation word data set; the matching result comprises target violation words existing in the first information to be identified and the violation word data set;
if the matching result meets a preset condition, second information to be recognized including the target violation word is acquired from the first information to be recognized;
and judging whether the second information to be identified violates the rules or not by using the depth model.
In an alternative embodiment, the second information to be recognized includes a plurality of words; the specific implementation method for judging whether the second information to be identified violates the rule by using the depth model is as follows: determining semantic dependency relations among the words in the second information to be recognized by using the depth model; and judging whether the second information to be identified violates rules or not according to the semantic dependency relationship.
In an alternative embodiment, the second information to be recognized includes a first word, a fixed language, and a second word; in the second information to be recognized, the appearance sequence of the first words, the fixed language and the second words is decreased; the specific implementation mode for determining the semantic dependency relationship between the words in the second information to be recognized by using the depth model is as follows: and determining the modification object of the fixed language as the first word from the first word and the second word by using the depth model.
In an alternative embodiment, the number of target violating words is one or more; the preset conditions include one or more of the following: the length of the target violation word is smaller than a first threshold; the number of the target violation words is smaller than a second threshold; and the violation degree value of the first information to be processed is smaller than a third threshold, and the violation degree value of the first information to be processed is determined by the part of speech of the target violation word.
In an optional implementation manner, a specific implementation manner of obtaining the second information to be recognized including the target violation word from the first information to be recognized is as follows: determining the position of a target violation word in the first information to be identified; according to the position, sentence cutting is carried out on the first information to be identified, and second information to be identified, including the target violation words, is obtained; and the length of the characters included in the second information to be recognized is smaller than a fourth threshold value, and/or the second information to be recognized has a complete sentence structure.
In an alternative embodiment, the first information to be identified is crawl information that does not match objects in a filter object data set, the filter object data set comprising blacklisted objects and/or whitelisted objects.
In an alternative embodiment, the method may further comprise: crawling the crawling information according to a crawling strategy; wherein the crawling policy comprises one or more of: in the crawling process, a preset number of requests are sent by using first information, and then requests are sent by using second information; the first information is identity information and/or address information; if the page structure of the crawled webpage is detected to be not the preset structure, formatting the page structure of the webpage; and if the crawling URL is detected to be incomplete, dynamically capturing a package of the page corresponding to the crawling URL.
In a second aspect, an embodiment of the present application provides an information identification apparatus, which includes means for implementing the method of the first aspect.
In a third aspect, an embodiment of the present application provides another information identification apparatus, including a processor; the processor is configured to perform the method of the first aspect.
In an alternative embodiment, the information identification device may further include a memory; the memory is used for storing a computer program; the processor is specifically configured to invoke the computer program from the memory and execute the method according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a chip, where the chip is configured to perform the method of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to the first aspect.
Drawings
Fig. 1 is a schematic flowchart of an information identification method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating a result of identifying violation information by using a depth model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an information identification system according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a hierarchy in a web page provided by an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a processing flow of a text judging module according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an information identification apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another information identification device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to better understand the technical solutions provided by the embodiments of the present application, first, technical terms related to the embodiments of the present application are introduced.
(1) Depth model
The depth model is used for identifying whether the information to be identified violates rules or not. In the present application, the depth model may use the depth model architecture biLSTM. Alternatively, the depth model may use a combination of the biLSTM and textcnn models. Optionally, the depth model may use biLSTM plus textcnn model and Virtual adaptive Training (Virtual adaptive Training) as the model architecture for text classification, i.e. adding the portion of the model after the Virtual adaptive Training perturbation when Training biLSTM.
The idea of the confrontation training is to develop an opponent (confrontation network) and continuously improve the learning of the opponent (generation network). For example, training the countermeasure network and the generation network separately with different objectives competes. The Virtual adaptive tracking is used for the depth model, so that the generalization capability and robustness of the model can be improved.
The countermeasure training is to learn the countermeasure sample generated by a certain model, so that the model is bound to be more targeted, and therefore, the error rate higher than that of the original model can occur when the countermeasure sample generated by other models is attacked. In addition, each model is generally robust to challenge samples generated by the challenge model. The anti-training is not only fit to the disturbance which affects the model, but also weakens the linear assumption of the model which needs to be relied on in the single-step attack, and further improves the robustness of the model to the black box attack.
Antagonistic perturbations typically involve small modifications to many real-valued inputs. For text classification, the input is discrete, usually represented as a series of high-dimensional one-dimensional thermally encoded vectors. Since the high-dimensional one-dimensional set of thermal coding vectors does not allow infinitesimal perturbations, the depth model in this application defines the perturbations on continuous word embedding, rather than on discrete word input. Both traditional and virtual adversarial training can be interpreted as regularization strategies to provide malicious input as a defense against enemies. Since perturbation embedding does not map to any word, and an adversary may not have access to the word embedding layer, the training strategy in this application is no longer a defense strategy for the adversary.
(2) Cloud service
The information identification method provided by the embodiment of the application can be executed by an information identification device, the existence form of the information identification device can be a virtual device borne on a cloud server, a plurality of parts similar to independent servers are virtually simulated on an entity server (host), each part can be used as an independent operating system, and the management method is the same as that of the server. The cloud server can provide a flexible cloud technology capable of adjusting the configuration of the cloud host, has the cloud host renting service with the capacity of using as required and paying immediately as required, and greatly improves the flexibility, controllability, expansibility and resource reusability. The management mode is simpler and more efficient than that of a physical server, and more stable and safe application can be quickly constructed, so that the difficulty in developing operation and maintenance and the cost of an Information Technology (IT) are reduced.
The information identification method related to the application can be packaged into a cloud service, and an interface is exposed to the outside. When the information identification method related to the application needs to be used, whether the information to be identified violates rules or not can be identified by calling the interface.
(3) Cloud computing
Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.
The information identification method provided by the application relates to large-scale calculation, and requires large calculation power and storage space, so that in a feasible implementation manner of the information identification method, enough calculation power and storage space can be obtained through a cloud computing technology.
In order to effectively identify violation information and thus create a healthier network environment, the embodiment of the present application provides an information identification method, as shown in fig. 1, which may include, but is not limited to, the following steps:
s101, acquiring first information to be identified.
The information identification method provided by the embodiment of the application can be executed by an information identification device, and the information identification device can be a server or a terminal device. The server may be a cloud server, that is, the cloud server may execute the information identification method. A terminal device may also be referred to as a User Equipment (UE), a terminal (terminal), a Mobile Station (MS), a Mobile Terminal (MT), and so on. The terminal device may be a mobile phone (mobile phone), a wearable device, a tablet computer, a computer with a wireless transceiving function, a Virtual Reality (VR) terminal device, an Augmented Reality (AR) terminal device, a wireless terminal in a smart city (smart city), a wireless terminal in a smart home (smart home), and so on.
The first information to be identified may be information obtained by crawling a crawler, and the first information to be identified may include one or more of the following forms of information: text, pictures, video, audio, files, etc. For example, the first information to be identified is a web page. Alternatively, the information recognition apparatus may acquire the first information to be recognized from a local database, or the information recognition apparatus may acquire the first information to be recognized through a cloud service.
In one implementation, the first to-be-identified information may be crawl information that does not match objects in a filter object data set, the filter object data set including blacklist objects and/or whitelist objects. In other words, the first to-be-identified information may be crawled information filtered by the filtering object dataset, which does not match (or is not associated with) the blacklisted object in case the filtering object dataset comprises the blacklisted object; in the event that the filtered object data set includes a whitelist object, the first information to be identified does not match (or is not associated with) the whitelist object; in the case where the filtered object dataset includes both blacklisted objects and whitelisted objects, the first information to be identified does not match (or is not associated with) either the blacklisted objects or the whitelisted objects. Wherein, the first information to be identified not matching (or not being associated with) the blacklist object may refer to: the blacklist object is not included in the first information to be identified.
The object can be information such as a website, a webpage, words and the like. The blacklisted objects may be objects for which violations were previously detected. The white list object can be a large website, and the data in the website is strictly checked in the process of publishing, so that the information in the website is generally not illegal. The crawling information is filtered through the filtering object data set, some crawling information is filtered without further identification, and only the crawling information obtained after filtering is further identified, so that the calculation resources are saved.
S102, determining a matching result between the first information to be recognized and the violation word data set; the matching result includes a target violation word that is present in both the first to-be-identified information and the violation word data set.
Wherein the violation word data set comprises a plurality of violation words, and the violation words can be violation words in one or more fields. For example, the offending word may be a word relating to pornography, gambling, political involvement, virus involvement, gambling, and the like.
The information recognition apparatus may retrieve whether the first information to be recognized includes the violation word in the violation word data set to obtain a matching result. It should be noted that the illegal word mentioned in the embodiment of the present application refers to a word existing in the illegal word data set. The matching result may include the target violation word retrieved from the first information to be identified, and it is understood that the target violation word also exists in the violation word data set. It should be noted that, for example, in other implementation manners, the information recognition apparatus may also retrieve the first information to be recognized by another device, and the information recognition apparatus obtains the matching result from the device. The violation word data set may be stored in the information recognition device, or may be stored in the cloud server, which is not limited in the embodiment of the present application.
S103, if the matching result meets a preset condition, second information to be recognized including the target violation word is obtained from the first information to be recognized.
After the information recognition device obtains the matching result, it can be determined whether the matching result meets a preset condition (e.g., matching the offending word). If the preset condition is met, whether violation occurs can be further identified, and if the preset condition is not met, whether the first information to be identified violates the rule can be determined without further identification.
In one implementation, the number of target offending words is one or more; the preset conditions may include one or more of the following: the length of the target violation word is smaller than a first threshold; the number of the target violation words is smaller than a second threshold; and the violation degree value of the first information to be processed is smaller than a third threshold, and the violation degree value of the first information to be processed is determined by the part of speech of the target violation word.
Violation word matching may include one or more of the following processing sub-processes: long word calculation, violation degree calculation and part of speech calculation. The long word calculation subprocess is to calculate the violation word with a longer length. A target violation word having a length less than a first threshold may indicate that the target violation word is not a violation word having a longer length. Optionally, if it is detected that the length of the target violation word is greater than or equal to the first threshold, further identification is not needed, and the first information to be identified is determined to be in violation. Since long words represent more semantic information and certainty with respect to conventional human use words. In the everyday habit of language use, if a relatively long character is used as a word, the word greatly represents an event or a specific idea. Therefore, if a long word is matched in the first information to be recognized, the first information to be recognized is violated with a high probability. The number of the target violation words is smaller than the second threshold, which may indicate that the number of the target violation words existing in the first information to be recognized is smaller. Optionally, if the number of the detected target violation words is greater than or equal to the second threshold, the first information to be identified may also be determined to be in violation without further identification. If the number of the illegal words is large, the first information to be identified to which the illegal words belong is illegal with a high probability.
The violation rule calculation operator process is used for calculating accumulation of parts of speech of a plurality of violation words matched in the first information to be identified. The violation degree value of the first information to be recognized is smaller than the third threshold, which may indicate that the violation degree value determined by the parts of speech of the plurality of target violation words present in the first information to be recognized is smaller than the third threshold, in other words, the violation degree of the first information to be recognized is lower. Optionally, if the violation degree value of the first to-be-processed information is greater than or equal to the third threshold, no further identification is needed, and it may also be determined that the first to-be-identified information is violated. The rule violation calculation is a quantitative process, for example, a large number of even all word parts appearing in a yellow-related webpage are yellow-related, and when the accumulated rule violation degree is greater than the second threshold, the webpage can be directly determined to be illegal without further identifying the webpage.
The part-of-speech computation subprocess is to guess the part-of-speech of a word and further guess semantic information of information to be identified to which the word belongs. Part-of-speech calculation can be applied to the violation degree calculation subprocess, and the part-of-speech of each target violation word is determined, so that the violation degree value of the first information to be processed is obtained. In one implementation, a part of speech may correspond to a violation degree value. After the part-of-speech of each target violation word is determined, the violation degree values corresponding to the part-of-speech of each target violation word can be added, and the obtained result is used as the violation degree value of the first information to be processed.
If the preset condition is met, second information to be recognized including the target violation word can be further acquired from the first information to be recognized, and whether the second information to be recognized violates the rule or not is judged. Whether the second information to be identified is illegal may indicate whether the first information to be identified is illegal. And if the second information to be identified is illegal, the first information to be identified is illegal. And if the second information to be identified does not violate the rule, the first information to be identified does not violate the rule.
In one implementation manner, a specific implementation manner of obtaining the second information to be recognized including the target violation word from the first information to be recognized may be: determining the position of a target violation word in the first information to be identified; according to the position, sentence cutting is carried out on the first information to be identified, and second information to be identified, including the target violation words, is obtained; and the length of the characters included in the second information to be recognized is smaller than a fourth threshold value, and/or the second information to be recognized has a complete sentence structure.
Under the condition that the matching result does not meet the preset condition, the meaning information provided by the first information to be recognized is less, and the first information to be recognized can be cut through searching the context information and the position around the target violation word. Optionally, the following sentence cutting manner may be adopted: first, a part of short sentences around the target violation words are cut in the first information to be recognized, and the second information to be recognized obtained by cutting in this way may not conform to normal reading habits and logic structures. The second information to be recognized includes a character length smaller than a fourth threshold value, which indicates that the second information to be recognized is a short sentence cut from the first information to be recognized. And secondly, cutting sentences according to normal reading conditions (namely, cutting sentences according to normal semantic conditions), wherein the cut second information to be recognized has a complete sentence structure. The complete sentence structure may refer to: the second information to be identified comprises a main predicate object. The first sentence cutting mode can be called an illegal short text sentence cutting or a relative sentence cutting, and the second sentence cutting mode can be called a semantic sentence cutting or an absolute sentence cutting.
And S104, judging whether the second information to be identified violates rules or not by using the depth model.
In one implementation, the second information to be recognized may include a plurality of words; judging whether the second information to be identified violates rules by using the depth model, including: determining semantic dependency relations among the words in the second information to be recognized by using the depth model; and judging whether the second information to be identified violates rules or not according to the semantic dependency relationship. The semantic dependency refers to a degree of correlation between words in the second information to be recognized, for example, the second information to be recognized includes an adjective and two nouns, and the semantic dependency may indicate which of the two nouns the adjective is used to modify.
In one implementation, the second information to be recognized includes a first word, a fixed language, and a second word; in the second information to be recognized, the appearance sequence of the first word, the fixed language and the second word is decreased progressively. Determining semantic dependency relationships between words in the second information to be recognized by using the depth model, wherein the semantic dependency relationships comprise the following steps: and determining that the modified object of the fixed language is the first word and not the second word from the first word and the second word by using the depth model. In other implementations, the modifier of the phrase may also be a second word. The appearance sequence of the first word, the fixed language and the second word is decreased, the modification object of the fixed language is the first word, and the fixed language appearing later in the second information to be identified is used for modifying the first word appearing before the fixed language. Therefore, the depth model in the application can fully utilize historical information. For example, "this restaurant is dirty and not well-separated," which is a modification of the degree of "dirty," the bi-directional semantic dependence can be better captured by the depth model in this application.
In natural language processing, in order to combine representations of words into a representation of a sentence, an addition method, i.e., a method of adding representations of all words or an averaging method, may be employed, but these methods do not consider the order of words in the sentence. Such as the sentence "i don't feel that he is good". Without the use of the depth model in this application, the mere addition of all word representations would not be known to negate the word "not" to the following "good", and the word "good" would therefore be mistaken for the recognition of a recognition of the sentence. If the depth model in the application is used, bidirectional semantic dependence can be captured better, that is, negation of 'good' from 'not' words can be known, so that the situation that the emotion of the sentence is devastating can be determined accurately. The depth model in the present application has a fine-grained classification of emotion words, such as five classifications of recognition, neutrality, derviation, and depreciation. And also note the interaction between emotional words, degree words, negative words in the five categories.
In one implementation manner, in the second information to be recognized, the number of characters spaced between the first word and the fixed language is greater than a preset number, which indicates that the distance between the first word and the fixed language is relatively long. Namely, through the depth model in the application, the semantic dependency relationship with a longer distance can be captured better, so that the accuracy of identifying the illegal words is improved.
Optionally, the learning parameters in the depth model in the present application may be updated. Specifically, the update of the learning parameters in the depth model may use a back-propagation time (BPTT) algorithm through time, in which case the depth model differs from the general model in that the hidden layer performs computation for all time steps in the forward computation error (forward) and backward update model parameter gradient (back).
In the case where the depth model uses a combination of the biLSTM and textcnn models, the similarity to the offending word can be calculated in the process of convolution, and then it is derived through the max posing layer whether the offending word concerned by the model appears in the information to be recognized. Alternatively, it may also be determined how similar the most similar offending word is to the convolution kernel is the greatest. Assuming that the chinese output is a word vector, ideally a convolution kernel represents a keyword (e.g. an offending word), for example, in a 2-class task, if the whole depth model is used as a black box to detect its output result, it is found that the model is particularly sensitive to whether the input text contains words such as "like" and "love". This is because if one or more of these two words are present in a large number of training samples, indicating that these two words are common features of such data, the convolution kernel can learn these characteristics. In the depth model of the application, one convolution kernel can learn only half of the keyword vector, then another convolution kernel learns the other half of the keyword vector, and finally the characteristic values are accumulated at the place of the classifier to obtain the final result. Therefore, the depth model of the application can acquire local semantic features of the text from multiple dimensions.
Optionally, in the depth model of the present application, a layer of bidirectional biLSTM may be added before the information to be identified enters the textcnn model to capture global information of the text, so that a classification model in which the LSTM layer learns context dependence and the textcnn captures locally important information may be formed.
Referring to fig. 2, a diagram of the result of identifying violation information using a depth model is shown. In fig. 2, before the improvement, the LSTM model is used to determine whether the text violates rules, and after the improvement, the LSTM model is used to determine whether the text violates rules. As can be seen from fig. 2, for the text "see asian american pictures together", the recognition result before the improvement is a violation, and the recognition result after the improvement is a non-violation. For the text "dawn casino", the recognition of the violation is accurate because the probability of the violation being involved in such a situation is very high. For the text "artist engaged in an event at dawn casino", the text includes "dawn casino" identified as violation information, but the probability that the text is not violated is higher in conjunction with the context. The depth model can be accurately distinguished, because the depth model can capture the semantic dependency relationship of a longer distance, but the LSTM model cannot. Therefore, by using the depth model, the accuracy of violation information identification can be improved, and the condition that normal information is mistakenly identified as violation information can be avoided.
It should be noted that the threshold (e.g., the first threshold, the second threshold, the third threshold, etc.) and the preset parameter (e.g., the preset number, etc.) in the embodiment of the present application may be set or modified by the information identification device.
In an implementation manner, an architecture diagram of the information recognition system is shown in fig. 3. The information identification system can comprise the following modules: the online configuration module, the crawler module, the data analysis, storage and duplication removal module and the text study and judgment module. As shown in FIG. 3, the online configuration module can also be referred to as module 1, the crawler module can also be referred to as module 2, the module for parsing, storing and de-duplicating data can also be referred to as module 3, and the text matching module can also be referred to as module 4.
It should be noted that the processing and flow performed by each module are a link, each link performs different processing architectures and storage on data, and each processing flow message queue in each small module is decoupled from each other. These functional small modules are in fact individual atomic capabilities that are relatively singular and centralized in function, and thus can be upgraded, optimized, or retrofitted to a portion of the modules individually. The communication among the modules is based on the flow of data. The independent module design can solve the problem of strong dependence among the modules of the system. Service capability can be provided among the small modules independently, and when other projects or systems want to obtain the corresponding service capability, the corresponding module can be accessed to use the service capability of the module.
Among other things, the online configuration module (i.e., module 1) may have one or more of the following functions: the crawler seeds and the web sites are flexibly configured for a user, the crawling strategy is configured, and a crawling target is set for the web crawler. For example, the crawled website level, whether a headless browser is used for rendering a page, the number of processes, the number of threads, the size of allocated memory used for crawler operation, the control of the flow threshold of a network, a program operation failure restart strategy, a crawl failure re-crawl strategy, the response time of each page, the alarm level and the like. As shown in module 1 in fig. 3, when there is a crawling task, seeds are loaded, and the loaded seed data enters a seed queue so that module 2 can use the seed queue for crawling.
The web page sites are web page sites that the crawler needs to crawl. Optionally, the provided website may have one or more of the following features: the browsing amount is less than a fifth threshold value, and the search ranking on the search engine platform is the first N digits. N is an integer greater than or equal to 1. And the browsing amount is smaller than a fifth threshold value, which indicates that the browsing amount is not large, the search ranking on the search engine platform is the top N, and indicates that the search ranking on the search engine platform is top. For example, the web site may be an official website of a medical institution, school education, or the like.
The information recognition system can dynamically load and render pages using a headless browser to ensure the authenticity of data. Because the web pages in some websites in the real site are dynamically loaded, if the crawler does not use a headless browser to load the content of a Uniform Resource Locator (URL), the obtained data will be a HyperText Markup Language (html) tag. In the dynamic page, text information and picture information are generally obtained by a remote server through a JavaScript (js) request, and information obtained only by using a common crawling strategy is unreal. Headless browsers refer to web browsers without a Graphical User Interface (GUI), and are typically controlled through a programming or command line Interface.
The crawl hierarchy can be used to control the deepest level of crawling of web pages to guarantee the number of crawls. The number of processes and the number of threads can be used to control the speed of crawling by the crawler. The memory allocation configuration item can be used for guaranteeing the performance of the crawler system, when the amount of the crawled data is large, the service capacity of the crawler can be guaranteed by appropriately setting the memory for operating the project to be larger, and the configuration item can also control the situation that when a program requests the memory, the brute growth of the memory cannot be caused, and negative feedback is brought to a server and other projects. The network traffic is the amount of data that can pass through the network in one second, and the unit is bit (bit)/second. The network can be compared with the highway, and the larger the flow, the wider the road, and the more data can pass through the network highway (at the same time). When crawling data, the crawler downloads the data in addition to the basic network request, and the crawling speed and progress can be controlled by reasonably setting a network flow threshold value. The program operation failure restart policy may ensure that the program does not lose service capability. The crawling failure re-crawling strategy can ensure that the crawled data is not lost. The page response time refers to the loading time of the page content, and the response time of the page loading is very long when the content of many web pages is loaded by a plurality of remote servers. The alarm level refers to the violation level of the violation information research and judgment result in the subsequent process. In the embodiment of the present application, the violation information may also be referred to as bad information.
The crawler module (i.e., module 2) may be used to: and data capture is carried out according to the strategy and the task configured by the module 1. In fig. 3, a cache (cache) in the module 2 may be used to temporarily store the loaded seed data and the subsequent link and picture data after URL parsing and deduplication, data in the cache may be preferentially crawled when the system is running, and if there is no data in the cache or the data in the cache is completely loaded, the data in the message queue may be loaded later. In this way, it is avoided that no extra resources are available to respond to directly loaded seeds when the system crawls are heavily tasked.
The message queue can be a URL queue to be captured or other queues. The fetching policy can be used to determine the arrangement order of the URLs in the URL queue to be fetched, and the arrangement order of the URLs can affect the fetching order of the pages corresponding to the URLs.
The crawler system is a multi-process multi-line support system, the existing mode of the crawled data at the sites is a tree form, that means that the quantity of the crawled data is increased in an exponential level along with the deepening of the crawled level. FIG. 4 is a hierarchy of web pages in which the letter A, B, C, D … J represents a hyperlink. The information identification system in the application can capture data in the following way: depth First Search (DFS) or Breadth First Search (BFS). DFS means that the crawler crawls links by links from a URL, and the links are not switched to other links until all the links where the links are located are processed. The grabbing sequence at this time is: a- > B- > D- > H- > I- > E- > J- > C- > F- > G. BFS is to insert the link found in the new downloaded web page directly into the end of the URL queue to be grabbed. That is, the web crawler will first crawl all web pages linked in the initial web page, then select one of the linked web pages, and continue to crawl all web pages linked in the web page. The grabbing sequence at this time is: a- > B- > C- > D- > E- > F- > G- > H- > I- > J. If the information recognition system crawls by using the DFS, a level identifier is brought by the information recognition system when each page is crawled, when the identifier does not exceed the level set in the configuration, the next level is automatically crawled and analyzed, and the level of the analyzed data is marked and set. Thus, as the iteration progresses, the automatic stopping and the downward continuation are carried out after the final level reaches the set level.
For some specific sites, the web crawler sends a large number of requests in a short time, consumes a large amount of server bandwidth, and may affect normal user access. In addition, data becomes a core asset of a company, and the enterprise needs to protect its core data to maintain or improve its core competitiveness, so that anti-crawlers are very important. The information identification system in the application can also solve the problem of reverse climbing.
The anti-crawling problem can be solved by configuring a crawling strategy in the information identification system. In one implementation, the information identification method may further include the following steps: according to the crawling strategy, crawling information is obtained through crawling; wherein the crawling strategy comprises one or more of the following for solving the anti-crawling problem:
crawling strategy 1: in the crawling process, a preset number of requests are sent by using first information, and then requests are sent by using second information; the first information is identity information and/or address information. The second information may also be identity information and/or address information. The first information is different from the second information. The address information may be, for example, an IP address, and various anti-crawlers may be easily bypassed by replacing an IP every few times requested in the crawling policy 1.
Crawling strategy 2: and if the page structure of the crawled webpage is detected to be not the preset structure, modifying the xpath or the regular expression according to the source code of the webpage. Some anti-crawling strategies change the original html page structure through JavaScript, so that required content cannot be matched in a program. In the crawling strategy 2, if the page structure of the crawled webpage is detected to be not the preset structure, the xpath or the regular expression of the webpage is changed according to the source code of the webpage, and data in a standard form can be returned, so that the problem of anti-crawler is solved. The preset structure can refer to a preset standard html page structure, and the page structure of the crawled webpage is not the preset structure and can show that: the page structure of the crawled web page has changed. xpath is a third-party repository of python for parsing web page content.
Crawling strategy 3: and if the page structure of the crawled webpage is not the preset structure, formatting the page structure of the webpage. The crawling strategy 3 is similar to the crawling strategy 2, and under the condition that the page structure of the crawled webpage is detected to be not a preset structure, the page structure of the webpage is formatted, and data in a standard form can be returned, so that the problem of anti-crawler is solved.
Crawling strategy 4: and if the crawling URL is detected to be incomplete, dynamically capturing a package of the page corresponding to the crawling URL. The crawling URL is incomplete, and data of a webpage corresponding to the crawling URL can be represented and not loaded at one time, so that the data crawled by the crawler is incomplete. In this case, the crawling policy 4 performs dynamic packet capturing on the page corresponding to the crawling URL, so as to obtain the asynchronously loaded data packet. In this way, new content loaded by the webpage every time can be grabbed, so that the anti-crawler problem is solved.
Crawling strategy 5: and if the crawling URL is detected to be incomplete or incorrect, acquiring an encrypted file, encrypting the crawled information according to the encrypted file, and returning encrypted data. The crawled URL is detected to be incomplete or incorrect, possibly because the target site has some parameters encrypted by JavaScript. In this case, the crawling policy 5 acquires an encrypted file, analyzes an encryption algorithm, encrypts crawled information according to the encrypted file, and returns encrypted data, thereby solving the anti-crawler problem.
Crawling strategy 6: adding Headers in the crawler, copying User-Agent of the browser into the Headers of the crawler, or modifying the Refer value into the domain name of the target website. For anti-crawlers detecting the Headers, copying the User-Agent of the browser to the Headers of the crawler, and solving the anti-crawlers problem. The Headers may be browser identification data, which may include browser configuration information (e.g., kernel version information used, a browser, supported network protocols, hypertext protocols, etc.). The User-Agent may be used to store the User's own information, such as an Identity (id), a User name, a User password, session information of the User, and the like. The information is verified when the website is accessed. Refer is when a certain page resource is accessed, the browser tells the page from which page the access is linked, and Refer can be used for verifying the validity of the access.
The parsing, storing and deduplication module of data (i.e., module 3) may be used to: and analyzing the data captured by the module 2, and then carrying out a series of deduplication and persistence operations on the data. As shown in fig. 3, the data captured by the module 2 may include, for example, pictures, texts, and the like, and correspondingly, the parsing of the captured data by the module 3 includes: parsing text, parsing picture (in fig. 3, the data parsing engine performs parsing as an example). After parsing, text, pictures and links may be obtained. Specifically, the text, the image, and the link obtained after the parsing may be placed in a corresponding text parsing queue, a corresponding picture parsing queue, and a corresponding link identification queue. The link in the link identification queue is further linked to remove the duplicate, hierarchical judgement, obtains the link that needs to crawl to it is put into the link and crawls the queue, and the data in the module 2 crawls the step and still includes: and crawling page contents corresponding to the links in the link crawling queue. The hierarchy judgment comprises the following steps: if the level identification of the link obtained from the link identification queue does not exceed the level set in the configuration, the next level of the link can be further grabbed.
In one implementation, the data captured by the parsing crawler module may refer to: parsing the text information in html, which is some plain text, needs to remove various tags in html, because these tags are not useful for the subsequent data analysis and study, and also cause huge consumption and pressure on data transmission and storage. Optionally, in addition to parsing the plain text information in the html, the text information in some key tags in the html may also be parsed, such as parsing html of a web page and parsing text information in meta tags of the web page. By parsing the TITLE and meta tags for a web page, the offending web page can be detected more efficiently. This is because many bad information pages may place some critical information or content in the TITLE to get clicks of potential users. In addition, in order to improve the search ranking of the bad information website, a large amount of key information or keywords can be put into the meta tags to improve the search ranking of the whole webpage, and when a user searches for the corresponding keywords, the website can display the bad information at the front position of the browser in a limited way, so that the user can find and browse the bad information conveniently. meta is an auxiliary tag in the head area of html language, is located in the head of the document, and does not contain any content.
In another implementation, the data captured by the parsing crawler module may refer to: and resolving the URL of the hyperlink in the html. The development modes of the websites are different, the existing modes of the hyperlinks in different websites are also different, and in order to analyze the correct URL more comprehensively as much as possible, the information identification system of the application provides one or more of the following analysis modes to analyze the URL of the hyperlink:
analysis method 1: match all URLs in web pages with http or https protocols and end [ -A-Za-z0-9+ & @ #/%? The name |)! 1,; spaces (including in text) any of these characters. This approach may be used to resolve hyperlinks that conform to the URL format of hypertext links.
Analysis method 2: match all URLs in the web page with www start and end [ -A-Za-z0-9+ & @ #/%? The name |)! 1,; spaces (including in text) any of these characters. This approach can match the domain names of web sites (including dark links and hyperlinks) in web pages that conform to the www protocol.
Analysis method 3: processing the presence of &in a link in a web page 'Ying' % 3A% 2F, etc. I.e. to identify these special characters in the link as normal characters. These symbols are special browser-encoded characters that are recognized as normal characters, for example, the special characters & amp are recognized as &.
Analysis method 4: matching href ═ URL in the hyperlink and performing corresponding splicing, so that the browser splicing rule can be reproduced, and the browser can identify the link obtained by accessing the splicing. E.g.//,/,? At the beginning. This approach handles the case where href is followed by a double quote. Href stands for hyperlink and refers to a connection from a web page to a target. Illustratively, the parent link is https:// www.ABC.com, the hyperlink href ═ common/declaration ">, and the splicing mode is as follows: and the links are spliced at the parent links at the beginning, and the links obtained after splicing are https:// ABC.
Analysis method 5: matching the web pages with base href labels, and changing the splicing rule. Specifically, the sub-links with base href are not spliced by the current parent link, but are spliced by the main domain name of the parent link. For example, for the link https:// a.ABC.com/common/declaration, the child links of the link are spliced with the main domain name of the link (i.e., https:// a.ABC.com).
Analysis method 6: matching href is URL in hyperlink and corresponding splicing. E.g.//,/,? At the beginning. This approach handles the case where href is followed by a single quotation mark.
Analysis method 7: and matching the URL in the href hyperlink and performing corresponding splicing. E.g.//,/,? At the beginning. This way, the case of no quotation marks after href is handled.
Analysis method 8: and matching the URL in the src ═ hyperlink and performing corresponding splicing. E.g.//,/,? At the beginning. This way, the case of src followed by a double quote is handled.
Analysis method 9: and matching the SRC-URL in the hyperlink and performing corresponding splicing. E.g.//,/,? At the beginning. This way, the case where src is followed by a single quotation mark is handled.
Analysis method 10: matching the mailbox following the @ symbol and the mailbox ending with any of the following suffixes: \. edu \, com \, gov \, cn \, org \, cn \, net \, cn \, top \, cp \, net \, org \, wang \, go \, gov \, mil \, co \, biz \, nafe \, info \, pro \, int \. Where, | is used to partition a particular term,/is a separator in a particular term. This approach can handle the case of a mailbox followed by an illegal link.
Analysis method 11: location ═ matches, this way can handle new links after web page jumps.
Analysis method 12: and matching the URL in the option value "" hyperlink and performing corresponding splicing. E.g.//,/,? At the beginning. This way, the case of option value followed by double quotation marks is handled.
Analysis method 13: and matching the option value as the URL in the hyperlink and performing corresponding splicing. E.g.//,/,? At the beginning. This way, the case where the option value is followed by an apostrophe is handled.
Analysis method 14: and processing the messy code text in the label, restoring the messy code text and performing link matching.
Data deduplication means that a plurality of repeated links exist in a website, the number of webpages to be crawled by the module 2 is too large, if repeated crawling of some repeated tags can not only lead to heavy stress of a crawler system and occupation of a large number of software and hardware performances, but also lead to the fact that the crawler enters a dead cycle and does not know when to finish crawling data, and thus the performance of the whole system is slowed down. If there are N websites in the whole network, the complexity of judging weight is N × log (N), because all the webpages need to be traversed once, and the complexity of log (N) is needed for each judgment. The redo determination method used by the information identification system in the application is as follows: a Bloom Filter (Bloom Filter) is used. The method is characterized in that a fixed memory (which does not increase along with the number of the URLs) can be used for judging whether the URLs are crawled or not according to the efficiency of O (1).
In the present application, the amount of data to be processed for information identification is large, and much data is picture data. The method and the device can obtain enough computing power and storage space through the cloud computing technology. Optionally, the information identification system may have a data recovery and error correction function, and when the storage server is down or the disk is damaged, the previously stored data may be recovered quickly and efficiently. In particular, an erasure code and checksum may be used to protect data from hardware failures and silent data. Erasure codes are a mathematical algorithm for recovering lost and corrupted data. Optionally, the present application may use the storage database MINIO for data storage, for example, to store crawled information. On standard hardware, MINIO has read/write speeds as high as 183 GB/sec and 171 GB/sec. It should be noted that, in fig. 3, the picture to be recognized is stored in the MINIO database for example, and may also be stored in another database, which is not limited in this embodiment of the application.
Optionally, the module 3 further has an Optical Character Recognition (OCR) module, so as to recognize the information to be recognized, which includes the picture. As shown in fig. 3, the OCR recognized results may be placed in an OCR recognition queue. The content with a large number of violations in many bad information web pages exists in the form of pictures, and text information included in the pictures can be recognized through an OCR module. Through the method, the illegal picture can be effectively identified so as to ensure the comprehensiveness of the information identification system to the bad information capturing capacity.
The text adjudication module (i.e., module 4) may be configured to: and judging whether the information to be identified from the module 3 violates rules or not. For example, referring to FIG. 3, a determination is made as to whether the information to be recognized from the text parsing queue and the OCR recognition queue is in violation. The general processing flow of the text studying and judging module comprises the following steps: and matching the information to be identified from the module 3 with the illegal word data set, putting the information to be identified, of which the matching result with the illegal word data set meets the preset condition, into an illegal word filtering queue, and further judging the information to be identified in the illegal word filtering queue by using the depth model so as to obtain a judging result. The evaluation result indicates whether the violation occurred. For the information to be identified whose matching result with the violating word data set does not satisfy the preset condition, the judgment result can be directly obtained, which is specifically described in step S102 and step S103.
For an exemplary process flow of the text judging module, see fig. 5. In fig. 5, the wave box represents various data sets, the parallelogram represents a certain process, the small rectangle represents a sub-process included in the process, and the large rectangle represents the result. The whole data flow is shown as a solid line in the figure, and the data output by the module 3 flows into the processing of this link, and is firstly subjected to data cleaning, and then the data without the black-and-white list (i.e. the first information to be identified) is obtained. The data has two flow directions, one is the flow direction violation word matching processing process, the other is that the data matched to the blacklist can directly flow to the final result, or the data matched to the outer chain can flow to an interface or a queue, and semantic analysis and judgment are further carried out on the data by adopting a depth model, so that the result is obtained. After data cleaning, the next processing process is carried out, namely illegal word matching. Specifically, a rule-breaking word data set of a data full-text matching system without a black-and-white list is used, then information to be identified (such as a webpage text) matched with the rule-breaking word is subjected to reverse full-text search to obtain position information of the rule-breaking word in an original text, after the specific position information is determined, short text sentence cutting can be performed on the text to obtain short text data related to the rule-breaking word, and finally the data are sent to an interface or a queue to be researched by using a depth model. The search efficiency using the reverse full-text search is high.
In FIG. 5, the crawler text data set includes information crawled by the crawler. In the crawler text dataset, the numbers can be numbered according to crawling batches or tasks. The crawler text data set may include one or more of the following: original webpage unstructured data of relevant webpages such as crawled relevant webpages, websites, public numbers, news media, supervision websites and the like. Low confidence text refers to web pages that match the relevant offending word, and since the web pages that match the offending word are not necessarily offending, these data are referred to as low confidence text. The content of matching the data without the black-and-white list and the violation word can be referred to the foregoing description, and is not described herein again.
In fig. 5, the dotted line part refers to the generation of the violation word and the operation of the violation word, and some basic word stock resources (i.e., word dictionary data sets in the graph) can be found by using the internet in the early stage of the operation of the violation word in the information recognition system, and then the violation word is added to the word stock through the operation (e.g., manual review). The operation violation words may include the following: some attributes are added to the offending word, such as adding one or more of the following: category, sensitivity level, interception rate, interception accuracy, recall rate and the like. These attributes may be used to calculate the degree of violation of the information to be identified that matches the offending word. Optionally, the related attributes of the violation words can be adjusted through feedback in actual production, and the accuracy of intercepting bad information is ensured. Optionally, the words in the basic thesaurus may also be expanded, for example, one or more of the following expansion may be performed: synonym expansion, pinyin expansion, allograph expansion, consent expansion, skip expansion, antisense expansion, and the like. The properties and categories of the expanded words may be the same as those of the source words. Optionally, a new violating word and a corresponding attribute category may be automatically obtained by using new word discovery and hot word calculation, so as to expand a violating word data set.
As shown in fig. 5, the illegal word matching includes 3 sub-processes of long word calculation, violation degree calculation, and part-of-speech calculation. In the illegal word matching process, the information to be identified meeting the preset conditions is the illegal text with low confidence coefficient, further, the position of the illegal word is searched in the original text, then sentence cutting is carried out, second information to be identified obtained by sentence cutting is placed into an interface or a queue, further, depth model alignment is adopted for studying and judging, and studying and judging results are obtained. And directly obtaining a judging result for the condition which does not meet the preset condition. The related content of the illegal word matching can be referred to the description in step S102, and is not described herein again.
In the process of matching the violation word data set, the violation word data set can be matched based on hash (hash) or regular filtering, or a Deterministic Finite state machine (DFA) algorithm or an Ahocorasick multi-mode matching algorithm is used. DFA may enable efficient filtering of offending words. The Ahocoralsick algorithm is a character string searching algorithm, and the Ahocoralsick algorithm is different from a common character string in matching: and simultaneously matching with all dictionary strings. The algorithm has approximately linear time complexity under the condition of equal share, which is about the length of the character string plus the number of all matches.
In the present application, "greater than or equal to" may be replaced with "greater than", and in this case, "less than" may be replaced with "less than or equal to".
The identification of the violation information by using the information identification system has the following beneficial effects: first, globally, violation information of different propagation paths and propagation modes can be identified, for example, violation information of text, picture and other types can be identified. Secondly, accuracy, the violation information is identified by adopting the depth model, and the identification accuracy can be improved. And thirdly, flexibility, and independent design of each module of the information identification system can greatly increase the remodelable and coupling properties of the system and is more flexible. Fourthly, the discovery and recognition capability of new information and the depth model can keep the discovery capability of the new information through continuous maintenance and training of the model, and further the recognition capability of violation information can be enhanced. And fifthly, high availability and reusability of data are realized, the cloud computing is used for providing strong computing power and storage resources, and a large amount of stored data can be used for self-learning or updating of the model.
Referring to fig. 6, fig. 6 is a schematic structural diagram of an information identification device according to an embodiment of the present application. As shown in fig. 6, the information recognition apparatus 60 includes an acquisition unit 601 and a processing unit 602. Wherein the content of the first and second substances,
an acquisition unit 601 configured to acquire first information to be identified;
the processing unit 602 is configured to determine a matching result between the first to-be-identified information and the violation word data set; the matching result comprises target violation words existing in the first information to be identified and the violation word data set;
the processing unit 602 is further configured to, if the matching result meets a preset condition, obtain second information to be identified including the target violation word from the first information to be identified;
the processing unit 602 is further configured to determine whether the second information to be identified violates rules by using the depth model.
In an alternative embodiment, the second information to be recognized includes a plurality of words; the processing unit 602 is configured to, when determining whether the second information to be identified violates the rule by using the depth model, specifically: determining semantic dependency relations among the words in the second information to be recognized by using the depth model; and judging whether the second information to be identified violates rules or not according to the semantic dependency relationship.
In an alternative embodiment, the second information to be recognized includes a first word, a fixed language, and a second word; in the second information to be recognized, the appearance sequence of the first words, the fixed language and the second words is decreased; the processing unit 602, when determining, by using the depth model, the semantic dependency relationship between the words in the second information to be recognized, is specifically configured to: and determining the modification object of the fixed language as the first word from the first word and the second word by using the depth model.
In an alternative embodiment, the number of target violating words is one or more; the preset conditions include one or more of the following: the length of the target violation word is smaller than a first threshold; the number of the target violation words is smaller than a second threshold; and the violation degree value of the first information to be processed is smaller than a third threshold, and the violation degree value of the first information to be processed is determined by the part of speech of the target violation word.
In an optional implementation manner, when the processing unit 602 is configured to obtain, from the first information to be identified, second information to be identified that includes the target violation word, specifically, to: determining the position of a target violation word in the first information to be identified; according to the position, sentence cutting is carried out on the first information to be identified, and second information to be identified, including the target violation words, is obtained; and the length of the characters included in the second information to be recognized is smaller than a fourth threshold value, and/or the second information to be recognized has a complete sentence structure.
In an alternative embodiment, the first information to be identified is crawl information that does not match objects in a filter object data set, the filter object data set comprising blacklisted objects and/or whitelisted objects.
In an alternative embodiment, the processing unit 602 may further be configured to: crawling the crawling information according to a crawling strategy; wherein the crawling policy comprises one or more of: in the crawling process, a preset number of requests are sent by using first information, and then requests are sent by using second information; the first information is identity information and/or address information; if the page structure of the crawled webpage is detected to be not the preset structure, formatting the page structure of the webpage; and if the crawling URL is detected to be incomplete, dynamically capturing a package of the page corresponding to the crawling URL.
Information identification device 60 may also be used to implement other functions of the information identification device in the corresponding embodiment of fig. 1, and will not be described herein again.
Referring to fig. 7, fig. 7 is a schematic diagram of another information identification apparatus 70 according to an embodiment of the present disclosure. Can be used for realizing the functions of the information identification device in the method embodiment. The information recognition apparatus 70 may include a processor 701. Optionally, the information identification device 70 may further include a memory 702. The processor 701 and the memory 702 may be connected by a bus 703 or by other means. The bus lines are shown in fig. 7 by thick lines, and the connection manner between other components is merely illustrative and not limited thereto. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.
The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, and may be an electrical, mechanical or other form for information interaction between the devices, units or modules. The embodiment of the present application does not limit the specific connection medium between the processor 701 and the memory 702.
The memory 702 may include both read-only memory and random access memory, and provides instructions and data to the processor 701. A portion of the memory 702 may also include non-volatile random access memory.
The Processor 701 may be a Central Processing Unit (CPU), and the Processor 701 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor, and optionally, the processor 701 may be any conventional processor or the like.
When the information identifying apparatus takes the form shown in fig. 7, the processor in fig. 7 may execute the method performed by the information identifying apparatus in any of the above-described method embodiments.
In an alternative embodiment, memory 702 is used to store program instructions; the processor 701 is configured to call the program instructions stored in the memory 702 for performing the steps performed by the information identification apparatus in the corresponding embodiment of fig. 1. Specifically, the functions/implementation processes of the acquiring unit and the processing unit in fig. 6 can be implemented by the processor 701 in fig. 7 calling a computer executing instruction stored in the memory 702.
In the embodiment of the present application, a computer program (including program codes) capable of executing the steps related to the method may be run on a general-purpose computing device, such as a computer, including a processing element and a storage element, such as a Central Processing Unit (CPU), a Random Access Memory (RAM), a Read-Only Memory (ROM), and the like, and the method provided by the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.
Based on the same inventive concept, the principle and the advantageous effect of the information recognition apparatus 70 provided in the embodiment of the present application for solving the problem are similar to those of the information recognition apparatus in the embodiment of the method of the present application for solving the problem, and for brevity, the principle and the advantageous effect of the implementation of the method can be referred to, and are not described herein again.
The embodiment of the present application further provides a chip, and the chip can execute the relevant steps of the information identification device in the foregoing method embodiments. In one possible implementation, the chip includes at least one processor, at least one first memory, and at least one second memory; the at least one first memory and the at least one processor are interconnected through a line, and instructions are stored in the first memory; the at least one second memory and the at least one processor are interconnected through a line, and the second memory stores the data required to be stored in the method embodiment.
For each device or product applied to or integrated in the chip, each module included in the device or product may be implemented by hardware such as a circuit, or at least a part of the modules may be implemented by a software program running on a processor integrated in the chip, and the rest (if any) part of the modules may be implemented by hardware such as a circuit.
The embodiment of the present application further provides a computer-readable storage medium, in which one or more instructions are stored, and the one or more instructions are adapted to be loaded by a processor and execute the method provided by the foregoing method embodiment.
Embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to perform the method provided by the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.
The modules in the device can be merged, divided and deleted according to actual needs.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
The above disclosure is only one preferred embodiment of the present application, which is only a few embodiments of the present application and should not be used to limit the scope of the present application.

Claims (10)

1. An information identification method, characterized in that the method comprises:
acquiring first information to be identified;
determining a matching result between the first information to be identified and the violation word data set; the matching result comprises target violation words existing in the first information to be identified and the violation word data set;
if the matching result meets a preset condition, second information to be identified comprising the target violation word is acquired from the first information to be identified;
and judging whether the second information to be identified violates the rules or not by using a depth model.
2. The method according to claim 1, wherein the second information to be recognized comprises a plurality of words; the judging whether the second information to be identified violates rules by using the depth model includes:
determining semantic dependency relations among the words in the second information to be recognized by using a depth model;
and judging whether the second information to be identified violates rules or not according to the semantic dependency relationship.
3. The method according to claim 2, wherein the second information to be recognized includes a first word, a fixed language, and a second word; in the second information to be recognized, the appearance sequence of the first word, the fixed language and the second word is decreased;
the determining semantic dependency relationship between words in the second information to be recognized by using the depth model includes:
determining, using a depth model, a modified object of the target from the first word and the second word as the first word.
4. The method according to any one of claims 1-3, wherein the number of target violation words is one or more; the preset conditions include one or more of the following:
the length of the target violation word is less than a first threshold;
the number of the target violation words is less than a second threshold;
the violation degree value of the first information to be processed is smaller than a third threshold value, and the violation degree value of the first information to be processed is determined by the part of speech of the target violation word.
5. The method according to any one of claims 1 to 3, wherein the obtaining of second information to be identified including the target violation word from the first information to be identified includes:
determining the position of the target violation word in the first information to be identified;
according to the position, sentence cutting is carried out on the first information to be identified, and second information to be identified, including the target violation words, is obtained; and the length of characters included in the second information to be recognized is smaller than a fourth threshold value, and/or the second information to be recognized has a complete sentence structure.
6. The method according to any one of claims 1 to 3,
the first information to be identified is crawl information which is not matched with objects in a filtering object data set, and the filtering object data set comprises blacklist objects and/or whitelist objects.
7. The method of claim 6, further comprising:
crawling the crawling information according to a crawling strategy;
wherein the crawling policy comprises one or more of:
in the crawling process, a preset number of requests are sent by using first information, and then requests are sent by using second information; the first information is identity information and/or address information;
if the page structure of the crawled webpage is detected to be not a preset structure, formatting the page structure of the webpage;
and if the URL of the crawled uniform resource locator is detected to be incomplete, dynamically capturing a package of the page corresponding to the crawled URL.
8. An information recognition apparatus comprising means for performing the method of any one of claims 1 to 7.
9. An information recognition apparatus, comprising a processor;
the processor is used for executing the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed, cause the method of any of claims 1-7 to be performed.
CN202111383614.9A 2021-11-19 2021-11-19 Information identification method and device Pending CN114282097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111383614.9A CN114282097A (en) 2021-11-19 2021-11-19 Information identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111383614.9A CN114282097A (en) 2021-11-19 2021-11-19 Information identification method and device

Publications (1)

Publication Number Publication Date
CN114282097A true CN114282097A (en) 2022-04-05

Family

ID=80869524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111383614.9A Pending CN114282097A (en) 2021-11-19 2021-11-19 Information identification method and device

Country Status (1)

Country Link
CN (1) CN114282097A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822496A (en) * 2023-06-02 2023-09-29 厦门她趣信息技术有限公司 Social information violation detection method, system and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822496A (en) * 2023-06-02 2023-09-29 厦门她趣信息技术有限公司 Social information violation detection method, system and storage medium
CN116822496B (en) * 2023-06-02 2024-04-19 厦门她趣信息技术有限公司 Social information violation detection method, system and storage medium

Similar Documents

Publication Publication Date Title
CN102436563B (en) Method and device for detecting page tampering
US20050171932A1 (en) Method and system for extracting, analyzing, storing, comparing and reporting on data stored in web and/or other network repositories and apparatus to detect, prevent and obfuscate information removal from information servers
CN102446255B (en) Method and device for detecting page tamper
WO2002037326A1 (en) System for monitoring publication of content on the internet
US11681765B2 (en) System and method for integrating content into webpages
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN107526718A (en) Method and apparatus for generating text
WO2013070534A1 (en) Function extension for browsers or documents
CN102591965A (en) Method and device for detecting black chain
Sharma et al. Evaluation of tools and extension for fake news detection
Yang et al. Scalable detection of promotional website defacements in black hat {SEO} campaigns
Kumar World towards advance web mining: A review
CN112818200A (en) Data crawling and event analyzing method and system based on static website
Mehta et al. A comparative study of various approaches to adaptive web scraping
CN104036189A (en) Page distortion detecting method and black link database generating method
CN104778232B (en) Searching result optimizing method and device based on long query
CN114282097A (en) Information identification method and device
Bello et al. Conversion of website users to customers-The black hat SEO technique
CN104077353B (en) A kind of method and device of detecting black chain
CN113742785A (en) Webpage classification method and device, electronic equipment and storage medium
US8195458B2 (en) Open class noun classification
CN104063491B (en) A kind of method and device that the detection page is distorted
CN104063494B (en) Page altering detecting method and black chain data library generating method
JP2001076000A (en) Device and method for searching illegal utilization of contents
Ou et al. Viopolicy-detector: An automated approach to detecting GDPR suspected compliance violations in websites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination