CN112101335B - APP violation monitoring method based on OCR and transfer learning - Google Patents
APP violation monitoring method based on OCR and transfer learning Download PDFInfo
- Publication number
- CN112101335B CN112101335B CN202010862575.XA CN202010862575A CN112101335B CN 112101335 B CN112101335 B CN 112101335B CN 202010862575 A CN202010862575 A CN 202010862575A CN 112101335 B CN112101335 B CN 112101335B
- Authority
- CN
- China
- Prior art keywords
- app
- violation
- ocr
- data
- transfer learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000012544 monitoring process Methods 0.000 title claims abstract description 57
- 238000013526 transfer learning Methods 0.000 title claims abstract description 51
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 26
- 238000013136 deep learning model Methods 0.000 claims abstract description 26
- 230000014509 gene expression Effects 0.000 claims abstract description 11
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 26
- 230000008569 process Effects 0.000 claims description 14
- 230000009193 crawling Effects 0.000 claims description 10
- 238000002372 labelling Methods 0.000 claims description 7
- 238000007689 inspection Methods 0.000 claims description 5
- 238000013145 classification model Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 238000013480 data collection Methods 0.000 claims description 3
- 238000001514 detection method Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 102100032202 Cornulin Human genes 0.000 description 2
- 101000920981 Homo sapiens Cornulin Proteins 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000208306 Apium Species 0.000 description 1
- 235000014676 Phragmites communis Nutrition 0.000 description 1
- 244000062793 Sorghum vulgare Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 235000019713 millet Nutrition 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/09—Recognition of logos
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an APP violation monitoring method based on OCR and transfer learning, which comprises the following steps: periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.
Description
Technical Field
The invention relates to the technical field of data monitoring, in particular to an APP violation monitoring method based on OCR and transfer learning.
Background
The mass network public opinion information is automatically collected in real time, analyzed, summarized and monitored, key public opinion information is identified, and related personnel are timely notified, so that emergency response is carried out at the first time, and an information platform which directly supports correct public opinion guiding and online friend opinion collecting is provided. However, public opinion data is only collected, and special content cannot be detected; and generally only the website data is detected, and the mobile terminal data is not detected.
According to user-defined task configuration, semi-structured and unstructured data in internet target web pages are extracted in batches and accurately, converted into structured records, stored in a local database and used for internal use or external network release, and external information is rapidly acquired. However, only network data is generally collected, and there is no way to collect data of the mobile terminal APP; and the website has different complexities and anti-crawling measures, and the success rate of data crawling cannot be guaranteed.
The short text classification model refers to a text form of no more than 200 words, such as a microblog, chat information, a news theme, a viewpoint comment, a question text, a mobile phone short message, a document abstract and the like. The short text classification task aims to automatically process the short text input by a user to obtain valuable classification output. However, the short text classification model is supervised learning, often needs massive data as support, and needs a large amount of manual labeling workload.
That is, in the prior art, target information cannot be quickly and effectively acquired, for example, data required by certain information violation monitoring cannot be acquired; if the APP has certain anti-climbing measures, crawlers cannot be used for data crawling; the target information propaganda contains a large number of pictures, and data in a picture format cannot be processed; data samples of certain information violation monitoring are insufficient, and even if network data are obtained, a large amount of manual marking is needed; the supervised deep learning model needs massive training data, needs to obtain good effect and needs a large amount of machine resources for training; monitoring for certain information violations lacks a platform for data review and comparison.
Accordingly, the prior art is yet to be improved and developed.
Disclosure of Invention
The invention mainly aims to provide an APP violation monitoring method based on OCR and transfer learning, and aims to solve the problem that target information cannot be quickly and effectively acquired in the prior art.
In order to achieve the above object, the present invention provides an APP violation monitoring method based on OCR and transfer learning, which includes the following steps:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP.
Optionally, the APP violation monitoring method based on OCR and transfer learning, where the periodically updating the APK and performing data acquisition on the corresponding APP according to the updated APK specifically includes:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the data acquisition mode specifically includes: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
Optionally, the APP violation monitoring method based on OCR and transfer learning, where the manually labeled sample set is input into a pre-trained deep learning model for model adjustment, and violation determination of texts in different scenes is implemented by dividing service scenes, and the method further includes:
constructing a corpus for supervising training of the deep learning model.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the corpus construction process includes:
acquiring a plurality of keywords and matching the keywords;
and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the counting of the scores of different APPs to obtain the violation score of an APP specifically includes:
the APP violation score is given by a weighted average:
wherein,representing a weighted average, f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the calculating, according to the decision result output by the deep learning model, scores of different APPs to obtain violation scores of APPs further includes:
and setting a timing starting task for all tasks.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the task includes: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: the device comprises a memory, a processor and an APP violation monitoring program based on OCR and transfer learning, wherein the APP violation monitoring program based on OCR and transfer learning is stored on the memory and can run on the processor, and when being executed by the processor, the APP violation monitoring program based on OCR and transfer learning realizes the steps of the APP violation monitoring method based on OCR and transfer learning.
In addition, in order to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the APP violation monitoring program based on OCR and transfer learning implements the steps of the APP violation monitoring method based on OCR and transfer learning when being executed by a processor.
According to the method, the APK is updated periodically, and data acquisition of the corresponding APP is carried out according to the updated APK, wherein the data acquisition comprises data packet capturing and page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.
Drawings
FIG. 1 is a schematic diagram of a cross-platform, multi-language mobile-side automated testing framework based on a Client/Server architecture;
FIG. 2 is a schematic diagram of the Paddlehub architecture in the pre-trained model management and transfer learning tool;
FIG. 3 is a flow chart of the CTPN algorithm;
FIG. 4 is a block diagram of a monitoring system of a mobile terminal based on OCR and transfer learning;
FIG. 5 is a schematic diagram of a microservice architecture;
FIG. 6 is a flow chart of a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;
FIG. 7 is a schematic diagram illustrating an implementation process of an APP violation monitoring method based on OCR and transfer learning according to a preferred embodiment of the present invention;
FIG. 8 is a schematic diagram of a configuration path table structure formed when capturing images in the preferred embodiment of the APP violation monitoring method based on OCR and transfer learning according to the present invention;
FIG. 9 is a schematic diagram of a monitoring function for monitoring APP advertisement data in real time according to a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;
fig. 10 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The application is a cross-platform and multi-language mobile terminal automatic test framework based on a Client/Server architecture, and can simultaneously support Android and iOS in a cross-platform manner. As shown in fig. 1, by optionally implementing a Client side (such as Java, Python, Javascript, etc.) of the WebDriver JSONWriteProtocol corresponding to the app Client class library, the app server side can execute commands such as screen capture and click-down on the mobile phone device through parsing. In addition, the application stop can conveniently and accurately assist the developer to obtain the coordinates and xpath (the language for determining the position of a certain part in the XML document) path attribute information of each control of the APP.
Transfer Learning (Transfer Learning) belongs to a sub-research field of deep Learning, and the research field aims to utilize data, tasks or similarity among models to apply the knowledge learned in the old field to the new field in a migration manner, and usually Fine-tune (Fine tuning) can be performed on a pre-trained model to realize model migration, so that the purpose that the model adapts to data in the new field is achieved.
OCR algorithms can generally be divided into two parts: text detection (detecting the region where the Text is located) and character recognition (recognizing the characters in the region) are performed by using, for example, a CTPN (connectionist Text forward network) model, which greatly simplifies the detection process, improves the detection accuracy by using the seamless combination of RNN and CNN (CNN is used for extracting depth features, RNN is used for sequence feature recognition), and improves the Text detection effect, speed and robustness qualitatively. The algorithm in the CTPN paper can be realized through a Keras + TensorFlow framework: specifically, a series of prosals (preselected frames) are generated by utilizing a feature map (feature map) output by VGG16 convolution for detection, a CTPN text recognition model is trained on a VOC2007_ text _ detection data set, and a text region of a picture can be detected by using a CTPN algorithm. However, most of the image data acquired by the system through the automatic screenshot by the Appium are in the horizontal direction, the interference background is less, the edge overlapping hardly exists, and the inclination angle of the image is not large, so that the CTPN text detection algorithm has high accuracy, and the algorithm steps are as shown in fig. 3:
(1) VGG16 (a classical model of CNN convolutional neural network) is used as an extracted feature, a feature map with the size of W × H × C is obtained, sliding windows with the size of 3 × 3 are arranged on the basis of the 5 th layer, and each window obtains a feature vector with the length of 3 × 3 × C.
(2) And (3) taking the convolution characteristics obtained in the step (1) as the input of 256-dimensional bidirectional LSTM (two 128-dimensional LSTM) to obtain the output with the length of W multiplied by 256, wherein the LSTM is introduced to solve the problem of RNN layer gradient disappearance and further expand the RNN layer.
(3) The output layer section contains three outputs, 2k vertical coordinates (vertical coordinate), 2k scores (score), k edge refinements (side-refining), using a standard non-maximum suppression algorithm (NMS) to filter out duplicate redundant text boxes.
After the regions containing the text in the picture are obtained, the next work is to identify the text of each region. The DenseNet is one of character recognition algorithms, Relu is selected as an activation function, 3 Dense Block layers are used for calculation, each Dense Block layer is connected together through a Transition structure to form a DenseNet network, and finally CTC (China traffic control) loss is matched for training to obtain a data model, after the data is processed through the DenseBlock layers, convolution operation is carried out, then the data is transmitted to the Transition structure for parameter integration specification, parameters are reduced through pooling and then transmitted to the lower Dense Block structure, so that high precision is achieved, in the training process, when the accuracy and the loss value reach the limit and fall into the oscillation, the learning rate needs to be reduced exponentially, and the accuracy can be promoted instantly and greatly.
Further, as shown in fig. 4, the technology stack used by the system includes a front end, a back end, an algorithm, and an operation, wherein the front end, the back end, the algorithm, and the operation can be classified as follows:
front end: vue, ElementUI, Vuex, Axios;
a rear end: java, Springboot, SpringCloud, Nacos, SpringGateWay, SpringAdmin, Feign, XXL-JOB, Mybatis-Plus, Maven;
the algorithm is as follows: flash, TensorFlow, Keras, Pythrch, CTPN, CRNN;
operation and maintenance: docker, Linux;
the system adopts a design idea of separating a front end from a rear end, the front end is Vue, a management system interface is built by combining technical stacks such as ElementUI, Vuex, Axios and the like, the background is uniformly deployed in a Linux mirroring mode based on Docker, the Web background realizes a micro-service architecture by combining Java and SpringCloud, a configuration center and a registration center are realized through Nacos, the background interface is uniformly accessed through a SpringGateWay gateway component, and the rear-end service is uniformly monitored through SpringAdmin. The timing task background combines an open source framework XXL-JOB to realize the distributed timing task based on Java. An algorithm background realizes an OCR character recognition algorithm based on the combination of flash and TensorFlow and Keras, and fine adjustment is carried out on the basis of a Paddlehub pre-training model to generate a violation detection model.
In a background data persistence layer, MySQL is used, Mybatis Plus is used for performing operations such as adding, deleting, modifying and searching on data, MongoDb is used for storing HTTP packet capturing data, Redis is used for caching hot spot data and realizing distributed locking.
According to the function requirements, the background is divided into a Web background, a timing task background and an algorithm background 3.
Web background: java-based provides a basic interface for adding, deleting, modifying, and finding various types of data, including but not limited to data and model management. Timing task background: based on Java and XXL-JOB for periodic execution of timed tasks. Algorithm background: an OCR algorithm, a violation detection algorithm and a semantic similarity algorithm are realized based on Python, CTPN, CRNN and Paddlehub.
By taking the idea of the microservice as a reference, the background can be abstractly subdivided into 5 types of data query service, timed task service, data acquisition service, data violation detection service and data analysis service, wherein the data query service and the timed task service directly face the user, the bottom layer provides basic functions for the data query service, the timed task service, the data violation detection service and the data analysis service, and the microservice architecture is shown in fig. 5.
Through the various technical frameworks, the violation monitoring system of the mobile terminal based on OCR and transfer learning is constructed, and all modules of the system are integrated based on the technical framework, so that the maintainability and the expansibility of the system are ensured.
As shown in fig. 6 and 7, the APP violation monitoring method based on OCR and transfer learning according to the preferred embodiment of the present invention includes the following steps:
and step S10, periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises data packet capture and page screenshot.
Specifically, the APK of each application is crawled by means of a Jsoup library based on Java, the APK of an application store is regularly updated, data collection of the corresponding APP is carried out according to the updated APK, and the collected application store comprises a millet application store, an application treasure, a hundred-degree application assistant, a 360-degree mobile phone assistant, a pea pod, a PP assistant, a dog searching mobile phone assistant and the like.
The data collection aims at monitoring the relevant behaviors of the APP, such as fund profiles, fund managers, fund announcements, fund promotion carousel graphs, fund promotion activity introduction and the like in the APP.
The data acquisition mode specifically comprises the following steps: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
For example, when data is captured by a crawler, part of the APPs are collected by a folder capture form, such as day fund, china fund manager, egg roll fund, guang fa yi kou jin, shun fund, xing quan fund, national fund, buyout fund, reed fund, and the like. Firstly, acquiring URL and parameters of a fund list interface through a Finder, and acquiring fund outline and fund announcement data according to fund codes in the URL and the parameters; then constructing a corresponding http request through Python, and storing a result into a MongoDb database; finally, through a requests library of Python, the request can be easily constructed and the corresponding data result can be obtained.
For example, when performing automatic screenshot, the difficulty of the page automatic screenshot based on the Appium is how to ensure the stability and comprehensiveness of the automatic click script and the accuracy of the OCR recognition. The method mainly positions the control elements through texts, and simultaneously adopts the Apium stop to acquire the XPath to assist in positioning the control elements.
For example, because the interface of the fund APP which needs screenshot monitoring has the commonality of low interface depth and less interface search buttons, the automation screenshot requirement of most APPs can be covered through the design of the APP configuration path table. The configuration path table structure finally formed is shown in fig. 8.
To achieve the versatility of the script, the Appium configuration path is abstracted into 5 fields, which are the application, root path, sub-path, exception log, and enable state. The specific format definitions of the root path and the sub-path are as follows:
(initial-0 | active-0 | com. hctformgf. gff: id/risk _ ward _ tv-5-0);
wherein, 3 paths (first page-0, current-0 and com.hctformgf.gff: ID/risk _ ward _ tv-5-0) separated by | symbol represent 3 element controls clicked in sequence, 0 in the first page-0 and current-0 represents positioning by text control, 5 in com.hctformgf.gff: ID/risk _ ward _ tv-5-0 represents positioning by a specified manner, and 0 therein represents positioning by element ID control. The general format of the path can thus be summarized as text-positioned control numeric mode-additional parameter | text-positioned control numeric mode-additional parameter. The specific numbers and meanings are shown in the following table:
in the carousel map of the APP, by providing a positioning carousel map with characters and coordinates, the reality of positioning the carousel map with the characters is still that the carousel map is positioned with the coordinates, the carousel map can be slid left and right by the coordinates, and at the same time, the problem to be considered is that the problem that the number of the obtained carousel maps is abnormal when the automatic sliding of the APP carousel map and the sliding of the script are performed simultaneously, the total number of the carousel maps is defined in an additional parameter for leading out a positioning carousel map mode, the same carousel map is prevented from being processed by means of picture similarity, and the process is exited again until the corresponding carousel map total number is obtained.
And step S20, performing character recognition and extraction on the screenshot based on an OCR algorithm.
Specifically, after all the screenshots are obtained, the characters are recognized and extracted by using an OCR algorithm to obtain the required information.
And step S30, constructing a sample set for the recognized text content through keywords and regular expressions, and manually marking the sample set.
Specifically, for the recognized text content, a sample set is constructed through keywords and a regular expression (the regular expression is a logic formula for operating a character string, namely, a 'regular character string' is formed by using certain specific characters defined in advance and a combination of the specific characters, and the 'regular character string' is used for expressing a filtering logic for the character string), and the reliability is improved through manual labeling.
And step S40, inputting the manually labeled sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing business scenes.
Before the step S40, the method further includes: constructing a corpus for supervising training of the deep learning model. The method specifically comprises the following steps: acquiring a plurality of keywords and matching the keywords; and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
For example, the fund sale violation distinguishing model is trained in a supervised manner, so a corpus needs to be constructed for training the supervised deep learning model. Because the abundance of the linguistic data directly influences the accuracy of the semantic tags, in order to ensure the quality of semantic tag samples, the linguistic data of fund sales violation propaganda are collected from the network in a manual mode.
By analyzing and summarizing the network violation cases, keywords for patrolling, such as high income, zero risk, cash red package, required purchase rate, guarantee and the like, are arranged. Based on the keywords, matching the keywords in the modes of including, not including, being larger than, being smaller than or equal to, being regular expressions and the like, constructing a training corpus based on the keywords, and manually labeling labels of the training corpus for training a model.
Further, since deep learning has a high requirement on the data volume of the model training process, an effective model cannot be obtained by directly training the model from the beginning, so that the system adopts a PaddleHub framework to construct a violation monitoring model, and performs model fine tuning work (i.e., model retraining) based on transfer learning on the basis of the model training Ernie model after mass network data training. In the training process, carrying out individual training work according to different violation scenes (such as an exaggerated profit scene, other character librarizers and the like) to divide violation distinguishing models in different scenes; and finally, obtaining a classification model for judging whether the data violate the rules or not in each scene.
That is, in order to obtain a model for classification, a large amount of labeled data is put into the model for training, the weights of the neural network nodes in the model are adjusted through the data and the labels, and then the finally obtained model with specific weights can be used for data classification.
And S50, counting the scores of different APPs according to the judgment result output by the deep learning model to obtain the violation score of the APP.
Specifically, the violation score of APP is obtained by a weighted average (i.e., the weighted average represents the violation score of APP):
wherein,representing a weighted average, wherein f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes (such as ' lack of reasonable risk prompt ', ' promise and benefit ', exaggerate benefit ', and the like); the score in the specified range is searched according to the quality inspection result abnormal number, namely the dimension score, and similarly, the weighted average score of 5 (when n is 5) dimensions can be calculated as the average score, for example, the violation score is calculated by five dimensions, such as illegal promised benefit, lack of reasonable risk prompt, other fund manager destroyed by the assidum, promised warranty benefit, exaggerated benefit and the like.
Further, after the above steps are realized, all tasks are set to be started at regular time, as shown in fig. 9, the tasks include an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task, so that the monitoring function of monitoring APP propaganda data in real time is met.
The method realizes the timed automatic crawling of the APP data of the mobile terminal; for data which cannot be crawled by a mobile terminal, adopting an Apdium to simulate a touch screen operation to perform screenshot; the image obtained by screenshot is subjected to character monitoring and character recognition by using an OCR technology, so that the problem of character monitoring in the fund propaganda image is solved; aiming at the problem of illegal monitoring of the propaganda text, a corpus construction method based on keywords and regular expressions is provided; aiming at the problem that a deep learning field model needs a large amount of training corpora, a transfer learning model is used, and fine tuning work of labeled data is carried out on the basis of a pre-training model; a data set is constructed to train the deep learning model, and a model with fund violation text classification judgment capability is obtained; and (3) carrying out violation statistical analysis on the application store and the fund APP according to the violation monitoring result, thereby realizing automatic violation monitoring on the fund APP.
The invention can uniformly collect and monitor the network fund propaganda data, and can make the follow-up data and model results into a knowledge base for more fields such as big data analysis, knowledge maps and the like.
Further, as shown in fig. 10, based on the APP violation monitoring method based on OCR and transfer learning, the present invention further provides an intelligent terminal, where the intelligent terminal includes a processor 10, a memory 20, and a display 30. Fig. 10 shows only some of the components of the smart terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 stores an APP violation monitoring program 40 based on OCR and transfer learning, and the APP violation monitoring program 40 based on OCR and transfer learning can be executed by the processor 10, so as to implement the APP violation monitoring method based on OCR and transfer learning in the present application.
The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, and is configured to run program codes stored in the memory 20 or process data, such as executing the APP violation monitoring method based on OCR and transfer learning.
The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.
In one embodiment, the following steps are implemented when the processor 10 executes the OCR and migration learning based APP violation monitoring program 40 in the memory 20:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP.
Wherein, regularly update APK, carry out the data acquisition of corresponding APP according to APK after the update, specifically include:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
The data acquisition mode specifically comprises the following steps: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
The method comprises the following steps of inputting a manually labeled sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts under different scenes by dividing service scenes, wherein the method also comprises the following steps:
constructing a corpus for supervising training of the deep learning model.
The corpus construction process comprises the following steps:
acquiring a plurality of keywords and matching the keywords;
and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
Wherein, the score to different APPs is counted, and the violation score of the APP is obtained, and the method specifically comprises the following steps:
the APP violation score is given by a weighted average:
wherein,representing a weighted average, f 1-fk are the configured weights of violation items in each dimension, and x 1-xk are realQuality inspection results of every dimension violation item are different in constant, n represents the total number of dimensions, and different dimensions represent different violation scenes.
Wherein, according to the discrimination result of the deep learning model output, the score of different APPs is counted to obtain the violation score of the APP, and then the method further comprises the following steps:
and setting a timing starting task for all tasks.
Wherein the tasks include: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
The invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the steps of the APP violation monitoring method based on OCR and transfer learning are realized when the APP violation monitoring program based on OCR and transfer learning is executed by a processor.
In summary, the present invention provides an APP violation monitoring method based on OCR and transfer learning, the method includes: periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.
Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.
Claims (7)
1. The APP violation monitoring method based on OCR and transfer learning is characterized by comprising the following steps:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
constructing a corpus used for supervising the training of the deep learning model;
the corpus construction process comprises the following steps:
acquiring a plurality of keywords and matching the keywords;
constructing a training corpus based on keywords, and manually labeling the training corpus to generate the corpus;
in the training process, carrying out independent training work according to different violation scenes to divide violation distinguishing models in different scenes and finally obtain a classification model for judging whether data violate in each scene;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP;
the statistics of the scores of different APPs is carried out to obtain the violation score of the APP, and the method specifically comprises the following steps:
the APP violation score is given by a weighted average:
wherein,representing a weighted average, f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes.
2. The APP violation monitoring method based on OCR and transfer learning according to claim 1, wherein the APK is periodically updated, and data acquisition of the corresponding APP is performed according to the updated APK, specifically including:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
3. The APP violation monitoring method based on OCR and transfer learning of claim 1, wherein the data collection manner specifically comprises: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
4. The APP violation monitoring method based on OCR and transfer learning according to claim 1, wherein the method performs statistics on scores of different APPs according to a discrimination result output by the deep learning model to obtain violation scores of APPs, and then further comprises:
and setting a timing starting task for all tasks.
5. The APP violation monitoring method based on OCR and transfer learning of claim 4, wherein the task comprises: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
6. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and an OCR and transfer learning based APP violation monitoring program stored on the memory and executable on the processor, the OCR and transfer learning based APP violation monitoring program when executed by the processor implementing the steps of the OCR and transfer learning based APP violation monitoring method of any of claims 1-5.
7. A storage medium storing an OCR and transfer learning based APP violation monitoring program which, when executed by a processor, implements the steps of the OCR and transfer learning based APP violation monitoring method according to any one of claims 1-5.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010862575.XA CN112101335B (en) | 2020-08-25 | 2020-08-25 | APP violation monitoring method based on OCR and transfer learning |
PCT/CN2020/120724 WO2022041406A1 (en) | 2020-08-25 | 2020-10-14 | Ocr and transfer learning-based app violation monitoring method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010862575.XA CN112101335B (en) | 2020-08-25 | 2020-08-25 | APP violation monitoring method based on OCR and transfer learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112101335A CN112101335A (en) | 2020-12-18 |
CN112101335B true CN112101335B (en) | 2022-04-15 |
Family
ID=73753383
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010862575.XA Active CN112101335B (en) | 2020-08-25 | 2020-08-25 | APP violation monitoring method based on OCR and transfer learning |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112101335B (en) |
WO (1) | WO2022041406A1 (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112686022A (en) * | 2020-12-30 | 2021-04-20 | 平安普惠企业管理有限公司 | Method and device for detecting illegal corpus, computer equipment and storage medium |
CN112948830B (en) * | 2021-03-12 | 2023-11-10 | 安天科技集团股份有限公司 | File risk identification method and device |
CN113076339B (en) * | 2021-03-18 | 2024-08-20 | 北京沃东天骏信息技术有限公司 | Data caching method, device, equipment and storage medium |
CN113221890A (en) * | 2021-05-25 | 2021-08-06 | 深圳市瑞驰信息技术有限公司 | OCR-based cloud mobile phone text content supervision method, system and system |
CN113326376A (en) * | 2021-05-28 | 2021-08-31 | 南京大学 | Code review opinion quality evaluation system and method based on machine learning |
CN113568823A (en) * | 2021-09-27 | 2021-10-29 | 深圳市永达电子信息股份有限公司 | Employee operation behavior monitoring method, system and computer readable medium |
CN113888760B (en) * | 2021-09-29 | 2024-04-23 | 平安银行股份有限公司 | Method, device, equipment and medium for monitoring violation information based on software application |
CN114978936B (en) * | 2022-05-24 | 2024-08-16 | 身边云(北京)信息服务有限公司 | Upgrading method, system and storage medium of shared service platform |
CN117197816B (en) * | 2023-06-19 | 2024-07-30 | 珠海盈米基金销售有限公司 | User material identification method and system |
CN116664825B (en) * | 2023-06-26 | 2024-07-19 | 北京智源人工智能研究院 | Self-supervision contrast learning method and system for large-scene point cloud object detection |
CN116912867B (en) * | 2023-09-13 | 2023-12-29 | 之江实验室 | Teaching material structure extraction method and device combining automatic labeling and recall completion |
CN117272113B (en) * | 2023-10-10 | 2024-09-17 | 友福同享(深圳)智能科技有限公司 | Method and system for detecting illegal behaviors based on virtual social network |
CN117541269B (en) * | 2023-12-08 | 2024-07-02 | 北京中数睿智科技有限公司 | Third party module data real-time monitoring method and system based on intelligent large model |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110837615A (en) * | 2019-11-05 | 2020-02-25 | 福建省趋普物联科技有限公司 | Artificial intelligent checking system for advertisement content information filtering |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108205674B (en) * | 2017-12-22 | 2022-04-15 | 广州爱美互动网络科技有限公司 | Social APP content identification method, electronic device, storage medium and system |
CN109492143A (en) * | 2018-09-21 | 2019-03-19 | 平安科技(深圳)有限公司 | Image processing method, device, computer equipment and storage medium |
CN110210484A (en) * | 2019-04-19 | 2019-09-06 | 成都三零凯天通信实业有限公司 | System and method for detecting and identifying poor text of view image based on deep learning |
CN110210542B (en) * | 2019-05-24 | 2021-10-08 | 厦门美柚股份有限公司 | Picture character recognition model training method and device and character recognition system |
CN110275958B (en) * | 2019-06-26 | 2021-07-27 | 北京市博汇科技股份有限公司 | Website information identification method and device and electronic equipment |
KR20190103088A (en) * | 2019-08-15 | 2019-09-04 | 엘지전자 주식회사 | Method and apparatus for recognizing a business card using federated learning |
CN111400132B (en) * | 2020-03-09 | 2023-08-18 | 北京版信通技术有限公司 | Automatic monitoring method and system for on-shelf APP |
-
2020
- 2020-08-25 CN CN202010862575.XA patent/CN112101335B/en active Active
- 2020-10-14 WO PCT/CN2020/120724 patent/WO2022041406A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110837615A (en) * | 2019-11-05 | 2020-02-25 | 福建省趋普物联科技有限公司 | Artificial intelligent checking system for advertisement content information filtering |
Also Published As
Publication number | Publication date |
---|---|
WO2022041406A1 (en) | 2022-03-03 |
CN112101335A (en) | 2020-12-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112101335B (en) | APP violation monitoring method based on OCR and transfer learning | |
AU2019355933B2 (en) | Software testing | |
US20220004878A1 (en) | Systems and methods for synthetic document and data generation | |
EP3640847A1 (en) | Systems and methods for identifying form fields | |
US20220342921A1 (en) | Systems and methods for parsing log files using classification and a plurality of neural networks | |
CN110972499A (en) | Labeling system of neural network | |
CN111881398B (en) | Page type determining method, device and equipment and computer storage medium | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN113918794B (en) | Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium | |
CN115937887A (en) | Method and device for extracting document structured information, electronic equipment and storage medium | |
CN114676705B (en) | Dialogue relation processing method, computer and readable storage medium | |
EP3640861A1 (en) | Systems and methods for parsing log files using classification and a plurality of neural networks | |
CN114328936B (en) | Method and device for establishing classification model | |
CN116226850A (en) | Method, device, equipment, medium and program product for detecting virus of application program | |
CN115186240A (en) | Social network user alignment method, device and medium based on relevance information | |
CN113220843A (en) | Method, device, storage medium and equipment for determining information association relation | |
CN112487398A (en) | Automatic character type identifying code identifying method, terminal equipment and storage medium | |
Tulsyan et al. | A benchmark system for Indian language text recognition | |
US20230282013A1 (en) | Automated key-value pair extraction | |
CN114765702B (en) | Video processing method and device and computer readable storage medium | |
US11763589B1 (en) | Detection of blanks in documents | |
CN111598159B (en) | Training method, device, equipment and storage medium of machine learning model | |
CN116386068A (en) | Webpage text extraction method, device, equipment and storage medium based on image text | |
CN117215947A (en) | Page white screen detection method and device, computer equipment and storage medium | |
CN117932386A (en) | Analysis method and device for affective influence of open source software development robot on developer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |