CN112101335A - APP violation monitoring method based on OCR and transfer learning - Google Patents

APP violation monitoring method based on OCR and transfer learning Download PDF

Info

Publication number
CN112101335A
CN112101335A CN202010862575.XA CN202010862575A CN112101335A CN 112101335 A CN112101335 A CN 112101335A CN 202010862575 A CN202010862575 A CN 202010862575A CN 112101335 A CN112101335 A CN 112101335A
Authority
CN
China
Prior art keywords
app
violation
ocr
data
monitoring method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010862575.XA
Other languages
Chinese (zh)
Other versions
CN112101335B (en
Inventor
蔡树彬
明仲
林旭恒
吴东阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN202010862575.XA priority Critical patent/CN112101335B/en
Priority to PCT/CN2020/120724 priority patent/WO2022041406A1/en
Publication of CN112101335A publication Critical patent/CN112101335A/en
Application granted granted Critical
Publication of CN112101335B publication Critical patent/CN112101335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/09Recognition of logos
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention discloses an APP violation monitoring method based on OCR and transfer learning, which comprises the following steps: periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.

Description

APP violation monitoring method based on OCR and transfer learning
Technical Field
The invention relates to the technical field of data monitoring, in particular to an APP violation monitoring method based on OCR and transfer learning.
Background
The mass network public opinion information is automatically collected in real time, analyzed, summarized and monitored, key public opinion information is identified, and related personnel are timely notified, so that emergency response is carried out at the first time, and an information platform which directly supports correct public opinion guiding and online friend opinion collecting is provided. However, public opinion data is only collected, and special content cannot be detected; and generally only the website data is detected, and the mobile terminal data is not detected.
According to user-defined task configuration, semi-structured and unstructured data in internet target web pages are extracted in batches and accurately, converted into structured records, stored in a local database and used for internal use or external network release, and external information is rapidly acquired. However, only network data is generally collected, and there is no way to collect data of the mobile terminal APP; and the website has different complexities and anti-crawling measures, and the success rate of data crawling cannot be guaranteed.
The short text classification model refers to a text form of no more than 200 words, such as a microblog, chat information, a news theme, a viewpoint comment, a question text, a mobile phone short message, a document abstract and the like. The short text classification task aims to automatically process the short text input by a user to obtain valuable classification output. However, the short text classification model is supervised learning, often needs massive data as support, and needs a large amount of manual labeling workload.
That is, in the prior art, target information cannot be quickly and effectively acquired, for example, data required by certain information violation monitoring cannot be acquired; if the APP has certain anti-climbing measures, crawlers cannot be used for data crawling; the target information propaganda contains a large number of pictures, and data in a picture format cannot be processed; data samples of certain information violation monitoring are insufficient, and even if network data are obtained, a large amount of manual marking is needed; the supervised deep learning model needs massive training data, needs to obtain good effect and needs a large amount of machine resources for training; monitoring for certain information violations lacks a platform for data review and comparison.
Accordingly, the prior art is yet to be improved and developed.
Disclosure of Invention
The invention mainly aims to provide an APP violation monitoring method based on OCR and transfer learning, and aims to solve the problem that target information cannot be quickly and effectively acquired in the prior art.
In order to achieve the above object, the present invention provides an APP violation monitoring method based on OCR and transfer learning, which includes the following steps:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP.
Optionally, the APP violation monitoring method based on OCR and transfer learning, where the periodically updating the APK and performing data acquisition on the corresponding APP according to the updated APK specifically includes:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the data acquisition mode specifically includes: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
Optionally, the APP violation monitoring method based on OCR and transfer learning, where the manually labeled sample set is input into a pre-trained deep learning model for model adjustment, and violation determination of texts in different scenes is implemented by dividing service scenes, and the method further includes:
constructing a corpus for supervising training of the deep learning model.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the corpus construction process includes:
acquiring a plurality of keywords and matching the keywords;
and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the counting of the scores of different APPs to obtain the violation score of an APP specifically includes:
the APP violation score is given by a weighted average:
Figure BDA0002648641150000041
wherein the content of the first and second substances,
Figure BDA0002648641150000042
representing a weighted average, f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the calculating, according to the decision result output by the deep learning model, scores of different APPs to obtain violation scores of APPs further includes:
and setting a timing starting task for all tasks.
Optionally, the APP violation monitoring method based on OCR and transfer learning, wherein the task includes: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: the device comprises a memory, a processor and an APP violation monitoring program based on OCR and transfer learning, wherein the APP violation monitoring program based on OCR and transfer learning is stored on the memory and can run on the processor, and when being executed by the processor, the APP violation monitoring program based on OCR and transfer learning realizes the steps of the APP violation monitoring method based on OCR and transfer learning.
In addition, in order to achieve the above object, the present invention further provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the APP violation monitoring program based on OCR and transfer learning implements the steps of the APP violation monitoring method based on OCR and transfer learning when being executed by a processor.
According to the method, the APK is updated periodically, and data acquisition of the corresponding APP is carried out according to the updated APK, wherein the data acquisition comprises data packet capturing and page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.
Drawings
FIG. 1 is a schematic diagram of a cross-platform, multi-language mobile-side automated testing framework based on a Client/Server architecture;
FIG. 2 is a schematic diagram of the Paddlehub architecture in the pre-trained model management and transfer learning tool;
FIG. 3 is a flow chart of the CTPN algorithm;
FIG. 4 is a block diagram of a monitoring system of a mobile terminal based on OCR and transfer learning;
FIG. 5 is a schematic diagram of a microservice architecture;
FIG. 6 is a flow chart of a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;
FIG. 7 is a schematic diagram illustrating an implementation process of an APP violation monitoring method based on OCR and transfer learning according to a preferred embodiment of the present invention;
FIG. 8 is a schematic diagram of a configuration path table structure formed when capturing images in the preferred embodiment of the APP violation monitoring method based on OCR and transfer learning according to the present invention;
FIG. 9 is a schematic diagram of a monitoring function for monitoring APP advertisement data in real time according to a preferred embodiment of the APP violation monitoring method based on OCR and transfer learning of the present invention;
fig. 10 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The application is a cross-platform and multi-language mobile terminal automatic test framework based on a Client/Server architecture, and can simultaneously support Android and iOS in a cross-platform manner. As shown in fig. 1, by optionally implementing a Client side (such as Java, Python, Javascript, etc.) of the WebDriver JSONWriteProtocol corresponding to the app Client class library, the app server side can execute commands such as screen capture and click-down on the mobile phone device through parsing. In addition, the application stop can conveniently and accurately assist the developer to obtain the coordinates and xpath (the language for determining the position of a certain part in the XML document) path attribute information of each control of the APP.
Transfer Learning (Transfer Learning) belongs to a sub-research field of deep Learning, and the research field aims to utilize data, tasks or similarity among models to apply the knowledge learned in the old field to the new field in a migration manner, and usually Fine-tune (Fine tuning) can be performed on a pre-trained model to realize model migration, so that the purpose that the model adapts to data in the new field is achieved.
OCR algorithms can generally be divided into two parts: text detection (detecting the region where the Text is located) and character recognition (recognizing the characters in the region) are performed by using, for example, a CTPN (connectionist Text forward network) model, which greatly simplifies the detection process, improves the detection accuracy by using the seamless combination of RNN and CNN (CNN is used for extracting depth features, RNN is used for sequence feature recognition), and improves the Text detection effect, speed and robustness qualitatively. The algorithm in the CTPN paper can be realized through a Keras + TensorFlow framework: specifically, a series of prosals (preselected frames) are generated by utilizing a feature map (feature map) output by VGG16 convolution for detection, a CTPN text recognition model is trained on a VOC2007_ text _ detection data set, and a text region of a picture can be detected by using a CTPN algorithm. However, most of the image data acquired by the system through the automatic screenshot by the Appium are in the horizontal direction, the interference background is less, the edge overlapping hardly exists, and the inclination angle of the image is not large, so that the CTPN text detection algorithm has high accuracy, and the algorithm steps are as shown in fig. 3:
(1) VGG16 (a classical model of CNN convolutional neural network) is used as an extracted feature, a feature map with the size of W × H × C is obtained, sliding windows with the size of 3 × 3 are arranged on the basis of the 5 th layer, and each window obtains a feature vector with the length of 3 × 3 × C.
(2) And (3) taking the convolution characteristics obtained in the step (1) as the input of 256-dimensional bidirectional LSTM (two 128-dimensional LSTM) to obtain the output with the length of W multiplied by 256, wherein the LSTM is introduced to solve the problem of RNN layer gradient disappearance and further expand the RNN layer.
(3) The output layer section contains three outputs, 2k vertical coordinates (vertical coordinate), 2k scores (score), k edge refinements (side-refining), using a standard non-maximum suppression algorithm (NMS) to filter out duplicate redundant text boxes.
After the regions containing the text in the picture are obtained, the next work is to identify the text of each region. The DenseNet is one of character recognition algorithms, Relu is selected as an activation function, 3 Dense Block layers are used for calculation, each Dense Block layer is connected together through a Transition structure to form a DenseNet network, and finally CTC (China traffic control) loss is matched for training to obtain a data model, after the data is processed through the DenseBlock layers, convolution operation is carried out, then the data is transmitted to the Transition structure for parameter integration specification, parameters are reduced through pooling and then transmitted to the lower Dense Block structure, so that high precision is achieved, in the training process, when the accuracy and the loss value reach the limit and fall into the oscillation, the learning rate needs to be reduced exponentially, and the accuracy can be promoted instantly and greatly.
Further, as shown in fig. 4, the technology stack used by the system includes a front end, a back end, an algorithm, and an operation, wherein the front end, the back end, the algorithm, and the operation can be classified as follows:
front end: vue, ElementUI, Vuex, Axios;
a rear end: java, Springboot, SpringCloud, Nacos, SpringGateWay, SpringAdmin, Feign, XXL-JOB, Mybatis-Plus, Maven;
the algorithm is as follows: flash, TensorFlow, Keras, Pythrch, CTPN, CRNN;
operation and maintenance: docker, Linux;
the system adopts a design idea of separating a front end from a rear end, the front end is Vue, a management system interface is built by combining technical stacks such as ElementUI, Vuex, Axios and the like, the background is uniformly deployed in a Linux mirroring mode based on Docker, the Web background realizes a micro-service architecture by combining Java and SpringCloud, a configuration center and a registration center are realized through Nacos, the background interface is uniformly accessed through a SpringGateWay gateway component, and the rear-end service is uniformly monitored through SpringAdmin. The timing task background combines an open source framework XXL-JOB to realize the distributed timing task based on Java. An algorithm background realizes an OCR character recognition algorithm based on the combination of flash and TensorFlow and Keras, and fine adjustment is carried out on the basis of a Paddlehub pre-training model to generate a violation detection model.
In a background data persistence layer, MySQL is used, Mybatis Plus is used for performing operations such as adding, deleting, modifying and searching on data, MongoDb is used for storing HTTP packet capturing data, Redis is used for caching hot spot data and realizing distributed locking.
According to the function requirements, the background is divided into a Web background, a timing task background and an algorithm background 3.
Web background: java-based provides a basic interface for adding, deleting, modifying, and finding various types of data, including but not limited to data and model management. Timing task background: based on Java and XXL-JOB for periodic execution of timed tasks. Algorithm background: an OCR algorithm, a violation detection algorithm and a semantic similarity algorithm are realized based on Python, CTPN, CRNN and Paddlehub.
By taking the idea of the microservice as a reference, the background can be abstractly subdivided into 5 types of data query service, timed task service, data acquisition service, data violation detection service and data analysis service, wherein the data query service and the timed task service directly face the user, the bottom layer provides basic functions for the data query service, the timed task service, the data violation detection service and the data analysis service, and the microservice architecture is shown in fig. 5.
Through the various technical frameworks, the violation monitoring system of the mobile terminal based on OCR and transfer learning is constructed, and all modules of the system are integrated based on the technical framework, so that the maintainability and the expansibility of the system are ensured.
As shown in fig. 6 and 7, the APP violation monitoring method based on OCR and transfer learning according to the preferred embodiment of the present invention includes the following steps:
and step S10, periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises data packet capture and page screenshot.
Specifically, the APK of each application is crawled by means of a Jsoup library based on Java, the APK of an application store is regularly updated, data collection of the corresponding APP is carried out according to the updated APK, and the collected application store comprises a millet application store, an application treasure, a hundred-degree application assistant, a 360-degree mobile phone assistant, a pea pod, a PP assistant, a dog searching mobile phone assistant and the like.
The data collection aims at monitoring the relevant behaviors of the APP, such as fund profiles, fund managers, fund announcements, fund promotion carousel graphs, fund promotion activity introduction and the like in the APP.
The data acquisition mode specifically comprises the following steps: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
For example, when data is captured by a crawler, part of the APPs are collected by a folder capture form, such as day fund, china fund manager, egg roll fund, guang fa yi kou jin, shun fund, xing quan fund, national fund, buyout fund, reed fund, and the like. Firstly, acquiring URL and parameters of a fund list interface through a Finder, and acquiring fund outline and fund announcement data according to fund codes in the URL and the parameters; then constructing a corresponding http request through Python, and storing a result into a MongoDb database; finally, through a requests library of Python, the request can be easily constructed and the corresponding data result can be obtained.
For example, when performing automatic screenshot, the difficulty of the page automatic screenshot based on the Appium is how to ensure the stability and comprehensiveness of the automatic click script and the accuracy of the OCR recognition. The method mainly positions the control elements through texts, and simultaneously adopts the Apium stop to acquire the XPath to assist in positioning the control elements.
For example, because the interface of the fund APP which needs screenshot monitoring has the commonality of low interface depth and less interface search buttons, the automation screenshot requirement of most APPs can be covered through the design of the APP configuration path table. The configuration path table structure finally formed is shown in fig. 8.
To achieve the versatility of the script, the Appium configuration path is abstracted into 5 fields, which are the application, root path, sub-path, exception log, and enable state. The specific format definitions of the root path and the sub-path are as follows:
(initial-0 | active-0 | com. hctformgf. gff: id/risk _ ward _ tv-5-0);
wherein, 3 paths (first page-0, current-0 and com.hctformgf.gff: ID/risk _ ward _ tv-5-0) separated by | symbol represent 3 element controls clicked in sequence, 0 in the first page-0 and current-0 represents positioning by text control, 5 in com.hctformgf.gff: ID/risk _ ward _ tv-5-0 represents positioning by a specified manner, and 0 therein represents positioning by element ID control. The general format of the path can thus be summarized as text-positioned control numeric mode-additional parameter | text-positioned control numeric mode-additional parameter. The specific numbers and meanings are shown in the following table:
Figure BDA0002648641150000121
Figure BDA0002648641150000131
in the carousel map of the APP, by providing a positioning carousel map with characters and coordinates, the reality of positioning the carousel map with the characters is still that the carousel map is positioned with the coordinates, the carousel map can be slid left and right by the coordinates, and at the same time, the problem to be considered is that the problem that the number of the obtained carousel maps is abnormal when the automatic sliding of the APP carousel map and the sliding of the script are performed simultaneously, the total number of the carousel maps is defined in an additional parameter for leading out a positioning carousel map mode, the same carousel map is prevented from being processed by means of picture similarity, and the process is exited again until the corresponding carousel map total number is obtained.
And step S20, performing character recognition and extraction on the screenshot based on an OCR algorithm.
Specifically, after all the screenshots are obtained, the characters are recognized and extracted by using an OCR algorithm to obtain the required information.
And step S30, constructing a sample set for the recognized text content through keywords and regular expressions, and manually marking the sample set.
Specifically, for the recognized text content, a sample set is constructed through keywords and a regular expression (the regular expression is a logic formula for operating a character string, namely, a 'regular character string' is formed by using certain specific characters defined in advance and a combination of the specific characters, and the 'regular character string' is used for expressing a filtering logic for the character string), and the reliability is improved through manual labeling.
And step S40, inputting the manually labeled sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing business scenes.
Before the step S40, the method further includes: constructing a corpus for supervising training of the deep learning model. The method specifically comprises the following steps: acquiring a plurality of keywords and matching the keywords; and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
For example, the fund sale violation distinguishing model is trained in a supervised manner, so a corpus needs to be constructed for training the supervised deep learning model. Because the abundance of the linguistic data directly influences the accuracy of the semantic tags, in order to ensure the quality of semantic tag samples, the linguistic data of fund sales violation propaganda are collected from the network in a manual mode.
By analyzing and summarizing the network violation cases, keywords for patrolling, such as high income, zero risk, cash red package, required purchase rate, guarantee and the like, are arranged. Based on the keywords, matching the keywords in the modes of including, not including, being larger than, being smaller than or equal to, being regular expressions and the like, constructing a training corpus based on the keywords, and manually labeling labels of the training corpus for training a model.
Further, since deep learning has a high requirement on the data volume of the model training process, an effective model cannot be obtained by directly training the model from the beginning, so that the system adopts a PaddleHub framework to construct a violation monitoring model, and performs model fine tuning work (i.e., model retraining) based on transfer learning on the basis of the model training Ernie model after mass network data training. In the training process, carrying out individual training work according to different violation scenes (such as an exaggerated profit scene, other character librarizers and the like) to divide violation distinguishing models in different scenes; and finally, obtaining a classification model for judging whether the data violate the rules or not in each scene.
That is, in order to obtain a model for classification, a large amount of labeled data is put into the model for training, the weights of the neural network nodes in the model are adjusted through the data and the labels, and then the finally obtained model with specific weights can be used for data classification.
And S50, counting the scores of different APPs according to the judgment result output by the deep learning model to obtain the violation score of the APP.
Specifically, the violation score of APP is obtained by a weighted average (i.e., the weighted average represents the violation score of APP):
Figure BDA0002648641150000151
wherein the content of the first and second substances,
Figure BDA0002648641150000152
representing a weighted average, wherein f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes (such as ' lack of reasonable risk prompt ', ' promise and benefit ', exaggerate benefit ', and the like); the score in the specified range is searched according to the quality inspection result abnormal number, namely the dimension score, and similarly, the weighted average score of 5 (when n is 5) dimensions can be calculated as the average score, for example, the violation score is calculated by five dimensions, such as illegal promised benefit, lack of reasonable risk prompt, other fund manager destroyed by the assidum, promised warranty benefit, exaggerated benefit and the like.
Further, after the above steps are realized, all tasks are set to be started at regular time, as shown in fig. 9, the tasks include an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task, so that the monitoring function of monitoring APP propaganda data in real time is met.
The method realizes the timed automatic crawling of the APP data of the mobile terminal; for data which cannot be crawled by a mobile terminal, adopting an Apdium to simulate a touch screen operation to perform screenshot; the image obtained by screenshot is subjected to character monitoring and character recognition by using an OCR technology, so that the problem of character monitoring in the fund propaganda image is solved; aiming at the problem of illegal monitoring of the propaganda text, a corpus construction method based on keywords and regular expressions is provided; aiming at the problem that a deep learning field model needs a large amount of training corpora, a transfer learning model is used, and fine tuning work of labeled data is carried out on the basis of a pre-training model; a data set is constructed to train the deep learning model, and a model with fund violation text classification judgment capability is obtained; and (3) carrying out violation statistical analysis on the application store and the fund APP according to the violation monitoring result, thereby realizing automatic violation monitoring on the fund APP.
The invention can uniformly collect and monitor the network fund propaganda data, and can make the follow-up data and model results into a knowledge base for more fields such as big data analysis, knowledge maps and the like.
Further, as shown in fig. 10, based on the APP violation monitoring method based on OCR and transfer learning, the present invention further provides an intelligent terminal, where the intelligent terminal includes a processor 10, a memory 20, and a display 30. Fig. 10 shows only some of the components of the smart terminal, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 stores an APP violation monitoring program 40 based on OCR and transfer learning, and the APP violation monitoring program 40 based on OCR and transfer learning can be executed by the processor 10, so as to implement the APP violation monitoring method based on OCR and transfer learning in the present application.
The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, and is configured to run program codes stored in the memory 20 or process data, such as executing the APP violation monitoring method based on OCR and transfer learning.
The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.
In one embodiment, the following steps are implemented when the processor 10 executes the OCR and migration learning based APP violation monitoring program 40 in the memory 20:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP.
Wherein, regularly update APK, carry out the data acquisition of corresponding APP according to APK after the update, specifically include:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
The data acquisition mode specifically comprises the following steps: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
The method comprises the following steps of inputting a manually labeled sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts under different scenes by dividing service scenes, wherein the method also comprises the following steps:
constructing a corpus for supervising training of the deep learning model.
The corpus construction process comprises the following steps:
acquiring a plurality of keywords and matching the keywords;
and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
Wherein, the score to different APPs is counted, and the violation score of the APP is obtained, and the method specifically comprises the following steps:
the APP violation score is given by a weighted average:
Figure BDA0002648641150000191
wherein the content of the first and second substances,
Figure BDA0002648641150000192
representing a weighted average, f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes.
Wherein, according to the discrimination result of the deep learning model output, the score of different APPs is counted to obtain the violation score of the APP, and then the method further comprises the following steps:
and setting a timing starting task for all tasks.
Wherein the tasks include: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
The invention also provides a storage medium, wherein the storage medium stores an APP violation monitoring program based on OCR and transfer learning, and the steps of the APP violation monitoring method based on OCR and transfer learning are realized when the APP violation monitoring program based on OCR and transfer learning is executed by a processor.
In summary, the present invention provides an APP violation monitoring method based on OCR and transfer learning, the method includes: periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot; performing character recognition and extraction on the screenshot based on an OCR algorithm; constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking; inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes; and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP. According to the invention, the illegal use condition of the APP is effectively and quickly detected by collecting and analyzing the data of the APP.
Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program instructing relevant hardware (such as a processor, a controller, etc.), and the program may be stored in a computer readable storage medium, and when executed, the program may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (10)

1. The APP violation monitoring method based on OCR and transfer learning is characterized by comprising the following steps:
periodically updating the APK, and acquiring data of the corresponding APP according to the updated APK, wherein the data acquisition comprises a data packet capture and a page screenshot;
performing character recognition and extraction on the screenshot based on an OCR algorithm;
constructing a sample set for the recognized text contents through keywords and a regular expression, and manually marking;
inputting the manually marked sample set into a pre-trained deep learning model for model adjustment, and realizing violation judgment of texts in different scenes by dividing service scenes;
and according to the discrimination result output by the deep learning model, counting the scores of different APPs to obtain the violation score of the APP.
2. The APP violation monitoring method based on OCR and transfer learning according to claim 1, wherein the APK is periodically updated, and data acquisition of the corresponding APP is performed according to the updated APK, specifically comprising:
and crawling APK of each application by means of a Jsoup library based on Java, regularly updating the APK of the application store, and acquiring data of the corresponding APP according to the updated APK.
3. The APP violation monitoring method based on OCR and transfer learning of claim 1, wherein the data collection manner specifically comprises: and (4) directly carrying out propaganda data packet capturing by using a crawler and carrying out page automatic screenshot by using an Appium script.
4. The APP violation monitoring method based on OCR and transfer learning of claim 1, wherein the manually labeled sample set is input into a pre-trained deep learning model for model adjustment, and violation judgment of texts under different scenes is realized by dividing service scenes, and the method further comprises the following steps:
constructing a corpus for supervising training of the deep learning model.
5. The APP violation monitoring method based on OCR and transfer learning of claim 4, wherein the corpus construction process comprises:
acquiring a plurality of keywords and matching the keywords;
and constructing a corpus based on the keywords, and manually labeling the keywords for generating the corpus.
6. The APP violation monitoring method based on OCR and transfer learning according to claim 1, wherein the counting of the scores of different APPs to obtain the violation score of an APP specifically comprises:
the APP violation score is given by a weighted average:
Figure FDA0002648641140000021
wherein the content of the first and second substances,
Figure FDA0002648641140000022
representing a weighted average, f 1-fk are configured weights of violation items of each dimension, x 1-xk are different constants of quality inspection results of the violation items of each dimension, n represents the total number of the dimensions, and different dimensions represent different violation scenes.
7. The APP violation monitoring method based on OCR and transfer learning according to claim 1, wherein the method performs statistics on scores of different APPs according to a discrimination result output by the deep learning model to obtain violation scores of APPs, and then further comprises:
and setting a timing starting task for all tasks.
8. The APP violation monitoring method based on OCR and transfer learning of claim 7, wherein the task comprises: the APP monitoring method comprises an APP crawling timing task, an APP screenshot timing task and an illegal monitoring timing task.
9. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and an OCR and migration learning based APP violation monitoring program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the OCR and migration learning based APP violation monitoring method according to any of claims 1-8.
10. A storage medium storing an OCR and transfer learning based APP violation monitoring program which, when executed by a processor, implements the steps of the OCR and transfer learning based APP violation monitoring method according to any one of claims 1-8.
CN202010862575.XA 2020-08-25 2020-08-25 APP violation monitoring method based on OCR and transfer learning Active CN112101335B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010862575.XA CN112101335B (en) 2020-08-25 2020-08-25 APP violation monitoring method based on OCR and transfer learning
PCT/CN2020/120724 WO2022041406A1 (en) 2020-08-25 2020-10-14 Ocr and transfer learning-based app violation monitoring method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010862575.XA CN112101335B (en) 2020-08-25 2020-08-25 APP violation monitoring method based on OCR and transfer learning

Publications (2)

Publication Number Publication Date
CN112101335A true CN112101335A (en) 2020-12-18
CN112101335B CN112101335B (en) 2022-04-15

Family

ID=73753383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010862575.XA Active CN112101335B (en) 2020-08-25 2020-08-25 APP violation monitoring method based on OCR and transfer learning

Country Status (2)

Country Link
CN (1) CN112101335B (en)
WO (1) WO2022041406A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948830A (en) * 2021-03-12 2021-06-11 哈尔滨安天科技集团股份有限公司 File risk identification method and device
CN113076339A (en) * 2021-03-18 2021-07-06 北京沃东天骏信息技术有限公司 Data caching method, device, equipment and storage medium
CN113221890A (en) * 2021-05-25 2021-08-06 深圳市瑞驰信息技术有限公司 OCR-based cloud mobile phone text content supervision method, system and system
CN113326376A (en) * 2021-05-28 2021-08-31 南京大学 Code review opinion quality evaluation system and method based on machine learning
CN113568823A (en) * 2021-09-27 2021-10-29 深圳市永达电子信息股份有限公司 Employee operation behavior monitoring method, system and computer readable medium
CN113888760A (en) * 2021-09-29 2022-01-04 平安银行股份有限公司 Violation information monitoring method, device, equipment and medium based on software application

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912867B (en) * 2023-09-13 2023-12-29 之江实验室 Teaching material structure extraction method and device combining automatic labeling and recall completion
CN117541269A (en) * 2023-12-08 2024-02-09 北京中数睿智科技有限公司 Third party module data real-time monitoring method and system based on intelligent large model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205674A (en) * 2017-12-22 2018-06-26 广州爱美互动网络科技有限公司 Content identification method, electronic equipment, storage medium and the system of social APP
CN109492143A (en) * 2018-09-21 2019-03-19 平安科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN110210542A (en) * 2019-05-24 2019-09-06 厦门美柚信息科技有限公司 Picture character identification model training method, device and character identification system
CN110210484A (en) * 2019-04-19 2019-09-06 成都三零凯天通信实业有限公司 System and method for detecting and identifying poor text of view image based on deep learning
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111400132A (en) * 2020-03-09 2020-07-10 北京版信通技术有限公司 Automatic monitoring method and system for on-shelf APP

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190103088A (en) * 2019-08-15 2019-09-04 엘지전자 주식회사 Method and apparatus for recognizing a business card using federated learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205674A (en) * 2017-12-22 2018-06-26 广州爱美互动网络科技有限公司 Content identification method, electronic equipment, storage medium and the system of social APP
CN109492143A (en) * 2018-09-21 2019-03-19 平安科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN110210484A (en) * 2019-04-19 2019-09-06 成都三零凯天通信实业有限公司 System and method for detecting and identifying poor text of view image based on deep learning
CN110210542A (en) * 2019-05-24 2019-09-06 厦门美柚信息科技有限公司 Picture character identification model training method, device and character identification system
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111400132A (en) * 2020-03-09 2020-07-10 北京版信通技术有限公司 Automatic monitoring method and system for on-shelf APP

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948830A (en) * 2021-03-12 2021-06-11 哈尔滨安天科技集团股份有限公司 File risk identification method and device
CN112948830B (en) * 2021-03-12 2023-11-10 安天科技集团股份有限公司 File risk identification method and device
CN113076339A (en) * 2021-03-18 2021-07-06 北京沃东天骏信息技术有限公司 Data caching method, device, equipment and storage medium
CN113221890A (en) * 2021-05-25 2021-08-06 深圳市瑞驰信息技术有限公司 OCR-based cloud mobile phone text content supervision method, system and system
CN113326376A (en) * 2021-05-28 2021-08-31 南京大学 Code review opinion quality evaluation system and method based on machine learning
CN113568823A (en) * 2021-09-27 2021-10-29 深圳市永达电子信息股份有限公司 Employee operation behavior monitoring method, system and computer readable medium
CN113888760A (en) * 2021-09-29 2022-01-04 平安银行股份有限公司 Violation information monitoring method, device, equipment and medium based on software application
CN113888760B (en) * 2021-09-29 2024-04-23 平安银行股份有限公司 Method, device, equipment and medium for monitoring violation information based on software application

Also Published As

Publication number Publication date
WO2022041406A1 (en) 2022-03-03
CN112101335B (en) 2022-04-15

Similar Documents

Publication Publication Date Title
CN112101335B (en) APP violation monitoring method based on OCR and transfer learning
AU2019355933B2 (en) Software testing
EP3522078A1 (en) Explainable artificial intelligence
EP3640847A1 (en) Systems and methods for identifying form fields
US20220004878A1 (en) Systems and methods for synthetic document and data generation
WO2018235252A1 (en) Analysis device, log analysis method, and recording medium
US20200125595A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
US9792370B2 (en) Identifying equivalent links on a page
CN111881398B (en) Page type determining method, device and equipment and computer storage medium
CN111552800A (en) Abstract generation method and device, electronic equipment and medium
WO2020101479A1 (en) System and method to detect and generate relevant content from uniform resource locator (url)
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN113918794A (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN115937887A (en) Method and device for extracting document structured information, electronic equipment and storage medium
CN114780891A (en) Website key resource analysis method and device based on page rendering contribution degree
CN111428724B (en) Examination paper handwriting statistics method, device and storage medium
CN114518993A (en) System performance monitoring method, device, equipment and medium based on business characteristics
US20230282013A1 (en) Automated key-value pair extraction
CN114328936B (en) Method and device for establishing classification model
CN117077678B (en) Sensitive word recognition method, device, equipment and medium
CN111598159B (en) Training method, device, equipment and storage medium of machine learning model
CN116049213A (en) Keyword retrieval method of form document and electronic equipment
CN117215947A (en) Page white screen detection method and device, computer equipment and storage medium
CN117746446A (en) Identity acquisition method, system, equipment and medium based on bill
Tan Computing jobs monitoring dashboard in Malaysia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant