CN116991978B - CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium - Google Patents

CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN116991978B
CN116991978B CN202311244461.9A CN202311244461A CN116991978B CN 116991978 B CN116991978 B CN 116991978B CN 202311244461 A CN202311244461 A CN 202311244461A CN 116991978 B CN116991978 B CN 116991978B
Authority
CN
China
Prior art keywords
static
text
fragments
words
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311244461.9A
Other languages
Chinese (zh)
Other versions
CN116991978A (en
Inventor
郭伟
王闽东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Jinyuan Biaoju Technology Co ltd
Original Assignee
Hangzhou Jinyuan Biaoju Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Jinyuan Biaoju Technology Co ltd filed Critical Hangzhou Jinyuan Biaoju Technology Co ltd
Priority to CN202311244461.9A priority Critical patent/CN116991978B/en
Publication of CN116991978A publication Critical patent/CN116991978A/en
Application granted granted Critical
Publication of CN116991978B publication Critical patent/CN116991978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/174Form filling; Merging

Abstract

The invention provides a CMS fragment feature extraction method, a system, electronic equipment and a storage medium, which relate to the technical field of feature extraction and comprise the following steps: classifying fragments of the CMS into static fragments, dynamic fragments, and code fragments; extracting the static fragments by using a feature extraction method, and establishing a static feature table; extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments; acquiring the names of codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments; the present invention is directed to solving the problem in the prior art that the lack of an efficient feature extraction method for each of the patches in the CMS results in an inability to accurately analyze each of the patches when analyzing the CMS patches.

Description

CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of feature extraction technologies, and in particular, to a method, a system, an electronic device, and a storage medium for extracting CMS fragment features.
Background
The CMS, i.e. a content management system, is a software system located between a WEB front-end and a back-end office system or a process, and content creators, editors and publishers use the content management system to submit, modify, approve and publish content, where "content" may include information about files, tables, pictures, data in a database, even video, and the like that you want to publish to Internet, intranet and Extranet websites.
The existing improvement for extracting CMS fragments features is usually to carefully identify one feature in the CMS, for example, in chinese patent with the invention publication No. CN110489701a, a method, an apparatus for extracting CMS identification features and a CMS identification method are disclosed.
Disclosure of Invention
In view of the shortcomings of the prior art, the present invention aims to provide a method, a system, an electronic device and a storage medium for extracting features of CMS fragments, which are used for solving the problem that the prior art lacks an effective feature extraction method for each fragment in CMS, which results in that the effective analysis for each fragment cannot be accurately performed when the CMS fragments are analyzed.
In order to achieve the above object, a CMS fragment feature extraction method includes:
classifying fragments of the CMS into static fragments, dynamic fragments, and code fragments;
extracting the static fragments by using a feature extraction method, establishing a static feature table, and correspondingly putting abstract features and feature word libraries into the static feature table one by one based on an extraction result extracted by the feature extraction method;
extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments;
the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.
Further, classifying the fragments of the CMS includes:
based on the difference of the positions of the acquired CMS fragments, marking the fragments for adding the titles and abstracts of the articles as static fragments;
the fragments for pushing data and calling data are recorded as dynamic fragments;
fragments of code for editing HTML added data are noted as code fragments.
Further, extracting the static patch using the feature extraction method includes:
acquiring texts corresponding to the static fragments, and marking the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;
marking the static texts including abstracts in the static texts 1 to N as extracted static texts 1 to N1;
recording the static texts from the static text 1 to the static text N without the abstract as non-extracted static text 1 to non-extracted static text N2, wherein N, N1 and N2 are positive integers and N is the sum of N1 and N2;
the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;
the extraction of the non-extracted static text 1 to the non-extracted static text N2 is performed using a feature extraction method.
Further, the feature extraction method includes:
extracting any one of the non-extracted static text 1 to the non-extracted static text N2;
using Chinese word segmentation to the non-extracted static text, and marking a plurality of Chinese words in the non-extracted static text as Chinese words 1 to Q;
obtaining common real words with a first standard number by using a crawler technology, marking the common real words as a real word library, and marking the real words in the real word library as standard real words;
sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the complete consistency is that the characters are consistent and the positions of the characters are consistent;
and when the Chinese word is not completely consistent with any standard real word in the real word lexicon, marking the Chinese word as a text virtual word.
Further, the feature extraction method further includes:
acquiring a plurality of text real words from a Chinese word 1 to a Chinese word Q, and marking the text real words as text real words 1 to W;
obtaining common nouns and common verbs with a second standard quantity by using a crawler technology, marking the common nouns and the common verbs as a name verb word stock, and marking Chinese words in the name verb word stock as name feature words;
establishing a Y1 row and Y2 column table, and marking the table as a text real word table, wherein the top row in the text real word table is filled with different text real words except the first one in sequence, the second one in the leftmost column in the text real word table is filled with times, and the times in the text real word table are the times of the filled text real words corresponding to the text real words in the text real words 1 to W;
acquiring text real words with the frequency less than or equal to the third standard quantity, and recording the text real words as few-frequency real words;
sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library, marking the text real words as first-level feature words, wherein the complete consistency is consistent with characters and the positions of the characters are consistent;
and (3) marking the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words as standard characteristic words.
Further, the feature extraction method further includes:
acquiring a first section of an unextracted static text and a last section of the unextracted static text, performing Chinese word segmentation on the first section of the unextracted static text and the last section of the unextracted static text, marking a plurality of Chinese words of the first section of the unextracted static text as first-section Chinese words, and marking a plurality of Chinese words of the last section of the unextracted static text as last-section Chinese words;
acquiring the number of times of occurrence of the standard feature words in the first section, and recording the number of times as the first section;
acquiring the number of times of occurrence of standard feature words in the tail section, and recording the number of times as the number of times of the tail section;
when the number of first sections is larger than or equal to the number of tail sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;
when the number of the first segment is smaller than the number of the tail segments and the number of the tail segments is larger than the fourth standard number, marking the tail segments of the non-extracted static text as abstract features of the non-extracted static text;
and when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text.
Further, placing the abstract features and the feature word stock into the static feature table in a one-to-one correspondence based on the extracted result of the feature extraction method comprises:
after sequentially using a feature extraction method for the non-extracted static texts 1 to N2, marking the non-extracted static texts 1 to N2 as extracted static texts N1+1 to N, and establishing a T1 row and T2 column table to be marked as a static feature table;
sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;
and sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table.
Further, the data extraction method comprises the following steps:
after the webpage starts to run, performing dynamic data grabbing on the webpage by using the Selenium every first standard time;
after the Selenium performs one-time dynamic data grabbing, the grabbed dynamic data are marked as updated dynamic data, and the dynamic data grabbed by the Selenium in the previous time are marked as non-updated dynamic data;
when the updated dynamic data is identical to the dynamic data at the same position in the non-updated dynamic data, the dynamic data at the same position is recorded as unchanged data.
Further, the step of marking the obtained data characteristic of the dynamic fragment as the fragment characteristic of the dynamic fragment comprises the following steps:
recording the data except unchanged data in the updated dynamic data as changed data, and recording the changed data as the data characteristics of the dynamic fragments;
and after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments.
In a second aspect, the present invention provides a CMS fragment feature extraction system, including a fragment classification module, a static fragment extraction module, a dynamic fragment extraction module, and a code fragment extraction module;
the fragment classification module is used for classifying fragments of the CMS into static fragments, dynamic fragments and code fragments;
the static fragment extraction module is used for extracting the static fragments by using a feature extraction method, establishing a static feature table, classifying extraction results based on extraction objects of the feature extraction method, and correspondingly placing the extraction objects and the extraction results into the static feature table;
the dynamic fragment extraction module is used for extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments;
the code fragment extraction module is used for obtaining the names of codes in the code fragments and recording the names of the codes as fragment characteristics of the code fragments.
In a third aspect, an electronic device comprises a processor and a memory storing computer readable instructions that, when executed by the processor, perform the steps in the above method.
In a fourth aspect, a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
The invention has the beneficial effects that: the invention classifies the fragments of the CMS into static fragments, dynamic fragments and code fragments, which has the advantages that the classification of the fragments of the CMS is beneficial to the targeted extraction according to the different properties of the fragments of the CMS, so that the extracted fragment characteristics are more beneficial to reflecting the properties of the fragments;
the invention also extracts the static fragments by using a feature extraction method, establishes a static feature table, and correspondingly places abstract features and feature word libraries into the static feature table one by one based on the extracted result of the feature extraction method; extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments; the method has the advantages that the static fragments and the dynamic fragments can be subjected to feature extraction based on the text characteristics of the static fragments and the real-time change characteristics of the dynamic fragments by the feature extraction method and the dynamic extraction method, and the efficiency of the CMS fragments in the feature extraction process can be improved.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a schematic block diagram of a system of the present invention;
FIG. 2 is a flow chart of the steps of the method of the present invention;
FIG. 3 is a flow chart of the static fragment extraction strategy of the present invention;
fig. 4 is a connection block diagram of the electronic device of the present invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.
Embodiments of the invention and features of the embodiments may be combined with each other without conflict.
Example 1
Referring to fig. 1, the present invention provides a CMS fragment feature extraction system, which includes a fragment classification module, a static fragment extraction module, a dynamic fragment extraction module, and a code fragment extraction module;
the fragment classification module is used for classifying fragments of the CMS into static fragments, dynamic fragments and code fragments;
the patch classification module is configured with a patch classification strategy comprising:
based on the difference of the positions of the acquired CMS fragments, marking the fragments for adding the titles and abstracts of the articles as static fragments;
the fragments for pushing data and calling data are recorded as dynamic fragments;
marking fragments of a code for editing HTML added data as code fragments;
in the implementation process, the static fragment comprises contents such as titles, links, introduction and the like of information which are manually added in the CMS and information contents are acquired through information IDs, the dynamic fragment comprises fragments of pushing information in a management information page and manually appointed information IDs to be called, and the code fragment comprises manual visualization or code lower editing calling contents, wherein the code fragment supports automatic backup and reducible backup;
the static fragment extraction module is used for extracting the static fragments by using a feature extraction method, establishing a static feature table, classifying extraction results based on extraction objects of the feature extraction method, and correspondingly placing the extraction objects and the extraction results into the static feature table;
the static patch extraction module is configured with a static patch extraction policy that includes:
acquiring texts corresponding to the static fragments, and marking the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;
referring to fig. 3, the static texts including the abstract in the static texts 1 to N are referred to as extracted static texts 1 to N1;
recording the static texts from the static text 1 to the static text N without the abstract as non-extracted static text 1 to non-extracted static text N2, wherein N, N1 and N2 are positive integers and N is the sum of N1 and N2;
the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;
extracting the non-extracted static text 1 to the non-extracted static text N2 by using a feature extraction method;
the feature extraction method comprises the following steps:
extracting any one of the non-extracted static text 1 to the non-extracted static text N2;
using Chinese word segmentation to the non-extracted static text, and marking a plurality of Chinese words in the non-extracted static text as Chinese words 1 to Q;
obtaining common real words with a first standard number by using a crawler technology, marking the common real words as a real word library, and marking the real words in the real word library as standard real words;
in a specific implementation process, common real words with highest click rate are obtained from a hundred-degree library on the Internet by using a crawler technology, each real word is different in the obtaining process, and the first standard quantity is set to be 300;
sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the completely consistent fonts are consistent and the positions of the words where the fonts are located are consistent;
when the Chinese word is not completely consistent with any standard real word in the real word lexicon, the Chinese word is marked as a text virtual word;
in the specific implementation process, text works include works and unusual real words which are excluded by the method;
the feature extraction method further comprises the following steps:
acquiring a plurality of text real words from a Chinese word 1 to a Chinese word Q, and marking the text real words as text real words 1 to W;
obtaining common nouns and common verbs with a second standard quantity by using a crawler technology, marking the common nouns and the common verbs as a name verb word stock, and marking Chinese words in the name verb word stock as name feature words;
in a specific implementation process, common works and common verbs with highest click rate are acquired in a hundred-degree library on the Internet by using a crawler technology, each real word and each verb are different in the acquisition process, and the second standard quantity is set to be 100;
referring to table 1, wherein Y1 is the number of rows in table 1, Y2 is the number of columns in table 1, Y1 is 2, Y2 is w+1, a table of Y1 rows×y2 columns is established, and is recorded as a text real word table, different text real words are sequentially filled in the top row except the first row in the text real word table, the second filling times are set in the leftmost column in the text real word table, and the times in the text real word table are set to the times of filling the text real words in the text real words 1 to W;
TABLE 1 text real word list
Acquiring text real words with the frequency less than or equal to the third standard quantity, and recording the text real words as few-frequency real words;
in the specific implementation process, the third standard quantity is set to 10;
sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and marking the text real words as first-level feature words when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library;
the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words are marked as standard characteristic words;
in the specific implementation process, the real words with smaller occurrence frequency have larger characteristics, and when the real words with smaller occurrence frequency appear in the first section or the tail section, the first section or the tail section has a representative meaning, so that text real words which are simultaneously marked as few-frequency real words and primary characteristic words are marked as standard characteristic words and used for marking the first section or the tail section;
the feature extraction method further comprises the following steps:
acquiring a first section of an unextracted static text and a last section of the unextracted static text, performing Chinese word segmentation on the first section of the unextracted static text and the last section of the unextracted static text, marking a plurality of Chinese words of the first section of the unextracted static text as first-section Chinese words, and marking a plurality of Chinese words of the last section of the unextracted static text as last-section Chinese words;
acquiring the number of times of occurrence of the standard feature words in the first section, and recording the number of times as the first section;
acquiring the number of times of occurrence of standard feature words in the tail section, and recording the number of times as the number of times of the tail section;
when the number of first sections is larger than or equal to the number of tail sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;
in a specific implementation process, the fourth standard quantity is set to 5;
when the number of the first segment is smaller than the number of the tail segments and the number of the tail segments is larger than the fourth standard number, marking the tail segments of the non-extracted static text as abstract features of the non-extracted static text;
when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text;
the static fragment extraction module is further configured with a feature table establishment policy, the static table establishment policy comprising:
referring to table 2, T1 is the number of rows in table 2, T2 is the number of columns in table 2, wherein T1 is 2, T2 is N, and after feature extraction methods are sequentially applied to the non-extracted static text 1 through the non-extracted static text N2, the non-extracted static text 1 through the non-extracted static text N2 are recorded as extracted static text n1+1 through the extracted static text N, and a table of T1 rows×t2 columns is established and recorded as a static feature table;
table 2 static characteristics table
Sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;
sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table;
the dynamic fragment extraction module is used for extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments;
the dynamic fragment extraction module is configured with a data extraction method, and the data extraction method is used for extracting the characteristics of the dynamic fragments, and comprises the following steps:
after the webpage starts to run, performing dynamic data grabbing on the webpage by using the Selenium every first standard time;
after the Selenium performs one-time dynamic data grabbing, the grabbed dynamic data are marked as updated dynamic data, and the dynamic data grabbed by the Selenium in the previous time are marked as non-updated dynamic data;
when the updated dynamic data is completely the same as the dynamic data at the same position in the non-updated dynamic data, recording the dynamic data at the same position as non-changed data;
in the implementation process, the Selenium is used for simulating the operation of the user in the browser, for example, in an e-commerce website, the Selenium can simulate the user to search for and purchase goods in the e-commerce website, and meanwhile, the operation data of the Selenium can be transmitted to the website background in real time;
in the implementation process, after the Selenium performs dynamic data grabbing once, grabbing a number corresponding to the purchase amount to be 100, and marking the purchase amount as changed data if the number corresponding to the purchase amount in the dynamic data grabbed by the Selenium at the previous time is 90, and marking the purchase amount and 100 as data characteristics of dynamic fragments;
recording the data except unchanged data in the updated dynamic data as changed data, and recording the changed data as the data characteristics of the dynamic fragments;
after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments;
the code fragment extraction module is used for obtaining the names of codes in the code fragments and recording the names of the codes as fragment characteristics of the code fragments.
Example 2
Referring to fig. 2, S1 to S4 correspond to the following description of a method for extracting CMS fragment features, which includes:
step S1, classifying fragments of the CMS into static fragments, dynamic fragments and code fragments; the step S1 comprises the following steps:
step S101, marking fragments for adding titles and summaries of articles as static fragments based on the difference of the positions of the acquired CMS fragments;
step S102, marking fragments for pushing data and calling data as dynamic fragments;
step S103, the fragments of the code for editing the HTML added data are recorded as code fragments.
Step S2, extracting the static fragments by using a feature extraction method, establishing a static feature table, and correspondingly putting abstract features and feature word libraries into the static feature table one by one based on an extracted result of the feature extraction method; the step S2 comprises the following steps:
step S201, obtaining texts corresponding to the static fragments, and recording the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;
step S202, the static texts from the static text 1 to the static text N containing the abstract are recorded as extracted static text 1 to extracted static text N1;
recording the static texts from the static text 1 to the static text N without the abstract as non-extracted static text 1 to non-extracted static text N2, wherein N, N1 and N2 are positive integers and N is the sum of N1 and N2;
step S203, the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;
extracting the non-extracted static text 1 to the non-extracted static text N2 by using a feature extraction method;
the feature extraction method comprises the following steps:
v1, extracting any one of the non-extracted static texts 1 to N2;
using Chinese word segmentation to the non-extracted static text, and marking a plurality of Chinese words in the non-extracted static text as Chinese words 1 to Q;
obtaining common real words with a first standard number by using a crawler technology, marking the common real words as a real word library, and marking the real words in the real word library as standard real words;
step V2, sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the complete consistency is that the characters are consistent and the positions of the characters are consistent;
when the Chinese word is not completely consistent with any standard real word in the real word lexicon, the Chinese word is marked as a text virtual word;
step V3, obtaining a plurality of text real words from the Chinese word 1 to the Chinese word Q, and marking the text real words as the text real words 1 to the text real words W;
obtaining common nouns and common verbs with a second standard quantity by using a crawler technology, marking the common nouns and the common verbs as a name verb word stock, and marking Chinese words in the name verb word stock as name feature words;
establishing a Y1 row and Y2 column table, and marking the table as a text real word table, wherein the top row in the text real word table is filled with different text real words except the first one in sequence, the second one in the leftmost column in the text real word table is filled with times, and the times in the text real word table are the times of the filled text real words corresponding to the text real words in the text real words 1 to W;
v4, acquiring text real words with the frequency less than or equal to the third standard quantity, and marking the text real words as few-frequency real words;
sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library, marking the text real words as first-level feature words, wherein the complete consistency is consistent with characters and the positions of the characters are consistent;
the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words are marked as standard characteristic words;
step V5, obtaining a first segment of the non-extracted static text and a tail segment of the non-extracted static text, performing Chinese word segmentation on the first segment of the non-extracted static text and the tail segment of the non-extracted static text, marking a plurality of Chinese words of the first segment of the non-extracted static text as first segment Chinese words, and marking a plurality of Chinese words of the tail segment of the non-extracted static text as tail segment Chinese words;
acquiring the number of times of occurrence of the standard feature words in the first section, and recording the number of times as the first section;
acquiring the number of times of occurrence of standard feature words in the tail section, and recording the number of times as the number of times of the tail section;
v6, when the number of first sections is larger than or equal to the number of last sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;
when the number of the first segment is smaller than the number of the tail segments and the number of the tail segments is larger than the fourth standard number, marking the tail segments of the non-extracted static text as abstract features of the non-extracted static text;
and when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text.
Step S2 further includes:
step S204, after the feature extraction method is sequentially used for the non-extracted static texts 1 to N2, marking the non-extracted static texts 1 to N2 as extracted static texts N1+1 to N, and establishing a T1 row and T2 column table to be marked as a static feature table;
step S205, sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;
and sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table.
Step S3, extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments; the data extraction method comprises the following steps:
step X1, after a webpage starts to run, performing dynamic data grabbing on the webpage by using a Selenium every first standard time;
step X2, after the Selenium performs one-time dynamic data capture, recording the captured dynamic data as updated dynamic data, and recording the dynamic data captured by the Selenium at the previous time as non-updated dynamic data;
step X3, when the updated dynamic data is completely the same as the dynamic data at the same position in the non-updated dynamic data, recording the dynamic data at the same position as non-changed data;
the step S3 comprises the following steps:
recording the data except unchanged data in the updated dynamic data as changed data, and recording the changed data as the data characteristics of the dynamic fragments;
and after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments.
And S4, acquiring the names of the codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments.
Example 3
Referring to fig. 4, the present application provides an electronic device 50, including a processor 501 and a memory 502, where the memory 502 stores computer readable instructions that, when executed by the processor 501, perform steps as in the method described above. Through the above technical solutions, the processor 501 and the memory 502 are interconnected and communicate with each other through a communication bus and/or other form of connection mechanism (not shown), and the memory 502 stores a computer program executable by the processor 501, which when the electronic device 50 is running, is executed by the processor 501 to perform the method in any of the alternative implementations of the above embodiments, so as to implement the following functions: the method comprises the steps of classifying fragments of the CMS into static fragments, dynamic fragments and code fragments, extracting the static fragments by using a feature extraction method, establishing a static feature table, and placing abstract features and feature word libraries into the static feature table based on an extraction result extracted by the feature extraction method; extracting the dynamic fragments by using a data extraction method; the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.
Example 4
The present application provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above. By the above technical solution, the computer program, when executed by the processor, performs the method in any of the alternative implementations of the above embodiments to implement the following functions: the method comprises the steps of classifying fragments of the CMS into static fragments, dynamic fragments and code fragments, extracting the static fragments by using a feature extraction method, establishing a static feature table, and placing abstract features and feature word libraries into the static feature table based on an extraction result extracted by the feature extraction method; extracting the dynamic fragments by using a data extraction method; the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein. The storage medium may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
The above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (6)

1. A method for extracting CMS fragment characteristics, comprising:
classifying fragments of the CMS into static fragments, dynamic fragments, and code fragments;
extracting the static fragments by using a feature extraction method, establishing a static feature table, and correspondingly putting abstract features and feature word libraries into the static feature table one by one based on an extraction result extracted by the feature extraction method;
extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments;
acquiring the names of codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments;
classifying the fragments of the CMS includes:
based on the difference of the positions of the acquired CMS fragments, marking the fragments for adding the titles and abstracts of the articles as static fragments;
the fragments for pushing data and calling data are recorded as dynamic fragments;
marking fragments of a code for editing HTML added data as code fragments;
extracting the static fragments by using a feature extraction method comprises the following steps:
acquiring texts corresponding to the static fragments, and marking the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;
marking the static texts including abstracts in the static texts 1 to N as extracted static texts 1 to N1;
recording the static texts from the static text 1 to the static text N without the abstract as non-extracted static text 1 to non-extracted static text N2, wherein N, N1 and N2 are positive integers and N is the sum of N1 and N2;
the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;
extracting the non-extracted static text 1 to the non-extracted static text N2 by using a feature extraction method;
the feature extraction method comprises the following steps:
extracting any one of the non-extracted static text 1 to the non-extracted static text N2;
using Chinese word segmentation to the non-extracted static text, and marking a plurality of Chinese words in the non-extracted static text as Chinese words 1 to Q;
obtaining common real words with a first standard number by using a crawler technology, marking the common real words as a real word library, and marking the real words in the real word library as standard real words;
sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the complete consistency is that the characters are consistent and the positions of the characters are consistent;
when the Chinese word is not completely consistent with any standard real word in the real word lexicon, the Chinese word is marked as a text virtual word;
the feature extraction method further comprises the following steps:
acquiring a plurality of text real words from a Chinese word 1 to a Chinese word Q, and marking the text real words as text real words 1 to W;
obtaining common nouns and common verbs with a second standard quantity by using a crawler technology, marking the common nouns and the common verbs as a name verb word stock, and marking Chinese words in the name verb word stock as name feature words;
establishing a Y1 row and Y2 column table, and marking the table as a text real word table, wherein the top row in the text real word table is filled with different text real words except the first one in sequence, the second one in the leftmost column in the text real word table is filled with times, and the times in the text real word table are the times of the filled text real words corresponding to the text real words in the text real words 1 to W;
acquiring text real words with the frequency less than or equal to the third standard quantity, and recording the text real words as few-frequency real words;
sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library, marking the text real words as first-level feature words, wherein the complete consistency is consistent with characters and the positions of the characters are consistent;
the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words are marked as standard characteristic words;
the feature extraction method further comprises the following steps:
acquiring a first section of an unextracted static text and a last section of the unextracted static text, performing Chinese word segmentation on the first section of the unextracted static text and the last section of the unextracted static text, marking a plurality of Chinese words of the first section of the unextracted static text as first-section Chinese words, and marking a plurality of Chinese words of the last section of the unextracted static text as last-section Chinese words;
acquiring the number of times of occurrence of the standard feature words in the first section, and recording the number of times as the first section;
acquiring the number of times of occurrence of standard feature words in the tail section, and recording the number of times as the number of times of the tail section;
when the number of first sections is larger than or equal to the number of tail sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;
when the number of the first segment is smaller than the number of the tail segments and the number of the tail segments is larger than the fourth standard number, marking the tail segments of the non-extracted static text as abstract features of the non-extracted static text;
when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text;
the step of putting the abstract features and the feature word stock into the static feature table in a one-to-one correspondence manner based on the extracted extraction result of the feature extraction method comprises the following steps:
after sequentially using a feature extraction method for the non-extracted static texts 1 to N2, marking the non-extracted static texts 1 to N2 as extracted static texts N1+1 to N, and establishing a T1 row and T2 column table to be marked as a static feature table;
sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;
and sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table.
2. The method for extracting CMS debris features of claim 1, wherein the data extraction method comprises:
after the webpage starts to run, performing dynamic data grabbing on the webpage by using the Selenium every first standard time;
after the Selenium performs one-time dynamic data grabbing, the grabbed dynamic data are marked as updated dynamic data, and the dynamic data grabbed by the Selenium in the previous time are marked as non-updated dynamic data;
when the updated dynamic data is identical to the dynamic data at the same position in the non-updated dynamic data, the dynamic data at the same position is recorded as unchanged data.
3. The method of claim 2, wherein marking the obtained data feature of the dynamic patch as the patch feature of the dynamic patch comprises:
recording the data except unchanged data in the updated dynamic data as changed data, and recording the changed data as the data characteristics of the dynamic fragments;
and after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments.
4. A system adapted for a CMS patch feature extraction method as claimed in any one of claims 1-3, comprising a patch classification module, a static patch extraction module, a dynamic patch extraction module, and a code patch extraction module;
the fragment classification module is used for classifying fragments of the CMS into static fragments, dynamic fragments and code fragments;
the static fragment extraction module is used for extracting the static fragments by using a feature extraction method, establishing a static feature table, classifying extraction results based on extraction objects of the feature extraction method, and correspondingly placing the extraction objects and the extraction results into the static feature table;
the dynamic fragment extraction module is used for extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments;
the code fragment extraction module is used for obtaining the names of codes in the code fragments and recording the names of the codes as fragment characteristics of the code fragments.
5. An electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method of any of claims 1-3.
6. A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1-3.
CN202311244461.9A 2023-09-26 2023-09-26 CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium Active CN116991978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311244461.9A CN116991978B (en) 2023-09-26 2023-09-26 CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311244461.9A CN116991978B (en) 2023-09-26 2023-09-26 CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116991978A CN116991978A (en) 2023-11-03
CN116991978B true CN116991978B (en) 2024-01-02

Family

ID=88521638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311244461.9A Active CN116991978B (en) 2023-09-26 2023-09-26 CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116991978B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853737A (en) * 2012-11-29 2014-06-11 怡丰联合(北京)科技有限责任公司 Hypertext markup language (HTML) content visualization compiling method and system
CN105893054A (en) * 2016-04-22 2016-08-24 乐视控股(北京)有限公司 Method and device for updating CMS (content management system) fragments
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature
CN111695075A (en) * 2020-06-12 2020-09-22 国网浙江省电力有限公司信息通信分公司 Website CMS (content management system) identification method and security vulnerability detection method and device
CN112445997A (en) * 2020-12-15 2021-03-05 安徽三实信息技术服务有限公司 Method and device for extracting CMS multi-version identification feature rule
CN115022926A (en) * 2022-05-09 2022-09-06 北京邮电大学 Multi-objective optimization container migration method based on resource balance
CN116662327A (en) * 2023-07-28 2023-08-29 南京芯颖科技有限公司 Data fusion cleaning method for database

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7076728B2 (en) * 2000-12-22 2006-07-11 International Business Machines Corporation Method and apparatus for end-to-end content publishing system using XML with an object dependency graph
US9251180B2 (en) * 2012-05-29 2016-02-02 International Business Machines Corporation Supplementing structured information about entities with information from unstructured data sources
CN110020422B (en) * 2018-11-26 2020-08-04 阿里巴巴集团控股有限公司 Feature word determining method and device and server
US11562037B2 (en) * 2019-09-18 2023-01-24 International Business Machines Corporation Crawlability of single page applications
US11842175B2 (en) * 2021-07-19 2023-12-12 Sap Se Dynamic recommendations for resolving static code issues

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853737A (en) * 2012-11-29 2014-06-11 怡丰联合(北京)科技有限责任公司 Hypertext markup language (HTML) content visualization compiling method and system
CN105893054A (en) * 2016-04-22 2016-08-24 乐视控股(北京)有限公司 Method and device for updating CMS (content management system) fragments
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis
CN109101600A (en) * 2018-08-01 2018-12-28 沈文策 The crawling method and device of dynamic data in a kind of webpage
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature
CN111695075A (en) * 2020-06-12 2020-09-22 国网浙江省电力有限公司信息通信分公司 Website CMS (content management system) identification method and security vulnerability detection method and device
CN112445997A (en) * 2020-12-15 2021-03-05 安徽三实信息技术服务有限公司 Method and device for extracting CMS multi-version identification feature rule
CN115022926A (en) * 2022-05-09 2022-09-06 北京邮电大学 Multi-objective optimization container migration method based on resource balance
CN116662327A (en) * 2023-07-28 2023-08-29 南京芯颖科技有限公司 Data fusion cleaning method for database

Also Published As

Publication number Publication date
CN116991978A (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN108595583B (en) Dynamic graph page data crawling method, device, terminal and storage medium
CN105677764B (en) Information extraction method and device
CN106354861A (en) Automatic film label indexing method and automatic indexing system
US9436768B2 (en) System and method for pushing and distributing promotion content
CN108090104B (en) Method and device for acquiring webpage information
CN105022803A (en) Method and system for extracting text content of webpage
US8290925B1 (en) Locating product references in content pages
CN111984792A (en) Website classification method and device, computer equipment and storage medium
US20220114269A1 (en) Page processing method, electronic apparatus and non-transitory computer-readable storage medium
CN111353071A (en) Label generation method and device
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112148956A (en) Hidden net threat information mining system and method based on machine learning
CN116991978B (en) CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium
CN115391711B (en) Webpage text information extraction method, device, equipment and medium
CN110569429A (en) method, device and equipment for generating content selection model
CN114528811B (en) Article content extraction method, device, equipment and storage medium
CN113743982A (en) Advertisement putting scheme recommendation method and device, computer equipment and storage medium
CN114067343A (en) Data set construction method, model training method and corresponding device
Bonsu Weighted accuracy algorithmic approach in counteracting fake news and disinformation
CN113642329A (en) Method and device for establishing term recognition model and method and device for recognizing terms
CN112765444A (en) Method, device and equipment for extracting target text segment and storage medium
CN108073588B (en) Column information extraction method and device
CN111914199B (en) Page element filtering method, device, equipment and storage medium
US10423636B2 (en) Relating collections in an item universe
CN112749294B (en) Page hidden text recognition method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant