CN116991978B

CN116991978B - CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Info

Publication number: CN116991978B
Application number: CN202311244461.9A
Authority: CN
Inventors: 郭伟; 王闽东
Original assignee: Hangzhou Jinyuan Biaoju Technology Co ltd
Current assignee: Hangzhou Jinyuan Biaoju Technology Co ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2024-01-02
Anticipated expiration: 2043-09-26
Also published as: CN116991978A

Abstract

The invention provides a CMS fragment feature extraction method, a system, electronic equipment and a storage medium, which relate to the technical field of feature extraction and comprise the following steps: classifying fragments of the CMS into static fragments, dynamic fragments, and code fragments; extracting the static fragments by using a feature extraction method, and establishing a static feature table; extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments; acquiring the names of codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments; the present invention is directed to solving the problem in the prior art that the lack of an efficient feature extraction method for each of the patches in the CMS results in an inability to accurately analyze each of the patches when analyzing the CMS patches.

Description

CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Technical Field

The present invention relates to the field of feature extraction technologies, and in particular, to a method, a system, an electronic device, and a storage medium for extracting CMS fragment features.

Background

The CMS, i.e. a content management system, is a software system located between a WEB front-end and a back-end office system or a process, and content creators, editors and publishers use the content management system to submit, modify, approve and publish content, where "content" may include information about files, tables, pictures, data in a database, even video, and the like that you want to publish to Internet, intranet and Extranet websites.

The existing improvement for extracting CMS fragments features is usually to carefully identify one feature in the CMS, for example, in chinese patent with the invention publication No. CN110489701a, a method, an apparatus for extracting CMS identification features and a CMS identification method are disclosed.

Disclosure of Invention

In view of the shortcomings of the prior art, the present invention aims to provide a method, a system, an electronic device and a storage medium for extracting features of CMS fragments, which are used for solving the problem that the prior art lacks an effective feature extraction method for each fragment in CMS, which results in that the effective analysis for each fragment cannot be accurately performed when the CMS fragments are analyzed.

In order to achieve the above object, a CMS fragment feature extraction method includes:

classifying fragments of the CMS into static fragments, dynamic fragments, and code fragments;

extracting the static fragments by using a feature extraction method, establishing a static feature table, and correspondingly putting abstract features and feature word libraries into the static feature table one by one based on an extraction result extracted by the feature extraction method;

extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments;

the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.

Further, classifying the fragments of the CMS includes:

based on the difference of the positions of the acquired CMS fragments, marking the fragments for adding the titles and abstracts of the articles as static fragments;

the fragments for pushing data and calling data are recorded as dynamic fragments;

fragments of code for editing HTML added data are noted as code fragments.

Further, extracting the static patch using the feature extraction method includes:

acquiring texts corresponding to the static fragments, and marking the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;

marking the static texts including abstracts in the static texts 1 to N as extracted static texts 1 to N1;

recording the static texts from the static text 1 to the static text N without the abstract as non-extracted static text 1 to non-extracted static text N2, wherein N, N1 and N2 are positive integers and N is the sum of N1 and N2;

the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;

the extraction of the non-extracted static text 1 to the non-extracted static text N2 is performed using a feature extraction method.

Further, the feature extraction method includes:

extracting any one of the non-extracted static text 1 to the non-extracted static text N2;

using Chinese word segmentation to the non-extracted static text, and marking a plurality of Chinese words in the non-extracted static text as Chinese words 1 to Q;

obtaining common real words with a first standard number by using a crawler technology, marking the common real words as a real word library, and marking the real words in the real word library as standard real words;

sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the complete consistency is that the characters are consistent and the positions of the characters are consistent;

and when the Chinese word is not completely consistent with any standard real word in the real word lexicon, marking the Chinese word as a text virtual word.

Further, the feature extraction method further includes:

acquiring a plurality of text real words from a Chinese word 1 to a Chinese word Q, and marking the text real words as text real words 1 to W;

obtaining common nouns and common verbs with a second standard quantity by using a crawler technology, marking the common nouns and the common verbs as a name verb word stock, and marking Chinese words in the name verb word stock as name feature words;

establishing a Y1 row and Y2 column table, and marking the table as a text real word table, wherein the top row in the text real word table is filled with different text real words except the first one in sequence, the second one in the leftmost column in the text real word table is filled with times, and the times in the text real word table are the times of the filled text real words corresponding to the text real words in the text real words 1 to W;

acquiring text real words with the frequency less than or equal to the third standard quantity, and recording the text real words as few-frequency real words;

sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library, marking the text real words as first-level feature words, wherein the complete consistency is consistent with characters and the positions of the characters are consistent;

and (3) marking the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words as standard characteristic words.

Further, the feature extraction method further includes:

acquiring a first section of an unextracted static text and a last section of the unextracted static text, performing Chinese word segmentation on the first section of the unextracted static text and the last section of the unextracted static text, marking a plurality of Chinese words of the first section of the unextracted static text as first-section Chinese words, and marking a plurality of Chinese words of the last section of the unextracted static text as last-section Chinese words;

acquiring the number of times of occurrence of the standard feature words in the first section, and recording the number of times as the first section;

acquiring the number of times of occurrence of standard feature words in the tail section, and recording the number of times as the number of times of the tail section;

when the number of first sections is larger than or equal to the number of tail sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;

when the number of the first segment is smaller than the number of the tail segments and the number of the tail segments is larger than the fourth standard number, marking the tail segments of the non-extracted static text as abstract features of the non-extracted static text;

and when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text.

Further, placing the abstract features and the feature word stock into the static feature table in a one-to-one correspondence based on the extracted result of the feature extraction method comprises:

after sequentially using a feature extraction method for the non-extracted static texts 1 to N2, marking the non-extracted static texts 1 to N2 as extracted static texts N1+1 to N, and establishing a T1 row and T2 column table to be marked as a static feature table;

sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;

and sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table.

Further, the data extraction method comprises the following steps:

after the webpage starts to run, performing dynamic data grabbing on the webpage by using the Selenium every first standard time;

after the Selenium performs one-time dynamic data grabbing, the grabbed dynamic data are marked as updated dynamic data, and the dynamic data grabbed by the Selenium in the previous time are marked as non-updated dynamic data;

when the updated dynamic data is identical to the dynamic data at the same position in the non-updated dynamic data, the dynamic data at the same position is recorded as unchanged data.

Further, the step of marking the obtained data characteristic of the dynamic fragment as the fragment characteristic of the dynamic fragment comprises the following steps:

recording the data except unchanged data in the updated dynamic data as changed data, and recording the changed data as the data characteristics of the dynamic fragments;

and after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments.

In a second aspect, the present invention provides a CMS fragment feature extraction system, including a fragment classification module, a static fragment extraction module, a dynamic fragment extraction module, and a code fragment extraction module;

the fragment classification module is used for classifying fragments of the CMS into static fragments, dynamic fragments and code fragments;

the static fragment extraction module is used for extracting the static fragments by using a feature extraction method, establishing a static feature table, classifying extraction results based on extraction objects of the feature extraction method, and correspondingly placing the extraction objects and the extraction results into the static feature table;

the dynamic fragment extraction module is used for extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments;

the code fragment extraction module is used for obtaining the names of codes in the code fragments and recording the names of the codes as fragment characteristics of the code fragments.

In a third aspect, an electronic device comprises a processor and a memory storing computer readable instructions that, when executed by the processor, perform the steps in the above method.

In a fourth aspect, a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

The invention has the beneficial effects that: the invention classifies the fragments of the CMS into static fragments, dynamic fragments and code fragments, which has the advantages that the classification of the fragments of the CMS is beneficial to the targeted extraction according to the different properties of the fragments of the CMS, so that the extracted fragment characteristics are more beneficial to reflecting the properties of the fragments;

the invention also extracts the static fragments by using a feature extraction method, establishes a static feature table, and correspondingly places abstract features and feature word libraries into the static feature table one by one based on the extracted result of the feature extraction method; extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as fragment characteristics of the dynamic fragments; the method has the advantages that the static fragments and the dynamic fragments can be subjected to feature extraction based on the text characteristics of the static fragments and the real-time change characteristics of the dynamic fragments by the feature extraction method and the dynamic extraction method, and the efficiency of the CMS fragments in the feature extraction process can be improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a schematic block diagram of a system of the present invention;

FIG. 2 is a flow chart of the steps of the method of the present invention;

FIG. 3 is a flow chart of the static fragment extraction strategy of the present invention;

fig. 4 is a connection block diagram of the electronic device of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

Referring to fig. 1, the present invention provides a CMS fragment feature extraction system, which includes a fragment classification module, a static fragment extraction module, a dynamic fragment extraction module, and a code fragment extraction module;

the patch classification module is configured with a patch classification strategy comprising:

marking fragments of a code for editing HTML added data as code fragments;

in the implementation process, the static fragment comprises contents such as titles, links, introduction and the like of information which are manually added in the CMS and information contents are acquired through information IDs, the dynamic fragment comprises fragments of pushing information in a management information page and manually appointed information IDs to be called, and the code fragment comprises manual visualization or code lower editing calling contents, wherein the code fragment supports automatic backup and reducible backup;

the static patch extraction module is configured with a static patch extraction policy that includes:

referring to fig. 3, the static texts including the abstract in the static texts 1 to N are referred to as extracted static texts 1 to N1;

extracting the non-extracted static text 1 to the non-extracted static text N2 by using a feature extraction method;

the feature extraction method comprises the following steps:

in a specific implementation process, common real words with highest click rate are obtained from a hundred-degree library on the Internet by using a crawler technology, each real word is different in the obtaining process, and the first standard quantity is set to be 300;

sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the completely consistent fonts are consistent and the positions of the words where the fonts are located are consistent;

when the Chinese word is not completely consistent with any standard real word in the real word lexicon, the Chinese word is marked as a text virtual word;

in the specific implementation process, text works include works and unusual real words which are excluded by the method;

the feature extraction method further comprises the following steps:

in a specific implementation process, common works and common verbs with highest click rate are acquired in a hundred-degree library on the Internet by using a crawler technology, each real word and each verb are different in the acquisition process, and the second standard quantity is set to be 100;

referring to table 1, wherein Y1 is the number of rows in table 1, Y2 is the number of columns in table 1, Y1 is 2, Y2 is w+1, a table of Y1 rows×y2 columns is established, and is recorded as a text real word table, different text real words are sequentially filled in the top row except the first row in the text real word table, the second filling times are set in the leftmost column in the text real word table, and the times in the text real word table are set to the times of filling the text real words in the text real words 1 to W;

TABLE 1 text real word list

in the specific implementation process, the third standard quantity is set to 10;

sequentially comparing the text real words in the text real word list with the name feature words in the name verb word library, and marking the text real words as first-level feature words when the text real words in the text real word list are completely consistent with any one name feature word in the name verb word library;

the text real words which are simultaneously marked as the few-frequency real words and the primary characteristic words are marked as standard characteristic words;

in the specific implementation process, the real words with smaller occurrence frequency have larger characteristics, and when the real words with smaller occurrence frequency appear in the first section or the tail section, the first section or the tail section has a representative meaning, so that text real words which are simultaneously marked as few-frequency real words and primary characteristic words are marked as standard characteristic words and used for marking the first section or the tail section;

the feature extraction method further comprises the following steps:

in a specific implementation process, the fourth standard quantity is set to 5;

when the number of the first segment and the number of the tail segment are smaller than the fourth standard number, marking the standard feature words as feature word libraries of the non-extracted static text;

the static fragment extraction module is further configured with a feature table establishment policy, the static table establishment policy comprising:

referring to table 2, T1 is the number of rows in table 2, T2 is the number of columns in table 2, wherein T1 is 2, T2 is N, and after feature extraction methods are sequentially applied to the non-extracted static text 1 through the non-extracted static text N2, the non-extracted static text 1 through the non-extracted static text N2 are recorded as extracted static text n1+1 through the extracted static text N, and a table of T1 rows×t2 columns is established and recorded as a static feature table;

table 2 static characteristics table

sequentially filling the corresponding feature word library or abstract features into the column of each extracted static text in the static feature table;

the dynamic fragment extraction module is configured with a data extraction method, and the data extraction method is used for extracting the characteristics of the dynamic fragments, and comprises the following steps:

when the updated dynamic data is completely the same as the dynamic data at the same position in the non-updated dynamic data, recording the dynamic data at the same position as non-changed data;

in the implementation process, the Selenium is used for simulating the operation of the user in the browser, for example, in an e-commerce website, the Selenium can simulate the user to search for and purchase goods in the e-commerce website, and meanwhile, the operation data of the Selenium can be transmitted to the website background in real time;

in the implementation process, after the Selenium performs dynamic data grabbing once, grabbing a number corresponding to the purchase amount to be 100, and marking the purchase amount as changed data if the number corresponding to the purchase amount in the dynamic data grabbed by the Selenium at the previous time is 90, and marking the purchase amount and 100 as data characteristics of dynamic fragments;

after each time of dynamic data grabbing by the Selenium, updating the data characteristics of the dynamic fragments;

Example 2

Referring to fig. 2, S1 to S4 correspond to the following description of a method for extracting CMS fragment features, which includes:

step S1, classifying fragments of the CMS into static fragments, dynamic fragments and code fragments; the step S1 comprises the following steps:

step S101, marking fragments for adding titles and summaries of articles as static fragments based on the difference of the positions of the acquired CMS fragments;

step S102, marking fragments for pushing data and calling data as dynamic fragments;

step S103, the fragments of the code for editing the HTML added data are recorded as code fragments.

Step S2, extracting the static fragments by using a feature extraction method, establishing a static feature table, and correspondingly putting abstract features and feature word libraries into the static feature table one by one based on an extracted result of the feature extraction method; the step S2 comprises the following steps:

step S201, obtaining texts corresponding to the static fragments, and recording the texts as a static text 1 to a static text N based on the difference of the static fragments corresponding to each text;

step S202, the static texts from the static text 1 to the static text N containing the abstract are recorded as extracted static text 1 to extracted static text N1;

step S203, the abstracts of the extracted static texts 1 to N1 are recorded as abstract features of the extracted static texts 1 to N1;

the feature extraction method comprises the following steps:

v1, extracting any one of the non-extracted static texts 1 to N2;

step V2, sequentially placing the Chinese words 1 to Q into a word stock for Chinese word comparison, and when the Chinese words are completely consistent with any standard real word in the word stock, marking the Chinese words as text real words, wherein the complete consistency is that the characters are consistent and the positions of the characters are consistent;

step V3, obtaining a plurality of text real words from the Chinese word 1 to the Chinese word Q, and marking the text real words as the text real words 1 to the text real words W;

v4, acquiring text real words with the frequency less than or equal to the third standard quantity, and marking the text real words as few-frequency real words;

step V5, obtaining a first segment of the non-extracted static text and a tail segment of the non-extracted static text, performing Chinese word segmentation on the first segment of the non-extracted static text and the tail segment of the non-extracted static text, marking a plurality of Chinese words of the first segment of the non-extracted static text as first segment Chinese words, and marking a plurality of Chinese words of the tail segment of the non-extracted static text as tail segment Chinese words;

v6, when the number of first sections is larger than or equal to the number of last sections and the number of first sections is larger than the fourth standard number, marking the first sections of the unextracted static text as abstract features of the unextracted static text;

Step S2 further includes:

step S204, after the feature extraction method is sequentially used for the non-extracted static texts 1 to N2, marking the non-extracted static texts 1 to N2 as extracted static texts N1+1 to N, and establishing a T1 row and T2 column table to be marked as a static feature table;

step S205, sequentially filling the extracted static text 1 to the extracted static text N in the top row of the static feature table;

Step S3, extracting the dynamic fragments by using a data extraction method, and marking the obtained data characteristics of the dynamic fragments as the fragment characteristics of the dynamic fragments; the data extraction method comprises the following steps:

step X1, after a webpage starts to run, performing dynamic data grabbing on the webpage by using a Selenium every first standard time;

step X2, after the Selenium performs one-time dynamic data capture, recording the captured dynamic data as updated dynamic data, and recording the dynamic data captured by the Selenium at the previous time as non-updated dynamic data;

step X3, when the updated dynamic data is completely the same as the dynamic data at the same position in the non-updated dynamic data, recording the dynamic data at the same position as non-changed data;

the step S3 comprises the following steps:

And S4, acquiring the names of the codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments.

Example 3

Referring to fig. 4, the present application provides an electronic device 50, including a processor 501 and a memory 502, where the memory 502 stores computer readable instructions that, when executed by the processor 501, perform steps as in the method described above. Through the above technical solutions, the processor 501 and the memory 502 are interconnected and communicate with each other through a communication bus and/or other form of connection mechanism (not shown), and the memory 502 stores a computer program executable by the processor 501, which when the electronic device 50 is running, is executed by the processor 501 to perform the method in any of the alternative implementations of the above embodiments, so as to implement the following functions: the method comprises the steps of classifying fragments of the CMS into static fragments, dynamic fragments and code fragments, extracting the static fragments by using a feature extraction method, establishing a static feature table, and placing abstract features and feature word libraries into the static feature table based on an extraction result extracted by the feature extraction method; extracting the dynamic fragments by using a data extraction method; the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.

Example 4

The present application provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above. By the above technical solution, the computer program, when executed by the processor, performs the method in any of the alternative implementations of the above embodiments to implement the following functions: the method comprises the steps of classifying fragments of the CMS into static fragments, dynamic fragments and code fragments, extracting the static fragments by using a feature extraction method, establishing a static feature table, and placing abstract features and feature word libraries into the static feature table based on an extraction result extracted by the feature extraction method; extracting the dynamic fragments by using a data extraction method; the names of codes in the code fragments are obtained, and the names of the codes are marked as fragment characteristics of the code fragments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein. The storage medium may be implemented by any type or combination of volatile or nonvolatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable Programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

The above examples are only specific embodiments of the present invention for illustrating the technical solution of the present invention, but not for limiting the scope of the present invention, and although the present invention has been described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that the present invention is not limited thereto: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for extracting CMS fragment characteristics, comprising:

acquiring the names of codes in the code fragments, and marking the names of the codes as fragment characteristics of the code fragments;

classifying the fragments of the CMS includes:

marking fragments of a code for editing HTML added data as code fragments;

extracting the static fragments by using a feature extraction method comprises the following steps:

the feature extraction method comprises the following steps:

the feature extraction method further comprises the following steps:

the step of putting the abstract features and the feature word stock into the static feature table in a one-to-one correspondence manner based on the extracted extraction result of the feature extraction method comprises the following steps:

2. The method for extracting CMS debris features of claim 1, wherein the data extraction method comprises:

3. The method of claim 2, wherein marking the obtained data feature of the dynamic patch as the patch feature of the dynamic patch comprises:

4. A system adapted for a CMS patch feature extraction method as claimed in any one of claims 1-3, comprising a patch classification module, a static patch extraction module, a dynamic patch extraction module, and a code patch extraction module;

5. An electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method of any of claims 1-3.

6. A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1-3.