CN108182210A - Service semantics acquisition methods and system based on page info parsing - Google Patents

Service semantics acquisition methods and system based on page info parsing Download PDF

Info

Publication number
CN108182210A
CN108182210A CN201711368914.3A CN201711368914A CN108182210A CN 108182210 A CN108182210 A CN 108182210A CN 201711368914 A CN201711368914 A CN 201711368914A CN 108182210 A CN108182210 A CN 108182210A
Authority
CN
China
Prior art keywords
gfps
block
page
label
dominant character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711368914.3A
Other languages
Chinese (zh)
Other versions
CN108182210B (en
Inventor
曹健
童兰轩
钱诗友
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangyin Zhuri Information Technology Co ltd
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201711368914.3A priority Critical patent/CN108182210B/en
Publication of CN108182210A publication Critical patent/CN108182210A/en
Application granted granted Critical
Publication of CN108182210B publication Critical patent/CN108182210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention provides a kind of service semantics acquisition methods and system based on page info parsing, including Segment:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, GFPS Block are defined as, and build GFPS Tree tree structures;Vector structure:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, define the vector expression of four classifications and each classification;Vectorial weight determines:By available data collection cross validation is determined each dimension of vector calculate apart from when weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS Block belongs to corresponding classification.The present invention excludes page redundancy to the full extent, and effective information is placed on by description in different piece, and it is readable to promote data.

Description

Service semantics acquisition methods and system based on page info parsing
Technical field
The present invention relates to technical field of data processing, and in particular, to a kind of service semantics based on page info parsing Acquisition methods and system.
Background technology
Under the background rapidly developed in internet, a kind of this emerging software technology of Web service is come into being.In recent years, The characteristics of Web service of REST style compares more light weight with it, obtains more next more concern, and more next more service provider will The clothes of oneself are migrated into RESTful Web services.In fact, with open platform and the development of cloud computing, RESTful Web The situation of the quantity of API presented explosive growth.Under such overall background, more and more non-software Engineering Speciality people Member can be touched to Web service, and in Web API, RESTful Web API occupy seventy percent proportion.In order to enable these information It is easy-to-understand, need to introduce it be structured to machine excavation, utilize.
However, there is presently no standards for the recommended information of RESTful Web API, each company is with the form of oneself to this A little API carry out operation instruction:Such as global maximum Mapping Web Service Google Map, the introduction that same group is serviced and example It puts on one page, will be introduced on content chain to other pages for authentication, using limitation etc.;And Youtube will be carried All API supplied, which are introduced, all to be put on one page.In face of disunity form, flood tide and the fast Web service data of speedup, according to Manually go that tissue, to manage these data obviously unrealistic.
In order to enable machine with automatic identification, acquisition and can excavate these information, it would be desirable to these page texts The thought of semantic tagger information, i.e. " Web is semantic " is added in content, design is suitable, the model of light weight so that the model can be accurate Really, recommended information is efficiently handled.
Invention content
For the defects in the prior art, the object of the present invention is to provide a kind of service semantics based on page info parsing Acquisition methods and system.
According to it is provided by the invention it is a kind of based on page info parse service semantics acquisition methods, including:
Segment step:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vectorial construction step:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification;
Vectorial weight determines step:Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.
Preferably, the Segment step specifically includes:
Step 101:By containing the label unrelated with description content and without content of text redundancy label remove, again for Each tag location;
Step 102:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not including on the page Effective content of dominant character label then carries out piecemeal based on non-dominant character label;
Step 103:Piecemeal result is stored on GFPS-Tree, to each GFPS-Block according to its dominant character mark The degree of strength of label is layered again, and positions forerunner, subsequent and fraternal position.
Preferably, the step 103 specifically includes:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong There are one the GFPS-Block of dominant character label, then father's node of its node of correspondence in source code is taken as root section Point;If there is multiple GFPS-Block, then take they correspond in the source code consistent node of relative position it is common most Small father's node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, it will be containing dominant not higher than this intensity Feature tag or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the portion excessively high to similarity Divide and reintegrate, the partial resolution too low to similarity.
Preferably, four classifications in the vector construction step include:Http interface IP addresses, required parameter, sound Answer parameter and response example.
Preferably, the vector weight determines that step specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculating wicAnd wicj, obtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of trained moulds every time Type is verified whether successfully to be referred to a part of description information in affiliated classification with the data of last portion;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, is classified As a result after, accurate description information is added in data set.
System is obtained according to a kind of service semantics parsed based on page info provided by the invention, including:
Segment module:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vector structure module:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification;
Vectorial weight determination module:Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.
Preferably, the Segment module specifically includes:
Pre-process submodule:Redundancy label containing the label unrelated with description content and without content of text is removed, Again it is each tag location;
Piecemeal submodule:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not wrapped on the page Effective content of the label containing dominant character then carries out piecemeal based on non-dominant character label;
GFPS-Tree builds submodule:Piecemeal result is stored on GFPS-Tree, to each GFPS-Block foundations The degree of strength of its dominant character label is layered again, and positions forerunner, subsequent and fraternal position.
Preferably, the GFPS-Tree structures submodule specifically includes:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong There are one the GFPS-Block of dominant character label, then father's node of its node of correspondence in source code is taken as root section Point;If there is multiple GFPS-Block, then take they correspond in the source code consistent node of relative position it is common most Small father's node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, it will be containing dominant not higher than this intensity Feature tag or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the portion excessively high to similarity Divide and reintegrate, the partial resolution too low to similarity.
Preferably, vector structure mould four classifications in the block include:Http interface IP addresses, required parameter, sound Answer parameter and response example.
Preferably, the vector weight determination module specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculatingAndObtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of trained moulds every time Type is verified whether successfully to be referred to a part of description information in affiliated classification with the data of last portion;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, is classified As a result after, accurate description information is added in data set.
Compared with prior art, the present invention has following advantageous effect:
Compared to bulky physical template, the present invention provides the template of a lightweight, for the Web API of REST style There is good support.Two kinds of characteristics of page-tag arranging rule and vision are combined, exclude page redundancy letter to the full extent Breath, and by effective information according to description in terms of be placed in different parts, promote the readability of data.The present invention compensates for phase Deficiency of the work for RESTful Web API description information supports is closed, by system, the data after filtering have very high Readability can better adapt to Machine oriented and the demand of layman.
Description of the drawings
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the page-describing block sort algorithmic code of the present invention.
Specific embodiment
With reference to specific embodiment, the present invention is described in detail.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection domain.
For RESTful Web API introduction pages, the present invention is proposed based on GFPS (General Feature Page Segmentation API description information automatic obtaining methods), by the thought of " Web is semantic ", with reference to web page source code label Pattern of rows and columns and page visual signature extract effective RESTful Web API informations, and training, classification, and being organized into machine can The description data of reading.Feedback mechanism is increased, preferably changing raising, it trains accuracy.
As shown in Figure 1, a kind of service semantics acquisition methods based on page info parsing provided by the invention, including:
Segment step:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vectorial construction step:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification, four classifications include:Http interface IP addresses, required parameter, sound Answer parameter and response example;
Vectorial weight determines step:Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.
Feedback mechanism is set for the assorting process of new description information, preferably grasps dominant character label and non-dominant spy It levies label and carries out Segment, parameter is sorted out in also optimization, improves accuracy.
Segment step specifically includes:
Step 101:By containing the label unrelated with description content and without content of text redundancy label remove, again for Each tag location;
Step 102:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not including on the page Effective content of dominant character label then carries out piecemeal based on non-dominant character label;
Step 103:Piecemeal result is stored on GFPS-Tree, to each GFPS-Block according to its dominant character mark The degree of strength of label is layered again, and positions forerunner, subsequent and fraternal position.
Step 103 specifically includes:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong There are one the GFPS-Block of dominant character label, then father's node of its node of correspondence in source code is taken as root section Point;If there is multiple GFPS-Block, then take they correspond in the source code consistent node of relative position it is common most Small father's node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, it will be containing dominant not higher than this intensity Feature tag or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the portion excessively high to similarity Divide and reintegrate, the partial resolution too low to similarity.
Vectorial weight determines that step specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculating wicAnd wicj, obtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of trained moulds every time Type is verified whether successfully to be referred to a part of description information in affiliated classification with the data of last portion;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, is classified As a result after, accurate description information is added in data set.
After user provides a page source code comprising RESTful Web API description informations, we carry out first Segment step:
The first step, will<script>,<img>,<a>Deng containing the label unrelated with description content and<div>、<sector> It is each tag location again Deng the redundancy label removal without content of text.
Second step, extract the page on dominant character label such as<h1>-<h6>, cut-off rule<br>Deng content of pages is torn open Into GFPS-Block.For not including effective content of above-mentioned label on the page, then based on non-dominant character label such as<p>、< table>Deng progress piecemeal.
Third walks, and piecemeal result is stored on GFPS-Tree, it would be desirable to dominant according to its to each GFPS-Block Feature tag degree of strength (<h1>>...><h6>><br>, totally 7) and it is layered again, and position forerunner, subsequent and fraternal position It puts.Specific operation is as shown in the algorithm 1 of Fig. 2.4th row finds most strong dominant character label tags [visual_index].The 5 rows determine GFPS-Tree root nodes, GFPS-Tree are initialized, if the GFPS-Block containing most strong dominant character label There are one only, then father's node of its node of correspondence in source code is taken as root node;If there is multiple GFPS- Block then takes them to correspond to common minimum father's node of the consistent node of relative position in source code, remaining is then Noise content is not considered.6-9 rows find the GFPS-Block of the dominant character label comprising each intensity, RearrangeTree () (contains follow-up adjacent GFPS-Block not higher than the dominant character label of this intensity or containing non- Dominant character label) move on to this layer.10th row judges text, the structure of the adjacent GFPS-Block containing non-dominant character Similarity, part excessively high to similarity rearrangeBlock () are reintegrated, the partial resolution too low to similarity.
Secondly, it is the every GFPS-Block and existing data definition vector (F to have classifiedtext;Ftag;Fformat), wherein FtextIt is included language material set;Ftag、FformatRecord is with the presence or absence of feature tag and special construction respectively.
Effective RESTful Web API description informations are divided into four classes by the present invention:Http interface IP addresses, required parameter, Response parameter and response example.The corresponding vector of four classifications is as shown in formula 1-1 to 1-4.
Vendpoint=(Ftext;0;Furi), (1-1)
Vinput_param=(Ftext;Ftag;0), (1-2)
Voutput_param=(Ftext;Ftag;0), (1-3)
Voutput_response=(Ftext;0;Fxml|json), (1-4)
Finally, vector value is calculated to every GFPS-Block and the existing data classified.Belonged to using tf-idf calculating Language material weight w of i-th of the word of classification c in the category and j-th of GFPS-BlockicAnd wicj, obtain corpus of text similarity TextSimjc, as shown in 2-1.2-2 and 2-3 calculates feature tag similarity and special construction similarity respectively.In order to obtain The final similarity of j-th of GFPS-Block and classification c,, will be existing using the thought of K folding cross validations as shown in formula 2-5 Descriptor data set is divided into K parts, chooses K-1 parts of training patterns every time, is verified with the data of last portion, Suc () function It indicates whether that successfully a part of description information is referred in the classification belonging to it.It chooses so that most description informations are sorted out successfully Parameter is as the classification weight of vectorial three dimensions, i.e. x, y, z in formula 2-4.After obtaining classification results, via artificial inspection It tests, accurate description information is added in data set.
TagSimjc=exist (some_tag), (2-2)
FormatSimjc=exist (some_fomat), (2-3)
Simjc=x × TextSimjc+y×TagSimjc+z×FormatSimjc, (2-4)
ci∈ { endpoint, input_param, output_param, output_response },
Based on the above-mentioned service semantics acquisition methods based on page info parsing, believed the present invention also provides one kind based on the page The service semantics of breath parsing obtain system, including:
Segment module:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vector structure module:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification, four classifications include:Http interface IP addresses, required parameter, sound Answer parameter and response example.;
Vectorial weight determination module:Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.
Segment module specifically includes:
Pre-process submodule:Redundancy label containing the label unrelated with description content and without content of text is removed, Again it is each tag location;
Piecemeal submodule:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not wrapped on the page Effective content of the label containing dominant character then carries out piecemeal based on non-dominant character label;
GFPS-Tree builds submodule:Piecemeal result is stored on GFPS-Tree, to each GFPS-Block foundations The degree of strength of its dominant character label is layered again, and positions forerunner, subsequent and fraternal position.
GFPS-Tree structure submodules specifically include:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong There are one the GFPS-Block of dominant character label, then father's node of its node of correspondence in source code is taken as root section Point;If there is multiple GFPS-Block, then take they correspond in the source code consistent node of relative position it is common most Small father's node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, it will be containing dominant not higher than this intensity Feature tag or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the portion excessively high to similarity Divide and reintegrate, the partial resolution too low to similarity.
Vectorial weight determination module specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculatingAndObtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of trained moulds every time Type is verified whether successfully to be referred to a part of description information in affiliated classification with the data of last portion;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, is classified As a result after, accurate description information is added in data set.
One skilled in the art will appreciate that in addition to realizing system provided by the invention in a manner of pure computer readable program code It, completely can be by the way that method and step progress programming in logic be provided come the present invention and its other than each device, module, unit System and its each device, module, unit with logic gate, switch, application-specific integrated circuit, programmable logic controller (PLC) and embedding Enter the form of the controller that declines etc. to realize identical function.So system provided by the invention and its every device, module, list Member is considered a kind of hardware component, and also may be used to the device for being used to implement various functions, module, the unit that include in it To be considered as the structure in hardware component;The device for being used to implement various functions, module, unit can also be considered as either real The software module of existing method can be the structure in hardware component again.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the absence of conflict, the feature in embodiments herein and embodiment can arbitrary phase Mutually combination.

Claims (10)

1. a kind of service semantics acquisition methods based on page info parsing, which is characterized in that including:
Segment step:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and Non- feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vectorial construction step:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, definition Four classifications and the vector expression for defining each classification;
Vectorial weight determines step:By available data collection cross validation is determined each dimension of vector calculate apart from when Weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes and divides Class.
2. the service semantics acquisition methods according to claim 1 based on page info parsing, which is characterized in that the page Face dividing step specifically includes:
Step 101:Redundancy label containing the label unrelated with description content and without content of text is removed, is each again Tag location;
Step 102:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, it is dominant for not including on the page Effective content of feature tag then carries out piecemeal based on non-dominant character label;
Step 103:Piecemeal result is stored on GFPS-Tree, to each GFPS-Block according to its dominant character label Degree of strength is layered again, and positions forerunner, subsequent and fraternal position.
3. the service semantics acquisition methods according to claim 2 based on page info parsing, which is characterized in that the step Rapid 103 specifically include:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong dominant There are one the GFPS-Block of feature tag, then father's node of its node of correspondence in source code is taken as root node; If there is multiple GFPS-Block, then them is taken to correspond to the common minimum father of the consistent node of relative position in source code Close node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, the dominant character not higher than this intensity will be contained Label or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the part excessively high to similarity weight It is new to integrate, the partial resolution too low to similarity.
4. it is according to claim 1 based on page info parsing service semantics acquisition methods, which is characterized in that it is described to Four classifications in amount construction step include:Http interface IP addresses, required parameter, response parameter and response example.
5. it is according to claim 1 based on page info parsing service semantics acquisition methods, which is characterized in that it is described to Amount weight determines that step specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculatingAndObtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of training patterns every time, is used Last a data are verified whether successfully to be referred to a part of description information in affiliated classification;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, obtains classification results Afterwards, accurate description information is added in data set.
6. a kind of service semantics based on page info parsing obtain system, which is characterized in that including:
Segment module:With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and Non- feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block;
Vector structure module:Determining classify three corpus of text of vector used, feature tag and special construction dimensions, definition Four classifications and the vector expression for defining each classification;
Vectorial weight determination module:By available data collection cross validation is determined each dimension of vector calculate apart from when Weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes and divides Class.
7. the service semantics according to claim 6 based on page info parsing obtain system, which is characterized in that the page Face dividing module specifically includes:
Pre-process submodule:Redundancy label containing the label unrelated with description content and without content of text is removed, again For each tag location;
Piecemeal submodule:The dominant character label on the page is extracted, content of pages is subjected to piecemeal, it is aobvious for not including on the page Property feature tag effective content, then based on non-dominant character label carry out piecemeal;
GFPS-Tree builds submodule:Piecemeal result is stored on GFPS-Tree, it is aobvious according to it to each GFPS-Block The degree of strength of property feature tag is layered again, and positions forerunner, subsequent and fraternal position.
8. the service semantics according to claim 7 based on page info parsing obtain system, which is characterized in that described GFPS-Tree structure submodules specifically include:
Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong dominant There are one the GFPS-Block of feature tag, then father's node of its node of correspondence in source code is taken as root node; If there is multiple GFPS-Block, then them is taken to correspond to the common minimum father of the consistent node of relative position in source code Close node, remaining is then noise content, is not considered;
The GFPS-Block of the dominant character label comprising each intensity is found, the dominant character not higher than this intensity will be contained Label or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer;
Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the part excessively high to similarity weight It is new to integrate, the partial resolution too low to similarity.
9. it is according to claim 6 based on page info parsing service semantics obtain system, which is characterized in that it is described to Amount structure mould four classifications in the block include:Http interface IP addresses, required parameter, response parameter and response example.
10. the service semantics according to claim 6 based on page info parsing obtain system, which is characterized in that described Vectorial weight determination module specifically includes:
Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculatingAndObtain corpus of text similarity TextSimjc
Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of training patterns every time, is used Last a data are verified whether successfully to be referred to a part of description information in affiliated classification;
It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, obtains classification results Afterwards, accurate description information is added in data set.
CN201711368914.3A 2017-12-18 2017-12-18 Service semantic acquisition method and system based on page information analysis Active CN108182210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711368914.3A CN108182210B (en) 2017-12-18 2017-12-18 Service semantic acquisition method and system based on page information analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711368914.3A CN108182210B (en) 2017-12-18 2017-12-18 Service semantic acquisition method and system based on page information analysis

Publications (2)

Publication Number Publication Date
CN108182210A true CN108182210A (en) 2018-06-19
CN108182210B CN108182210B (en) 2020-11-17

Family

ID=62546363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711368914.3A Active CN108182210B (en) 2017-12-18 2017-12-18 Service semantic acquisition method and system based on page information analysis

Country Status (1)

Country Link
CN (1) CN108182210B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742785A (en) * 2020-05-28 2021-12-03 深信服科技股份有限公司 Webpage classification method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663023B (en) * 2012-03-22 2014-09-17 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103559234B (en) * 2013-10-24 2017-01-25 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
US20170069009A1 (en) * 2007-09-28 2017-03-09 Great-Circle Technologies, Inc. Contextual execution of automated workflows
CN103823824B (en) * 2013-11-12 2017-04-05 哈尔滨工业大学深圳研究生院 A kind of method and system that text classification corpus is built automatically by the Internet
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170069009A1 (en) * 2007-09-28 2017-03-09 Great-Circle Technologies, Inc. Contextual execution of automated workflows
CN102663023B (en) * 2012-03-22 2014-09-17 浙江盘石信息技术有限公司 Implementation method for extracting web content
CN103559234B (en) * 2013-10-24 2017-01-25 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN103823824B (en) * 2013-11-12 2017-04-05 哈尔滨工业大学深圳研究生院 A kind of method and system that text classification corpus is built automatically by the Internet
CN107135092A (en) * 2017-03-15 2017-09-05 浙江工业大学 A kind of Web service clustering method towards global social interaction server net

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XI YANG AND JIAN CAO: "A Fast and Accurate Way for API Network", 《IFIP INTERNATIONAL FEDERATION FOR INFORMATION PROCESSING 2017》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742785A (en) * 2020-05-28 2021-12-03 深信服科技股份有限公司 Webpage classification method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108182210B (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN105930159B (en) A kind of method and system that the GUI code based on image generates
Sanoja et al. Block-o-matic: A web page segmentation framework
CN106874256A (en) Name the method and device of entity in identification field
CN107169049A (en) The label information generation method and device of application
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN108629043A (en) Extracting method, device and the storage medium of webpage target information
CN109002834A (en) Fine granularity image classification method based on multi-modal characterization
CN106445919A (en) Sentiment classifying method and device
CN107392143A (en) A kind of resume accurate Analysis method based on SVM text classifications
CN106815194A (en) Model training method and device and keyword recognition method and device
CN107679110A (en) The method and device of knowledge mapping is improved with reference to text classification and picture attribute extraction
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN109522417A (en) Business number extraction method for company name
CN105068990B (en) A kind of English long sentence dividing method of more strategies of Machine oriented translation
CN107273546A (en) Counterfeit application detection method and system
CN106960058A (en) A kind of structure of web page alteration detection method and system
CN106649250A (en) Method and device for identifying emotional new words
CN107895117A (en) Malicious code mask method and device
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
Meng et al. Seeds-based part segmentation by seeds propagation and region convexity decomposition
CN109635260A (en) For generating the method, apparatus, equipment and storage medium of article template
CN103631874A (en) UGC label classification determining method and device for social platform
CN106874397A (en) A kind of automatic semanteme marking method of internet of things oriented equipment
CN113762257B (en) Identification method and device for mark in make-up brand image
CN113836933B (en) Method, device, electronic equipment and storage medium for generating graphic mark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220718

Address after: 214400 room 317-108, building 12, No. 2 Binjiang West Road, Jiangyin City, Wuxi City, Jiangsu Province

Patentee after: JIANGYIN ZHURI INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 200240 No. 800, Dongchuan Road, Shanghai, Minhang District

Patentee before: SHANGHAI JIAO TONG University

TR01 Transfer of patent right