CN108182210A

CN108182210A - Service semantics acquisition methods and system based on page info parsing

Info

Publication number: CN108182210A
Application number: CN201711368914.3A
Authority: CN
Inventors: 曹健; 童兰轩; 钱诗友
Original assignee: Shanghai Jiaotong University
Current assignee: Jiangyin Zhuri Information Technology Co ltd
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2018-06-19
Anticipated expiration: 2037-12-18
Also published as: CN108182210B

Abstract

The present invention provides a kind of service semantics acquisition methods and system based on page info parsing, including Segment：With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, GFPS Block are defined as, and build GFPS Tree tree structures；Vector structure：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, define the vector expression of four classifications and each classification；Vectorial weight determines：By available data collection cross validation is determined each dimension of vector calculate apart from when weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS Block belongs to corresponding classification.The present invention excludes page redundancy to the full extent, and effective information is placed on by description in different piece, and it is readable to promote data.

Description

Service semantics acquisition methods and system based on page info parsing

Technical field

The present invention relates to technical field of data processing, and in particular, to a kind of service semantics based on page info parsing Acquisition methods and system.

Background technology

Under the background rapidly developed in internet, a kind of this emerging software technology of Web service is come into being.In recent years, The characteristics of Web service of REST style compares more light weight with it, obtains more next more concern, and more next more service provider will The clothes of oneself are migrated into RESTful Web services.In fact, with open platform and the development of cloud computing, RESTful Web The situation of the quantity of API presented explosive growth.Under such overall background, more and more non-software Engineering Speciality people Member can be touched to Web service, and in Web API, RESTful Web API occupy seventy percent proportion.In order to enable these information It is easy-to-understand, need to introduce it be structured to machine excavation, utilize.

However, there is presently no standards for the recommended information of RESTful Web API, each company is with the form of oneself to this A little API carry out operation instruction：Such as global maximum Mapping Web Service Google Map, the introduction that same group is serviced and example It puts on one page, will be introduced on content chain to other pages for authentication, using limitation etc.；And Youtube will be carried All API supplied, which are introduced, all to be put on one page.In face of disunity form, flood tide and the fast Web service data of speedup, according to Manually go that tissue, to manage these data obviously unrealistic.

In order to enable machine with automatic identification, acquisition and can excavate these information, it would be desirable to these page texts The thought of semantic tagger information, i.e. " Web is semantic " is added in content, design is suitable, the model of light weight so that the model can be accurate Really, recommended information is efficiently handled.

Invention content

For the defects in the prior art, the object of the present invention is to provide a kind of service semantics based on page info parsing Acquisition methods and system.

According to it is provided by the invention it is a kind of based on page info parse service semantics acquisition methods, including：

Segment step：With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block；

Vectorial construction step：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification；

Vectorial weight determines step：Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.

Preferably, the Segment step specifically includes：

Step 101：By containing the label unrelated with description content and without content of text redundancy label remove, again for Each tag location；

Step 102：The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not including on the page Effective content of dominant character label then carries out piecemeal based on non-dominant character label；

Step 103：Piecemeal result is stored on GFPS-Tree, to each GFPS-Block according to its dominant character mark The degree of strength of label is layered again, and positions forerunner, subsequent and fraternal position.

Preferably, the step 103 specifically includes：

Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong There are one the GFPS-Block of dominant character label, then father's node of its node of correspondence in source code is taken as root section Point；If there is multiple GFPS-Block, then take they correspond in the source code consistent node of relative position it is common most Small father's node, remaining is then noise content, is not considered；

The GFPS-Block of the dominant character label comprising each intensity is found, it will be containing dominant not higher than this intensity Feature tag or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer；

Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the portion excessively high to similarity Divide and reintegrate, the partial resolution too low to similarity.

Preferably, four classifications in the vector construction step include：Http interface IP addresses, required parameter, sound Answer parameter and response example.

Preferably, the vector weight determines that step specifically includes：

Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculating w_icAnd w_icj, obtain corpus of text similarity TextSim_jc；

Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of trained moulds every time Type is verified whether successfully to be referred to a part of description information in affiliated classification with the data of last portion；

It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, is classified As a result after, accurate description information is added in data set.

System is obtained according to a kind of service semantics parsed based on page info provided by the invention, including：

Segment module：With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, feature mark is summarized Label and non-feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block；

Vector structure module：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification；

Vectorial weight determination module：Distance is being calculated by determine each dimension of vector to available data collection cross validation When weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes Classification.

Preferably, the Segment module specifically includes：

Pre-process submodule：Redundancy label containing the label unrelated with description content and without content of text is removed, Again it is each tag location；

Piecemeal submodule：The dominant character label on the page is extracted, content of pages is subjected to piecemeal, for not wrapped on the page Effective content of the label containing dominant character then carries out piecemeal based on non-dominant character label；

GFPS-Tree builds submodule：Piecemeal result is stored on GFPS-Tree, to each GFPS-Block foundations The degree of strength of its dominant character label is layered again, and positions forerunner, subsequent and fraternal position.

Preferably, the GFPS-Tree structures submodule specifically includes：

Preferably, vector structure mould four classifications in the block include：Http interface IP addresses, required parameter, sound Answer parameter and response example.

Preferably, the vector weight determination module specifically includes：

Belong to language material weight of i-th of the word of classification c in the category and j-th of GFPS-Block using tf-idf calculatingAndObtain corpus of text similarity TextSim_jc；

Compared with prior art, the present invention has following advantageous effect：

Compared to bulky physical template, the present invention provides the template of a lightweight, for the Web API of REST style There is good support.Two kinds of characteristics of page-tag arranging rule and vision are combined, exclude page redundancy letter to the full extent Breath, and by effective information according to description in terms of be placed in different parts, promote the readability of data.The present invention compensates for phase Deficiency of the work for RESTful Web API description information supports is closed, by system, the data after filtering have very high Readability can better adapt to Machine oriented and the demand of layman.

Description of the drawings

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the page-describing block sort algorithmic code of the present invention.

Specific embodiment

With reference to specific embodiment, the present invention is described in detail.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection domain.

For RESTful Web API introduction pages, the present invention is proposed based on GFPS (General Feature Page Segmentation API description information automatic obtaining methods), by the thought of " Web is semantic ", with reference to web page source code label Pattern of rows and columns and page visual signature extract effective RESTful Web API informations, and training, classification, and being organized into machine can The description data of reading.Feedback mechanism is increased, preferably changing raising, it trains accuracy.

As shown in Figure 1, a kind of service semantics acquisition methods based on page info parsing provided by the invention, including：

Vectorial construction step：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification, four classifications include：Http interface IP addresses, required parameter, sound Answer parameter and response example；

Feedback mechanism is set for the assorting process of new description information, preferably grasps dominant character label and non-dominant spy It levies label and carries out Segment, parameter is sorted out in also optimization, improves accuracy.

Segment step specifically includes：

Step 103 specifically includes：

Vectorial weight determines that step specifically includes：

After user provides a page source code comprising RESTful Web API description informations, we carry out first Segment step：

The first step, will<script>,<img>,<a>Deng containing the label unrelated with description content and<div>、<sector> It is each tag location again Deng the redundancy label removal without content of text.

Second step, extract the page on dominant character label such as<h1>-<h6>, cut-off rule<br>Deng content of pages is torn open Into GFPS-Block.For not including effective content of above-mentioned label on the page, then based on non-dominant character label such as<p>、< table>Deng progress piecemeal.

Third walks, and piecemeal result is stored on GFPS-Tree, it would be desirable to dominant according to its to each GFPS-Block Feature tag degree of strength (<h1>>...><h6>><br>, totally 7) and it is layered again, and position forerunner, subsequent and fraternal position It puts.Specific operation is as shown in the algorithm 1 of Fig. 2.4th row finds most strong dominant character label tags [visual_index].The 5 rows determine GFPS-Tree root nodes, GFPS-Tree are initialized, if the GFPS-Block containing most strong dominant character label There are one only, then father's node of its node of correspondence in source code is taken as root node；If there is multiple GFPS- Block then takes them to correspond to common minimum father's node of the consistent node of relative position in source code, remaining is then Noise content is not considered.6-9 rows find the GFPS-Block of the dominant character label comprising each intensity, RearrangeTree () (contains follow-up adjacent GFPS-Block not higher than the dominant character label of this intensity or containing non- Dominant character label) move on to this layer.10th row judges text, the structure of the adjacent GFPS-Block containing non-dominant character Similarity, part excessively high to similarity rearrangeBlock () are reintegrated, the partial resolution too low to similarity.

Secondly, it is the every GFPS-Block and existing data definition vector (F to have classified_text；F_tag；F_format), wherein F_textIt is included language material set；F_tag、F_formatRecord is with the presence or absence of feature tag and special construction respectively.

Effective RESTful Web API description informations are divided into four classes by the present invention：Http interface IP addresses, required parameter, Response parameter and response example.The corresponding vector of four classifications is as shown in formula 1-1 to 1-4.

V_endpoint=(F_text；0；F_uri), (1-1)

V_{input_param}=(F_text；F_tag；0), (1-2)

V_{output_param}=(F_text；F_tag；0), (1-3)

V_{output_response}=(F_text；0；F_xml|json), (1-4)

Finally, vector value is calculated to every GFPS-Block and the existing data classified.Belonged to using tf-idf calculating Language material weight w of i-th of the word of classification c in the category and j-th of GFPS-Block_icAnd w_icj, obtain corpus of text similarity TextSim_jc, as shown in 2-1.2-2 and 2-3 calculates feature tag similarity and special construction similarity respectively.In order to obtain The final similarity of j-th of GFPS-Block and classification c,, will be existing using the thought of K folding cross validations as shown in formula 2-5 Descriptor data set is divided into K parts, chooses K-1 parts of training patterns every time, is verified with the data of last portion, Suc () function It indicates whether that successfully a part of description information is referred in the classification belonging to it.It chooses so that most description informations are sorted out successfully Parameter is as the classification weight of vectorial three dimensions, i.e. x, y, z in formula 2-4.After obtaining classification results, via artificial inspection It tests, accurate description information is added in data set.

TagSim_jc=exist (some_tag), (2-2)

FormatSim_jc=exist (some_fomat), (2-3)

Sim_jc=x × TextSim_jc+y×TagSim_jc+z×FormatSim_jc, (2-4)

c_i∈ { endpoint, input_param, output_param, output_response },

Based on the above-mentioned service semantics acquisition methods based on page info parsing, believed the present invention also provides one kind based on the page The service semantics of breath parsing obtain system, including：

Vector structure module：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, It defines four classifications and defines the vector expression of each classification, four classifications include：Http interface IP addresses, required parameter, sound Answer parameter and response example.；

Segment module specifically includes：

GFPS-Tree structure submodules specifically include：

Vectorial weight determination module specifically includes：

One skilled in the art will appreciate that in addition to realizing system provided by the invention in a manner of pure computer readable program code It, completely can be by the way that method and step progress programming in logic be provided come the present invention and its other than each device, module, unit System and its each device, module, unit with logic gate, switch, application-specific integrated circuit, programmable logic controller (PLC) and embedding Enter the form of the controller that declines etc. to realize identical function.So system provided by the invention and its every device, module, list Member is considered a kind of hardware component, and also may be used to the device for being used to implement various functions, module, the unit that include in it To be considered as the structure in hardware component；The device for being used to implement various functions, module, unit can also be considered as either real The software module of existing method can be the structure in hardware component again.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the absence of conflict, the feature in embodiments herein and embodiment can arbitrary phase Mutually combination.

Claims

1. a kind of service semantics acquisition methods based on page info parsing, which is characterized in that including：

Segment step：With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and Non- feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block；

Vectorial construction step：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, definition Four classifications and the vector expression for defining each classification；

Vectorial weight determines step：By available data collection cross validation is determined each dimension of vector calculate apart from when Weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes and divides Class.

2. the service semantics acquisition methods according to claim 1 based on page info parsing, which is characterized in that the page Face dividing step specifically includes：

Step 101：Redundancy label containing the label unrelated with description content and without content of text is removed, is each again Tag location；

Step 102：The dominant character label on the page is extracted, content of pages is subjected to piecemeal, it is dominant for not including on the page Effective content of feature tag then carries out piecemeal based on non-dominant character label；

Step 103：Piecemeal result is stored on GFPS-Tree, to each GFPS-Block according to its dominant character label Degree of strength is layered again, and positions forerunner, subsequent and fraternal position.

3. the service semantics acquisition methods according to claim 2 based on page info parsing, which is characterized in that the step Rapid 103 specifically include：

Most strong dominant character label is found, determines GFPS-Tree root nodes, initializes GFPS-Tree, if containing most strong dominant There are one the GFPS-Block of feature tag, then father's node of its node of correspondence in source code is taken as root node； If there is multiple GFPS-Block, then them is taken to correspond to the common minimum father of the consistent node of relative position in source code Close node, remaining is then noise content, is not considered；

The GFPS-Block of the dominant character label comprising each intensity is found, the dominant character not higher than this intensity will be contained Label or follow-up adjacent GFPS-Block containing non-dominant character label move on to this layer；

Judge text, the structural similarity of the adjacent GFPS-Block containing non-dominant character, the part excessively high to similarity weight It is new to integrate, the partial resolution too low to similarity.

4. it is according to claim 1 based on page info parsing service semantics acquisition methods, which is characterized in that it is described to Four classifications in amount construction step include：Http interface IP addresses, required parameter, response parameter and response example.

5. it is according to claim 1 based on page info parsing service semantics acquisition methods, which is characterized in that it is described to Amount weight determines that step specifically includes：

Using the thought of K folding cross validations, existing descriptor data set is divided into K parts, chooses K-1 parts of training patterns every time, is used Last a data are verified whether successfully to be referred to a part of description information in affiliated classification；

It chooses so that weight of most description informations classification Success parameter as classification with vectorial three dimensions, obtains classification results Afterwards, accurate description information is added in data set.

6. a kind of service semantics based on page info parsing obtain system, which is characterized in that including：

Segment module：With reference to the tagging arrangements pattern and visual characteristic of page of web page source code, summarize feature tag and Non- feature tag, and content of pages is divided into block according to dominant character label and non-dominant character label, it is defined as GFPS- Block, and build GFPS-Tree tree structures and preserve these GFPS-Block；

Vector structure module：Determining classify three corpus of text of vector used, feature tag and special construction dimensions, definition Four classifications and the vector expression for defining each classification；

Vectorial weight determination module：By available data collection cross validation is determined each dimension of vector calculate apart from when Weight, when distance is maximum and higher than threshold value, it is believed that the description information of block GFPS-Block belongs to corresponding classification, realizes and divides Class.

7. the service semantics according to claim 6 based on page info parsing obtain system, which is characterized in that the page Face dividing module specifically includes：

Pre-process submodule：Redundancy label containing the label unrelated with description content and without content of text is removed, again For each tag location；

Piecemeal submodule：The dominant character label on the page is extracted, content of pages is subjected to piecemeal, it is aobvious for not including on the page Property feature tag effective content, then based on non-dominant character label carry out piecemeal；

GFPS-Tree builds submodule：Piecemeal result is stored on GFPS-Tree, it is aobvious according to it to each GFPS-Block The degree of strength of property feature tag is layered again, and positions forerunner, subsequent and fraternal position.

8. the service semantics according to claim 7 based on page info parsing obtain system, which is characterized in that described GFPS-Tree structure submodules specifically include：

9. it is according to claim 6 based on page info parsing service semantics obtain system, which is characterized in that it is described to Amount structure mould four classifications in the block include：Http interface IP addresses, required parameter, response parameter and response example.

10. the service semantics according to claim 6 based on page info parsing obtain system, which is characterized in that described Vectorial weight determination module specifically includes：