CN112328946A - Method and system for automatically generating Tibetan language webpage abstract - Google Patents

Method and system for automatically generating Tibetan language webpage abstract Download PDF

Info

Publication number
CN112328946A
CN112328946A CN202011433753.3A CN202011433753A CN112328946A CN 112328946 A CN112328946 A CN 112328946A CN 202011433753 A CN202011433753 A CN 202011433753A CN 112328946 A CN112328946 A CN 112328946A
Authority
CN
China
Prior art keywords
tibetan
webpage
tibetan language
abstract
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011433753.3A
Other languages
Chinese (zh)
Inventor
更太加
魏建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qinghai Nationalities University
Original Assignee
Qinghai Nationalities University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qinghai Nationalities University filed Critical Qinghai Nationalities University
Priority to CN202011433753.3A priority Critical patent/CN112328946A/en
Publication of CN112328946A publication Critical patent/CN112328946A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention provides a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to a Tibetan language word segmentation result from the aspect of integral optimization by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weight of the Tibetan language sentences, and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage.

Description

Method and system for automatically generating Tibetan language webpage abstract
Technical Field
The application relates to the technical field of network security, in particular to a method and a system for automatically generating a Tibetan language webpage abstract.
Background
In 1997, the 33 rd international organization for standardization held in greece, the second technical committee and the second working group conference of the second technical committee, a resolution was coded by discussing Tibetan language, which was born since then. With the popularization of Tibetan language coding, Tibetan language network information is increased dramatically. How to search for the needed information in the massive information and how to quickly read new information emerging every day is an urgent problem.
The Tibetan characters have characteristics of the characters and are unique in character structure and grammar, so that the abstract extraction method in English and Chinese cannot be directly applied when the abstract is extracted, and the abstract is extracted by processing the abstract on the basis of Tibetan text research.
Therefore, there is a need for a method and system for automatically generating a Tibetan language webpage abstract.
Disclosure of Invention
The invention aims to provide a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to the Tibetan language word segmentation result, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weights of the Tibetan language sentences and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation according to the characteristics of Tibetan language.
In a first aspect, the present application provides a method for automatically generating a Tibetan language webpage abstract, where the method includes:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
carrying out word frequency statistics on the Tibetan language webpage text with the word being segmented, and calculating the weight of the Tibetan language webpage text sentence;
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the crawler of the Tibetan language webpage stores the crawled webpage, allocates a specified path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, downloads the Tibetan language webpage with a suffix of html in an utf-8 form, and circularly stores the filename in an automatic number.
With reference to the first aspect, in a second possible implementation manner of the first aspect, after the sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
With reference to the first aspect, in a third possible implementation manner of the first aspect, the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
In a second aspect, the present application provides an automatic generation system for a Tibetan language webpage abstract, the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the automatic Tibetan language webpage abstract generation method of any one of the four possible Tibetan language webpage abstracts of the first aspect according to instructions in the program codes.
In a third aspect, the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the method for automatically generating the Tibetan language webpage abstract of any one of the four possible methods in the first aspect.
In a fourth aspect, the present application provides a computer program product comprising instructions, which when run on a computer, causes the computer to perform the method for automatically generating the summary of the Tibetan language webpage according to any one of the four possible ways of the first aspect.
The invention provides a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to a Tibetan language word segmentation result from the aspect of integral optimization by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weight of the Tibetan language sentences, and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of the summary weight calculation of the Tibetan web page of the present invention;
FIG. 2 is a flowchart of summary generation for Tibetan web pages.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.
Fig. 1 is a flowchart of calculating the abstract weight of the Tibetan language webpage provided by the present application, including:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
and carrying out word frequency statistics on the Tibetan language webpage text with the word segmentation, and calculating the weight of the Tibetan language webpage text sentence.
Fig. 2 is a flowchart of summary generation of a Tibetan webpage provided in the present application, including:
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
In some preferred embodiments, the crawlers of the Tibetan web pages store the crawled web pages, the web pages are assigned with the specified paths to the link address URLs of the web pages in a DownloadFile downloading mode according to the storage space, the Tibetan web pages with the html suffix are downloaded in an utf-8 mode, and the file names are circularly stored in automatic numbers.
In some preferred embodiments, after sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
In some preferred embodiments, the removing noise comprises: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
The application provides a Tibetan webpage abstract automatic generation system, which comprises: the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for automatically generating the Tibetan language webpage abstract according to any one of the embodiments of the first aspect according to instructions in the program code.
The present application provides a computer-readable storage medium for storing a program code for executing the method for automatically generating the Tibetan language webpage abstract described in any of the embodiments of the first aspect.
The present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method for automatically generating a summary of a Tibetan language webpage according to any one of the embodiments of the first aspect
In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The same and similar parts in the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.
The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims (7)

1. An automatic generation method of a Tibetan language webpage abstract is characterized by comprising the following steps:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
carrying out word frequency statistics on the Tibetan language webpage text with the word being segmented, and calculating the weight of the Tibetan language webpage text sentence;
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
2. The method of claim 1, wherein: the crawler of the Tibetan webpage stores the crawled webpage, the webpage distributes an appointed path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, the Tibetan webpage with a suffix of html is downloaded in an utf-8 form, and the file names are circularly stored in automatic numbers.
3. The method according to any one of claims 1-2, wherein: after the Tibetan language webpage text sentences are sorted according to the weight, the method further comprises the following steps: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
4. A method according to any one of claims 1-3, characterized in that: the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
5. An automatic generation system for a Tibetan language webpage abstract is characterized by comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is used for executing the method for automatically generating the Tibetan language webpage abstract of any one of claims 1-4 according to instructions in the program codes.
6. A computer-readable storage medium storing program code for executing the method for automatically generating the Tibetan language webpage abstract according to any one of claims 1 to 4.
7. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the automated Tibetan language web page summary generation method of any one of claims 1-4.
CN202011433753.3A 2020-12-10 2020-12-10 Method and system for automatically generating Tibetan language webpage abstract Pending CN112328946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011433753.3A CN112328946A (en) 2020-12-10 2020-12-10 Method and system for automatically generating Tibetan language webpage abstract

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011433753.3A CN112328946A (en) 2020-12-10 2020-12-10 Method and system for automatically generating Tibetan language webpage abstract

Publications (1)

Publication Number Publication Date
CN112328946A true CN112328946A (en) 2021-02-05

Family

ID=74302279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011433753.3A Pending CN112328946A (en) 2020-12-10 2020-12-10 Method and system for automatically generating Tibetan language webpage abstract

Country Status (1)

Country Link
CN (1) CN112328946A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051913A (en) * 2021-04-09 2021-06-29 中译语通科技股份有限公司 Tibetan word segmentation information processing method, system, storage medium, terminal and application

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526841A (en) * 2017-09-19 2017-12-29 中央民族大学 A kind of Tibetan language text summarization generation method based on Web

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
安见才让: "藏文搜索引擎系统中网页自动摘要的研究", 《微处理机》 *
更太加: "藏文网页摘要自动生成系统的研究", 《万方》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051913A (en) * 2021-04-09 2021-06-29 中译语通科技股份有限公司 Tibetan word segmentation information processing method, system, storage medium, terminal and application

Similar Documents

Publication Publication Date Title
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
US8423885B1 (en) Updating search engine document index based on calculated age of changed portions in a document
CN111651198B (en) Automatic code abstract generation method and device
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN112749284A (en) Knowledge graph construction method, device, equipment and storage medium
CN111506795B (en) Method and device for acquiring bid information
Barbaresi Generic web content extraction with open-source software
CN111554272A (en) Language model modeling method for Chinese speech recognition
Neysiani et al. Automatic interconnected lexical typo correction in bug reports of software triage systems
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN112328946A (en) Method and system for automatically generating Tibetan language webpage abstract
CN113868568A (en) Webpage keyword highlighting method, device, equipment and storage medium
Barbaresi htmldate: A Python package to extract publication dates from web pages
CN104965902A (en) Enriched URL (uniform resource locator) recognition method and apparatus
CN115391711B (en) Webpage text information extraction method, device, equipment and medium
CN111143642A (en) Webpage classification method and device, electronic equipment and computer readable storage medium
CN113806667B (en) Method and system for supporting webpage classification
CN111966930B (en) Webpage list analyzing method and system based on XPath sequence
Hládek et al. Text mining and processing for corpora creation in Slovak language
Rofiq Indonesian news extractive text summarization using latent semantic analysis
CN113987320A (en) Real-time information crawler method, device and equipment based on intelligent page analysis
CN112115266A (en) Malicious website classification method and device, computer equipment and readable storage medium
CN112380337A (en) Highlight method and device based on rich text
CN111950037A (en) Detection method, detection device, electronic equipment and storage medium
CN106897271B (en) News text noise removing method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205