CN112328946A - Method and system for automatically generating Tibetan language webpage abstract - Google Patents
Method and system for automatically generating Tibetan language webpage abstract Download PDFInfo
- Publication number
- CN112328946A CN112328946A CN202011433753.3A CN202011433753A CN112328946A CN 112328946 A CN112328946 A CN 112328946A CN 202011433753 A CN202011433753 A CN 202011433753A CN 112328946 A CN112328946 A CN 112328946A
- Authority
- CN
- China
- Prior art keywords
- tibetan
- webpage
- tibetan language
- abstract
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention provides a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to a Tibetan language word segmentation result from the aspect of integral optimization by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weight of the Tibetan language sentences, and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage.
Description
Technical Field
The application relates to the technical field of network security, in particular to a method and a system for automatically generating a Tibetan language webpage abstract.
Background
In 1997, the 33 rd international organization for standardization held in greece, the second technical committee and the second working group conference of the second technical committee, a resolution was coded by discussing Tibetan language, which was born since then. With the popularization of Tibetan language coding, Tibetan language network information is increased dramatically. How to search for the needed information in the massive information and how to quickly read new information emerging every day is an urgent problem.
The Tibetan characters have characteristics of the characters and are unique in character structure and grammar, so that the abstract extraction method in English and Chinese cannot be directly applied when the abstract is extracted, and the abstract is extracted by processing the abstract on the basis of Tibetan text research.
Therefore, there is a need for a method and system for automatically generating a Tibetan language webpage abstract.
Disclosure of Invention
The invention aims to provide a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to the Tibetan language word segmentation result, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weights of the Tibetan language sentences and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation according to the characteristics of Tibetan language.
In a first aspect, the present application provides a method for automatically generating a Tibetan language webpage abstract, where the method includes:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
carrying out word frequency statistics on the Tibetan language webpage text with the word being segmented, and calculating the weight of the Tibetan language webpage text sentence;
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the crawler of the Tibetan language webpage stores the crawled webpage, allocates a specified path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, downloads the Tibetan language webpage with a suffix of html in an utf-8 form, and circularly stores the filename in an automatic number.
With reference to the first aspect, in a second possible implementation manner of the first aspect, after the sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
With reference to the first aspect, in a third possible implementation manner of the first aspect, the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
In a second aspect, the present application provides an automatic generation system for a Tibetan language webpage abstract, the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the automatic Tibetan language webpage abstract generation method of any one of the four possible Tibetan language webpage abstracts of the first aspect according to instructions in the program codes.
In a third aspect, the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the method for automatically generating the Tibetan language webpage abstract of any one of the four possible methods in the first aspect.
In a fourth aspect, the present application provides a computer program product comprising instructions, which when run on a computer, causes the computer to perform the method for automatically generating the summary of the Tibetan language webpage according to any one of the four possible ways of the first aspect.
The invention provides a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to a Tibetan language word segmentation result from the aspect of integral optimization by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weight of the Tibetan language sentences, and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of the summary weight calculation of the Tibetan web page of the present invention;
FIG. 2 is a flowchart of summary generation for Tibetan web pages.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.
Fig. 1 is a flowchart of calculating the abstract weight of the Tibetan language webpage provided by the present application, including:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
and carrying out word frequency statistics on the Tibetan language webpage text with the word segmentation, and calculating the weight of the Tibetan language webpage text sentence.
Fig. 2 is a flowchart of summary generation of a Tibetan webpage provided in the present application, including:
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
In some preferred embodiments, the crawlers of the Tibetan web pages store the crawled web pages, the web pages are assigned with the specified paths to the link address URLs of the web pages in a DownloadFile downloading mode according to the storage space, the Tibetan web pages with the html suffix are downloaded in an utf-8 mode, and the file names are circularly stored in automatic numbers.
In some preferred embodiments, after sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
In some preferred embodiments, the removing noise comprises: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
The application provides a Tibetan webpage abstract automatic generation system, which comprises: the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for automatically generating the Tibetan language webpage abstract according to any one of the embodiments of the first aspect according to instructions in the program code.
The present application provides a computer-readable storage medium for storing a program code for executing the method for automatically generating the Tibetan language webpage abstract described in any of the embodiments of the first aspect.
The present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method for automatically generating a summary of a Tibetan language webpage according to any one of the embodiments of the first aspect
In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The same and similar parts in the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.
The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.
Claims (7)
1. An automatic generation method of a Tibetan language webpage abstract is characterized by comprising the following steps:
using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;
the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;
the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;
removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;
carrying out word frequency statistics on the Tibetan language webpage text with the word being segmented, and calculating the weight of the Tibetan language webpage text sentence;
and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.
2. The method of claim 1, wherein: the crawler of the Tibetan webpage stores the crawled webpage, the webpage distributes an appointed path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, the Tibetan webpage with a suffix of html is downloaded in an utf-8 form, and the file names are circularly stored in automatic numbers.
3. The method according to any one of claims 1-2, wherein: after the Tibetan language webpage text sentences are sorted according to the weight, the method further comprises the following steps: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.
4. A method according to any one of claims 1-3, characterized in that: the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.
5. An automatic generation system for a Tibetan language webpage abstract is characterized by comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is used for executing the method for automatically generating the Tibetan language webpage abstract of any one of claims 1-4 according to instructions in the program codes.
6. A computer-readable storage medium storing program code for executing the method for automatically generating the Tibetan language webpage abstract according to any one of claims 1 to 4.
7. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the automated Tibetan language web page summary generation method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011433753.3A CN112328946A (en) | 2020-12-10 | 2020-12-10 | Method and system for automatically generating Tibetan language webpage abstract |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011433753.3A CN112328946A (en) | 2020-12-10 | 2020-12-10 | Method and system for automatically generating Tibetan language webpage abstract |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112328946A true CN112328946A (en) | 2021-02-05 |
Family
ID=74302279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011433753.3A Pending CN112328946A (en) | 2020-12-10 | 2020-12-10 | Method and system for automatically generating Tibetan language webpage abstract |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112328946A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113051913A (en) * | 2021-04-09 | 2021-06-29 | 中译语通科技股份有限公司 | Tibetan word segmentation information processing method, system, storage medium, terminal and application |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526841A (en) * | 2017-09-19 | 2017-12-29 | 中央民族大学 | A kind of Tibetan language text summarization generation method based on Web |
-
2020
- 2020-12-10 CN CN202011433753.3A patent/CN112328946A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107526841A (en) * | 2017-09-19 | 2017-12-29 | 中央民族大学 | A kind of Tibetan language text summarization generation method based on Web |
Non-Patent Citations (2)
Title |
---|
安见才让: "藏文搜索引擎系统中网页自动摘要的研究", 《微处理机》 * |
更太加: "藏文网页摘要自动生成系统的研究", 《万方》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113051913A (en) * | 2021-04-09 | 2021-06-29 | 中译语通科技股份有限公司 | Tibetan word segmentation information processing method, system, storage medium, terminal and application |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113822067A (en) | Key information extraction method and device, computer equipment and storage medium | |
US8423885B1 (en) | Updating search engine document index based on calculated age of changed portions in a document | |
CN111651198B (en) | Automatic code abstract generation method and device | |
CN108416034B (en) | Information acquisition system based on financial heterogeneous big data and control method thereof | |
CN112749284A (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN111506795B (en) | Method and device for acquiring bid information | |
Barbaresi | Generic web content extraction with open-source software | |
CN111554272A (en) | Language model modeling method for Chinese speech recognition | |
Neysiani et al. | Automatic interconnected lexical typo correction in bug reports of software triage systems | |
CN113918794B (en) | Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium | |
CN112328946A (en) | Method and system for automatically generating Tibetan language webpage abstract | |
CN113868568A (en) | Webpage keyword highlighting method, device, equipment and storage medium | |
Barbaresi | htmldate: A Python package to extract publication dates from web pages | |
CN104965902A (en) | Enriched URL (uniform resource locator) recognition method and apparatus | |
CN115391711B (en) | Webpage text information extraction method, device, equipment and medium | |
CN111143642A (en) | Webpage classification method and device, electronic equipment and computer readable storage medium | |
CN113806667B (en) | Method and system for supporting webpage classification | |
CN111966930B (en) | Webpage list analyzing method and system based on XPath sequence | |
Hládek et al. | Text mining and processing for corpora creation in Slovak language | |
Rofiq | Indonesian news extractive text summarization using latent semantic analysis | |
CN113987320A (en) | Real-time information crawler method, device and equipment based on intelligent page analysis | |
CN112115266A (en) | Malicious website classification method and device, computer equipment and readable storage medium | |
CN112380337A (en) | Highlight method and device based on rich text | |
CN111950037A (en) | Detection method, detection device, electronic equipment and storage medium | |
CN106897271B (en) | News text noise removing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210205 |