CN112328946A

CN112328946A - Method and system for automatically generating Tibetan language webpage abstract

Info

Publication number: CN112328946A
Application number: CN202011433753.3A
Authority: CN
Inventors: 更太加; 魏建国
Original assignee: Qinghai Nationalities University
Current assignee: Qinghai Nationalities University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-02-05

Abstract

The invention provides a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to a Tibetan language word segmentation result from the aspect of integral optimization by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weight of the Tibetan language sentences, and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage.

Description

Method and system for automatically generating Tibetan language webpage abstract

Technical Field

The application relates to the technical field of network security, in particular to a method and a system for automatically generating a Tibetan language webpage abstract.

Background

In 1997, the 33 rd international organization for standardization held in greece, the second technical committee and the second working group conference of the second technical committee, a resolution was coded by discussing Tibetan language, which was born since then. With the popularization of Tibetan language coding, Tibetan language network information is increased dramatically. How to search for the needed information in the massive information and how to quickly read new information emerging every day is an urgent problem.

The Tibetan characters have characteristics of the characters and are unique in character structure and grammar, so that the abstract extraction method in English and Chinese cannot be directly applied when the abstract is extracted, and the abstract is extracted by processing the abstract on the basis of Tibetan text research.

Therefore, there is a need for a method and system for automatically generating a Tibetan language webpage abstract.

Disclosure of Invention

The invention aims to provide a method and a system for automatically generating a Tibetan language webpage abstract, which are used for carrying out word frequency statistics on a Tibetan language webpage text according to the Tibetan language word segmentation result, carrying out weight calculation on sentences of the Tibetan language webpage text according to the word frequency, outputting an initial abstract according to the weights of the Tibetan language sentences and carrying out smoothing processing on the Tibetan language webpage abstract according to the initial abstract to obtain the abstract of the Tibetan language webpage by comparing and analyzing Chinese word segmentation and Tibetan language word segmentation according to the characteristics of Tibetan language.

In a first aspect, the present application provides a method for automatically generating a Tibetan language webpage abstract, where the method includes:

using a Tibetan web crawler tool to crawl training and testing samples that are a Tibetan web summary system, the Tibetan web crawler tool comprising: extracting a first hyperlink by an extraction program according to i = instr (a downloaded Tibetan webpage, "href ="), j = instr (i, ">), wherein i and j are storage variables, and instr is a character searching function; judging the length of the downloaded Tibetan webpage, and if the length of the downloaded Tibetan webpage is less than 1, capturing the next Tibetan webpage; if the absolute path is not the absolute path, the hyperlink is modified into the absolute path, and whether the hyperlink is in the database is judged;

the Tibetan language webpage abstract system comprises: the Tibetan language webpage abstract training corpus module, the Tibetan language webpage purifying and Tibetan language webpage text word segmentation module, the Tibetan language webpage text word and sentence weight calculation module and the Tibetan language webpage text abstract generating module;

the corpus downloaded from the server is used for extracting the Tibetan language webpage abstract, and the corpus comprises: one or more of news, politics, economics, literature, history, educational travel, information technology, medicine, religion, and gosa;

removing noise from the crawled Tibetan webpages to generate a Tibetan webpage text form, and then automatically segmenting the texts;

carrying out word frequency statistics on the Tibetan language webpage text with the word being segmented, and calculating the weight of the Tibetan language webpage text sentence;

and after the text sentences of the Tibetan language web pages are sequenced according to the weight, setting a threshold value for extracting the Tibetan language web page abstract, extracting an initial abstract of the Tibetan language web pages according to the threshold value, sequencing original text sequences of the initial abstract, then performing smoothing treatment on the initial abstract, and outputting the Tibetan language web page abstract.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the crawler of the Tibetan language webpage stores the crawled webpage, allocates a specified path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, downloads the Tibetan language webpage with a suffix of html in an utf-8 form, and circularly stores the filename in an automatic number.

With reference to the first aspect, in a second possible implementation manner of the first aspect, after the sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.

In a second aspect, the present application provides an automatic generation system for a Tibetan language webpage abstract, the system includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the automatic Tibetan language webpage abstract generation method of any one of the four possible Tibetan language webpage abstracts of the first aspect according to instructions in the program codes.

In a third aspect, the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the method for automatically generating the Tibetan language webpage abstract of any one of the four possible methods in the first aspect.

In a fourth aspect, the present application provides a computer program product comprising instructions, which when run on a computer, causes the computer to perform the method for automatically generating the summary of the Tibetan language webpage according to any one of the four possible ways of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of the summary weight calculation of the Tibetan web page of the present invention;

FIG. 2 is a flowchart of summary generation for Tibetan web pages.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.

Fig. 1 is a flowchart of calculating the abstract weight of the Tibetan language webpage provided by the present application, including:

and carrying out word frequency statistics on the Tibetan language webpage text with the word segmentation, and calculating the weight of the Tibetan language webpage text sentence.

Fig. 2 is a flowchart of summary generation of a Tibetan webpage provided in the present application, including:

In some preferred embodiments, the crawlers of the Tibetan web pages store the crawled web pages, the web pages are assigned with the specified paths to the link address URLs of the web pages in a DownloadFile downloading mode according to the storage space, the Tibetan web pages with the html suffix are downloaded in an utf-8 mode, and the file names are circularly stored in automatic numbers.

In some preferred embodiments, after sorting the text sentences of the Tibetan language webpage according to the weight, the method further includes: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.

In some preferred embodiments, the removing noise comprises: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.

The application provides a Tibetan webpage abstract automatic generation system, which comprises: the system includes a processor and a memory:

the processor is configured to execute the method for automatically generating the Tibetan language webpage abstract according to any one of the embodiments of the first aspect according to instructions in the program code.

The present application provides a computer-readable storage medium for storing a program code for executing the method for automatically generating the Tibetan language webpage abstract described in any of the embodiments of the first aspect.

The present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method for automatically generating a summary of a Tibetan language webpage according to any one of the embodiments of the first aspect

In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).

Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The same and similar parts in the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.

The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. An automatic generation method of a Tibetan language webpage abstract is characterized by comprising the following steps:

2. The method of claim 1, wherein: the crawler of the Tibetan webpage stores the crawled webpage, the webpage distributes an appointed path to a link address URL of the webpage in a download form of a DownloadFile according to a storage space, the Tibetan webpage with a suffix of html is downloaded in an utf-8 form, and the file names are circularly stored in automatic numbers.

3. The method according to any one of claims 1-2, wherein: after the Tibetan language webpage text sentences are sorted according to the weight, the method further comprises the following steps: feature word-based weighting, location-based weighting, cue-based weighting, and title-based weighting.

4. A method according to any one of claims 1-3, characterized in that: the removing noise includes: reading a file by a file stream, placing the file into a specified variable xinxi from the beginning to the end, extracting the text of the Tibetan language webpage, searching characters "< script", "<", "{" and "(" to "," script > "," > "}" and ")" and the like from the xinxi for matching, and replacing the content in the text with a null; for example, xinxi contains "[ a-Z ]", "[ A-Z ]", "; some useless information such as "," - ", enppropperty- - >", replace the above-mentioned useless information with empty; defining a write stream sv, writing the cleaned text in xinxi into the storage space.

5. An automatic generation system for a Tibetan language webpage abstract is characterized by comprising a processor and a memory:

the processor is used for executing the method for automatically generating the Tibetan language webpage abstract of any one of claims 1-4 according to instructions in the program codes.

6. A computer-readable storage medium storing program code for executing the method for automatically generating the Tibetan language webpage abstract according to any one of claims 1 to 4.

7. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the automated Tibetan language web page summary generation method of any one of claims 1-4.