CN113420554A

CN113420554A - Ancient poetry word frequency analysis method and system

Info

Publication number: CN113420554A
Application number: CN202110675786.7A
Authority: CN
Inventors: 韩珍
Original assignee: Zaozhuang Vocational College of Science and Technology
Current assignee: Zaozhuang Vocational College of Science and Technology
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-21
Anticipated expiration: 2041-06-18
Also published as: CN113420554B

Abstract

The present disclosure relates to an ancient poetry word frequency analysis method, which comprises: acquiring a first data set comprising ancient poems, and constructing a first document according to the first data set, wherein the first data set at least comprises M poems; performing word frequency analysis on the first document to obtain a first list representing word frequency sequencing, and establishing a first mapping table from keywords in the first list to names of M poems in a first data set according to the first list; removing the virtual words in the first list to generate a second list according to virtual word information preset in a virtual word library, and updating the first mapping table to form a second mapping table according to the second list; screening at least one keyword which accords with preset conditions and has the highest word frequency sequence in a second list according to preset conditions of a user, and determining the names of N poems according to the corresponding relation between the keyword and the second mapping table; and respectively displaying the poetry contents according to the names of the N poetry.

Description

Ancient poetry word frequency analysis method and system

Technical Field

The invention relates to an information processing method, in particular to an ancient poetry word frequency analysis method and system.

Background

Poetry is generally understood to refer to old physique regular poetry and words, such as popular Tang poetry and Song dynasty poetry, which belong to old physique regular poetry. Generally speaking, poetry is considered to be more suitable than "lyrics", and words are more suitable than "lyrics". The recorded Chinese poetry originally originated from the first Qin, but was prosperous in the Tang Dynasty. The Chinese word originates from the sui Tang and is popular in Song Dynasty. The Chinese poetry originates from folk, and is a grass root literature. With the culture inheritance, poems are still deeply favored by the general public in China in the 21 st century today. Moreover, the method is not only limited to the traditional literature lovers, but also is very beneficial to enhancing the national confidence and the national luxury for the common people, especially teenagers or children, to accept the fumigated pottery of the traditional poetry culture. Therefore, courses set up by numerous early infant teaching mechanisms at present all contain the teaching contents of poetry parts. Even in some electronic products such as early education story machines, poetry contents have a considerable weight. However, at present, the popular early education story tellers have different recorded poems and have no unified standard. According to statistics, only Tang poetry works, according to records of 'Quantang poetry', 55763 are the current lives. Similarly, the number of times of the Song Dynasty recorded in the full Song Dynasty only includes 20000. These are just the numbers recorded at the time and later, and only include essence of poetry, and for a large number of poetry with low popularity, the poetry may not be recorded, but the part with appreciation value cannot be excluded. In addition, it is known that, considering the psychogenic development characteristics of children and teenagers and the expression forms and contents of poems, it is easy to think that not all poems are suitable as learning appreciation materials for the young people. Therefore, effective and reasonable analysis and classification aiming at the traditional ancient poems in China are urgently needed at present to guide the early traditional literature education of children and teenagers.

Disclosure of Invention

In view of the foregoing problems in the prior art, an aspect of the present invention is to provide a method for analyzing word frequency of ancient poetry, which can push preset ancient poetry to a user in a preset time according to a user-set condition through the word frequency analysis.

In order to achieve the purpose, the ancient poetry word frequency analysis method provided by the invention comprises the following steps:

acquiring a first data set comprising ancient poems, and constructing a first document according to the first data set, wherein the first data set at least comprises M poems;

performing word frequency analysis on the first document to obtain a first list representing word frequency sequencing, and establishing a first mapping table from keywords in the first list to names of M poems in a first data set according to the first list;

removing the virtual words in the first list to generate a second list according to virtual word information preset in a virtual word library, and updating the first mapping table to form a second mapping table according to the second list; the second mapping table at least comprises classification information corresponding to poetry;

screening at least one keyword which accords with preset conditions and has the highest word frequency sequence in a second list according to preset conditions of a user, and determining the names of N poems according to the corresponding relation between the keyword and the second mapping table;

respectively displaying poetry contents according to the names of the N poetry; m is larger than N, and M and N are both natural numbers.

In the technical scheme of the invention, the classification information corresponding to poems is in a conventional classification mode, is pre-stored in equipment or a cloud, and comprises a friend presenting class, a border seeking war class, a trip village class, a song object class, a ancient song history class, a writing scene lyric class and a landscape garden class.

Preferably, the obtaining of the first data set including the ancient poems comprises obtaining pre-stored poem information from a local database, and/or obtaining pre-stored poem information from a cloud server, and/or obtaining the poem information through a WebAPI interface.

Preferably, constructing a first document from the first data set comprises:

respectively collecting each poem according to name, author name, age and content, and connecting according to a first fixed separator to form block information; the block information further comprises block sequence information;

and sequentially connecting a plurality of pieces of block information respectively corresponding to each poem according to a second fixed separator, and storing the block information in a text form to generate a first document.

Preferably, the word frequency analysis is performed on the first document to obtain a first list representing word frequency ordering, and the method includes:

performing word segmentation processing on the first document to obtain a keyword set;

removing stop words from the keyword set, wherein the stop words at least comprise author names and times;

and counting the word frequency in the keyword set to obtain a first list representing word frequency ordering.

Preferably, establishing a first mapping table from the keywords in the first list to the names of M poems in the first data set includes:

establishing an index in the first document according to the keywords in the keyword set;

acquiring block sequence information of the key words according to the indexes;

and acquiring a first mapping table of the names of the keywords and the poems according to the block sequence information.

Preferably, the counting the word frequency in the keyword set includes:

performing a cluster analysis on the set of keywords,

a first list characterizing word frequency ordering is generated based on the cluster analysis results.

Preferably, the word frequency in the keyword set is counted, and the method further comprises the step of removing the keywords with the word frequency smaller than a first preset value.

The ancient poetry word frequency analysis system provided by the invention comprises:

the data acquisition unit is configured to acquire a first data set comprising ancient poems and construct a first document according to the first data set, wherein the first data set at least comprises M poems;

the word frequency analysis unit is configured to perform word frequency analysis on the first document to obtain a first list representing word frequency sequencing, and according to the first list, establish a first mapping table from keywords in the first list to names of M poems in a first data set;

the information screening module is configured to remove virtual words in the first list to generate a second list according to virtual word information preset in a virtual word library, and update the first mapping table to form a second mapping table according to the second list; the second mapping table at least comprises classification information corresponding to poetry; screening at least one keyword which accords with preset conditions and has the highest word frequency sequence in a second list according to preset conditions of a user, and determining the names of N poems according to the corresponding relation between the keyword and the second mapping table;

the display unit is configured to respectively display poetry contents according to the names of the N poetry; m is larger than N, and M and N are both natural numbers.

Preferably, the poetry system further comprises a WebAPI interface, a cloud server and/or a storage unit, wherein the WebAPI interface is configured to obtain the first data set from a public API, and the cloud server and/or the storage unit is configured to store poetry information at least containing the first data set.

Preferably, the word frequency analyzing unit includes:

a word segmentation module configured to perform word segmentation processing on the first document to obtain a keyword set;

a stop word removing module configured to remove stop words from the keyword set, the stop words including at least author names and years;

and the word frequency counting module is configured to count the word frequency in the keyword set to obtain a first list representing word frequency ordering.

Compared with the prior art, the ancient poetry word frequency analysis and system provided by the invention can be applied to electronic products such as early education story machines and the like, a large number of ancient poetry can be preset through built-in storage of the electronic products, when a user uses the ancient poetry word frequency analysis and system, themes such as poetry, poetry sings and the like can be preset according to the characteristics of the age bracket of a child, the ancient poetry with certain entries can be randomly and orderly read in each running or in a specific time period, and then word frequency statistics is carried out, so that poetry which accords with the age bracket and the preset requirement is screened out for learning and appreciation.

Drawings

FIG. 1 is a flow chart of the ancient poetry word frequency analysis method of the present invention.

Fig. 2 is a word segmentation flow chart of the ancient poetry word frequency analysis method of the invention.

Fig. 3 is a schematic structural diagram of block information of the ancient poetry word frequency analysis method of the present invention.

Fig. 4 is a system block diagram of the ancient poetry word frequency analysis system of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Various aspects and features of the present invention are described herein with reference to the drawings.

These and other characteristics of the invention will become apparent from the following description of a preferred form of embodiment, given as a non-limiting example, with reference to the accompanying drawings.

It should also be understood that, although the invention has been described with reference to some specific examples, a person of skill in the art shall certainly be able to achieve many other equivalent forms of the invention, having the characteristics as set forth in the claims and hence all coming within the field of protection defined thereby.

The above and other aspects, features and advantages of the present invention will become more apparent in view of the following detailed description when taken in conjunction with the accompanying drawings.

Specific embodiments of the present invention are described hereinafter with reference to the accompanying drawings; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention, which can be embodied in various forms. Well-known and/or repeated functions and constructions are not described in detail to avoid obscuring the invention in unnecessary or unnecessary detail based on the user's historical actions. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a basis for the claims and as a representative basis for teaching one skilled in the art to variously employ the present invention in virtually any appropriately detailed structure.

The specification may use the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments," which may each refer to one or more of the same or different embodiments in accordance with the invention.

As shown in fig. 1, an embodiment of the present invention provides a method for analyzing term frequency of ancient poetry, including:

s1, obtaining a first data set including ancient poems, and constructing a first document according to the first data set, wherein the first data set at least includes M poems;

s2, performing word frequency analysis on the first document to obtain a first list representing word frequency sequencing, and establishing a first mapping table from the keywords in the first list to the names of M poems in a first data set according to the first list;

s3, removing the virtual words in the first list to generate a second list according to virtual word information preset in a virtual word library, and updating the first mapping table to form a second mapping table according to the second list; the second mapping table at least comprises classification information corresponding to poetry;

s4, screening at least one keyword which meets preset conditions and has the highest word frequency sequence in a second list according to preset conditions of a user, and determining the names of N poems according to the corresponding relation between the keyword and the second mapping table;

s5, respectively displaying poetry contents according to the names of the N poetry; m is larger than N, and M and N are both natural numbers. In fact, for example, when the present invention is applied to an early education story machine, the setting of M is generally much larger than N, for example, in a general early education story machine, a parent wants to explain or appreciate poems for children, and usually only needs to select 2-3 persons each time. In the current early education machines, poetry appreciation part is usually a fixed sequence or a composition of playing fixed poetry stored in the electronic equipment. In the invention, for example, 100 ancient poems can be randomly selected when the early teaching machine runs each time, then the 100 ancient poems are combined into a first document, then the word frequency analysis is carried out on the first document, the purpose of the word frequency analysis is to confirm the content of the randomly obtained ancient poems, and then one or more ancient poems with the highest word frequency can be pushed to the user according to the conditions set by the user.

In the technical scheme of the invention, the classification information corresponding to poems is in a conventional classification mode, is pre-stored in equipment or a cloud, and comprises a friend presenting class, a border seeking war class, a trip village class, a song object class, a ancient song history class, a writing scene lyric class and a landscape garden class. The classification information can be stored in a storage unit of the early education story machine and also can be stored in a cloud server for being called at any time.

In step S1 of the present invention, obtaining a first data set including ancient poems includes obtaining pre-stored poem information from a local database or a storage unit, and/or obtaining pre-stored poem information from a cloud server, and/or obtaining the poem information through a WebAPI interface. In the ancient poetry word frequency analysis system shown in FIG. 4, WebAPI adopted "today poetry" (https:// www.jinrishici.com /).

Meanwhile, after the first data set of the ancient poetry is obtained in the step S1, when the first data set is used to construct a first document, the method may be referred to in fig. 3, that is, the method includes: aiming at each poem, respectively collecting the poems according to names, author names, ages and contents, and connecting the poems according to a first fixed separator 10 to form block information; the block information further comprises block sequence information; and sequentially connecting a plurality of pieces of block information respectively corresponding to each poem according to a second fixed separator 40, and storing the block information in a text form to generate a first document. In the example of fig. 3, only the tile information 20 and the tile information 30 are shown, and in fact, the first document may link M ancient poems in the same manner.

Furthermore, in step S2, performing word frequency analysis on the first document to obtain a first list representing word frequency ordering, which may specifically include: performing word segmentation processing on the first document to obtain a keyword set; removing stop words from the keyword set, wherein the stop words at least comprise author names and times; and counting the word frequency in the keyword set to obtain a first list representing word frequency ordering.

Still further, establishing a first mapping table from the keywords in the first list to the names of the M poems in the first data set, including: establishing an index in the first document according to the keywords in the keyword set; acquiring block sequence information of the key words according to the indexes; and acquiring a first mapping table of the names of the keywords and the poems according to the block sequence information.

Preferably, the counting the word frequency in the keyword set includes: and performing cluster analysis on the keyword set, and generating a first list representing word frequency ordering based on a cluster analysis result. In this step, a cluster analysis is performed, which may be specifically by K-Means (K-Means) clustering, mean shift clustering, density-based clustering method (DBSCAN), maximal Expectation (EM) clustering with Gaussian Mixture Model (GMM), hierarchical clustering, or Graph Community Detection (Graph Community Detection).

And further, counting the word frequency in the keyword set, and removing the keywords with the word frequency smaller than a first preset value.

Preferably, the word frequency analyzing unit includes:

Various specific embodiments of the methods described above, including various software modules, may be implemented on the computer-readable storage media.

In the above, various operations or functions are described herein, which may be implemented as or defined as software code or instructions. Such content may be directly executable ("object" or "executable" form) source code or differential code ("delta" or "patch" code). Software implementations of embodiments described herein may be provided via an article of manufacture having code or instructions stored therein or via a method of operating a communication interface to transmit data via the communication interface. A machine or computer-readable storage medium may cause a machine to perform the functions or operations described, and includes any mechanism for storing information in a form accessible by a machine (e.g., a computing device, an electronic system, etc.), such as recordable/non-recordable media (e.g., Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). A communication interface includes any mechanism that interfaces to any of a hardwired, wireless, optical, etc. medium to communicate with another device, such as a memory bus interface, a processor bus interface, an internet connection, a disk controller, etc. The communication interface may be configured by providing configuration parameters and/or transmitting signals to prepare the communication interface to provide data signals describing the software content. The communication interface may be accessed via one or more commands or signals sent to the communication interface.

The present invention also relates to a system for performing the operations herein. The system may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CDROMs, and magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.

The above embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and the scope of the present invention is defined by the claims. Various modifications and equivalents may be made by those skilled in the art within the spirit and scope of the present invention, and such modifications and equivalents should also be considered as falling within the scope of the present invention.

Claims

1. The ancient poetry word frequency analysis method comprises the following steps:

2. The method of claim 1, wherein obtaining the first data set including ancient poems comprises obtaining pre-stored poem information from a local database, and/or obtaining pre-stored poem information from a cloud server, and/or obtaining the poem information through a WebAPI interface.

3. The method of claim 1, constructing a first document from the first data set, comprising:

4. The method of claim 1, performing a word frequency analysis on the first document to obtain a first list characterizing word frequency ordering, comprising:

5. The method of claim 1, establishing a first mapping table of keywords in the first list to names of M poems in the first data set, comprising:

acquiring block sequence information of the key words according to the indexes;

6. The method of claim 4, wherein counting word frequencies in the keyword set comprises:

performing a cluster analysis on the set of keywords,

7. The method of claim 4, wherein the word frequency of the keyword set is counted, and further comprising removing the keywords with the word frequency less than a first predetermined value.

8. Ancient poetry word frequency analysis system includes:

9. The system of claim 8, further comprising a WebAPI interface configured to obtain the first data set from a public API, a cloud server and/or a storage unit configured to store verse information including at least the first data set.

10. The system of claim 8, the term frequency analysis unit, comprising: