CN114116867A

CN114116867A - Information data identification and conversion method

Info

Publication number: CN114116867A
Application number: CN202111400390.8A
Authority: CN
Inventors: 夏正新; 王东传; 邓鹏�; 李鹏
Original assignee: Nanjing Yizhanshendeng Network Information Technology Co Ltd
Current assignee: Nanjing Yizhanshendeng Network Information Technology Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-01

Abstract

The invention discloses an information data identification and conversion method, which comprises the steps of firstly, crawling target information in different websites through a crawler on the basis of a user-defined theme, and selecting sequencing index data of each piece of information; then, identifying the sorting index data of each piece of information, carrying out unified processing, obtaining the specific quantity value of the sorting index data, storing the specific quantity value into a database, and waiting for subsequent sorting processing; finally, sorting processing is carried out according to different sorting index data types according to sorting index data stored in a database; the information data identification and conversion method provided by the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for the subsequent information sorting. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.

Description

Information data identification and conversion method

Technical Field

The invention relates to the technical field of network information integration, in particular to an information data identification and conversion method.

Background

In the information integration system in the prior art, generally, a user-defined theme is used to find related information from the whole network according to keywords of the theme, and the related information is provided to a client terminal after integration. The topic information obtained from the whole network needs to be sorted. The sorting standard generally adopts data such as reading amount, praise amount, comment amount, collection amount and the like. However, the data formats of different websites are not uniform, and are mainly embodied in the magnitude part, and the display forms of information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different in correlation mode, for example, some web sites display 1K reading amount, or different data formats such as 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand and the like, so that the computer cannot perform effective sequencing. How to unify the data formats of praise amount and the like of different website information and facilitate subsequent sequencing is an urgent problem to be solved by an information integration system.

Disclosure of Invention

The purpose of the invention is as follows: the purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides an information data identification and conversion method, which divides the sorting index data into basic data and a magnitude part for respective identification, finally obtains a general specific magnitude value and provides an effective basis for the subsequent information sorting.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

an information data identification and conversion method includes the following steps:

step S1, crawling target information in different websites through a crawler based on a custom theme, and selecting sequencing index data of each piece of information;

step S2, identifying the sorting index data of each piece of information, performing unified processing, acquiring specific quantity values of the sorting index data, storing the specific quantity values into a database, and waiting for subsequent sorting processing;

and step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.

Further, the specific steps of identifying and processing the ranking index data in step S2 are as follows:

s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, directly storing the sorting index data into a database;

s2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first³-10⁸6, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database;

and S2.3, when the order index data cannot be identified by the order conversion system, transferring the order index data to manual processing, classifying the characters into a mapping set for training a corresponding mapping set, and further correcting the order conversion system.

Further, after the specific sorting index data is obtained in step S3, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.

Has the advantages that:

the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for sorting the subsequent information. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.

Drawings

FIG. 1 is a flow chart of an information data identification and conversion method provided by the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides an information data identification and conversion method, which comprises the following steps:

and step S1, crawling target information in different websites through a crawler based on the custom theme, and selecting the sequencing index data of each piece of information.

The invention is from different websites through crawling of the crawler, and the display forms of the information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different, for example, some websites have a reading amount of 1K, or 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand, etc. The patent takes the reading amount, the praise amount, the collection amount and the comment amount of different information as the sequencing index data and further processes the data as follows.

Step S2, identifying the sorting index data of each piece of information, and performing unified processing to obtain the specific quantity value of the sorting index data, storing the specific quantity value into the database, and waiting for the subsequent sorting processing.

For the information sorting, the reading amount, the praise amount, the review amount and the collection amount of the information on the original platform are important sorting bases, so that the non-uniform praise amount and reading amount need to be unified so as to perform related sorting. The specific unified processing mode comprises the following steps:

s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, the sorting index data is directly stored in the database.

S2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first³-10⁸6, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database.

For 10³Hierarchical, assume that there are thousands of: [ K ], "K" and "thousand" ] keyword information. When the sorting index data of a piece of information is acquired, whether the data contains a certain element in the keyword information is sequentially judged, for example, the acquired reading amount is 5.3k, and then it is judged that k appears in the thousand-level keyword list, and the data is divided into 5.3 and k. After the division, the number before k is multiplied by 1000 to obtain 5.3 × 1000 ═ 5300, and finally 5300 data is stored in the database.

Same as for 10⁴In level, the mapping set is changed into [ W ', ' W ' and ' ten thousand ' ], whether the order part is one of the keywords is judged in turn during judgment, and the order part is confirmed to be 10⁴And in the grading process, the left data is multiplied by 10000 to obtain a final value, and the final value is stored in a database. And processing data for other levels, such as the tens of millions and the like.

When the data has a 1.5k +, only the calculation mode of 1.5 × 1000 ═ 15000 is adopted as the final sorting index data value, and the influence of the subsequent omitted data on the sorting is ignored.

And S2.3, classifying the characters into a mapping set by adopting a manual expansion mode for the order part characters which cannot be matched with any key words for training the corresponding mapping set. If some information uses new order partial characters, the characters are classified into a mapping set by adopting a manual expansion mode, so that the mapping set is more complete, and an order conversion system is corrected.

After processing most data, the database carries out subsequent sorting processing on each piece of information according to the specific quantity value. If some special data can not be converted through the steps, the data is set to be null.

After acquiring specific sorting index data, sorting according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An information data identification and conversion method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of identifying and processing the sorting index data in step S2 comprises the following steps:

3. The method as claimed in claim 1, wherein in step S3, after obtaining the specific sorting index data, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.