CN114116867A - Information data identification and conversion method - Google Patents

Information data identification and conversion method Download PDF

Info

Publication number
CN114116867A
CN114116867A CN202111400390.8A CN202111400390A CN114116867A CN 114116867 A CN114116867 A CN 114116867A CN 202111400390 A CN202111400390 A CN 202111400390A CN 114116867 A CN114116867 A CN 114116867A
Authority
CN
China
Prior art keywords
data
sorting
index data
information
order
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111400390.8A
Other languages
Chinese (zh)
Inventor
夏正新
王东传
邓鹏�
李鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Yizhanshendeng Network Information Technology Co Ltd
Original Assignee
Nanjing Yizhanshendeng Network Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Yizhanshendeng Network Information Technology Co Ltd filed Critical Nanjing Yizhanshendeng Network Information Technology Co Ltd
Priority to CN202111400390.8A priority Critical patent/CN114116867A/en
Publication of CN114116867A publication Critical patent/CN114116867A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information data identification and conversion method, which comprises the steps of firstly, crawling target information in different websites through a crawler on the basis of a user-defined theme, and selecting sequencing index data of each piece of information; then, identifying the sorting index data of each piece of information, carrying out unified processing, obtaining the specific quantity value of the sorting index data, storing the specific quantity value into a database, and waiting for subsequent sorting processing; finally, sorting processing is carried out according to different sorting index data types according to sorting index data stored in a database; the information data identification and conversion method provided by the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for the subsequent information sorting. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.

Description

Information data identification and conversion method
Technical Field
The invention relates to the technical field of network information integration, in particular to an information data identification and conversion method.
Background
In the information integration system in the prior art, generally, a user-defined theme is used to find related information from the whole network according to keywords of the theme, and the related information is provided to a client terminal after integration. The topic information obtained from the whole network needs to be sorted. The sorting standard generally adopts data such as reading amount, praise amount, comment amount, collection amount and the like. However, the data formats of different websites are not uniform, and are mainly embodied in the magnitude part, and the display forms of information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different in correlation mode, for example, some web sites display 1K reading amount, or different data formats such as 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand and the like, so that the computer cannot perform effective sequencing. How to unify the data formats of praise amount and the like of different website information and facilitate subsequent sequencing is an urgent problem to be solved by an information integration system.
Disclosure of Invention
The purpose of the invention is as follows: the purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides an information data identification and conversion method, which divides the sorting index data into basic data and a magnitude part for respective identification, finally obtains a general specific magnitude value and provides an effective basis for the subsequent information sorting.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
an information data identification and conversion method includes the following steps:
step S1, crawling target information in different websites through a crawler based on a custom theme, and selecting sequencing index data of each piece of information;
step S2, identifying the sorting index data of each piece of information, performing unified processing, acquiring specific quantity values of the sorting index data, storing the specific quantity values into a database, and waiting for subsequent sorting processing;
and step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
Further, the specific steps of identifying and processing the ranking index data in step S2 are as follows:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, directly storing the sorting index data into a database;
s2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database;
and S2.3, when the order index data cannot be identified by the order conversion system, transferring the order index data to manual processing, classifying the characters into a mapping set for training a corresponding mapping set, and further correcting the order conversion system.
Further, after the specific sorting index data is obtained in step S3, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
Has the advantages that:
the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for sorting the subsequent information. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.
Drawings
FIG. 1 is a flow chart of an information data identification and conversion method provided by the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides an information data identification and conversion method, which comprises the following steps:
and step S1, crawling target information in different websites through a crawler based on the custom theme, and selecting the sequencing index data of each piece of information.
The invention is from different websites through crawling of the crawler, and the display forms of the information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different, for example, some websites have a reading amount of 1K, or 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand, etc. The patent takes the reading amount, the praise amount, the collection amount and the comment amount of different information as the sequencing index data and further processes the data as follows.
Step S2, identifying the sorting index data of each piece of information, and performing unified processing to obtain the specific quantity value of the sorting index data, storing the specific quantity value into the database, and waiting for the subsequent sorting processing.
For the information sorting, the reading amount, the praise amount, the review amount and the collection amount of the information on the original platform are important sorting bases, so that the non-uniform praise amount and reading amount need to be unified so as to perform related sorting. The specific unified processing mode comprises the following steps:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, the sorting index data is directly stored in the database.
S2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database.
For 103Hierarchical, assume that there are thousands of: [ K ], "K" and "thousand" ] keyword information. When the sorting index data of a piece of information is acquired, whether the data contains a certain element in the keyword information is sequentially judged, for example, the acquired reading amount is 5.3k, and then it is judged that k appears in the thousand-level keyword list, and the data is divided into 5.3 and k. After the division, the number before k is multiplied by 1000 to obtain 5.3 × 1000 ═ 5300, and finally 5300 data is stored in the database.
Same as for 104In level, the mapping set is changed into [ W ', ' W ' and ' ten thousand ' ], whether the order part is one of the keywords is judged in turn during judgment, and the order part is confirmed to be 104And in the grading process, the left data is multiplied by 10000 to obtain a final value, and the final value is stored in a database. And processing data for other levels, such as the tens of millions and the like.
When the data has a 1.5k +, only the calculation mode of 1.5 × 1000 ═ 15000 is adopted as the final sorting index data value, and the influence of the subsequent omitted data on the sorting is ignored.
And S2.3, classifying the characters into a mapping set by adopting a manual expansion mode for the order part characters which cannot be matched with any key words for training the corresponding mapping set. If some information uses new order partial characters, the characters are classified into a mapping set by adopting a manual expansion mode, so that the mapping set is more complete, and an order conversion system is corrected.
After processing most data, the database carries out subsequent sorting processing on each piece of information according to the specific quantity value. If some special data can not be converted through the steps, the data is set to be null.
And step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
After acquiring specific sorting index data, sorting according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (3)

1. An information data identification and conversion method is characterized by comprising the following steps:
step S1, crawling target information in different websites through a crawler based on a custom theme, and selecting sequencing index data of each piece of information;
step S2, identifying the sorting index data of each piece of information, performing unified processing, acquiring specific quantity values of the sorting index data, storing the specific quantity values into a database, and waiting for subsequent sorting processing;
and step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
2. The method as claimed in claim 1, wherein the step of identifying and processing the sorting index data in step S2 comprises the following steps:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, directly storing the sorting index data into a database;
s2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database;
and S2.3, when the order index data cannot be identified by the order conversion system, transferring the order index data to manual processing, classifying the characters into a mapping set for training a corresponding mapping set, and further correcting the order conversion system.
3. The method as claimed in claim 1, wherein in step S3, after obtaining the specific sorting index data, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
CN202111400390.8A 2021-11-19 2021-11-19 Information data identification and conversion method Pending CN114116867A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111400390.8A CN114116867A (en) 2021-11-19 2021-11-19 Information data identification and conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111400390.8A CN114116867A (en) 2021-11-19 2021-11-19 Information data identification and conversion method

Publications (1)

Publication Number Publication Date
CN114116867A true CN114116867A (en) 2022-03-01

Family

ID=80440748

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111400390.8A Pending CN114116867A (en) 2021-11-19 2021-11-19 Information data identification and conversion method

Country Status (1)

Country Link
CN (1) CN114116867A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776808A (en) * 2016-11-23 2017-05-31 百度在线网络技术(北京)有限公司 Information data offering method and device based on artificial intelligence
CN110826902A (en) * 2019-10-31 2020-02-21 北京东软望海科技有限公司 Target object assessment and evaluation method and device, computer equipment and storage medium
CN110851709A (en) * 2019-10-17 2020-02-28 浙江大搜车软件技术有限公司 Information pushing method and device, computer equipment and storage medium
CN111312351A (en) * 2020-01-20 2020-06-19 和宇健康科技股份有限公司 Regional medical record data analysis method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776808A (en) * 2016-11-23 2017-05-31 百度在线网络技术(北京)有限公司 Information data offering method and device based on artificial intelligence
CN110851709A (en) * 2019-10-17 2020-02-28 浙江大搜车软件技术有限公司 Information pushing method and device, computer equipment and storage medium
CN110826902A (en) * 2019-10-31 2020-02-21 北京东软望海科技有限公司 Target object assessment and evaluation method and device, computer equipment and storage medium
CN111312351A (en) * 2020-01-20 2020-06-19 和宇健康科技股份有限公司 Regional medical record data analysis method and system

Similar Documents

Publication Publication Date Title
CN110688553B (en) Information pushing method and device based on data analysis, computer equipment and storage medium
EP3745276A1 (en) Discovering a semantic meaning of data fields from profile data of the data fields
CN102483745B (en) Co-selected image classification
AU2010249253B2 (en) A method for automatically indexing documents
US8271495B1 (en) System and method for automating categorization and aggregation of content from network sites
CN106649557B (en) Semantic association mining method for defect report and mail list
CN114138784B (en) Information tracing method and device based on storage library, electronic equipment and medium
CN114676279B (en) Image retrieval method, device, equipment and computer readable storage medium
CN106484913A (en) Method and server that a kind of Target Photo determines
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN114003803B (en) Method and system for discovering media account numbers of specific regions on social platform
CN105787800B (en) Intelligent social platform potential relationship retrieval device, system and method
CN104408144A (en) Detection method and device for web search keyword
CN113254572A (en) Electronic document classification supervision system based on cloud platform
CN112052310A (en) Information acquisition method, device, equipment and storage medium based on big data
CN114116867A (en) Information data identification and conversion method
CN113094444A (en) Data processing method, data processing apparatus, computer device, and medium
CN114880584B (en) Generator set fault analysis method based on community discovery
CN113869024A (en) Method and system for generating initial guarantee scheme of airplane
CN113139106B (en) Event auditing method and device for security check
CN115018258B (en) Method for identifying enterprise type and industry chain space in target area
CN117668273B (en) Mapping result management method
CN117171676B (en) Decision tree-based soil microorganism identification analysis method, system and storage medium
CN118278970A (en) Method for constructing user space-time portrait array based on big data algorithm
CN113988193A (en) Crowd matching method and system based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination