CN114116867A - Information data identification and conversion method - Google Patents
Information data identification and conversion method Download PDFInfo
- Publication number
- CN114116867A CN114116867A CN202111400390.8A CN202111400390A CN114116867A CN 114116867 A CN114116867 A CN 114116867A CN 202111400390 A CN202111400390 A CN 202111400390A CN 114116867 A CN114116867 A CN 114116867A
- Authority
- CN
- China
- Prior art keywords
- data
- sorting
- index data
- information
- order
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 14
- 238000012163 sequencing technique Methods 0.000 claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 22
- 230000009193 crawling Effects 0.000 claims abstract description 7
- 238000013507 mapping Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 3
- 230000010354 integration Effects 0.000 description 4
- 230000006978 adaptation Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2291—User-Defined Types; Storage management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90348—Query processing by searching ordered data, e.g. alpha-numerically ordered data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an information data identification and conversion method, which comprises the steps of firstly, crawling target information in different websites through a crawler on the basis of a user-defined theme, and selecting sequencing index data of each piece of information; then, identifying the sorting index data of each piece of information, carrying out unified processing, obtaining the specific quantity value of the sorting index data, storing the specific quantity value into a database, and waiting for subsequent sorting processing; finally, sorting processing is carried out according to different sorting index data types according to sorting index data stored in a database; the information data identification and conversion method provided by the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for the subsequent information sorting. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.
Description
Technical Field
The invention relates to the technical field of network information integration, in particular to an information data identification and conversion method.
Background
In the information integration system in the prior art, generally, a user-defined theme is used to find related information from the whole network according to keywords of the theme, and the related information is provided to a client terminal after integration. The topic information obtained from the whole network needs to be sorted. The sorting standard generally adopts data such as reading amount, praise amount, comment amount, collection amount and the like. However, the data formats of different websites are not uniform, and are mainly embodied in the magnitude part, and the display forms of information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different in correlation mode, for example, some web sites display 1K reading amount, or different data formats such as 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand and the like, so that the computer cannot perform effective sequencing. How to unify the data formats of praise amount and the like of different website information and facilitate subsequent sequencing is an urgent problem to be solved by an information integration system.
Disclosure of Invention
The purpose of the invention is as follows: the purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides an information data identification and conversion method, which divides the sorting index data into basic data and a magnitude part for respective identification, finally obtains a general specific magnitude value and provides an effective basis for the subsequent information sorting.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
an information data identification and conversion method includes the following steps:
step S1, crawling target information in different websites through a crawler based on a custom theme, and selecting sequencing index data of each piece of information;
step S2, identifying the sorting index data of each piece of information, performing unified processing, acquiring specific quantity values of the sorting index data, storing the specific quantity values into a database, and waiting for subsequent sorting processing;
and step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
Further, the specific steps of identifying and processing the ranking index data in step S2 are as follows:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, directly storing the sorting index data into a database;
s2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database;
and S2.3, when the order index data cannot be identified by the order conversion system, transferring the order index data to manual processing, classifying the characters into a mapping set for training a corresponding mapping set, and further correcting the order conversion system.
Further, after the specific sorting index data is obtained in step S3, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
Has the advantages that:
the invention divides the sorting index data into basic data and a magnitude part, respectively identifies the basic data and the magnitude part, and finally obtains a general specific magnitude value to provide an effective basis for sorting the subsequent information. Through the unification processing, subsequent information sorting work efficiency is improved, and the utilization rate of crawling data is improved.
Drawings
FIG. 1 is a flow chart of an information data identification and conversion method provided by the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides an information data identification and conversion method, which comprises the following steps:
and step S1, crawling target information in different websites through a crawler based on the custom theme, and selecting the sequencing index data of each piece of information.
The invention is from different websites through crawling of the crawler, and the display forms of the information in different websites are different. The reading amount/praise amount/comment amount of the information displayed on different platforms are different, for example, some websites have a reading amount of 1K, or 1K +, 2.5K, 1.1w, 1w +, 1 thousand, 1 ten thousand, etc. The patent takes the reading amount, the praise amount, the collection amount and the comment amount of different information as the sequencing index data and further processes the data as follows.
Step S2, identifying the sorting index data of each piece of information, and performing unified processing to obtain the specific quantity value of the sorting index data, storing the specific quantity value into the database, and waiting for the subsequent sorting processing.
For the information sorting, the reading amount, the praise amount, the review amount and the collection amount of the information on the original platform are important sorting bases, so that the non-uniform praise amount and reading amount need to be unified so as to perform related sorting. The specific unified processing mode comprises the following steps:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, the sorting index data is directly stored in the database.
S2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database.
For 103Hierarchical, assume that there are thousands of: [ K ], "K" and "thousand" ] keyword information. When the sorting index data of a piece of information is acquired, whether the data contains a certain element in the keyword information is sequentially judged, for example, the acquired reading amount is 5.3k, and then it is judged that k appears in the thousand-level keyword list, and the data is divided into 5.3 and k. After the division, the number before k is multiplied by 1000 to obtain 5.3 × 1000 ═ 5300, and finally 5300 data is stored in the database.
Same as for 104In level, the mapping set is changed into [ W ', ' W ' and ' ten thousand ' ], whether the order part is one of the keywords is judged in turn during judgment, and the order part is confirmed to be 104And in the grading process, the left data is multiplied by 10000 to obtain a final value, and the final value is stored in a database. And processing data for other levels, such as the tens of millions and the like.
When the data has a 1.5k +, only the calculation mode of 1.5 × 1000 ═ 15000 is adopted as the final sorting index data value, and the influence of the subsequent omitted data on the sorting is ignored.
And S2.3, classifying the characters into a mapping set by adopting a manual expansion mode for the order part characters which cannot be matched with any key words for training the corresponding mapping set. If some information uses new order partial characters, the characters are classified into a mapping set by adopting a manual expansion mode, so that the mapping set is more complete, and an order conversion system is corrected.
After processing most data, the database carries out subsequent sorting processing on each piece of information according to the specific quantity value. If some special data can not be converted through the steps, the data is set to be null.
And step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
After acquiring specific sorting index data, sorting according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (3)
1. An information data identification and conversion method is characterized by comprising the following steps:
step S1, crawling target information in different websites through a crawler based on a custom theme, and selecting sequencing index data of each piece of information;
step S2, identifying the sorting index data of each piece of information, performing unified processing, acquiring specific quantity values of the sorting index data, storing the specific quantity values into a database, and waiting for subsequent sorting processing;
and step S3, carrying out sorting processing according to different sorting index data types according to the sorting index data stored in the database.
2. The method as claimed in claim 1, wherein the step of identifying and processing the sorting index data in step S2 comprises the following steps:
s2.1, inputting the obtained sequencing index data into a code conversion part for code conversion, and judging whether the sequencing index data can be directly converted into a numerical value or not; when the sorting index data can be directly converted, directly storing the sorting index data into a database;
s2.2, when the sequencing index data cannot be directly subjected to code conversion, inputting the sequencing index data into an order of magnitude conversion system; the order conversion system divides the identified sequencing index data into a digital part and an order part; for the number part, identifying each digit, and taking the identified character string as basic data; for the order of magnitude part, 10 is aimed at in turn first3-1086, establishing corresponding mapping sets in the number of orders, wherein each mapping set comprises a plurality of keywords; then comparing the recognized order partial characters with the keywords in all mapping sets one by one to obtain the order of the same keywords, namely the order of the magnitude information needed finally; multiplying the identified basic data and the magnitude information to obtain a specific magnitude value of the sequencing index data; storing the specific quantity value in a database;
and S2.3, when the order index data cannot be identified by the order conversion system, transferring the order index data to manual processing, classifying the characters into a mapping set for training a corresponding mapping set, and further correcting the order conversion system.
3. The method as claimed in claim 1, wherein in step S3, after obtaining the specific sorting index data, sorting is performed according to different types of sorting indexes; selecting reading amount and collection amount according to the sort index type; when some piece of information lacks reading data or collection data, estimation is carried out according to a preset ratio of the reading data to the collection data, estimated reading data or collection data is obtained, and finally sequencing results of the reading data and the collection data are obtained respectively.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111400390.8A CN114116867A (en) | 2021-11-19 | 2021-11-19 | Information data identification and conversion method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111400390.8A CN114116867A (en) | 2021-11-19 | 2021-11-19 | Information data identification and conversion method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114116867A true CN114116867A (en) | 2022-03-01 |
Family
ID=80440748
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111400390.8A Pending CN114116867A (en) | 2021-11-19 | 2021-11-19 | Information data identification and conversion method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114116867A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776808A (en) * | 2016-11-23 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Information data offering method and device based on artificial intelligence |
CN110826902A (en) * | 2019-10-31 | 2020-02-21 | 北京东软望海科技有限公司 | Target object assessment and evaluation method and device, computer equipment and storage medium |
CN110851709A (en) * | 2019-10-17 | 2020-02-28 | 浙江大搜车软件技术有限公司 | Information pushing method and device, computer equipment and storage medium |
CN111312351A (en) * | 2020-01-20 | 2020-06-19 | 和宇健康科技股份有限公司 | Regional medical record data analysis method and system |
-
2021
- 2021-11-19 CN CN202111400390.8A patent/CN114116867A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776808A (en) * | 2016-11-23 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Information data offering method and device based on artificial intelligence |
CN110851709A (en) * | 2019-10-17 | 2020-02-28 | 浙江大搜车软件技术有限公司 | Information pushing method and device, computer equipment and storage medium |
CN110826902A (en) * | 2019-10-31 | 2020-02-21 | 北京东软望海科技有限公司 | Target object assessment and evaluation method and device, computer equipment and storage medium |
CN111312351A (en) * | 2020-01-20 | 2020-06-19 | 和宇健康科技股份有限公司 | Regional medical record data analysis method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110688553B (en) | Information pushing method and device based on data analysis, computer equipment and storage medium | |
EP3745276A1 (en) | Discovering a semantic meaning of data fields from profile data of the data fields | |
CN102483745B (en) | Co-selected image classification | |
AU2010249253B2 (en) | A method for automatically indexing documents | |
US8271495B1 (en) | System and method for automating categorization and aggregation of content from network sites | |
CN106649557B (en) | Semantic association mining method for defect report and mail list | |
CN114138784B (en) | Information tracing method and device based on storage library, electronic equipment and medium | |
CN114676279B (en) | Image retrieval method, device, equipment and computer readable storage medium | |
CN106484913A (en) | Method and server that a kind of Target Photo determines | |
CN116881430A (en) | Industrial chain identification method and device, electronic equipment and readable storage medium | |
CN114003803B (en) | Method and system for discovering media account numbers of specific regions on social platform | |
CN105787800B (en) | Intelligent social platform potential relationship retrieval device, system and method | |
CN104408144A (en) | Detection method and device for web search keyword | |
CN113254572A (en) | Electronic document classification supervision system based on cloud platform | |
CN112052310A (en) | Information acquisition method, device, equipment and storage medium based on big data | |
CN114116867A (en) | Information data identification and conversion method | |
CN113094444A (en) | Data processing method, data processing apparatus, computer device, and medium | |
CN114880584B (en) | Generator set fault analysis method based on community discovery | |
CN113869024A (en) | Method and system for generating initial guarantee scheme of airplane | |
CN113139106B (en) | Event auditing method and device for security check | |
CN115018258B (en) | Method for identifying enterprise type and industry chain space in target area | |
CN117668273B (en) | Mapping result management method | |
CN117171676B (en) | Decision tree-based soil microorganism identification analysis method, system and storage medium | |
CN118278970A (en) | Method for constructing user space-time portrait array based on big data algorithm | |
CN113988193A (en) | Crowd matching method and system based on big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |