WO2021035976A1

WO2021035976A1 - Scenario application method and system based on information classification, and medium and device

Info

Publication number: WO2021035976A1
Application number: PCT/CN2019/117970
Authority: WO
Inventors: 王旭阳; 孙沛基; 朱悦; 刘晋元; 潘永春
Original assignee: 上海市研发公共服务平台管理中心; 上海科技发展有限公司
Priority date: 2019-08-23
Filing date: 2019-11-13
Publication date: 2021-03-04
Also published as: CN110688453A; CN110688453B

Abstract

Provided are a scenario application method and system based on information classification, and a medium and a device. The scenario application method based on information classification comprises: performing formatting pre-processing on information data; performing information source attribute processing on information text according to an information source so as to generate an information source attribute processing result; performing application scenario attribute processing on the information source attribute processing result according to an information application scenario so as to generate different application scenario feature word libraries after extracting application scenario feature words of the information text; and performing word frequency index calculation on the information text so as to push information in a targeted manner by combining a calculation result with the information source attribute processing result and the application scenario feature word libraries. According to the present invention, information crawled in batches can be flexibly and accurately classified and released.

Description

Scene application method, system, medium and equipment based on information classification

Technical field

The present invention belongs to the field of information data application, and relates to a scene application method of information data, in particular to a scene application method, system, medium and equipment based on information classification.

Background technique

With the rapid development of the Internet, the information and data of various channels are complicated, and the accuracy of the information disseminated by some channels cannot be guaranteed, which will mislead the information obtainers. How to effectively extract and use this information becomes a problem. A huge challenge, even if you use web crawlers, you cannot accurately push the information data crawled through the web through authoritative channels.

Taking science and technology information as an example, science and technology information is an important part of big data resources, and there are many classifications of technology information. Users of different fields and different backgrounds often have different retrieval purposes and needs, and users as information acquirers cannot know accurately Information content you need.

Therefore, how to categorize and release information from different information sources, such as webpages and official account news sources, for specific user groups and application scenarios after batch crawling has become a technical problem to be solved by those skilled in the art.

Summary of the invention

In view of the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a scene application method, system, medium and equipment based on information classification, which is used to solve the problem that the prior art cannot target the crawled information data to specific user groups and The problem of classified delivery and push in application scenarios.

In order to achieve the above and other related purposes, one aspect of the present invention provides a scene application method based on information classification. The scene application method based on information classification includes: formatting and preprocessing information data to generate information that conforms to the format. Text; the information text is processed according to the information source information source attributes to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results; according to the information application The scenario performs application scenario attribute processing on the information source attribute processing result to extract application scenario feature words of the information text to generate different application scenario feature vocabularies; perform word frequency index calculation on the information text to combine the calculation results The information source attribute processing result and the application scenario feature vocabulary perform targeted push of information; the targeted push includes hidden operations, update operations, new operations, and/or associated storage operations.

In an embodiment of the present invention, the step of formatting and preprocessing information data to generate information text conforming to the format includes: performing noise reduction processing on the information data to obtain purified information text; The noise reduction processing includes symbol noise reduction and text noise reduction; word embedding technology is used to perform word segmentation and labeling processing on the information text to distinguish specific phrases through labeling; the specific phrases include: time phrases, name phrases, and/or Institutional phrase; grammatically deconstructs the information text with specific phrase annotations through a grammar machine; uses the format machine to store the grammatically deconstructed information text in a preset format, and the preset format is determined by the formatter, so The formatter is used to convert the standard format and supplement the default value of the information text field.

In an embodiment of the present invention, the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result includes: analyzing the information source of the information text to determine the information The category of the source; the category of the information source includes: integrated media, public platforms, management units, research institutions and/or industry media; the information text is classified into one of the categories of information sources according to the information source to obtain information Source feature results.

In an embodiment of the present invention, the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result further includes: calibrating the type of the information source for different The importance of the application scenarios of, to determine the relevance results of the information application scenarios, and the relevance results of the information application scenarios refer to the dependency ratios generated by each of the application scenarios in different categories of information sources; The categories of application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, conference category, media hotspot category and/or policy category.

In an embodiment of the present invention, the application scenario attribute processing is performed on the information source attribute processing result according to the information application scenario to extract application scenario feature words of the information text to generate different application scenario features The step of the thesaurus includes: extracting nouns and/or verb phrases in the information text as application scenario feature words; counting the number of documents in which the application scenario feature words are located; the number of documents refers to all the information text components The total number of documents; filter out several of the application scenario feature words whose number of documents is within a preset range; calculate and combine the semantic vector of the information text through the dependency coefficients between several of the application scenario feature words, The application scenario feature words are classified into matching application scenarios categories to form an application scenario feature vocabulary.

In an embodiment of the present invention, the step of performing word frequency index calculation on the information text so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push includes: calculation The word frequency index of the target vocabulary of each paragraph in the information text is used to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, and the extraction order is The first several digits of the target vocabulary corresponding to the word frequency index, the target vocabulary refers to the vocabulary selected according to the article category, including scientific vocabulary; the core vocabulary is semantically matched in the application scenario feature vocabulary to filter out Information text where the core vocabulary whose matching result is greater than the preset value is located; combining the information text with the category of the information source to generate an information source triple group, and combining the application scenario feature word library to generate a feature word triple group Combining the information source triad group and the feature word triad group to determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs; selecting the top three cores after sorting Vocabulary, and search for the category of the application scenario corresponding to each core vocabulary to determine the information source with the highest category dependency of the application scenario; push the information text to the determined information source with the highest dependency, and proceed Targeted operations.

In an embodiment of the present invention, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding operations for honors and awards, and/or batch association entry for lists. Library operations.

Another aspect of the present invention provides a scene application system based on information classification. The scene application system based on information classification includes: a preprocessing module for formatting and preprocessing information data to generate information text conforming to the format; The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario The application scenario attribute processing module is used to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting the application scenario feature words of the information text The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hiding operations, Update operations, new operations, and/or associated storage operations.

Another aspect of the present invention provides a medium on which a computer program is stored, and when the program is executed by a processor, the scene application method based on information classification is implemented.

The last aspect of the present invention provides a device including: a processor and a memory; the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the information-based Classification of the application method of the scene.

As mentioned above, the scene application method, system, medium and equipment based on information classification of the present invention have the following

Beneficial effects:

The present invention provides a classification method and scene application based on scientific and technological information, which comprehensively considers the entire process control of scientific and technological information collection, classification and scene application; combines information source and full-text feature word segmentation to improve feature classification, which is beneficial to reduce the lexicon Construction process and judgment errors; use the collected information to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit.

Description of the drawings

FIG. 1 shows a schematic flow chart of an embodiment of the scene application method based on information classification of the present invention.

FIG. 2 shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention.

FIG. 3 is a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention.

FIG. 4 shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention.

Component label description

4 Scenario application system based on information classification

41 Pre-processing module

42 Information source attribute processing module

43 Application scenario attribute processing module

44 Application Module

S11～S14 Scenario application method steps based on information classification

S111～S114 Information data preprocessing steps

detailed description

The following describes the implementation of the present invention through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, in the case of no conflict, the following embodiments and the features in the embodiments can be combined with each other.

It should be noted that the illustrations provided in the following embodiments only illustrate the basic idea of the present invention in a schematic manner. The figures only show the components related to the present invention instead of the number, shape and actual implementation of the components. For size drawing, the type, quantity, and proportion of each component can be changed at will during actual implementation, and the component layout type may also be more complicated.

The technical principles of the scene application method, system, medium and equipment based on information classification of the present invention are as follows: format and preprocess information data; perform information source attribute processing on the information text according to the information source to generate information source attributes Processing result; according to the information application scene, the information source attribute processing result is processed by the application scene attribute to extract the application scene feature words of the information text, and then generate different application scene feature vocabularies; perform word frequency on the information text Index calculation, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push.

Example one

This embodiment provides a scene application method based on information classification. The scene application method based on information classification includes:

Format and preprocess the information data to generate information text that conforms to the format;

Information source attribute processing is performed on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results;

Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;

Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new operations And/or associated storage operations.

The following will describe in detail the scene application method based on information classification provided by this embodiment in conjunction with the diagrams.

An embodiment of the present invention is based on the train browser on the tens of thousands of crawled data from more than 100 news networks and self-media in the past year, through the word segmentation method of natural language processing, the related fields, core content, and related experts are analyzed. Feature extraction; then arrange the relevance according to the frequency of feature words and vector weights; finally, the information is classified into different application scenarios by comprehensive judgment of the data source and data content.

Please refer to FIG. 1, which shows a principle flow chart of the scene application method based on information classification in an embodiment of the present invention. As shown in Figure 1, the scene application method based on information classification specifically includes the following steps:

S11, format and preprocess the information data to generate information text conforming to the format.

Specifically, the information data is preprocessed by word segmentation technology to generate a word segmentation model, and the accuracy of the word segmentation model is optimized by noise reduction, word segmentation, grammar optimization, and format unification of the information data, and finally a word vector model is established. Further, in the word segmentation process, the information data is segmented according to the sentence and then the word segmentation is performed and includes part-of-speech tagging, and the word embedding technology is used to build a word vector model using sentence as a unit. It should be noted that the word segmentation technology includes: a word segmentation method for string matching, a word meaning segmentation method, and/or a statistical word segmentation method.

Please refer to FIG. 2, which shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention. As shown in Figure 2, the S11 includes:

S111: Perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction.

In an actual application of this embodiment, the noise reduction processing includes:

(1) Change a full-width symbol to a half-width symbol, for example, a full-width space into a half-width space.

(2) Replace special symbols with common symbols, such as "①⑨⑧⑤年" with "1985".

(3) Simplify the use of symbols, such as: replace tab symbols with spaces, uniformly replace curly brackets and square brackets with parentheses, and replace commas with commas, etc., to realize information text by changing all symbols to commas and periods Maximum simplification.

(4) Correction of typos based on commonly used Chinese character dictionaries and the directory of higher education institutions of the Ministry of Education, for example, "Qi Shui" is changed to "Soda".

(5) Simplified and traditional conversion, for example, "country" is changed to "country", etc.

(6) Unified terms, such as "Santa Barbara" changed to "Santa Barbara" and so on.

S112: Use word embedding technology to perform word segmentation and labeling processing on the information text, so that specific phrases can be distinguished by labeling; the specific phrases include: time phrases, name phrases, and/or organization phrases.

In an actual application of this embodiment, the word segmentation and labeling processing includes:

(1) Treat the word representing time as a chunk. Take this as a feature point that differentiates it from the mainstream word segmentation system, such as "December 1998" as only a word block.

(2) Treat the words representing organizations/institutions/awards as a block. For example, "Third World Academy of Sciences" will not be divided into "Third World/Academies of Sciences" or "Third World/Academies of Sciences".

(3) Perform part-of-speech tagging on the word segmentation results, where nouns specifically distinguish time phrases, names, organizations, etc.

S113: Perform grammatical deconstruction on the information text with a specific phrase mark by a grammar machine.

Specifically, the grammar machine is used for Chinese grammar deconstruction, decomposing complex structures into simple structures. For example, after tagging a sentence in an information text, it is presented in the following form: {time: 1987}, {time: 1990}, {order: successively}, {event: obtained}, {univ: the school}, {title: master degree}, {title: doctorate degree}.

Further, the working process of the grammar machine is:

The "order grammar machine" is triggered by {order: successively} in the information text. The sequence of time is determined by the "sequential grammar machine", using {time: 1987} as one branch and {time: 1990} as another branch. It should be noted that if there are at least two time words in the sentence and the two times are not the same, the "sequential grammar machine" is triggered when the other components of the sentence contain the actual words corresponding to the number of times; if the above assumptions are not satisfied Condition, the "Sequence Grammar Machine" reports a grammatical error.

The "refers to the grammar machine" is triggered by {univ: the school} in the information text. Search forward for the univ tag mentioned last time to find the specific school name referred to by "the school". It should be noted that the "referring to the grammar machine" step forwards no more than 10 sentences, and it ends at the beginning of the whole article; if the above conditions are not met, the "referring to the grammar machine" reports a grammatical error.

In this embodiment, the result after processing by the grammar machine is displayed as follows:

Branch 1: {time: 1987} {order: first} {event: obtained} {univ: Jilin University} {title: master};

Branch 2: {time: 1990}{order: later}{event:obtained}{univ: Jilin University}{title: PhD}.

It should be noted that, after the sentence of the information text is processed by the grammar machine into the above-mentioned branch 1 or branch 2 format, it is then handed over to the format machine for final processing.

S114. Using a format machine to store the information text deconstructed by grammar in a preset format, the preset format is determined by a formatter, and the formatter is used to convert and default a field of the information text into a standardized format. Value addition.

Specifically, the format machine unifies and standardizes the storage of the components in the sentence according to the field format that meets the classification requirements of scientific and technological information application scenarios. The formatter uses triggers to match the required formatter for the sentence, and then calls the corresponding formatter to perform normalized conversion of the field and supplement the default value.

Further, the processing process of the format machine is:

(1) Determine the trigger method according to the part-of-speech tagging. For example, there are "univ" and "title" tags in the sentence, and "Jilin University" and "Master/PhD" can be found in the school dictionary and the academic dictionary respectively. Therefore, The sentence content of "Jilin University" and "Master/PhD" will trigger the "Educational Experience Formatter".

(2) Generate field headers, including "entry year", "graduation year", "school", "professional", "educational background", and "graduation thesis/graduation design".

(3) Format standardization, including the unification of the expression format of time and the unification of the name. For example, “1987” shall be standardized as “1987-00-00”, and “Jilin University” shall be kept in the default form, which is still “Jilin University”. "Doctorate degree" is standardized as "Doctorate".

(4) The default value in the information text is uniformly filled with "-".

(5) Assemble the data after the format is normalized to generate the temporary text of the information feature words that conform to the format as the preprocessing result, and store it.

S12: Perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results.

In this embodiment, the information source of the information text is analyzed to determine the category of the information source; the category of the information source includes: comprehensive media, public platforms, management units, research institutions, and/or industry media; The information text is classified into one of the information source categories according to the information source to obtain the information source characteristic result.

In an actual application of this embodiment, the crawled information is preliminarily divided into integrated media, public platforms, management units, research institutions, and others according to the characteristics of the data source. Among them, comprehensive media such as Science Network, Science and Technology Daily, etc. are more diverse and the total amount of information is relatively prominent, and the result information is relatively large; WeChat public platform industry information is mixed, information types are widely distributed, and dynamic updates are fast; management unit policy news is the most, meetings and conferences. Second, the hotspots are high in authority and public recognition, and low in frequency; 90% of university institutions come from information on scientific and technological achievements, and can obtain first-hand data on university development policies, achievements, and talent flow, and their institutional characteristics are prominent.

Further, take Xinzhiyuan as an example. As a WeChat official account platform, Xinzhiyuan’s main business is to plan artificial intelligence-related conferences and have cooperative relationships with domestic AI companies. The "Xinzhiyuan" WeChat official account is its industry. In the first link of the chain, the number of categories is relatively equal, and there is no obvious focus; the categories such as achievements, employment, enterprises, industry hotspots, rankings, conferences, and macro statistics are balanced, and the quality is stable.

In this embodiment, the importance of the categories of the information sources for different application scenarios is calibrated through weight calculation to determine the correlation results of the information application scenarios. The correlation results of the information application scenarios refer to each of the information application scenarios. Describes the dependency ratio of the application scenarios in different categories of information sources; the categories of the application scenarios include: achievement, obituary, employment, enterprise industry, integrity and ethical issues, rankings, honors , Macro statistical report category, conference category, media hotspot category and/or policy category.

In an actual application of this embodiment, since the total amount of information of different information sources is very different, in order to accurately weigh the information quality of different information sources, it is based on the weight of the specific application scenario category of the information in the total amount of information provided by the information source. , The information source and the information source form a mutual reference to reflect the authority of the information source.

Please refer to FIG. 3, which shows a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention. As shown in Figure 3, A represents the comprehensive media in the information source category, B represents the public platform in the information source category, C represents the management unit in the information source category, D represents the university website in the information source category, and E represents the information source Others in the category, for example, include industry media in other information sources E; a represents the achievement category in the application scenario category, b represents the obituary category in the application scenario category, c represents the employment category in the application scenario category, and d represents the application Enterprise-related categories in the scenario category, e represents the honorary award title category in the application scenario category, f represents the list category in the application scenario category, g represents the conference category in the application scenario category, and h represents the field news figure in the application scenario category Hotspot category, i represents the policy category in the application scenario category, j represents the integrity and ethical issue category in the application scenario category, and k represents the macro statistical report category in the application scenario category.

In an actual application of this embodiment, taking the proportion of each source in the result information as an example, let:

As shown in Figure 3, the final result is judged as:

Based on the comparison of the above calculation results, it shows that with the development of information-sharing self-media in recent years, the reliability of the WeChat public platform has surpassed the comprehensive media.

S13: Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario to extract application scenario feature words of the information text, and then generate different application scenario feature vocabularies.

Specifically, according to the scenarios where different information can be used, it can be preliminarily divided into the following categories: a. Achievement category, b. Obituary category, c. Employment category, d. Enterprise related category, e. Honorary award title category, f. List category , G. Conference category, h. Hot topic category of news figures in the field, i. Policy category, j. Integrity and ethical issues category, k. Macro statistical report category. It should be noted that the category of the application scenario can also be assigned a label with a specific meaning for identification or retrieval, such as: A-achievement category, D-obituary category, EM-employment category, ET-enterprise related, H-honorary award Title, L-list, M-conference, N-field news figures hotspot, P-policy, PO-integrity and ethics issues, ST-macro statistical report.

Specifically, the categories of the application scenarios are described as follows:

(1) Achievement category: Contains the profile of the person, the cooperation between domestic and foreign institutions and research groups. The expert profile in the information may include the honors that are not yet mastered, and the rare field segmentation. The expert profile can be added, and the results themselves can be used Define the latest research content and research direction.

(2) Obituaries: According to this, the available and contact status of experts can be "hidden" updated.

(3) Employment category: Information on the flow of domestic and overseas talents in colleges and universities and global high-tech companies is used to update the latest developments in the institutions and cooperation of experts.

(4) Enterprise industry-related categories: as a supplement to the industry's macro situation, basic information of the enterprise, and important talents of the enterprise.

(5) Honorary award categories: such as co-opted academician titles and awards in various disciplines. Generally, this type of information provides complete information on awarding institutions and winners, which can be used to update expert content and initially assess the authority of the award.

(6) List and ranking category: The scope of ranking includes universities, achievements, disciplines, enterprises, scholars, etc. There are not only domestic and foreign institutions selection indicators, but also a large number of normalized list contents for batch acquisition.

(7) Conference category: including government conferences, scientific and technological forums, and outcome challenges. Through academic conferences hosted by the Mainland, you can obtain information on the cooperation between foreign professors and the country. Through international conferences, background information of participants and institutions can be obtained. At the same time, artificial intelligence conferences are also important field classification references and the latest results data.

(8) Media hotspots: Media hotspots contain a wider range of content. It is usually the introduction and prospect of new technologies related to production, education and research, the transformation of achievements, the latest achievements of popular technology companies, the detailed introduction of scholars, corporate executives, scientific research teams, and famous teachers.

(9) Policy category: Mainly include the latest instructions of local governments on talents and infrastructure construction, interpretation of national science and technology policies and situations, new disciplines/industry standards established by various institutions, the launch of large-scale projects, international cooperation agreements and foreign countries Major policy adjustments, etc. It can be used by policy researchers as background or comparison materials.

(10) Integrity and ethics issues: common content includes paper retractions and academic scandals in various fields, as well as ethical reflections on emerging disciplines and technologies. On the one hand, it is an important consideration for expert evaluation and employment, and it is also a hot topic of international research disputes. Tracking.

(11) Macro statistical reports: mainly data from international authoritative institutions and domestic industry media. The level involved includes talent, industry (trend/status), bibliometrics, university research index, patent, subject area, etc.

In this embodiment, the S13 includes:

S131: Extract nouns and/or verb phrases in the information text as application scenario feature words.

Specifically, according to the categories of the above-mentioned 11 information sources and the part-of-speech tagging made by word segmentation, nouns and noun phrases with part of speech starting with n or verb phrases with part of speech of v are extracted from the segmented information. It should be noted that if the following correspondences are set in the part-of-speech tagging: n-noun, nt-organization group, nz-other proper nouns, words with the part-of-speech tag beginning with nt or nz can also be extracted during extraction.

S132: Count the number of documents where the application scenario feature words are located; the number of documents refers to the total number of documents formed by all the information texts.

Specifically, the DF value of the feature word of the application scenario is calculated, and the DF value represents the number of documents in which the feature word of the application scenario appears. The DF or df refers to the document frequency, and DF calculation is a feature extraction technology. Because of its linear calculation complexity relative to the scale of the text database, it can be easily used for large-scale document statistics.

S133: Filter out a number of the application scenario feature words whose number of documents is within a preset range.

In an actual application of this embodiment, the application scenario feature words are selected according to a criterion that the DF value of the application scenario feature words is greater than 5 and less than 20% of the total number of documents. It should be noted that the value greater than 5 and less than 20% of the total number of documents is an example of the preset range, and the rest of the numerical ranges that can be used to define and filter application scenario feature words are also within the scope of the present invention.

S134: Calculate the dependency coefficients between several of the application scenario feature words and combine with the semantic vector of the information text, and classify the application scenario feature words into matching application scenarios categories to form an application scenario feature vocabulary .

Specifically, the selected application scenario feature words are formed into a feature extraction vocabulary according to the classification of the application scenario, and 11 extracted word sets are divided accordingly.

It should be noted that there are no words common to all information in the same category. Information in the same category is only "family similar", so multiple words need to be used to match the semantic vector of the whole article; the search is not completed independently between words, and different words of the same category have dependency coefficients for more accurate classification. class.

Specifically, the categories and feature words of the application scenario are edited in the form of a table to form 11 extracted word sets. Based on the matching and learning results, an example of the feature word extraction word set is as follows, please refer to Table 1 for the extracted word set classification table. It can be seen from Table 1 that "published" as a feature word is classified into the achievement category of the application scenario category.

Table 1: Extraction word set classification table

S14. Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new Increase operation and/or associated storage operation.

Specifically, the word frequency index calculation is performed on the target vocabulary in the formatted information text to determine the number of times each target vocabulary appears in the information text, thereby representing the weight of the target vocabulary in the information text.

In this embodiment, the S14 includes:

S141. Calculate the word frequency index of the target vocabulary of each paragraph in the information text to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, A number of first target words corresponding to the word frequency index are extracted, and the target words refer to words selected according to article categories, including scientific and technological words.

In an actual application of this embodiment, each scientific and technological information text is regarded as a document, the scientific and technological vocabulary in the full-text data of the scientific and technological information is extracted, the idf value of all words in the scientific and technological vocabulary is calculated, and the technology in each paragraph is extracted Vocabulary, get the core vocabulary of the first few digits in the reverse order of the tf-idf value. The idf value is the word frequency of scientific vocabulary in the text, and the calculation formula is as follows:

Among them, w represents the scientific and technological vocabulary, idf(w) represents the frequency of the scientific and technological vocabulary w in the text, |D| is the number of documents, and df(w) represents the number of documents containing the scientific and technological vocabulary w.

Specifically, taking a paragraph of a scientific and technological information text as an example, the number of sentences L is obtained, and the top L positions in the reverse order are used as the core vocabulary of the paragraph. It should be noted that the number of core words is extracted according to the number of paragraph sentences, one sentence will extract multiple core words, and there is a repetitive relationship between the core words of multiple sentences in the whole paragraph, so the word frequency ranking is selected as the core word of the final whole paragraph. .

S142: Perform semantic matching on the core vocabulary in the application scenario feature vocabulary to filter out the information text where the core vocabulary whose matching result is greater than a preset value is located.

Specifically, the semantic similarity between the core vocabulary and the extracted feature vocabulary is calculated, and the article containing the core vocabulary with a semantic similarity greater than 0.5 is extracted. It should be noted that 0.5 is an embodiment of the preset value, and other preset values that can be used for semantic matching are all included in the scope of the present invention.

S143: Combine the information text with the category of the information source to generate an information source triple group, and combine the application scenario feature vocabulary to generate a feature word triple group.

Specifically, extract the triples containing the name of the information item from the information crawling result. The triples containing the name of the information item mainly include two types: one is the is-a relational triple based on the classification of the information source, That is, <information name, isA, information source category name>, where isA represents the information source of the information text; the second is <information name, feature word category name, attribute value> based on feature words. Combine the selected information item name with the information source classification and feature word set to form a <information item, isA, category name> triple group and a <information item, feature word, attribute value> triple group.

Further, according to the classification of the crawled data source, the matching degree calculation is performed on the application scenario of the data source, and the result is used as the is-a relationship triple of the information source classification <information name, isA, information source classification Name>.

Furthermore, filter the semantic vectors with the highest occurrence frequency and the best relevance among the known application scenarios to form a feature classification word set, forming a relationship triplet based on feature words <information name, feature word classification name, attribute value >.

S144: Combining the information source triad group and the feature word triad group, determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs.

Specifically, the application scenario category to which a certain piece of information text belongs is determined by the attribute classification features in the information source triple group and the feature word triple group.

S145: Select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each of the core vocabulary to determine the information source with the highest category dependency of the application scenario.

In an actual application of this embodiment, since the entire information text has a core vocabulary sorted by word frequency, it is necessary to compare the application scenario categories of the original ten thousand documents and the 11 information in the actual database, and call the application scenario feature vocabulary to correspond to The application scenario category of the information text; then a one-to-many cross calculation is performed with the category of the information source, and the final result is unified according to the scene with the largest overlap, so as to determine the information source with the highest category dependency of the application scenario.

S146: Push the information text to the determined information source with the highest degree of dependence, and perform targeted operations.

Specifically, according to the weight of the feature words in the information and the weighted ranking of the information source type, the application scenarios to which the first three feature words belong are targeted to be pushed.

It should be noted that the targeted push of the application scenarios to which the first three feature words belong is one of the implementation methods of the present invention, and the application scenarios to which the remaining number of feature words belong can also be selected for targeted push.

In this embodiment, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding honors and awards, and/or batch association storage operations for lists. For example, the information text of the list category can be directly entered into the database as incremental data according to the partial word segmentation results.

This embodiment provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the scene application method based on information classification is implemented.

A person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by hardware related to a computer program. The aforementioned computer program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing computer-readable storage medium includes: ROM, RAM, magnetic disk, or optical disk and other computer storage media that can store program codes.

The scenario application method based on information classification in this embodiment can realize the classified placement and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.

Example two

This embodiment provides a scene application system based on information classification. The scene application system based on information classification includes:

The preprocessing module is used to format and preprocess the information data to generate information text that conforms to the format;

The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario ；

The application scenario attribute processing module is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;

The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations and updates Operations, new operations, and/or associated warehousing operations.

The scene application system based on information classification provided by this embodiment will be described in detail below in conjunction with the drawings. It should be noted that it should be understood that the division of the various modules of the following system is only a division of logical functions, and can be fully or partially integrated into a physical entity during actual implementation, or can be physically separated. And these modules can all be implemented in the form of software called by processing elements, or all can be implemented in the form of hardware, some modules can be implemented in the form of calling software by processing elements, and some modules can be implemented in the form of hardware. For example, the x module can be a separate processing element, or it can be integrated in a chip of the following system. In addition, the x module may also be stored in the memory of the following system in the form of program code, which is called by a certain processing element of the following system and executes the function of the following x module. The implementation of other modules is similar. All or part of these modules can be integrated together or implemented independently. The processing element described here may be an integrated circuit with signal processing capabilities. In the implementation process, the steps of the above method or the following modules can be completed by hardware integrated logic circuits in the processor element or instructions in the form of software.

The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Signal Processors) , Referred to as DSP), one or more Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA), etc. When one of the following modules is implemented by a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU for short) or other processors that can call program codes. These modules can be integrated together and implemented in the form of System-on-a-chip (SOC for short).

Please refer to FIG. 4, which shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention. As shown in FIG. 4, the scene application system 4 based on information classification includes: a preprocessing module 41, an information source processing module 42, an application scene attribute processing module 43, and an application module 44.

The preprocessing module 41 is used for formatting and preprocessing the information data to generate information text conforming to the format.

In this embodiment, the preprocessing module 41 is specifically configured to perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction; using word embedding The technology performs word segmentation and labeling processing on the information text to distinguish specific phrases by labeling; the specific phrases include: time phrases, name phrases, and/or institutional phrases; the information with specific phrases annotated by a grammar machine The text is grammatically deconstructed; using a format machine to store the grammatically deconstructed information text in a preset format, the preset format is determined by a formatter, and the formatter is used to convert the fields of the information text into a standardized format And the addition of default values.

The information source attribute processing module 42 is configured to perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios. Sexual results.

In this embodiment, the information source attribute processing module 42 is specifically used to analyze the information source of the information text to determine the type of the information source; the types of the information source include: integrated media, public platform, management Units, research institutions, and/or industry media; classify the information text into one of the information source categories according to the information source to obtain the information source characteristic results. Through weight calculation, the importance of the categories of the information sources for different application scenarios is calibrated to determine the relevance results of the information application scenarios. The relevance results of the information application scenarios mean that each of the application scenarios is in different The ratio of the degree of dependence generated in the category of the information source; the categories of the application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, Meeting category, media hotspot category and/or policy category.

The application scenario attribute processing module 43 is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature words after extracting the application scenario feature words of the information text Library.

In this embodiment, the application scenario attribute processing module 43 is specifically configured to extract nouns and/or verb phrases in the information text as application scenario feature words; count the number of documents in which the application scenario feature words are located; The number of documents refers to the total number of documents composed of all the information texts; a number of the application scenario feature words with the number of the documents within a preset range are filtered out; through the dependence coefficient between the several application scenario feature words Calculate and combine the semantic vector of the information text, and classify the application scenario feature words into the matching application scenario categories to form an application scenario feature vocabulary.

The application module 44 is configured to perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes a hiding operation , Update operations, new operations and/or associated storage operations.

In this embodiment, the application module 44 is specifically configured to calculate the word frequency index of the target vocabulary of each paragraph in the information text, so as to combine the word frequency index with a preset rule to determine the core vocabulary of each paragraph; the preset The rule includes that after the word frequency index is arranged in descending order, the first several target words corresponding to the word frequency index are extracted, and the target words refer to vocabulary selected according to the article category, including scientific vocabulary; in the application scenario feature Perform semantic matching on the core vocabulary in the thesaurus to filter out the information text where the core vocabulary whose matching result is greater than the preset value is located; combine the information text with the category of the information source to generate an information source triple group, and Combine the application scenario feature word database to generate a feature word triple group; combine the information source triple group and the feature word triple group to determine which core vocabulary in the feature word triple group belongs to The category of the application scenario; select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each core vocabulary to determine the information source with the highest category dependency of the application scenario; The text is pushed to the determined information source with the highest degree of dependence, and targeted operations are performed. Among them, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding new honors and awards, and/or batch association storage operations for lists.

The scene application system based on information classification of this embodiment can realize the classified delivery and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.

Example three

This embodiment provides a device including: a processor, a memory, a transceiver, a communication interface or/and a system bus; the memory and the communication interface are connected to the processor and the transceiver through the system bus to complete mutual communication, and the memory is used for A computer program is stored, the communication interface is used to communicate with other devices, and the processor and the transceiver are used to run the computer program to make the device execute each step of the scene application method based on information classification.

The aforementioned system bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The system bus can be divided into address bus, data bus, control bus and so on. The communication interface is used to realize the communication between the database access device and other devices (such as client, read-write library and read-only library). The memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short) , Application-specific integrated circuits (scanning application license Specific Integrated Circuit, ASIC for short), Field Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.

The scope of protection of the scene application method based on information classification of the present invention is not limited to the order of execution of the steps listed in this embodiment, and all the steps implemented in the prior art based on the principles of the present invention include Within the protection scope of the present invention.

The present invention also provides a scene application system based on information classification. The scene application system based on information classification can implement the scene application method based on information classification of the present invention, but the scene application based on information classification of the present invention The implementation of the method includes, but is not limited to, the structure of the scene application system based on information classification listed in this embodiment. Any structural modification and replacement of the prior art based on the principles of the present invention are included in the protection scope of the present invention. .

In summary, the scene application method, system, medium, and equipment based on information classification of the present invention comprehensively consider the entire process control of scientific and technological information collection, classification and scene application; feature classification combines information source and full text feature word segmentation to improve , It is helpful to reduce the construction process and judgment error of the lexicon; the use case design of the collected information is used to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit. The invention effectively overcomes various shortcomings in the prior art and has a high industrial value.

The above-mentioned embodiments only exemplarily illustrate the principles and effects of the present invention, but are not used to limit the present invention. Anyone familiar with this technology can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical ideas disclosed in the present invention should still be covered by the claims of the present invention.

Claims

A scene application method based on information classification, characterized in that the scene application method based on information classification includes:

Format and preprocess the information data to generate information text that conforms to the format;

Information source attribute processing is performed on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results;

Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;

Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new operations And/or associated storage operations.
The scene application method based on information classification according to claim 1, wherein the step of formatting and preprocessing the information data to generate information text conforming to the format comprises:

Perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction;

Using word embedding technology to perform word segmentation and labeling processing on the information text, so that specific phrases can be distinguished by labeling; the specific phrases include: time phrases, name phrases, and/or organization phrases;

Grammatically deconstruct the information text marked with specific phrases through a grammar machine;

Use a format machine to store the grammatically deconstructed information text according to a preset format, the preset format is determined by a formatter, and the formatter is used to perform standard format conversion and default value conversion for the information text fields supplement.
The scene application method based on information classification according to claim 1, wherein the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result comprises:

Analyze the information source of the information text to determine the category of the information source; the category of the information source includes: integrated media, public platforms, management units, research institutions and/or industry media;

The information text is classified into one of the information source categories according to the information source to obtain the information source characteristic result.
The scene application method based on information classification according to claim 3, wherein the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result further comprises:

Through weight calculation, the importance of the categories of the information sources for different application scenarios is calibrated to determine the relevance results of the information application scenarios. The relevance results of the information application scenarios mean that each of the application scenarios is in different Dependency ratio generated in the category of the information source;

The categories of the application scenarios include: achievement category, obituary category, employment category, corporate industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, conference category, media hotspot category and/or policy category .
The scenario application method based on information classification according to claim 1, wherein the application scenario attribute processing is performed on the information source attribute processing result according to the information application scenario to extract the application scenario of the information text After the feature words, the steps to generate feature word databases for different application scenarios include:

Extracting nouns and/or verb phrases in the information text as application scenario feature words;

Count the number of documents in which the application scenario feature words are located; the number of documents refers to the total number of documents composed of all the information texts;

Filter out several of the application scenario feature words whose number of documents is within a preset range;

By calculating the dependency coefficients between several of the application scenario feature words and combining with the semantic vector of the information text, the application scenario feature words are classified into the matching application scenario categories to form an application scenario feature vocabulary.
The scenario application method based on information classification according to claim 1, wherein the word frequency index calculation is performed on the information text, so that the calculation result is combined with the information source attribute processing result and the application scenario feature vocabulary. The steps of targeted push of information include:

Calculate the word frequency index of the target vocabulary of each paragraph in the information text to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order and then extracting the order The first several target words corresponding to the word frequency index, the target words refer to vocabulary selected according to the article category, including scientific vocabulary;

Performing semantic matching on the core vocabulary in the application scenario feature vocabulary to filter out the information text where the core vocabulary with a matching result greater than a preset value is located;

Combining the information text with the category of the information source to generate an information source triple group, and combining the application scenario feature word database to generate a feature word triple group;

Combining the information source triad group and the feature word triad group to determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs;

Selecting the top three core vocabularies after sorting, and searching the category of the application scenario corresponding to each of the core vocabulary to determine the information source with the highest category dependency of the application scenario;

Push the information text to the determined information source with the highest degree of dependence, and perform targeted operations.
The scene application method based on information classification according to claim 6, wherein:

The targeted operations include: hidden operations for experts in obituaries, updates to employment agencies, new operations for honors and awards, and/or batch association storage operations for lists.
A scene application system based on information classification, characterized in that the scene application system based on information classification includes:

The preprocessing module is used to format and preprocess the information data to generate information text that conforms to the format;

The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario ；

The application scenario attribute processing module is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;

The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations and updates Operations, new operations, and/or associated warehousing operations.
A medium with a computer program stored thereon, wherein the program is executed by a processor to implement the scene application method based on information classification according to any one of claims 1 to 7.
A device, characterized by comprising: a processor and a memory;

The memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the scene application method based on information classification according to any one of claims 1 to 7.