WO2021035976A1 - Scenario application method and system based on information classification, and medium and device - Google Patents

Scenario application method and system based on information classification, and medium and device Download PDF

Info

Publication number
WO2021035976A1
WO2021035976A1 PCT/CN2019/117970 CN2019117970W WO2021035976A1 WO 2021035976 A1 WO2021035976 A1 WO 2021035976A1 CN 2019117970 W CN2019117970 W CN 2019117970W WO 2021035976 A1 WO2021035976 A1 WO 2021035976A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
application scenario
information source
category
text
Prior art date
Application number
PCT/CN2019/117970
Other languages
French (fr)
Chinese (zh)
Inventor
王旭阳
孙沛基
朱悦
刘晋元
潘永春
Original Assignee
上海市研发公共服务平台管理中心
上海科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海市研发公共服务平台管理中心, 上海科技发展有限公司 filed Critical 上海市研发公共服务平台管理中心
Publication of WO2021035976A1 publication Critical patent/WO2021035976A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention belongs to the field of information data application, and relates to a scene application method of information data, in particular to a scene application method, system, medium and equipment based on information classification.
  • science and technology information is an important part of big data resources, and there are many classifications of technology information. Users of different fields and different backgrounds often have different retrieval purposes and needs, and users as information acquirers cannot know accurately Information content you need.
  • the purpose of the present invention is to provide a scene application method, system, medium and equipment based on information classification, which is used to solve the problem that the prior art cannot target the crawled information data to specific user groups and The problem of classified delivery and push in application scenarios.
  • the scene application method based on information classification includes: formatting and preprocessing information data to generate information that conforms to the format. Text; the information text is processed according to the information source information source attributes to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results; according to the information application
  • the scenario performs application scenario attribute processing on the information source attribute processing result to extract application scenario feature words of the information text to generate different application scenario feature vocabularies; perform word frequency index calculation on the information text to combine the calculation results
  • the information source attribute processing result and the application scenario feature vocabulary perform targeted push of information; the targeted push includes hidden operations, update operations, new operations, and/or associated storage operations.
  • the step of formatting and preprocessing information data to generate information text conforming to the format includes: performing noise reduction processing on the information data to obtain purified information text;
  • the noise reduction processing includes symbol noise reduction and text noise reduction;
  • word embedding technology is used to perform word segmentation and labeling processing on the information text to distinguish specific phrases through labeling;
  • the specific phrases include: time phrases, name phrases, and/or Institutional phrase;
  • the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result includes: analyzing the information source of the information text to determine the information The category of the source; the category of the information source includes: integrated media, public platforms, management units, research institutions and/or industry media; the information text is classified into one of the categories of information sources according to the information source to obtain information Source feature results.
  • the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result further includes: calibrating the type of the information source for different The importance of the application scenarios of, to determine the relevance results of the information application scenarios, and the relevance results of the information application scenarios refer to the dependency ratios generated by each of the application scenarios in different categories of information sources;
  • the categories of application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, conference category, media hotspot category and/or policy category.
  • the application scenario attribute processing is performed on the information source attribute processing result according to the information application scenario to extract application scenario feature words of the information text to generate different application scenario features
  • the step of the thesaurus includes: extracting nouns and/or verb phrases in the information text as application scenario feature words; counting the number of documents in which the application scenario feature words are located; the number of documents refers to all the information text components The total number of documents; filter out several of the application scenario feature words whose number of documents is within a preset range; calculate and combine the semantic vector of the information text through the dependency coefficients between several of the application scenario feature words,
  • the application scenario feature words are classified into matching application scenarios categories to form an application scenario feature vocabulary.
  • the step of performing word frequency index calculation on the information text so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push includes: calculation The word frequency index of the target vocabulary of each paragraph in the information text is used to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, and the extraction order is The first several digits of the target vocabulary corresponding to the word frequency index, the target vocabulary refers to the vocabulary selected according to the article category, including scientific vocabulary; the core vocabulary is semantically matched in the application scenario feature vocabulary to filter out Information text where the core vocabulary whose matching result is greater than the preset value is located; combining the information text with the category of the information source to generate an information source triple group, and combining the application scenario feature word library to generate a feature word triple group Combining the information source triad group and the feature word triad group to determine the category of the application scenario to which the core vocabulary in the feature word triad
  • the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding operations for honors and awards, and/or batch association entry for lists. Library operations.
  • the scene application system based on information classification includes: a preprocessing module for formatting and preprocessing information data to generate information text conforming to the format;
  • the information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result;
  • the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario
  • the application scenario attribute processing module is used to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting the application scenario feature words of the information text
  • the application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hiding operations, Update operations, new operations, and/or associated storage operations.
  • Another aspect of the present invention provides a medium on which a computer program is stored, and when the program is executed by a processor, the scene application method based on information classification is implemented.
  • the last aspect of the present invention provides a device including: a processor and a memory; the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the information-based Classification of the application method of the scene.
  • the scene application method, system, medium and equipment based on information classification of the present invention have the following
  • the present invention provides a classification method and scene application based on scientific and technological information, which comprehensively considers the entire process control of scientific and technological information collection, classification and scene application; combines information source and full-text feature word segmentation to improve feature classification, which is beneficial to reduce the lexicon Construction process and judgment errors; use the collected information to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit.
  • FIG. 1 shows a schematic flow chart of an embodiment of the scene application method based on information classification of the present invention.
  • FIG. 2 shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention.
  • FIG. 3 is a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention.
  • FIG. 4 shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention.
  • the technical principles of the scene application method, system, medium and equipment based on information classification of the present invention are as follows: format and preprocess information data; perform information source attribute processing on the information text according to the information source to generate information source attributes Processing result; according to the information application scene, the information source attribute processing result is processed by the application scene attribute to extract the application scene feature words of the information text, and then generate different application scene feature vocabularies; perform word frequency on the information text Index calculation, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push.
  • This embodiment provides a scene application method based on information classification.
  • the scene application method based on information classification includes:
  • Information source attribute processing is performed on the information text according to the information source to generate information source attribute processing results;
  • the information source attribute processing results include information source feature results and information application scenarios correlation results;
  • the targeted push includes hidden operations, update operations, and new operations And/or associated storage operations.
  • An embodiment of the present invention is based on the train browser on the tens of thousands of crawled data from more than 100 news networks and self-media in the past year, through the word segmentation method of natural language processing, the related fields, core content, and related experts are analyzed. Feature extraction; then arrange the relevance according to the frequency of feature words and vector weights; finally, the information is classified into different application scenarios by comprehensive judgment of the data source and data content.
  • FIG. 1 shows a principle flow chart of the scene application method based on information classification in an embodiment of the present invention.
  • the scene application method based on information classification specifically includes the following steps:
  • the information data is preprocessed by word segmentation technology to generate a word segmentation model, and the accuracy of the word segmentation model is optimized by noise reduction, word segmentation, grammar optimization, and format unification of the information data, and finally a word vector model is established.
  • the word segmentation process the information data is segmented according to the sentence and then the word segmentation is performed and includes part-of-speech tagging, and the word embedding technology is used to build a word vector model using sentence as a unit.
  • the word segmentation technology includes: a word segmentation method for string matching, a word meaning segmentation method, and/or a statistical word segmentation method.
  • FIG. 2 shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention.
  • the S11 includes:
  • S111 Perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction.
  • the noise reduction processing includes:
  • S112 Use word embedding technology to perform word segmentation and labeling processing on the information text, so that specific phrases can be distinguished by labeling; the specific phrases include: time phrases, name phrases, and/or organization phrases.
  • the word segmentation and labeling processing includes:
  • Treat the word representing time as a chunk Take this as a feature point that differentiates it from the mainstream word segmentation system, such as "December 1998" as only a word block.
  • S113 Perform grammatical deconstruction on the information text with a specific phrase mark by a grammar machine.
  • the grammar machine is used for Chinese grammar deconstruction, decomposing complex structures into simple structures. For example, after tagging a sentence in an information text, it is presented in the following form: ⁇ time: 1987 ⁇ , ⁇ time: 1990 ⁇ , ⁇ order: successively ⁇ , ⁇ event: obtained ⁇ , ⁇ univ: the school ⁇ , ⁇ title: master degree ⁇ , ⁇ title: doctorate degree ⁇ .
  • the "order grammar machine” is triggered by ⁇ order: successively ⁇ in the information text.
  • the sequence of time is determined by the "sequential grammar machine", using ⁇ time: 1987 ⁇ as one branch and ⁇ time: 1990 ⁇ as another branch. It should be noted that if there are at least two time words in the sentence and the two times are not the same, the "sequential grammar machine” is triggered when the other components of the sentence contain the actual words corresponding to the number of times; if the above assumptions are not satisfied Condition, the "Sequence Grammar Machine” reports a grammatical error.
  • the "refers to the grammar machine” is triggered by ⁇ univ: the school ⁇ in the information text. Search forward for the univ tag mentioned last time to find the specific school name referred to by "the school”. It should be noted that the "referring to the grammar machine” step forwards no more than 10 sentences, and it ends at the beginning of the whole article; if the above conditions are not met, the "referring to the grammar machine” reports a grammatical error.
  • Branch 1 ⁇ time: 1987 ⁇ ⁇ order: first ⁇ ⁇ event: obtained ⁇ ⁇ univ: Jilin University ⁇ ⁇ title: master ⁇ ;
  • Branch 2 ⁇ time: 1990 ⁇ order: later ⁇ event:obtained ⁇ univ: Jilin University ⁇ title: PhD ⁇ .
  • the preset format is determined by a formatter, and the formatter is used to convert and default a field of the information text into a standardized format. Value addition.
  • the format machine unifies and standardizes the storage of the components in the sentence according to the field format that meets the classification requirements of scientific and technological information application scenarios.
  • the formatter uses triggers to match the required formatter for the sentence, and then calls the corresponding formatter to perform normalized conversion of the field and supplement the default value.
  • processing process of the format machine is:
  • S12 Perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results.
  • the information source of the information text is analyzed to determine the category of the information source; the category of the information source includes: comprehensive media, public platforms, management units, research institutions, and/or industry media;
  • the information text is classified into one of the information source categories according to the information source to obtain the information source characteristic result.
  • the crawled information is preliminarily divided into integrated media, public platforms, management units, research institutions, and others according to the characteristics of the data source.
  • comprehensive media such as Science Network, Science and Technology Daily, etc. are more diverse and the total amount of information is relatively prominent, and the result information is relatively large;
  • WeChat public platform industry information is mixed, information types are widely distributed, and dynamic updates are fast; management unit policy news is the most, meetings and conferences.
  • the hotspots are high in authority and public recognition, and low in frequency; 90% of university institutions come from information on scientific and technological achievements, and can obtain first-hand data on university development policies, achievements, and talent flow, and their institutional characteristics are prominent.
  • Xinzhiyuan As a WeChat official account platform, Xinzhiyuan’s main business is to plan artificial intelligence-related conferences and have cooperative relationships with domestic AI companies.
  • the "Xinzhiyuan" WeChat official account is its industry. In the first link of the chain, the number of categories is relatively equal, and there is no obvious focus; the categories such as achievements, employment, enterprises, industry hotspots, rankings, conferences, and macro statistics are balanced, and the quality is stable.
  • the importance of the categories of the information sources for different application scenarios is calibrated through weight calculation to determine the correlation results of the information application scenarios.
  • the correlation results of the information application scenarios refer to each of the information application scenarios. Describes the dependency ratio of the application scenarios in different categories of information sources; the categories of the application scenarios include: achievement, obituary, employment, enterprise industry, integrity and ethical issues, rankings, honors , Macro statistical report category, conference category, media hotspot category and/or policy category.
  • the information source and the information source form a mutual reference to reflect the authority of the information source.
  • FIG. 3 shows a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention.
  • A represents the comprehensive media in the information source category
  • B represents the public platform in the information source category
  • C represents the management unit in the information source category
  • D represents the university website in the information source category
  • E represents the information source Others in the category, for example, include industry media in other information sources E;
  • a represents the achievement category in the application scenario category
  • b represents the obituary category in the application scenario category
  • c represents the employment category in the application scenario category, and
  • d represents the application Enterprise-related categories in the scenario category,
  • e represents the honorary award title category in the application scenario category,
  • f represents the list category in the application scenario category
  • g represents the conference category in the application scenario category
  • h represents the field news figure in the application scenario category Hotspot category
  • i represents the policy category in the application scenario category
  • j represents the integrity and ethical issue category in the application scenario category
  • the category of the application scenario can also be assigned a label with a specific meaning for identification or retrieval, such as: A-achievement category, D-obituary category, EM-employment category, ET-enterprise related, H-honorary award Title, L-list, M-conference, N-field news figures hotspot, P-policy, PO-integrity and ethics issues, ST-macro statistical report.
  • Achievement category Contains the profile of the person, the cooperation between domestic and foreign institutions and research groups.
  • the expert profile in the information may include the honors that are not yet mastered, and the rare field segmentation.
  • the expert profile can be added, and the results themselves can be used Define the latest research content and research direction.
  • List and ranking category The scope of ranking includes universities, achievements, disciplines, enterprises, scholar, etc. There are not only domestic and foreign institutions selection indicators, but also a large number of normalized list contents for batch acquisition.
  • Conference category including government conferences, scientific and technological forums, and outcome challenges. Through academic conferences hosted by the outskirt, you can obtain information on the cooperation between foreign professors and the country. Through international conferences, background information of participants and institutions can be obtained. At the same time, artificial intelligence conferences are also important field classification references and the latest results data.
  • Media hotspots contain a wider range of content. It is usually the introduction and prospect of new technologies related to production, education and research, the transformation of achievements, the latest achievements of popular technology companies, the detailed introduction of scholar, corporate executives, scientific research teams, and famous teachers.
  • Policy category Mainly include the latest instructions of local governments on talents and infrastructure construction, interpretation of national science and technology policies and situations, new disciplines/industry standards established by various institutions, the launch of large-scale projects, international cooperation agreements and foreign countries Major policy adjustments, etc. It can be used by policy researchers as background or comparison materials.
  • Integrity and ethics issues common content includes paper retractions and academic scandals in various fields, as well as ethical reflections on emerging disciplines and technologies. On the one hand, it is an important consideration for expert evaluation and employment, and it is also a hot topic of international research disputes. Tracking.
  • the S13 includes:
  • nouns and noun phrases with part of speech starting with n or verb phrases with part of speech of v are extracted from the segmented information. It should be noted that if the following correspondences are set in the part-of-speech tagging: n-noun, nt-organization group, nz-other proper nouns, words with the part-of-speech tag beginning with nt or nz can also be extracted during extraction.
  • the DF value of the feature word of the application scenario is calculated, and the DF value represents the number of documents in which the feature word of the application scenario appears.
  • the DF or df refers to the document frequency
  • DF calculation is a feature extraction technology. Because of its linear calculation complexity relative to the scale of the text database, it can be easily used for large-scale document statistics.
  • S133 Filter out a number of the application scenario feature words whose number of documents is within a preset range.
  • the application scenario feature words are selected according to a criterion that the DF value of the application scenario feature words is greater than 5 and less than 20% of the total number of documents. It should be noted that the value greater than 5 and less than 20% of the total number of documents is an example of the preset range, and the rest of the numerical ranges that can be used to define and filter application scenario feature words are also within the scope of the present invention.
  • S134 Calculate the dependency coefficients between several of the application scenario feature words and combine with the semantic vector of the information text, and classify the application scenario feature words into matching application scenarios categories to form an application scenario feature vocabulary .
  • the selected application scenario feature words are formed into a feature extraction vocabulary according to the classification of the application scenario, and 11 extracted word sets are divided accordingly.
  • the categories and feature words of the application scenario are edited in the form of a table to form 11 extracted word sets.
  • an example of the feature word extraction word set is as follows, please refer to Table 1 for the extracted word set classification table. It can be seen from Table 1 that "published" as a feature word is classified into the achievement category of the application scenario category.
  • the word frequency index calculation is performed on the target vocabulary in the formatted information text to determine the number of times each target vocabulary appears in the information text, thereby representing the weight of the target vocabulary in the information text.
  • the S14 includes:
  • S141 Calculate the word frequency index of the target vocabulary of each paragraph in the information text to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, A number of first target words corresponding to the word frequency index are extracted, and the target words refer to words selected according to article categories, including scientific and technological words.
  • each scientific and technological information text is regarded as a document, the scientific and technological vocabulary in the full-text data of the scientific and technological information is extracted, the idf value of all words in the scientific and technological vocabulary is calculated, and the technology in each paragraph is extracted Vocabulary, get the core vocabulary of the first few digits in the reverse order of the tf-idf value.
  • the idf value is the word frequency of scientific vocabulary in the text, and the calculation formula is as follows:
  • w represents the scientific and technological vocabulary
  • idf(w) represents the frequency of the scientific and technological vocabulary w in the text
  • is the number of documents
  • df(w) represents the number of documents containing the scientific and technological vocabulary w.
  • the number of sentences L is obtained, and the top L positions in the reverse order are used as the core vocabulary of the paragraph. It should be noted that the number of core words is extracted according to the number of paragraph sentences, one sentence will extract multiple core words, and there is a repetitive relationship between the core words of multiple sentences in the whole paragraph, so the word frequency ranking is selected as the core word of the final whole paragraph. .
  • S142 Perform semantic matching on the core vocabulary in the application scenario feature vocabulary to filter out the information text where the core vocabulary whose matching result is greater than a preset value is located.
  • the semantic similarity between the core vocabulary and the extracted feature vocabulary is calculated, and the article containing the core vocabulary with a semantic similarity greater than 0.5 is extracted.
  • 0.5 is an embodiment of the preset value, and other preset values that can be used for semantic matching are all included in the scope of the present invention.
  • S143 Combine the information text with the category of the information source to generate an information source triple group, and combine the application scenario feature vocabulary to generate a feature word triple group.
  • the triples containing the name of the information item mainly include two types: one is the is-a relational triple based on the classification of the information source, That is, ⁇ information name, isA, information source category name>, where isA represents the information source of the information text; the second is ⁇ information name, feature word category name, attribute value> based on feature words.
  • the matching degree calculation is performed on the application scenario of the data source, and the result is used as the is-a relationship triple of the information source classification ⁇ information name, isA, information source classification Name>.
  • the application scenario category to which a certain piece of information text belongs is determined by the attribute classification features in the information source triple group and the feature word triple group.
  • S145 Select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each of the core vocabulary to determine the information source with the highest category dependency of the application scenario.
  • the entire information text has a core vocabulary sorted by word frequency, it is necessary to compare the application scenario categories of the original ten thousand documents and the 11 information in the actual database, and call the application scenario feature vocabulary to correspond to The application scenario category of the information text; then a one-to-many cross calculation is performed with the category of the information source, and the final result is unified according to the scene with the largest overlap, so as to determine the information source with the highest category dependency of the application scenario.
  • the application scenarios to which the first three feature words belong are targeted to be pushed.
  • the targeted push of the application scenarios to which the first three feature words belong is one of the implementation methods of the present invention, and the application scenarios to which the remaining number of feature words belong can also be selected for targeted push.
  • the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding honors and awards, and/or batch association storage operations for lists.
  • the information text of the list category can be directly entered into the database as incremental data according to the partial word segmentation results.
  • This embodiment provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the scene application method based on information classification is implemented.
  • a person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by hardware related to a computer program.
  • the aforementioned computer program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing computer-readable storage medium includes: ROM, RAM, magnetic disk, or optical disk and other computer storage media that can store program codes.
  • the scenario application method based on information classification in this embodiment can realize the classified placement and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.
  • the scene application system based on information classification includes:
  • the preprocessing module is used to format and preprocess the information data to generate information text that conforms to the format
  • the information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result;
  • the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario ;
  • the application scenario attribute processing module is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;
  • the application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations and updates Operations, new operations, and/or associated warehousing operations.
  • the x module may also be stored in the memory of the following system in the form of program code, which is called by a certain processing element of the following system and executes the function of the following x module.
  • the implementation of other modules is similar. All or part of these modules can be integrated together or implemented independently.
  • the processing element described here may be an integrated circuit with signal processing capabilities. In the implementation process, the steps of the above method or the following modules can be completed by hardware integrated logic circuits in the processor element or instructions in the form of software.
  • the following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Signal Processors) , Referred to as DSP), one or more Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA), etc.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processors
  • FPGA Field Programmable Gate Array
  • the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU for short) or other processors that can call program codes.
  • CPU Central Processing Unit
  • These modules can be integrated together and implemented in the form of System-on-a-chip (SOC for short).
  • FIG. 4 shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention.
  • the scene application system 4 based on information classification includes: a preprocessing module 41, an information source processing module 42, an application scene attribute processing module 43, and an application module 44.
  • the preprocessing module 41 is used for formatting and preprocessing the information data to generate information text conforming to the format.
  • the preprocessing module 41 is specifically configured to perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction; using word embedding
  • the technology performs word segmentation and labeling processing on the information text to distinguish specific phrases by labeling; the specific phrases include: time phrases, name phrases, and/or institutional phrases; the information with specific phrases annotated by a grammar machine
  • the text is grammatically deconstructed; using a format machine to store the grammatically deconstructed information text in a preset format, the preset format is determined by a formatter, and the formatter is used to convert the fields of the information text into a standardized format And the addition of default values.
  • the information source attribute processing module 42 is configured to perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios. Sexual results.
  • the information source attribute processing module 42 is specifically used to analyze the information source of the information text to determine the type of the information source; the types of the information source include: integrated media, public platform, management Units, research institutions, and/or industry media; classify the information text into one of the information source categories according to the information source to obtain the information source characteristic results. Through weight calculation, the importance of the categories of the information sources for different application scenarios is calibrated to determine the relevance results of the information application scenarios.
  • the relevance results of the information application scenarios mean that each of the application scenarios is in different The ratio of the degree of dependence generated in the category of the information source; the categories of the application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, Meeting category, media hotspot category and/or policy category.
  • the application scenario attribute processing module 43 is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature words after extracting the application scenario feature words of the information text Library.
  • the application scenario attribute processing module 43 is specifically configured to extract nouns and/or verb phrases in the information text as application scenario feature words; count the number of documents in which the application scenario feature words are located; The number of documents refers to the total number of documents composed of all the information texts; a number of the application scenario feature words with the number of the documents within a preset range are filtered out; through the dependence coefficient between the several application scenario feature words Calculate and combine the semantic vector of the information text, and classify the application scenario feature words into the matching application scenario categories to form an application scenario feature vocabulary.
  • the application module 44 is configured to perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes a hiding operation , Update operations, new operations and/or associated storage operations.
  • the application module 44 is specifically configured to calculate the word frequency index of the target vocabulary of each paragraph in the information text, so as to combine the word frequency index with a preset rule to determine the core vocabulary of each paragraph;
  • the preset The rule includes that after the word frequency index is arranged in descending order, the first several target words corresponding to the word frequency index are extracted, and the target words refer to vocabulary selected according to the article category, including scientific vocabulary; in the application scenario feature Perform semantic matching on the core vocabulary in the thesaurus to filter out the information text where the core vocabulary whose matching result is greater than the preset value is located; combine the information text with the category of the information source to generate an information source triple group, and Combine the application scenario feature word database to generate a feature word triple group; combine the information source triple group and the feature word triple group to determine which core vocabulary in the feature word triple group belongs to The category of the application scenario; select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each core vocabulary to determine the information source with the highest category dependency of the
  • the scene application system based on information classification of this embodiment can realize the classified delivery and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.
  • This embodiment provides a device including: a processor, a memory, a transceiver, a communication interface or/and a system bus; the memory and the communication interface are connected to the processor and the transceiver through the system bus to complete mutual communication, and the memory is used for A computer program is stored, the communication interface is used to communicate with other devices, and the processor and the transceiver are used to run the computer program to make the device execute each step of the scene application method based on information classification.
  • the aforementioned system bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the system bus can be divided into address bus, data bus, control bus and so on.
  • the communication interface is used to realize the communication between the database access device and other devices (such as client, read-write library and read-only library).
  • the memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short) , Application-specific integrated circuits (scanning application license Specific Integrated Circuit, ASIC for short), Field Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application-specific integrated circuits
  • FPGA Field Programmable Gate Array
  • the present invention also provides a scene application system based on information classification.
  • the scene application system based on information classification can implement the scene application method based on information classification of the present invention, but the scene application based on information classification of the present invention
  • the implementation of the method includes, but is not limited to, the structure of the scene application system based on information classification listed in this embodiment. Any structural modification and replacement of the prior art based on the principles of the present invention are included in the protection scope of the present invention. .
  • the scene application method, system, medium, and equipment based on information classification of the present invention comprehensively consider the entire process control of scientific and technological information collection, classification and scene application; feature classification combines information source and full text feature word segmentation to improve , It is helpful to reduce the construction process and judgment error of the lexicon; the use case design of the collected information is used to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit.
  • the invention effectively overcomes various shortcomings in the prior art and has a high industrial value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a scenario application method and system based on information classification, and a medium and a device. The scenario application method based on information classification comprises: performing formatting pre-processing on information data; performing information source attribute processing on information text according to an information source so as to generate an information source attribute processing result; performing application scenario attribute processing on the information source attribute processing result according to an information application scenario so as to generate different application scenario feature word libraries after extracting application scenario feature words of the information text; and performing word frequency index calculation on the information text so as to push information in a targeted manner by combining a calculation result with the information source attribute processing result and the application scenario feature word libraries. According to the present invention, information crawled in batches can be flexibly and accurately classified and released.

Description

基于资讯分类的场景应用方法、系统、介质及设备Scene application method, system, medium and equipment based on information classification 技术领域Technical field
本发明属于资讯数据应用领域,涉及一种资讯数据的场景应用方法,特别是涉及一种基于资讯分类的场景应用方法、系统、介质及设备。The present invention belongs to the field of information data application, and relates to a scene application method of information data, in particular to a scene application method, system, medium and equipment based on information classification.
背景技术Background technique
随着互联网的迅速发展,各种渠道的资讯数据纷繁复杂,且有些渠道所传播的消息准确性不能保证,由此会给资讯获取者带来误导作用,如何有效地提取并利用这些信息成为一个巨大的挑战,即便利用网络爬虫,也不能将网络爬取的资讯数据准确地通过权威性渠道进行推送。With the rapid development of the Internet, the information and data of various channels are complicated, and the accuracy of the information disseminated by some channels cannot be guaranteed, which will mislead the information obtainers. How to effectively extract and use this information becomes a problem. A huge challenge, even if you use web crawlers, you cannot accurately push the information data crawled through the web through authoritative channels.
以科技资讯为例,科技资讯是科技大数据资源的重要组成部分,且科技资讯有较多分类,不同领域、不同背景的用户往往具有不同的检索目的和需求,用户作为信息获取者不能准确获知自己需要的资讯内容。Taking science and technology information as an example, science and technology information is an important part of big data resources, and there are many classifications of technology information. Users of different fields and different backgrounds often have different retrieval purposes and needs, and users as information acquirers cannot know accurately Information content you need.
因此,如何在对网页及公众号新闻源等不同资讯源的资讯数据进行批量爬取后,针对特定用户群体和应用场景进行分类投放,成为本领域技术人员亟待解决的技术问题。Therefore, how to categorize and release information from different information sources, such as webpages and official account news sources, for specific user groups and application scenarios after batch crawling has become a technical problem to be solved by those skilled in the art.
发明内容Summary of the invention
鉴于以上所述现有技术的缺点,本发明的目的在于提供一种基于资讯分类的场景应用方法、系统、介质及设备,用于解决现有技术无法将爬取的资讯数据针对特定用户群体和应用场景进行分类投放与推送的问题。In view of the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a scene application method, system, medium and equipment based on information classification, which is used to solve the problem that the prior art cannot target the crawled information data to specific user groups and The problem of classified delivery and push in application scenarios.
为实现上述目的及其他相关目的,本发明一方面提供一种基于资讯分类的场景应用方法,所述基于资讯分类的场景应用方法包括:将资讯数据进行格式化预处理,以生成符合格式的资讯文本;对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。In order to achieve the above and other related purposes, one aspect of the present invention provides a scene application method based on information classification. The scene application method based on information classification includes: formatting and preprocessing information data to generate information that conforms to the format. Text; the information text is processed according to the information source information source attributes to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results; according to the information application The scenario performs application scenario attribute processing on the information source attribute processing result to extract application scenario feature words of the information text to generate different application scenario feature vocabularies; perform word frequency index calculation on the information text to combine the calculation results The information source attribute processing result and the application scenario feature vocabulary perform targeted push of information; the targeted push includes hidden operations, update operations, new operations, and/or associated storage operations.
于本发明的一实施例中,所述将资讯数据进行格式化预处理,以生成符合格式的资讯文本的步骤包括:对所述资讯数据进行降噪处理,以得到净化后的资讯文本;所述降噪处理包 括符号降噪和文本降噪;利用词嵌入技术对所述资讯文本进行分词标注处理,以通过标注能区分出特定短语;所述特定短语包括:时间短语、姓名短语和/或机构短语;通过语法机对带有特定短语标注的所述资讯文本进行语法解构;利用格式机将所述语法解构的资讯文本按照预设格式进行存储,所述预设格式由格式器确定,所述格式器用于对所述资讯文本的字段进行规范格式的转换和缺省值的补充。In an embodiment of the present invention, the step of formatting and preprocessing information data to generate information text conforming to the format includes: performing noise reduction processing on the information data to obtain purified information text; The noise reduction processing includes symbol noise reduction and text noise reduction; word embedding technology is used to perform word segmentation and labeling processing on the information text to distinguish specific phrases through labeling; the specific phrases include: time phrases, name phrases, and/or Institutional phrase; grammatically deconstructs the information text with specific phrase annotations through a grammar machine; uses the format machine to store the grammatically deconstructed information text in a preset format, and the preset format is determined by the formatter, so The formatter is used to convert the standard format and supplement the default value of the information text field.
于本发明的一实施例中,所述对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果的步骤包括:分析所述资讯文本的资讯源,以确定所述资讯源的类别;所述资讯源的类别包括:综合媒体、公共平台、管理单位、研究机构和/或行业媒体;将所述资讯文本按照资讯源分入其中一个资讯源的类别中,以得到资讯源特征结果。In an embodiment of the present invention, the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result includes: analyzing the information source of the information text to determine the information The category of the source; the category of the information source includes: integrated media, public platforms, management units, research institutions and/or industry media; the information text is classified into one of the categories of information sources according to the information source to obtain information Source feature results.
于本发明的一实施例中,所述对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果的步骤还包括:通过权重计算,校准所述资讯源的类别对于不同的应用场景的重要性,以确定资讯应用场景的相关性结果,所述资讯应用场景的相关性结果是指每一个所述应用场景在不同的资讯源的类别中产生的依赖度比值;所述应用场景的类别包括:成果类、讣告类、聘用类、企业产业类、诚信和道德问题类、榜单类、荣誉类、宏观统计报告类、会议类、媒体热点类和/或政策类。In an embodiment of the present invention, the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result further includes: calibrating the type of the information source for different The importance of the application scenarios of, to determine the relevance results of the information application scenarios, and the relevance results of the information application scenarios refer to the dependency ratios generated by each of the application scenarios in different categories of information sources; The categories of application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, conference category, media hotspot category and/or policy category.
于本发明的一实施例中,所述根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库的步骤包括:抽取所述资讯文本中的名词和/或动词短语作为应用场景特征词;统计所述应用场景特征词所在的文档数量;所述文档数量是指所有的所述资讯文本构成的文档总数;筛选出所述文档数量在预设范围内的若干个所述应用场景特征词;通过若干个所述应用场景特征词之间的依赖系数计算并结合所述资讯文本的语义向量,将所述应用场景特征词分入匹配的应用场景的类别中,构成应用场景特征词库。In an embodiment of the present invention, the application scenario attribute processing is performed on the information source attribute processing result according to the information application scenario to extract application scenario feature words of the information text to generate different application scenario features The step of the thesaurus includes: extracting nouns and/or verb phrases in the information text as application scenario feature words; counting the number of documents in which the application scenario feature words are located; the number of documents refers to all the information text components The total number of documents; filter out several of the application scenario feature words whose number of documents is within a preset range; calculate and combine the semantic vector of the information text through the dependency coefficients between several of the application scenario feature words, The application scenario feature words are classified into matching application scenarios categories to form an application scenario feature vocabulary.
于本发明的一实施例中,所述对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送的步骤包括:计算所述资讯文本中每一段落的目标词汇的词频指数,以将所述词频指数结合预设规则确定每一段落的核心词汇;所述预设规则包括将所述词频指数进行降序排列后,提取顺序在前的若干位所述词频指数对应的目标词汇,所述目标词汇指按照文章类别选取的词汇,包括科技词汇;在所述应用场景特征词库中对所述核心词汇进行语义匹配,以筛选出匹配结果大于预设值的核心词汇所在的资讯文本;将所述资讯文本结合所述资讯源的类别生成资讯源三元组群,并结合所述应用场景特征词库生成特征词三元组群;结合所述资讯源三元组群和所述特征词三元 组群,确定所述特征词三元组群中的核心词汇所属的应用场景的类别;选取排序之后前三位的所述核心词汇,并查找每一个所述核心词汇对应的应用场景的类别,以确定该应用场景的类别依赖度最高的资讯源;将所述资讯文本推送至所确定的依赖度最高的资讯源,并进行针对性操作。In an embodiment of the present invention, the step of performing word frequency index calculation on the information text so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push includes: calculation The word frequency index of the target vocabulary of each paragraph in the information text is used to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, and the extraction order is The first several digits of the target vocabulary corresponding to the word frequency index, the target vocabulary refers to the vocabulary selected according to the article category, including scientific vocabulary; the core vocabulary is semantically matched in the application scenario feature vocabulary to filter out Information text where the core vocabulary whose matching result is greater than the preset value is located; combining the information text with the category of the information source to generate an information source triple group, and combining the application scenario feature word library to generate a feature word triple group Combining the information source triad group and the feature word triad group to determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs; selecting the top three cores after sorting Vocabulary, and search for the category of the application scenario corresponding to each core vocabulary to determine the information source with the highest category dependency of the application scenario; push the information text to the determined information source with the highest dependency, and proceed Targeted operations.
于本发明的一实施例中,所述针对性操作包括:针对讣告类的专家进行隐藏操作、对聘用类的任职机构进行更新、荣誉奖项类的新增操作和/或名单类的批量关联入库操作。In an embodiment of the present invention, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding operations for honors and awards, and/or batch association entry for lists. Library operations.
本发明另一方面提供一种基于资讯分类的场景应用系统,所述基于资讯分类的场景应用系统包括:预处理模块,用于将资讯数据进行格式化预处理,以生成符合格式的资讯文本;资讯源属性处理模块,用于对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;应用场景属性处理模块,用于根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;应用模块,用于对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。Another aspect of the present invention provides a scene application system based on information classification. The scene application system based on information classification includes: a preprocessing module for formatting and preprocessing information data to generate information text conforming to the format; The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario The application scenario attribute processing module is used to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting the application scenario feature words of the information text The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hiding operations, Update operations, new operations, and/or associated storage operations.
本发明又一方面提供一种介质,其上存储有计算机程序,该程序被处理器执行时实现所述基于资讯分类的场景应用方法。Another aspect of the present invention provides a medium on which a computer program is stored, and when the program is executed by a processor, the scene application method based on information classification is implemented.
本发明最后一方面提供一种设备,包括:处理器及存储器;所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述设备执行所述基于资讯分类的场景应用方法。The last aspect of the present invention provides a device including: a processor and a memory; the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the information-based Classification of the application method of the scene.
如上所述,本发明所述的基于资讯分类的场景应用方法、系统、介质及设备,具有以下As mentioned above, the scene application method, system, medium and equipment based on information classification of the present invention have the following
有益效果:Beneficial effects:
本发明提供了一种基于科技资讯的分类方法与场景应用,综合地考量了科技资讯收集、分类与场景应用的全流程控制;结合资讯源与全文特征分词来完善特征分类,有利于减少词库建设过程及判断误差;利用已收集资讯的使用案例设计自动分类,节省了后期人工分类应用成本,且具有高度实用价值、场景契合性。The present invention provides a classification method and scene application based on scientific and technological information, which comprehensively considers the entire process control of scientific and technological information collection, classification and scene application; combines information source and full-text feature word segmentation to improve feature classification, which is beneficial to reduce the lexicon Construction process and judgment errors; use the collected information to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit.
附图说明Description of the drawings
图1显示为本发明的基于资讯分类的场景应用方法于一实施例中的原理流程图。FIG. 1 shows a schematic flow chart of an embodiment of the scene application method based on information classification of the present invention.
图2显示为本发明的基于资讯分类的场景应用方法于一实施例中的预处理流程图。FIG. 2 shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention.
图3显示为本发明的基于资讯分类的场景应用方法于一实施例中的权重比例示意图。FIG. 3 is a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention.
图4显示为本发明的基于资讯分类的场景应用系统于一实施例中的结构原理图。FIG. 4 shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention.
元件标号说明Component label description
4           基于资讯分类的场景应用系统4 Scenario application system based on information classification
41          预处理模块41 Pre-processing module
42          资讯源属性处理模块42 Information source attribute processing module
43          应用场景属性处理模块43 Application scenario attribute processing module
44          应用模块44 Application Module
S11~S14    基于资讯分类的场景应用方法步骤S11~S14 Scenario application method steps based on information classification
S111~S114  资讯数据的预处理步骤S111~S114 Information data preprocessing steps
具体实施方式detailed description
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。The following describes the implementation of the present invention through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that, in the case of no conflict, the following embodiments and the features in the embodiments can be combined with each other.
需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。It should be noted that the illustrations provided in the following embodiments only illustrate the basic idea of the present invention in a schematic manner. The figures only show the components related to the present invention instead of the number, shape and actual implementation of the components. For size drawing, the type, quantity, and proportion of each component can be changed at will during actual implementation, and the component layout type may also be more complicated.
本发明所述基于资讯分类的场景应用方法、系统、介质及设备的技术原理如下:将资讯数据进行格式化预处理;对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送。The technical principles of the scene application method, system, medium and equipment based on information classification of the present invention are as follows: format and preprocess information data; perform information source attribute processing on the information text according to the information source to generate information source attributes Processing result; according to the information application scene, the information source attribute processing result is processed by the application scene attribute to extract the application scene feature words of the information text, and then generate different application scene feature vocabularies; perform word frequency on the information text Index calculation, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push.
实施例一Example one
本实施例提供一种基于资讯分类的场景应用方法,所述基于资讯分类的场景应用方法包括:This embodiment provides a scene application method based on information classification. The scene application method based on information classification includes:
将资讯数据进行格式化预处理,以生成符合格式的资讯文本;Format and preprocess the information data to generate information text that conforms to the format;
对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;Information source attribute processing is performed on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results;
根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;
对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new operations And/or associated storage operations.
以下将结合图示对本实施例所提供的基于资讯分类的场景应用方法进行详细描述。The following will describe in detail the scene application method based on information classification provided by this embodiment in conjunction with the diagrams.
本发明的一实施例是基于火车浏览器对100多家新闻网和自媒体近1年的上万爬取数据上,通过自然语言处理的分词手段,对所涉领域、核心内容、相关专家进行特征提取;再根据特征词频、向量权重排定相关度;最后通过对数据源、数据内容的综合判定将资讯划入不同应用场景。An embodiment of the present invention is based on the train browser on the tens of thousands of crawled data from more than 100 news networks and self-media in the past year, through the word segmentation method of natural language processing, the related fields, core content, and related experts are analyzed. Feature extraction; then arrange the relevance according to the frequency of feature words and vector weights; finally, the information is classified into different application scenarios by comprehensive judgment of the data source and data content.
请参阅图1,显示为本发明的基于资讯分类的场景应用方法于一实施例中的原理流程图。如图1所示,所述基于资讯分类的场景应用方法具体包括以下几个步骤:Please refer to FIG. 1, which shows a principle flow chart of the scene application method based on information classification in an embodiment of the present invention. As shown in Figure 1, the scene application method based on information classification specifically includes the following steps:
S11,将资讯数据进行格式化预处理,以生成符合格式的资讯文本。S11, format and preprocess the information data to generate information text conforming to the format.
具体地,通过分词技术对资讯数据进行预处理,以生成分词模型,并通过对资讯数据进行降噪、分词、语法优化、格式统一来优化分词模型的准确性,最终建立词向量模型。进一步地,在分词过程中,对资讯数据按照语句进行切分后进行分词并包含词性标注,以句为单位利用词嵌入技术建立词向量模型。需要说明的是,所述分词技术包括:字符串匹配的分词方法、词义分词法和/或统计分词法。Specifically, the information data is preprocessed by word segmentation technology to generate a word segmentation model, and the accuracy of the word segmentation model is optimized by noise reduction, word segmentation, grammar optimization, and format unification of the information data, and finally a word vector model is established. Further, in the word segmentation process, the information data is segmented according to the sentence and then the word segmentation is performed and includes part-of-speech tagging, and the word embedding technology is used to build a word vector model using sentence as a unit. It should be noted that the word segmentation technology includes: a word segmentation method for string matching, a word meaning segmentation method, and/or a statistical word segmentation method.
请参阅图2,显示为本发明的基于资讯分类的场景应用方法于一实施例中的预处理流程图。如图2所示,所述S11包括:Please refer to FIG. 2, which shows a flow chart of preprocessing in an embodiment of the scene application method based on information classification of the present invention. As shown in Figure 2, the S11 includes:
S111,对所述资讯数据进行降噪处理,以得到净化后的资讯文本;所述降噪处理包括符号降噪和文本降噪。S111: Perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction.
于本实施例的一实际应用中,所述降噪处理包括:In an actual application of this embodiment, the noise reduction processing includes:
(1)把全角符号变为半角符号,例如全角空格变半角空格。(1) Change a full-width symbol to a half-width symbol, for example, a full-width space into a half-width space.
(2)把特殊符号替换为常用符号,比如"①⑨⑧⑤年"替换为"1985年"。(2) Replace special symbols with common symbols, such as "①⑨⑧⑤年" with "1985".
(3)简化符号的使用,例如:对tab符号替换为空格,对大括号和中括号统一替换为小括号,对顿号替换为逗号等,以将所有符号变为逗号和句号来实现资讯文本最大程度的简化。(3) Simplify the use of symbols, such as: replace tab symbols with spaces, uniformly replace curly brackets and square brackets with parentheses, and replace commas with commas, etc., to realize information text by changing all symbols to commas and periods Maximum simplification.
(4)根据汉字常用词典和教育部高等院校名录,对错别字订正,例如"气水"改为"汽水"。(4) Correction of typos based on commonly used Chinese character dictionaries and the directory of higher education institutions of the Ministry of Education, for example, "Qi Shui" is changed to "Soda".
(5)简繁体转换,例如"國家"改为"国家"等。(5) Simplified and traditional conversion, for example, "country" is changed to "country", etc.
(6)用词统一化,例如"圣巴巴拉分校"改为"圣芭芭拉分校"等。(6) Unified terms, such as "Santa Barbara" changed to "Santa Barbara" and so on.
S112,利用词嵌入技术对所述资讯文本进行分词标注处理,以通过标注能区分出特定短语;所述特定短语包括:时间短语、姓名短语和/或机构短语。S112: Use word embedding technology to perform word segmentation and labeling processing on the information text, so that specific phrases can be distinguished by labeling; the specific phrases include: time phrases, name phrases, and/or organization phrases.
于本实施例的一实际应用中,所述分词标注处理包括:In an actual application of this embodiment, the word segmentation and labeling processing includes:
(1)将表示时间的词作为一个词块。以此作为区别于主流分词系统的一个特征点,例如将"1998年12月"仅作为一个词块。(1) Treat the word representing time as a chunk. Take this as a feature point that differentiates it from the mainstream word segmentation system, such as "December 1998" as only a word block.
(2)将表示组织/机构/奖项的词作为一个词块。例如"第三世界科学院"不会被分割成"第三/世界/科学院"或"第三世界/科学院"。(2) Treat the words representing organizations/institutions/awards as a block. For example, "Third World Academy of Sciences" will not be divided into "Third World/Academies of Sciences" or "Third World/Academies of Sciences".
(3)对分词结果进行词性标注,其中名词特别区分出时间短语、姓名、机构等。(3) Perform part-of-speech tagging on the word segmentation results, where nouns specifically distinguish time phrases, names, organizations, etc.
S113,通过语法机对带有特定短语标注的所述资讯文本进行语法解构。S113: Perform grammatical deconstruction on the information text with a specific phrase mark by a grammar machine.
具体地,所述语法机用于中文语法解构,将复杂的结构降解为简单的结构,例如,对资讯文本中的一句文字内容进行词性标注后,以如下形式呈现:{time:1987年},{time:1990年},{order:先后},{event:获},{univ:该校},{title:硕士},{title:博士学位}。Specifically, the grammar machine is used for Chinese grammar deconstruction, decomposing complex structures into simple structures. For example, after tagging a sentence in an information text, it is presented in the following form: {time: 1987}, {time: 1990}, {order: successively}, {event: obtained}, {univ: the school}, {title: master degree}, {title: doctorate degree}.
进一步地,所述语法机的工作过程为:Further, the working process of the grammar machine is:
由所述资讯文本中的{order:先后}触发"顺序语法机"。通过所述"顺序语法机"确定时间的先后,将{time:1987年}作为一个分支,将{time:1990年}作为另一分支。需要说明的是,假定句中至少有两个时间词且这两个时间不相同,假定语句中的其他成分含有与时间数目相对应的实体词时触发"顺序语法机";若不满足以上假定条件,则"顺序语法机"报语法错误。The "order grammar machine" is triggered by {order: successively} in the information text. The sequence of time is determined by the "sequential grammar machine", using {time: 1987} as one branch and {time: 1990} as another branch. It should be noted that if there are at least two time words in the sentence and the two times are not the same, the "sequential grammar machine" is triggered when the other components of the sentence contain the actual words corresponding to the number of times; if the above assumptions are not satisfied Condition, the "Sequence Grammar Machine" reports a grammatical error.
由所述资讯文本中的{univ:该校}触发"指代语法机"。通过向前搜索最近一次提到的univ标记,以找到“该校”所指代的具体地学校名称。需要说明的是,所述"指代语法机"向前步进不超过10句,至全篇起始则终止;若不满足上述条件,则所述"指代语法机"报语法错误。The "refers to the grammar machine" is triggered by {univ: the school} in the information text. Search forward for the univ tag mentioned last time to find the specific school name referred to by "the school". It should be noted that the "referring to the grammar machine" step forwards no more than 10 sentences, and it ends at the beginning of the whole article; if the above conditions are not met, the "referring to the grammar machine" reports a grammatical error.
在本实施例中,经过语法机处理后的结果显示如下:In this embodiment, the result after processing by the grammar machine is displayed as follows:
分支1:{time:1987年}{order:先}{event:获}{univ:吉林大学}{title:硕士};Branch 1: {time: 1987} {order: first} {event: obtained} {univ: Jilin University} {title: master};
分支2:{time:1990年}{order:后}{event:获}{univ:吉林大学}{title:博士学位}。Branch 2: {time: 1990}{order: later}{event:obtained}{univ: Jilin University}{title: PhD}.
需要说明的是,所述资讯文本的语句经语法机处理成上述分支1或分支2的格式后,再交给所述格式机进行最终处理。It should be noted that, after the sentence of the information text is processed by the grammar machine into the above-mentioned branch 1 or branch 2 format, it is then handed over to the format machine for final processing.
S114,利用格式机将所述语法解构的资讯文本按照预设格式进行存储,所述预设格式由 格式器确定,所述格式器用于对所述资讯文本的字段进行规范格式的转换和缺省值的补充。S114. Using a format machine to store the information text deconstructed by grammar in a preset format, the preset format is determined by a formatter, and the formatter is used to convert and default a field of the information text into a standardized format. Value addition.
具体地,格式机将语句中的成分按照符合科技资讯应用场景分类要求的字段格式进行统一化,规范化的存储工作。所述格式机利用触发器为语句匹配需要的格式器,然后调用相应的格式器对字段进行规范化的转换和缺省值的补充。Specifically, the format machine unifies and standardizes the storage of the components in the sentence according to the field format that meets the classification requirements of scientific and technological information application scenarios. The formatter uses triggers to match the required formatter for the sentence, and then calls the corresponding formatter to perform normalized conversion of the field and supplement the default value.
进一步地,所述格式机的处理过程为:Further, the processing process of the format machine is:
(1)根据词性标注确定触发方式,例如,语句中有"univ"和"title"的标注,且"吉林大学"和"硕士/博士"分别能在学校字典和学历字典中能找到,因此,所述"吉林大学"和"硕士/博士"的语句内容将触发"教育经历格式器"。(1) Determine the trigger method according to the part-of-speech tagging. For example, there are "univ" and "title" tags in the sentence, and "Jilin University" and "Master/PhD" can be found in the school dictionary and the academic dictionary respectively. Therefore, The sentence content of "Jilin University" and "Master/PhD" will trigger the "Educational Experience Formatter".
(2)生成字段头,包括生成"入学年份","毕业年份","学校","专业","学历","毕业论文/毕业设计"。(2) Generate field headers, including "entry year", "graduation year", "school", "professional", "educational background", and "graduation thesis/graduation design".
(3)格式规范化,包括时间的表达格式统一和名称的统一,例如将“1987年”规范为“1987-00-00”,将“吉林大学”保持默认形式,仍为“吉林大学”,将“博士学位”规范为“博士”。(3) Format standardization, including the unification of the expression format of time and the unification of the name. For example, “1987” shall be standardized as “1987-00-00”, and “Jilin University” shall be kept in the default form, which is still “Jilin University”. "Doctorate degree" is standardized as "Doctorate".
(4)对所述资讯文本中的缺省值统一用"-"填充。(4) The default value in the information text is uniformly filled with "-".
(5)将格式规范化后的数据进行组装,以生成符合格式的资讯特征词临时文本作为预处理结果,并进行存储。(5) Assemble the data after the format is normalized to generate the temporary text of the information feature words that conform to the format as the preprocessing result, and store it.
S12,对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果。S12: Perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results.
在本实施例中,分析所述资讯文本的资讯源,以确定所述资讯源的类别;所述资讯源的类别包括:综合媒体、公共平台、管理单位、研究机构和/或行业媒体;将所述资讯文本按照资讯源分入其中一个资讯源的类别中,以得到资讯源特征结果。In this embodiment, the information source of the information text is analyzed to determine the category of the information source; the category of the information source includes: comprehensive media, public platforms, management units, research institutions, and/or industry media; The information text is classified into one of the information source categories according to the information source to obtain the information source characteristic result.
于本实施例的一实际应用中,将爬取资讯根据数据源特征初步分为,综合媒体、公共平台、管理单位、研究机构和其他。其中,综合媒体如科学网、科技日报等多样性和资讯总量比较突出,成果信息占比较大;微信公众平台行业信息杂,资讯类型分布广,动态更新快;管理单位政策要闻最多,会议和热点其次,权威性和公众认可度较高,频率低;高校机构90%来自科技成果信息,能得到高校发展政策、成果和人才流动情况的一手数据,机构特征显著。In an actual application of this embodiment, the crawled information is preliminarily divided into integrated media, public platforms, management units, research institutions, and others according to the characteristics of the data source. Among them, comprehensive media such as Science Network, Science and Technology Daily, etc. are more diverse and the total amount of information is relatively prominent, and the result information is relatively large; WeChat public platform industry information is mixed, information types are widely distributed, and dynamic updates are fast; management unit policy news is the most, meetings and conferences. Second, the hotspots are high in authority and public recognition, and low in frequency; 90% of university institutions come from information on scientific and technological achievements, and can obtain first-hand data on university development policies, achievements, and talent flow, and their institutional characteristics are prominent.
进一步地,以新智元为例,新智元作为一微信公众号平台,其主要业务是策划人工智能相关的会议,与国内AI企业有合作关系,“新智元”微信公众号是其产业链的一环,各个类别数量比较均等,没有出现明显的侧重;成果、聘用、企业、行业热点、榜单、会议、宏观统计等类别均衡,质量稳定。Further, take Xinzhiyuan as an example. As a WeChat official account platform, Xinzhiyuan’s main business is to plan artificial intelligence-related conferences and have cooperative relationships with domestic AI companies. The "Xinzhiyuan" WeChat official account is its industry. In the first link of the chain, the number of categories is relatively equal, and there is no obvious focus; the categories such as achievements, employment, enterprises, industry hotspots, rankings, conferences, and macro statistics are balanced, and the quality is stable.
在本实施例中,通过权重计算,校准所述资讯源的类别对于不同的应用场景的重要性,以确定资讯应用场景的相关性结果,所述资讯应用场景的相关性结果是指每一个所述应用场景在不同的资讯源的类别中产生的依赖度比值;所述应用场景的类别包括:成果类、讣告类、聘用类、企业产业类、诚信和道德问题类、榜单类、荣誉类、宏观统计报告类、会议类、媒体热点类和/或政策类。In this embodiment, the importance of the categories of the information sources for different application scenarios is calibrated through weight calculation to determine the correlation results of the information application scenarios. The correlation results of the information application scenarios refer to each of the information application scenarios. Describes the dependency ratio of the application scenarios in different categories of information sources; the categories of the application scenarios include: achievement, obituary, employment, enterprise industry, integrity and ethical issues, rankings, honors , Macro statistical report category, conference category, media hotspot category and/or policy category.
于本实施例的一实际应用中,由于不同资讯源的信息总量差异悬殊,为准确权衡不同资讯源的资讯质量,以资讯的特定应用场景类别占该资讯源提供信息总量的权重为基础,资讯源与资讯源之间互相形成参照,以此反应该资讯源的权威性。In an actual application of this embodiment, since the total amount of information of different information sources is very different, in order to accurately weigh the information quality of different information sources, it is based on the weight of the specific application scenario category of the information in the total amount of information provided by the information source. , The information source and the information source form a mutual reference to reflect the authority of the information source.
请参阅图3,显示为本发明的基于资讯分类的场景应用方法于一实施例中的权重比例示意图。如图3所示,A表示资讯源类别中的综合媒体,B表示资讯源类别中的公众平台,C表示资讯源类别中的管理单位,D表示资讯源类别中的高校网站,E表示资讯源类别中的其他,例如,在其他资讯源E中包括行业媒体;a表示应用场景类别中的成果类,b表示应用场景类别中的讣告类,c表示应用场景类别中的聘用类,d表示应用场景类别中的企业相关类,e表示应用场景类别中的荣誉奖项头衔类,f表示应用场景类别中的名单类,g表示应用场景类别中的会议类,h表示应用场景类别中的领域新闻人物热点类,i表示应用场景类别中的政策类,j表示应用场景类别中的诚信及道德问题类,k表示应用场景类别中的宏观统计报告类。Please refer to FIG. 3, which shows a schematic diagram of the weight ratio of the scene application method based on information classification in an embodiment of the present invention. As shown in Figure 3, A represents the comprehensive media in the information source category, B represents the public platform in the information source category, C represents the management unit in the information source category, D represents the university website in the information source category, and E represents the information source Others in the category, for example, include industry media in other information sources E; a represents the achievement category in the application scenario category, b represents the obituary category in the application scenario category, c represents the employment category in the application scenario category, and d represents the application Enterprise-related categories in the scenario category, e represents the honorary award title category in the application scenario category, f represents the list category in the application scenario category, g represents the conference category in the application scenario category, and h represents the field news figure in the application scenario category Hotspot category, i represents the policy category in the application scenario category, j represents the integrity and ethical issue category in the application scenario category, and k represents the macro statistical report category in the application scenario category.
于本实施例的一实际应用中,以成果类资讯中各源的占比为例,设:
Figure PCTCN2019117970-appb-000001
Figure PCTCN2019117970-appb-000002
Figure PCTCN2019117970-appb-000003
Figure PCTCN2019117970-appb-000004
如图3所示,最终结果判断为:
In an actual application of this embodiment, taking the proportion of each source in the result information as an example, let:
Figure PCTCN2019117970-appb-000001
Figure PCTCN2019117970-appb-000002
Figure PCTCN2019117970-appb-000003
Figure PCTCN2019117970-appb-000004
As shown in Figure 3, the final result is judged as:
Figure PCTCN2019117970-appb-000005
Figure PCTCN2019117970-appb-000005
根据上述计算结果的比较,说明随着近年来资讯分享型自媒体的发展,微信公众平台的可依赖性反超了综合媒体。Based on the comparison of the above calculation results, it shows that with the development of information-sharing self-media in recent years, the reliability of the WeChat public platform has surpassed the comprehensive media.
S13,根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库。S13: Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario to extract application scenario feature words of the information text, and then generate different application scenario feature vocabularies.
具体地,根据不同资讯可使用的场景,可初步分为以下类别:a.成果类,b.讣告类,c.聘用类,d.企业相关类,e.荣誉奖项头衔类,f.名单类,g.会议类,h.领域新闻人物热点类,i.政 策类,j.诚信及道德问题类,k.宏观统计报告类。需要说明的是,所述应用场景的类别还可赋以特定含义的标号以便识别或检索,例如:A-成果类、D-讣告类、EM-聘用类、ET-企业相关、H-荣誉奖项头衔、L-名单、M-会议、N-领域新闻人物热点、P-政策、PO-诚信及道德问题、ST-宏观统计报告。Specifically, according to the scenarios where different information can be used, it can be preliminarily divided into the following categories: a. Achievement category, b. Obituary category, c. Employment category, d. Enterprise related category, e. Honorary award title category, f. List category , G. Conference category, h. Hot topic category of news figures in the field, i. Policy category, j. Integrity and ethical issues category, k. Macro statistical report category. It should be noted that the category of the application scenario can also be assigned a label with a specific meaning for identification or retrieval, such as: A-achievement category, D-obituary category, EM-employment category, ET-enterprise related, H-honorary award Title, L-list, M-conference, N-field news figures hotspot, P-policy, PO-integrity and ethics issues, ST-macro statistical report.
具体地,所述应用场景的类别描述如下:Specifically, the categories of the application scenarios are described as follows:
(1)成果类:包含人物简介、国内与国外机构和课题组的合作情况,资讯中的专家简介可能包含尚未掌握的荣誉、以及少见的领域细分,可以补入专家简介,成果本身可用于界定最新研究内容和研究方向。(1) Achievement category: Contains the profile of the person, the cooperation between domestic and foreign institutions and research groups. The expert profile in the information may include the honors that are not yet mastered, and the rare field segmentation. The expert profile can be added, and the results themselves can be used Define the latest research content and research direction.
(2)讣告类:可据此对专家可利用、联络状态进行“隐藏”更新。(2) Obituaries: According to this, the available and contact status of experts can be "hidden" updated.
(3)聘用类:有国内和海外人才在高校机构、全球高科技企业流动的信息,用于更新专家最新所在机构和合作动态。(3) Employment category: Information on the flow of domestic and overseas talents in colleges and universities and global high-tech companies is used to update the latest developments in the institutions and cooperation of experts.
(4)企业产业相关类:作为对产业宏观情况、企业基本信息、企业重要人才的内容补充。(4) Enterprise industry-related categories: as a supplement to the industry's macro situation, basic information of the enterprise, and important talents of the enterprise.
(5)荣誉奖项类:比如增选的院士头衔、以及各个学科领域的奖项。一般该类资讯提供完整的颁奖机构、获奖人信息,可供更新专家内容,同时初步评估奖项权威性。(5) Honorary award categories: such as co-opted academician titles and awards in various disciplines. Generally, this type of information provides complete information on awarding institutions and winners, which can be used to update expert content and initially assess the authority of the award.
(6)名单、榜单类:排名对象范围包括高校、成果、学科、企业、学者等。既有国内外机构评选指标,又有大量归一化名单内容可供批量获取。(6) List and ranking category: The scope of ranking includes universities, achievements, disciplines, enterprises, scholars, etc. There are not only domestic and foreign institutions selection indicators, but also a large number of normalized list contents for batch acquisition.
(7)会议类:包括政府会议及科技界论坛会议、成果挑战赛。通过内地主办的学术大会,可以获得外国教授与国内的合作情况。而通过国际性会议,可获得参赛人员及机构背景资料,同时像人工智能会议也是重要的领域分类参照和最新成果数据。(7) Conference category: including government conferences, scientific and technological forums, and outcome challenges. Through academic conferences hosted by the Mainland, you can obtain information on the cooperation between foreign professors and the country. Through international conferences, background information of participants and institutions can be obtained. At the same time, artificial intelligence conferences are also important field classification references and the latest results data.
(8)媒体热点:媒体热点包含的内容更广。通常是产学研有关的新技术、成果转化的介绍和展望、热门科技企业最新成果、学者、企业高层、科研团队、名师的详细介绍。(8) Media hotspots: Media hotspots contain a wider range of content. It is usually the introduction and prospect of new technologies related to production, education and research, the transformation of achievements, the latest achievements of popular technology companies, the detailed introduction of scholars, corporate executives, scientific research teams, and famous teachers.
(9)政策类:主要包括各地政府关于人才、基础设施建设的最新指示,对国家科技政策及形势的解读,各机构单位设立的学科/产业新标准,大型项目的启动、国际合作协议以及国外重大政策调整等。可供政策研究人员作为背景资料或比较材料使用。(9) Policy category: Mainly include the latest instructions of local governments on talents and infrastructure construction, interpretation of national science and technology policies and situations, new disciplines/industry standards established by various institutions, the launch of large-scale projects, international cooperation agreements and foreign countries Major policy adjustments, etc. It can be used by policy researchers as background or comparison materials.
(10)诚信和道德问题:常见内容包括论文撤稿和各领域的学术丑闻,也有对新兴学科和技术的伦理反思等,一方面是对专家评估聘用的重要考量,同时也是对国际研究争议热点的跟踪。(10) Integrity and ethics issues: common content includes paper retractions and academic scandals in various fields, as well as ethical reflections on emerging disciplines and technologies. On the one hand, it is an important consideration for expert evaluation and employment, and it is also a hot topic of international research disputes. Tracking.
(11)宏观统计报告:主要为国际权威机构和国内行业媒体的数据。所涉水平包括人才、行业(趋势/现状)、文献计量、高校研究指数、专利、学科领域等。(11) Macro statistical reports: mainly data from international authoritative institutions and domestic industry media. The level involved includes talent, industry (trend/status), bibliometrics, university research index, patent, subject area, etc.
在本实施例中,所述S13包括:In this embodiment, the S13 includes:
S131,抽取所述资讯文本中的名词和/或动词短语作为应用场景特征词。S131: Extract nouns and/or verb phrases in the information text as application scenario feature words.
具体地,根据上述11个资讯源的类别,根据分词所做的词性标注,抽取分词后的资讯中词性为n开头的名词和名词短语或词性为v的动词短语。需要说明的是,若词性标注中设置以下对应关系:n-名词、nt-机构团体、nz-其他专有名词,在抽取时还可抽取词性标注为nt或nz开头的词语。Specifically, according to the categories of the above-mentioned 11 information sources and the part-of-speech tagging made by word segmentation, nouns and noun phrases with part of speech starting with n or verb phrases with part of speech of v are extracted from the segmented information. It should be noted that if the following correspondences are set in the part-of-speech tagging: n-noun, nt-organization group, nz-other proper nouns, words with the part-of-speech tag beginning with nt or nz can also be extracted during extraction.
S132,统计所述应用场景特征词所在的文档数量;所述文档数量是指所有的所述资讯文本构成的文档总数。S132: Count the number of documents where the application scenario feature words are located; the number of documents refers to the total number of documents formed by all the information texts.
具体地,计算应用场景特征词的DF值,所述DF值表示出现该应用场景特征词的文档数量。所述DF或df是指文档频数,DF计算为特征提取技术,由于其具有相对于文本库规模的线性计算复杂度,能够容易的被用于大规模文档统计。Specifically, the DF value of the feature word of the application scenario is calculated, and the DF value represents the number of documents in which the feature word of the application scenario appears. The DF or df refers to the document frequency, and DF calculation is a feature extraction technology. Because of its linear calculation complexity relative to the scale of the text database, it can be easily used for large-scale document statistics.
S133,筛选出所述文档数量在预设范围内的若干个所述应用场景特征词。S133: Filter out a number of the application scenario feature words whose number of documents is within a preset range.
于本实施例的一实际应用中,根据应用场景特征词的DF值大于5且小于文档总数20%的标准筛选应用场景特征词。需要说明的是,所述大于5且小于文档总数20%为所述预设范围的一实施例,其余可用来限定和筛选应用场景特征词的数值范围也在本发明的范围内。In an actual application of this embodiment, the application scenario feature words are selected according to a criterion that the DF value of the application scenario feature words is greater than 5 and less than 20% of the total number of documents. It should be noted that the value greater than 5 and less than 20% of the total number of documents is an example of the preset range, and the rest of the numerical ranges that can be used to define and filter application scenario feature words are also within the scope of the present invention.
S134,通过若干个所述应用场景特征词之间的依赖系数计算并结合所述资讯文本的语义向量,将所述应用场景特征词分入匹配的应用场景的类别中,构成应用场景特征词库。S134: Calculate the dependency coefficients between several of the application scenario feature words and combine with the semantic vector of the information text, and classify the application scenario feature words into matching application scenarios categories to form an application scenario feature vocabulary .
具体地,将筛选出的应用场景特征词按照应用场景的分类形成特征提取词表,以此划分出11个提取词集。Specifically, the selected application scenario feature words are formed into a feature extraction vocabulary according to the classification of the application scenario, and 11 extracted word sets are divided accordingly.
需要说明的是,在同一类别中不存在所有资讯共有的词。同类中的资讯之间只是“家族相似”,故需要使用多个词在全篇的语义向量上进行匹配;词与词之间是非独立完成检索的,同类不同词存在依赖系数以更精确的归类。It should be noted that there are no words common to all information in the same category. Information in the same category is only "family similar", so multiple words need to be used to match the semantic vector of the whole article; the search is not completed independently between words, and different words of the same category have dependency coefficients for more accurate classification. class.
具体地,将所述应用场景的类别与特征词以表格形式编辑,构成11个提取词集,根据匹配和学习结果,特征词提取词集举例如下,请参见表1提取词集分类表。由表1可知,“发表”作为一特征词,被分入应用场景类别的成果类。Specifically, the categories and feature words of the application scenario are edited in the form of a table to form 11 extracted word sets. Based on the matching and learning results, an example of the feature word extraction word set is as follows, please refer to Table 1 for the extracted word set classification table. It can be seen from Table 1 that "published" as a feature word is classified into the achievement category of the application scenario category.
表1:提取词集分类表Table 1: Extraction word set classification table
Figure PCTCN2019117970-appb-000006
Figure PCTCN2019117970-appb-000006
Figure PCTCN2019117970-appb-000007
Figure PCTCN2019117970-appb-000007
S14,对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。S14. Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new Increase operation and/or associated storage operation.
具体地,对格式化处理资讯文本中的目标词汇进行词频指数计算,以确定每一个目标词汇在资讯文本中出现的次数,从而表征该目标词汇在资讯文本中的权重。Specifically, the word frequency index calculation is performed on the target vocabulary in the formatted information text to determine the number of times each target vocabulary appears in the information text, thereby representing the weight of the target vocabulary in the information text.
在本实施例中,所述S14包括:In this embodiment, the S14 includes:
S141,计算所述资讯文本中每一段落的目标词汇的词频指数,以将所述词频指数结合预设规则确定每一段落的核心词汇;所述预设规则包括将所述词频指数进行降序排列后,提取顺序在前的若干位所述词频指数对应的目标词汇,所述目标词汇指按照文章类别选取的词汇,包括科技词汇。S141. Calculate the word frequency index of the target vocabulary of each paragraph in the information text to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order, A number of first target words corresponding to the word frequency index are extracted, and the target words refer to words selected according to article categories, including scientific and technological words.
于本实施例的一实际应用中,将每篇科技资讯文本视作一个文档,抽取科技资讯全文数据中的科技词汇,计算科技词汇表中所有单词的idf值,并抽取每个段落中的科技词汇,获取tf-idf值倒序前若干位的核心词汇。所述idf值为科技词汇在文中出现的词频数,计算公式如下:In an actual application of this embodiment, each scientific and technological information text is regarded as a document, the scientific and technological vocabulary in the full-text data of the scientific and technological information is extracted, the idf value of all words in the scientific and technological vocabulary is calculated, and the technology in each paragraph is extracted Vocabulary, get the core vocabulary of the first few digits in the reverse order of the tf-idf value. The idf value is the word frequency of scientific vocabulary in the text, and the calculation formula is as follows:
Figure PCTCN2019117970-appb-000008
Figure PCTCN2019117970-appb-000008
其中,w表示科技词汇,idf(w)表示科技词汇w在文中出现的词频数,|D|是文档数,df(w)表示包含科技词汇w的文档数量。Among them, w represents the scientific and technological vocabulary, idf(w) represents the frequency of the scientific and technological vocabulary w in the text, |D| is the number of documents, and df(w) represents the number of documents containing the scientific and technological vocabulary w.
具体地,以一篇科技资讯文本的一个段落为例,获取语句数量L,将倒序排序中前L位作为该段落的核心词汇。需要说明的是,根据段落语句数提取核心词汇数,一句会提取多个核心词汇,而整段多句核心词汇之间有重复关系,所以,取词频排序靠前的作为最终整段的核心词汇。Specifically, taking a paragraph of a scientific and technological information text as an example, the number of sentences L is obtained, and the top L positions in the reverse order are used as the core vocabulary of the paragraph. It should be noted that the number of core words is extracted according to the number of paragraph sentences, one sentence will extract multiple core words, and there is a repetitive relationship between the core words of multiple sentences in the whole paragraph, so the word frequency ranking is selected as the core word of the final whole paragraph. .
S142,在所述应用场景特征词库中对所述核心词汇进行语义匹配,以筛选出匹配结果大于预设值的核心词汇所在的资讯文本。S142: Perform semantic matching on the core vocabulary in the application scenario feature vocabulary to filter out the information text where the core vocabulary whose matching result is greater than a preset value is located.
具体地,计算核心词汇与提取特征词库中的语义相似度,抽出含有语义相似度大于0.5的核心词汇所在的文章。需要说明的是,0.5为所述预设值的一种实施例,其他可用来进行语义匹配的预设值均包含在本发明的范围内。Specifically, the semantic similarity between the core vocabulary and the extracted feature vocabulary is calculated, and the article containing the core vocabulary with a semantic similarity greater than 0.5 is extracted. It should be noted that 0.5 is an embodiment of the preset value, and other preset values that can be used for semantic matching are all included in the scope of the present invention.
S143,将所述资讯文本结合所述资讯源的类别生成资讯源三元组群,并结合所述应用场景特征词库生成特征词三元组群。S143: Combine the information text with the category of the information source to generate an information source triple group, and combine the application scenario feature vocabulary to generate a feature word triple group.
具体地,从资讯爬去结果中抽取含有资讯条目名称的三元组,所述含有资讯条目名称的三元组主要包括两种类型:一是基于资讯源分类的is-a关系三元组,即<资讯名称,isA,资讯源分类名称>,其中isA表征该资讯文本的资讯源;二是基于特征词的<资讯名称,特征词分类名称,属性值>。将筛选出的资讯条目名称与资讯源分类、特征词集结合,形成<资讯条目,isA,分类名称>三元组群和<资讯条目,特征词,属性值>三元组群。Specifically, extract the triples containing the name of the information item from the information crawling result. The triples containing the name of the information item mainly include two types: one is the is-a relational triple based on the classification of the information source, That is, <information name, isA, information source category name>, where isA represents the information source of the information text; the second is <information name, feature word category name, attribute value> based on feature words. Combine the selected information item name with the information source classification and feature word set to form a <information item, isA, category name> triple group and a <information item, feature word, attribute value> triple group.
进一步地,根据对已爬取数据源的分类,对数据源的应用场景针对性进行匹配度计算,并将结果作为资讯源分类的is-a关系三元组<资讯名称,isA,资讯源分类名称>。Further, according to the classification of the crawled data source, the matching degree calculation is performed on the application scenario of the data source, and the result is used as the is-a relationship triple of the information source classification <information name, isA, information source classification Name>.
更进一步地,筛选已知应用场景实例中,出现频率最高,相关性最好的语义向量,形成特征分类词集,形成基于特征词的关系三元组<资讯名称,特征词分类名称,属性值>。Furthermore, filter the semantic vectors with the highest occurrence frequency and the best relevance among the known application scenarios to form a feature classification word set, forming a relationship triplet based on feature words <information name, feature word classification name, attribute value >.
S144,结合所述资讯源三元组群和所述特征词三元组群,确定所述特征词三元组群中的核心词汇所属的应用场景的类别。S144: Combining the information source triad group and the feature word triad group, determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs.
具体地,由所述资讯源三元组群和所述特征词三元组群中的属性分类特征确定某一篇资讯文本所属的应用场景类别。Specifically, the application scenario category to which a certain piece of information text belongs is determined by the attribute classification features in the information source triple group and the feature word triple group.
S145,选取排序之后前三位的所述核心词汇,并查找每一个所述核心词汇对应的应用场景的类别,以确定该应用场景的类别依赖度最高的资讯源。S145: Select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each of the core vocabulary to determine the information source with the highest category dependency of the application scenario.
于本实施例的一实际应用中,由于全篇资讯文本具有按词频排序的核心词汇,需对照初始万篇文献及实际数据库11个资讯的应用场景类别,调用应用场景特征词库,以对应出该资讯文本的应用场景类别;再与资讯源的类别进行一对多的交叉计算,根据重叠最大的场景统一最终结果,以确定该应用场景的类别依赖度最高的资讯源。In an actual application of this embodiment, since the entire information text has a core vocabulary sorted by word frequency, it is necessary to compare the application scenario categories of the original ten thousand documents and the 11 information in the actual database, and call the application scenario feature vocabulary to correspond to The application scenario category of the information text; then a one-to-many cross calculation is performed with the category of the information source, and the final result is unified according to the scene with the largest overlap, so as to determine the information source with the highest category dependency of the application scenario.
S146,将所述资讯文本推送至所确定的依赖度最高的资讯源,并进行针对性操作。S146: Push the information text to the determined information source with the highest degree of dependence, and perform targeted operations.
具体地,根据资讯中特征词权重及资讯源类型加权排序,对前三位特征词所属应用场景进行针对性推送。Specifically, according to the weight of the feature words in the information and the weighted ranking of the information source type, the application scenarios to which the first three feature words belong are targeted to be pushed.
需要说明的是,所述对前三位特征词所属应用场景进行针对性推送为本发明的其中一实 施方式,也可选取其余数量的特征词所属应用场景进行针对性推送。It should be noted that the targeted push of the application scenarios to which the first three feature words belong is one of the implementation methods of the present invention, and the application scenarios to which the remaining number of feature words belong can also be selected for targeted push.
在本实施例中,所述针对性操作包括:针对讣告类的专家进行隐藏操作、对聘用类的任职机构进行更新、荣誉奖项类的新增操作和/或名单类的批量关联入库操作,例如将名单类的资讯文本按照部分分词结果可直接作为增量数据录入数据库。In this embodiment, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding honors and awards, and/or batch association storage operations for lists. For example, the information text of the list category can be directly entered into the database as incremental data according to the partial word segmentation results.
本实施例提供一种计算机存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现所述基于资讯分类的场景应用方法。This embodiment provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the scene application method based on information classification is implemented.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过计算机程序相关的硬件来完成。前述的计算机程序可以存储于一计算机可读存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的计算机可读存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的计算机存储介质。A person of ordinary skill in the art can understand that all or part of the steps in the foregoing method embodiments can be implemented by hardware related to a computer program. The aforementioned computer program can be stored in a computer-readable storage medium. When the program is executed, it executes the steps including the foregoing method embodiments; and the foregoing computer-readable storage medium includes: ROM, RAM, magnetic disk, or optical disk and other computer storage media that can store program codes.
本实施例所述基于资讯分类的场景应用方法可实现在对网页及公众号新闻源等不同资讯源的资讯数据进行批量爬取后,针对特定用户群体和应用场景进行分类投放以及灵活操作。The scenario application method based on information classification in this embodiment can realize the classified placement and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.
实施例二Example two
本实施例提供一种基于资讯分类的场景应用系统,所述基于资讯分类的场景应用系统包括:This embodiment provides a scene application system based on information classification. The scene application system based on information classification includes:
预处理模块,用于将资讯数据进行格式化预处理,以生成符合格式的资讯文本;The preprocessing module is used to format and preprocess the information data to generate information text that conforms to the format;
资讯源属性处理模块,用于对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario ;
应用场景属性处理模块,用于根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;The application scenario attribute processing module is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;
应用模块,用于对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations and updates Operations, new operations, and/or associated warehousing operations.
以下将结合图示对本实施例所提供的基于资讯分类的场景应用系统进行详细描述。需要说明的是,应理解以下系统的各个模块的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现,也可以全部以硬件的形式实现,还可以部分模块通过处理元件调用软件的形式实现,部分模块通过硬件的形式实现。例如:x模块可以为单独设立的处理元 件,也可以集成在下述系统的某一个芯片中实现。此外,x模块也可以以程序代码的形式存储于下述系统的存储器中,由下述系统的某一个处理元件调用并执行以下x模块的功能。其它模块的实现与之类似。这些模块全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以下各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。The scene application system based on information classification provided by this embodiment will be described in detail below in conjunction with the drawings. It should be noted that it should be understood that the division of the various modules of the following system is only a division of logical functions, and can be fully or partially integrated into a physical entity during actual implementation, or can be physically separated. And these modules can all be implemented in the form of software called by processing elements, or all can be implemented in the form of hardware, some modules can be implemented in the form of calling software by processing elements, and some modules can be implemented in the form of hardware. For example, the x module can be a separate processing element, or it can be integrated in a chip of the following system. In addition, the x module may also be stored in the memory of the following system in the form of program code, which is called by a certain processing element of the following system and executes the function of the following x module. The implementation of other modules is similar. All or part of these modules can be integrated together or implemented independently. The processing element described here may be an integrated circuit with signal processing capabilities. In the implementation process, the steps of the above method or the following modules can be completed by hardware integrated logic circuits in the processor element or instructions in the form of software.
以下这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(Application Specific Integrated Circuit,简称ASIC),一个或多个数字信号处理器(Digital Singnal Processor,简称DSP),一个或者多个现场可编程门阵列(Field Programmable Gate Array,简称FPGA)等。当以下某个模块通过处理元件调用程序代码的形式实现时,该处理元件可以是通用处理器,如中央处理器(Central Processing Unit,简称CPU)或其它可以调用程序代码的处理器。这些模块可以集成在一起,以片上系统(System-on-a-chip,简称SOC)的形式实现。The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Signal Processors) , Referred to as DSP), one or more Field Programmable Gate Array (Field Programmable Gate Array, referred to as FPGA), etc. When one of the following modules is implemented by a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU for short) or other processors that can call program codes. These modules can be integrated together and implemented in the form of System-on-a-chip (SOC for short).
请参阅图4,显示为本发明的基于资讯分类的场景应用系统于一实施例中的结构原理图。如图4所示,所述基于资讯分类的场景应用系统4包括:预处理模块41、资讯源处理模块42、应用场景属性处理模块43和应用模块44。Please refer to FIG. 4, which shows a schematic structural diagram of the scene application system based on information classification in an embodiment of the present invention. As shown in FIG. 4, the scene application system 4 based on information classification includes: a preprocessing module 41, an information source processing module 42, an application scene attribute processing module 43, and an application module 44.
所述预处理模块41用于将资讯数据进行格式化预处理,以生成符合格式的资讯文本。The preprocessing module 41 is used for formatting and preprocessing the information data to generate information text conforming to the format.
在本实施例中,所述预处理模块41具体用于对所述资讯数据进行降噪处理,以得到净化后的资讯文本;所述降噪处理包括符号降噪和文本降噪;利用词嵌入技术对所述资讯文本进行分词标注处理,以通过标注能区分出特定短语;所述特定短语包括:时间短语、姓名短语和/或机构短语;通过语法机对带有特定短语标注的所述资讯文本进行语法解构;利用格式机将所述语法解构的资讯文本按照预设格式进行存储,所述预设格式由格式器确定,所述格式器用于对所述资讯文本的字段进行规范格式的转换和缺省值的补充。In this embodiment, the preprocessing module 41 is specifically configured to perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction; using word embedding The technology performs word segmentation and labeling processing on the information text to distinguish specific phrases by labeling; the specific phrases include: time phrases, name phrases, and/or institutional phrases; the information with specific phrases annotated by a grammar machine The text is grammatically deconstructed; using a format machine to store the grammatically deconstructed information text in a preset format, the preset format is determined by a formatter, and the formatter is used to convert the fields of the information text into a standardized format And the addition of default values.
所述资讯源属性处理模块42用于对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果。The information source attribute processing module 42 is configured to perform information source attribute processing on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios. Sexual results.
在本实施例中,所述资讯源属性处理模块42具体用于分析所述资讯文本的资讯源,以确定所述资讯源的类别;所述资讯源的类别包括:综合媒体、公共平台、管理单位、研究机构和/或行业媒体;将所述资讯文本按照资讯源分入其中一个资讯源的类别中,以得到资讯源特征结果。通过权重计算,校准所述资讯源的类别对于不同的应用场景的重要性,以确定资讯应用场景的相关性结果,所述资讯应用场景的相关性结果是指每一个所述应用场景在不同 的资讯源的类别中产生的依赖度比值;所述应用场景的类别包括:成果类、讣告类、聘用类、企业产业类、诚信和道德问题类、榜单类、荣誉类、宏观统计报告类、会议类、媒体热点类和/或政策类。In this embodiment, the information source attribute processing module 42 is specifically used to analyze the information source of the information text to determine the type of the information source; the types of the information source include: integrated media, public platform, management Units, research institutions, and/or industry media; classify the information text into one of the information source categories according to the information source to obtain the information source characteristic results. Through weight calculation, the importance of the categories of the information sources for different application scenarios is calibrated to determine the relevance results of the information application scenarios. The relevance results of the information application scenarios mean that each of the application scenarios is in different The ratio of the degree of dependence generated in the category of the information source; the categories of the application scenarios include: achievement category, obituary category, employment category, enterprise industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, Meeting category, media hotspot category and/or policy category.
所述应用场景属性处理模块43用于根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库。The application scenario attribute processing module 43 is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature words after extracting the application scenario feature words of the information text Library.
在本实施例中,所述应用场景属性处理模块43具体用于抽取所述资讯文本中的名词和/或动词短语作为应用场景特征词;统计所述应用场景特征词所在的文档数量;所述文档数量是指所有的所述资讯文本构成的文档总数;筛选出所述文档数量在预设范围内的若干个所述应用场景特征词;通过若干个所述应用场景特征词之间的依赖系数计算并结合所述资讯文本的语义向量,将所述应用场景特征词分入匹配的应用场景的类别中,构成应用场景特征词库。In this embodiment, the application scenario attribute processing module 43 is specifically configured to extract nouns and/or verb phrases in the information text as application scenario feature words; count the number of documents in which the application scenario feature words are located; The number of documents refers to the total number of documents composed of all the information texts; a number of the application scenario feature words with the number of the documents within a preset range are filtered out; through the dependence coefficient between the several application scenario feature words Calculate and combine the semantic vector of the information text, and classify the application scenario feature words into the matching application scenario categories to form an application scenario feature vocabulary.
所述应用模块44用于对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。The application module 44 is configured to perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes a hiding operation , Update operations, new operations and/or associated storage operations.
在本实施例中,所述应用模块44具体用于计算所述资讯文本中每一段落的目标词汇的词频指数,以将所述词频指数结合预设规则确定每一段落的核心词汇;所述预设规则包括将所述词频指数进行降序排列后,提取顺序在前的若干位所述词频指数对应的目标词汇,所述目标词汇指按照文章类别选取的词汇,包括科技词汇;在所述应用场景特征词库中对所述核心词汇进行语义匹配,以筛选出匹配结果大于预设值的核心词汇所在的资讯文本;将所述资讯文本结合所述资讯源的类别生成资讯源三元组群,并结合所述应用场景特征词库生成特征词三元组群;结合所述资讯源三元组群和所述特征词三元组群,确定所述特征词三元组群中的核心词汇所属的应用场景的类别;选取排序之后前三位的所述核心词汇,并查找每一个所述核心词汇对应的应用场景的类别,以确定该应用场景的类别依赖度最高的资讯源;将所述资讯文本推送至所确定的依赖度最高的资讯源,并进行针对性操作。其中,所述针对性操作包括:针对讣告类的专家进行隐藏操作、对聘用类的任职机构进行更新、荣誉奖项类的新增操作和/或名单类的批量关联入库操作。In this embodiment, the application module 44 is specifically configured to calculate the word frequency index of the target vocabulary of each paragraph in the information text, so as to combine the word frequency index with a preset rule to determine the core vocabulary of each paragraph; the preset The rule includes that after the word frequency index is arranged in descending order, the first several target words corresponding to the word frequency index are extracted, and the target words refer to vocabulary selected according to the article category, including scientific vocabulary; in the application scenario feature Perform semantic matching on the core vocabulary in the thesaurus to filter out the information text where the core vocabulary whose matching result is greater than the preset value is located; combine the information text with the category of the information source to generate an information source triple group, and Combine the application scenario feature word database to generate a feature word triple group; combine the information source triple group and the feature word triple group to determine which core vocabulary in the feature word triple group belongs to The category of the application scenario; select the top three core vocabularies after sorting, and search for the category of the application scenario corresponding to each core vocabulary to determine the information source with the highest category dependency of the application scenario; The text is pushed to the determined information source with the highest degree of dependence, and targeted operations are performed. Among them, the targeted operations include: hiding operations for experts in obituaries, updating employment agencies, adding new honors and awards, and/or batch association storage operations for lists.
本实施例所述基于资讯分类的场景应用系统可实现在对网页及公众号新闻源等不同资讯源的资讯数据进行批量爬取后,针对特定用户群体和应用场景进行分类投放以及灵活操作。The scene application system based on information classification of this embodiment can realize the classified delivery and flexible operation of specific user groups and application scenarios after batch crawling of information data of different information sources such as webpages and official account news sources.
实施例三Example three
本实施例提供一种设备,包括:处理器、存储器、收发器、通信接口或/和系统总线;存储器和通信接口通过系统总线与处理器和收发器连接并完成相互间的通信,存储器用于存储计算机程序,通信接口用于和其他设备进行通信,处理器和收发器用于运行计算机程序,使所述设备执行所述基于资讯分类的场景应用方法的各个步骤。This embodiment provides a device including: a processor, a memory, a transceiver, a communication interface or/and a system bus; the memory and the communication interface are connected to the processor and the transceiver through the system bus to complete mutual communication, and the memory is used for A computer program is stored, the communication interface is used to communicate with other devices, and the processor and the transceiver are used to run the computer program to make the device execute each step of the scene application method based on information classification.
上述提到的系统总线可以是外设部件互连标准(Peripheral Component Interconnect,简称PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,简称EISA)总线等。该系统总线可以分为地址总线、数据总线、控制总线等。通信接口用于实现数据库访问装置与其他设备(如客户端、读写库和只读库)之间的通信。存储器可能包含随机存取存储器(Random Access Memory,简称RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The aforementioned system bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The system bus can be divided into address bus, data bus, control bus and so on. The communication interface is used to realize the communication between the database access device and other devices (such as client, read-write library and read-only library). The memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(扫描应用程序lication Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short) , Application-specific integrated circuits (scanning application license Specific Integrated Circuit, ASIC for short), Field Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
本发明所述的基于资讯分类的场景应用方法保护范围不限于本实施例列举的步骤执行顺序,凡是根据本发明的原理所做的现有技术的步骤增减、步骤替换所实现的方案都包括在本发明的保护范围内。The scope of protection of the scene application method based on information classification of the present invention is not limited to the order of execution of the steps listed in this embodiment, and all the steps implemented in the prior art based on the principles of the present invention include Within the protection scope of the present invention.
本发明还提供一种基于资讯分类的场景应用系统,所述基于资讯分类的场景应用系统可以实现本发明所述的基于资讯分类的场景应用方法,但本发明所述的基于资讯分类的场景应用方法的实现装置包括但不限于本实施例列举的基于资讯分类的场景应用系统的结构,凡是根据本发明的原理所做的现有技术的结构变形和替换,都包括在本发明的保护范围内。The present invention also provides a scene application system based on information classification. The scene application system based on information classification can implement the scene application method based on information classification of the present invention, but the scene application based on information classification of the present invention The implementation of the method includes, but is not limited to, the structure of the scene application system based on information classification listed in this embodiment. Any structural modification and replacement of the prior art based on the principles of the present invention are included in the protection scope of the present invention. .
综上所述,本发明所述基于资讯分类的场景应用方法、系统、介质及设备综合地考量了科技资讯收集、分类与场景应用的全流程控制;特征分类结合资讯源与全文特征分词来完善,有利于减少词库建设过程及判断误差;利用已收集资讯的使用案例设计自动分类,节省了后期人工分类应用成本,且具有高度实用价值、场景契合性。本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。In summary, the scene application method, system, medium, and equipment based on information classification of the present invention comprehensively consider the entire process control of scientific and technological information collection, classification and scene application; feature classification combines information source and full text feature word segmentation to improve , It is helpful to reduce the construction process and judgment error of the lexicon; the use case design of the collected information is used to design automatic classification, which saves the cost of manual classification and application in the later period, and has high practical value and scene fit. The invention effectively overcomes various shortcomings in the prior art and has a high industrial value.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等 效修饰或改变,仍应由本发明的权利要求所涵盖。The above-mentioned embodiments only exemplarily illustrate the principles and effects of the present invention, but are not used to limit the present invention. Anyone familiar with this technology can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical ideas disclosed in the present invention should still be covered by the claims of the present invention.

Claims (10)

  1. 一种基于资讯分类的场景应用方法,其特征在于,所述基于资讯分类的场景应用方法包括:A scene application method based on information classification, characterized in that the scene application method based on information classification includes:
    将资讯数据进行格式化预处理,以生成符合格式的资讯文本;Format and preprocess the information data to generate information text that conforms to the format;
    对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;Information source attribute processing is performed on the information text according to the information source to generate information source attribute processing results; the information source attribute processing results include information source feature results and information application scenarios correlation results;
    根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;Perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;
    对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。Perform word frequency index calculation on the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations, update operations, and new operations And/or associated storage operations.
  2. 根据权利要求1所述的基于资讯分类的场景应用方法,其特征在于,所述将资讯数据进行格式化预处理,以生成符合格式的资讯文本的步骤包括:The scene application method based on information classification according to claim 1, wherein the step of formatting and preprocessing the information data to generate information text conforming to the format comprises:
    对所述资讯数据进行降噪处理,以得到净化后的资讯文本;所述降噪处理包括符号降噪和文本降噪;Perform noise reduction processing on the information data to obtain purified information text; the noise reduction processing includes symbol noise reduction and text noise reduction;
    利用词嵌入技术对所述资讯文本进行分词标注处理,以通过标注能区分出特定短语;所述特定短语包括:时间短语、姓名短语和/或机构短语;Using word embedding technology to perform word segmentation and labeling processing on the information text, so that specific phrases can be distinguished by labeling; the specific phrases include: time phrases, name phrases, and/or organization phrases;
    通过语法机对带有特定短语标注的所述资讯文本进行语法解构;Grammatically deconstruct the information text marked with specific phrases through a grammar machine;
    利用格式机将所述语法解构的资讯文本按照预设格式进行存储,所述预设格式由格式器确定,所述格式器用于对所述资讯文本的字段进行规范格式的转换和缺省值的补充。Use a format machine to store the grammatically deconstructed information text according to a preset format, the preset format is determined by a formatter, and the formatter is used to perform standard format conversion and default value conversion for the information text fields supplement.
  3. 根据权利要求1所述的基于资讯分类的场景应用方法,其特征在于,所述对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果的步骤包括:The scene application method based on information classification according to claim 1, wherein the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result comprises:
    分析所述资讯文本的资讯源,以确定所述资讯源的类别;所述资讯源的类别包括:综合媒体、公共平台、管理单位、研究机构和/或行业媒体;Analyze the information source of the information text to determine the category of the information source; the category of the information source includes: integrated media, public platforms, management units, research institutions and/or industry media;
    将所述资讯文本按照资讯源分入其中一个资讯源的类别中,以得到资讯源特征结果。The information text is classified into one of the information source categories according to the information source to obtain the information source characteristic result.
  4. 根据权利要求3所述的基于资讯分类的场景应用方法,其特征在于,所述对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果的步骤还包括:The scene application method based on information classification according to claim 3, wherein the step of performing information source attribute processing on the information text according to the information source to generate an information source attribute processing result further comprises:
    通过权重计算,校准所述资讯源的类别对于不同的应用场景的重要性,以确定资讯应用场景的相关性结果,所述资讯应用场景的相关性结果是指每一个所述应用场景在不同的资讯源的类别中产生的依赖度比值;Through weight calculation, the importance of the categories of the information sources for different application scenarios is calibrated to determine the relevance results of the information application scenarios. The relevance results of the information application scenarios mean that each of the application scenarios is in different Dependency ratio generated in the category of the information source;
    所述应用场景的类别包括:成果类、讣告类、聘用类、企业产业类、诚信和道德问题类、榜单类、荣誉类、宏观统计报告类、会议类、媒体热点类和/或政策类。The categories of the application scenarios include: achievement category, obituary category, employment category, corporate industry category, integrity and ethical issues category, ranking category, honor category, macro statistical report category, conference category, media hotspot category and/or policy category .
  5. 根据权利要求1所述的基于资讯分类的场景应用方法,其特征在于,所述根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库的步骤包括:The scenario application method based on information classification according to claim 1, wherein the application scenario attribute processing is performed on the information source attribute processing result according to the information application scenario to extract the application scenario of the information text After the feature words, the steps to generate feature word databases for different application scenarios include:
    抽取所述资讯文本中的名词和/或动词短语作为应用场景特征词;Extracting nouns and/or verb phrases in the information text as application scenario feature words;
    统计所述应用场景特征词所在的文档数量;所述文档数量是指所有的所述资讯文本构成的文档总数;Count the number of documents in which the application scenario feature words are located; the number of documents refers to the total number of documents composed of all the information texts;
    筛选出所述文档数量在预设范围内的若干个所述应用场景特征词;Filter out several of the application scenario feature words whose number of documents is within a preset range;
    通过若干个所述应用场景特征词之间的依赖系数计算并结合所述资讯文本的语义向量,将所述应用场景特征词分入匹配的应用场景的类别中,构成应用场景特征词库。By calculating the dependency coefficients between several of the application scenario feature words and combining with the semantic vector of the information text, the application scenario feature words are classified into the matching application scenario categories to form an application scenario feature vocabulary.
  6. 根据权利要求1所述的基于资讯分类的场景应用方法,其特征在于,所述对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送的步骤包括:The scenario application method based on information classification according to claim 1, wherein the word frequency index calculation is performed on the information text, so that the calculation result is combined with the information source attribute processing result and the application scenario feature vocabulary. The steps of targeted push of information include:
    计算所述资讯文本中每一段落的目标词汇的词频指数,以将所述词频指数结合预设规则确定每一段落的核心词汇;所述预设规则包括将所述词频指数进行降序排列后,提取顺序在前的若干位所述词频指数对应的目标词汇,所述目标词汇指按照文章类别选取的词汇,包括科技词汇;Calculate the word frequency index of the target vocabulary of each paragraph in the information text to determine the core vocabulary of each paragraph by combining the word frequency index with a preset rule; the preset rule includes sorting the word frequency index in descending order and then extracting the order The first several target words corresponding to the word frequency index, the target words refer to vocabulary selected according to the article category, including scientific vocabulary;
    在所述应用场景特征词库中对所述核心词汇进行语义匹配,以筛选出匹配结果大于预设值的核心词汇所在的资讯文本;Performing semantic matching on the core vocabulary in the application scenario feature vocabulary to filter out the information text where the core vocabulary with a matching result greater than a preset value is located;
    将所述资讯文本结合所述资讯源的类别生成资讯源三元组群,并结合所述应用场景特征词库生成特征词三元组群;Combining the information text with the category of the information source to generate an information source triple group, and combining the application scenario feature word database to generate a feature word triple group;
    结合所述资讯源三元组群和所述特征词三元组群,确定所述特征词三元组群中的核心词汇所属的应用场景的类别;Combining the information source triad group and the feature word triad group to determine the category of the application scenario to which the core vocabulary in the feature word triad group belongs;
    选取排序之后前三位的所述核心词汇,并查找每一个所述核心词汇对应的应用场景的类别,以确定该应用场景的类别依赖度最高的资讯源;Selecting the top three core vocabularies after sorting, and searching the category of the application scenario corresponding to each of the core vocabulary to determine the information source with the highest category dependency of the application scenario;
    将所述资讯文本推送至所确定的依赖度最高的资讯源,并进行针对性操作。Push the information text to the determined information source with the highest degree of dependence, and perform targeted operations.
  7. 根据权利要求6所述的基于资讯分类的场景应用方法,其特征在于,The scene application method based on information classification according to claim 6, wherein:
    所述针对性操作包括:针对讣告类的专家进行隐藏操作、对聘用类的任职机构进行更新、荣誉奖项类的新增操作和/或名单类的批量关联入库操作。The targeted operations include: hidden operations for experts in obituaries, updates to employment agencies, new operations for honors and awards, and/or batch association storage operations for lists.
  8. 一种基于资讯分类的场景应用系统,其特征在于,所述基于资讯分类的场景应用系统包括:A scene application system based on information classification, characterized in that the scene application system based on information classification includes:
    预处理模块,用于将资讯数据进行格式化预处理,以生成符合格式的资讯文本;The preprocessing module is used to format and preprocess the information data to generate information text that conforms to the format;
    资讯源属性处理模块,用于对所述资讯文本按照资讯源进行资讯源属性处理,以生成资讯源属性处理结果;所述资讯源属性处理结果包括资讯源特征结果和资讯应用场景的相关性结果;The information source attribute processing module is used for processing the information source attribute of the information text according to the information source to generate the information source attribute processing result; the information source attribute processing result includes the information source characteristic result and the correlation result of the information application scenario ;
    应用场景属性处理模块,用于根据所述资讯应用场景对所述资讯源属性处理结果进行应用场景属性处理,以提取所述资讯文本的应用场景特征词后,生成不同的应用场景特征词库;The application scenario attribute processing module is configured to perform application scenario attribute processing on the information source attribute processing result according to the information application scenario, so as to generate different application scenario feature vocabularies after extracting application scenario feature words of the information text;
    应用模块,用于对资讯文本进行词频指数计算,以便将计算结果结合所述资讯源属性处理结果和所述应用场景特征词库进行资讯的针对性推送;所述针对性推送包括隐藏操作、更新操作、新增操作和/或关联入库操作。The application module is used to calculate the word frequency index of the information text, so as to combine the calculation result with the information source attribute processing result and the application scenario feature vocabulary for targeted information push; the targeted push includes hidden operations and updates Operations, new operations, and/or associated warehousing operations.
  9. 一种介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至7中任一项所述基于资讯分类的场景应用方法。A medium with a computer program stored thereon, wherein the program is executed by a processor to implement the scene application method based on information classification according to any one of claims 1 to 7.
  10. 一种设备,其特征在于,包括:处理器及存储器;A device, characterized by comprising: a processor and a memory;
    所述存储器用于存储计算机程序,所述处理器用于执行所述存储器存储的计算机程序,以使所述设备执行如权利要求1至7中任一项所述基于资讯分类的场景应用方法。The memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes the scene application method based on information classification according to any one of claims 1 to 7.
PCT/CN2019/117970 2019-08-23 2019-11-13 Scenario application method and system based on information classification, and medium and device WO2021035976A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910786293.3A CN110688453B (en) 2019-08-23 2019-08-23 Scene application method, system, medium and equipment based on information classification
CN201910786293.3 2019-08-23

Publications (1)

Publication Number Publication Date
WO2021035976A1 true WO2021035976A1 (en) 2021-03-04

Family

ID=69108665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117970 WO2021035976A1 (en) 2019-08-23 2019-11-13 Scenario application method and system based on information classification, and medium and device

Country Status (2)

Country Link
CN (1) CN110688453B (en)
WO (1) WO2021035976A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874542A (en) * 2024-02-19 2024-04-12 广东省计算技术应用研究所 Big data-based result conversion supply and demand matching method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699669A (en) * 2013-12-30 2014-04-02 北京奇虎科技有限公司 Method for message pushing in browser and browser terminal
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593336B (en) * 2013-10-30 2017-05-10 中国运载火箭技术研究院 Knowledge pushing system and method based on semantic analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699669A (en) * 2013-12-30 2014-04-02 北京奇虎科技有限公司 Method for message pushing in browser and browser terminal
CN104679875A (en) * 2015-03-10 2015-06-03 杭州凡闻科技有限公司 Method for classifying information data based on digital newspaper
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117874542A (en) * 2024-02-19 2024-04-12 广东省计算技术应用研究所 Big data-based result conversion supply and demand matching method, device, equipment and medium

Also Published As

Publication number Publication date
CN110688453A (en) 2020-01-14
CN110688453B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
Jung Semantic vector learning for natural language understanding
Maekawa et al. Balanced corpus of contemporary written Japanese
Luo et al. Synthesizing natural language to visualization (NL2VIS) benchmarks from NL2SQL benchmarks
Srihari et al. Infoxtract: A customizable intermediate level information extraction engine
CN110674252A (en) High-precision semantic search system for judicial domain
CN106126620A (en) Method of Chinese Text Automatic Abstraction based on machine learning
WO2022052639A1 (en) Data query method and apparatus
TW201841121A (en) A method of automatically generating semantic similar sentence samples
Jiang et al. Mcdtb: a macro-level chinese discourse treebank
Bjarnadóttir The database of modern Icelandic inflection (Beygingarlýsing íslensks nútímamáls)
Ercan et al. Anlamver: Semantic model evaluation dataset for turkish-word similarity and relatedness
Solanki et al. A system to transform natural language queries into SQL queries
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
Haque et al. B-ner: A novel bangla named entity recognition dataset with largest entities and its baseline evaluation
WO2021035976A1 (en) Scenario application method and system based on information classification, and medium and device
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Mariani et al. Reuse and plagiarism in Speech and Natural Language Processing publications
Camps et al. Corpus and Models for Lemmatisation and POS-tagging of Old French
CN115617965A (en) Rapid retrieval method for language structure big data
Lu et al. Attributed rhetorical structure grammar for domain text summarization
Tachicart et al. Morphological analyzers of arabic dialects: A survey
Rao et al. Automatic identification of concepts and conceptual relations from patents using machine learning methods
Chen et al. Construction Methods of Knowledge Mapping for Full Service Power Data Semantic Search System
Vanetik et al. Multilingual text analysis: History, tasks, and challenges
Tianwen et al. Evaluate the chinese version of machine translation based on perplexity analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943520

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19943520

Country of ref document: EP

Kind code of ref document: A1