CN114238616A - Expert information detection method and storage device - Google Patents

Expert information detection method and storage device Download PDF

Info

Publication number
CN114238616A
CN114238616A CN202111546359.5A CN202111546359A CN114238616A CN 114238616 A CN114238616 A CN 114238616A CN 202111546359 A CN202111546359 A CN 202111546359A CN 114238616 A CN114238616 A CN 114238616A
Authority
CN
China
Prior art keywords
expert
information
expert information
detection method
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111546359.5A
Other languages
Chinese (zh)
Inventor
黄丽丽
石宝玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou Institute Of Data Technology Co ltd
Original Assignee
Fuzhou Institute Of Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou Institute Of Data Technology Co ltd filed Critical Fuzhou Institute Of Data Technology Co ltd
Priority to CN202111546359.5A priority Critical patent/CN114238616A/en
Publication of CN114238616A publication Critical patent/CN114238616A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to an expert information detection method and storage equipment. The expert information detection method comprises the following steps: collecting target resources and preprocessing the target resources; fusing and aligning multi-source heterogeneous data; and pushing the expert information change early warning. By the method, the target expert can be quickly detected, and when the important information of the target expert is changed, early warning updating can be timely carried out, so that the accuracy of data is ensured.

Description

Expert information detection method and storage device
Technical Field
The invention relates to the technical field of data processing, in particular to an expert information detection method and storage equipment.
Background
The expert database is one of the most important information support resources in the scientific and technological activities. In order to fully play the important role of experts in scientific and technological innovation decision consultation, scientific and technological expert database construction has been accelerated all over the country in recent years, certain expert data is accumulated from the view of actual construction conditions, but the problems of untimely information updating, slow capacity expansion, single information source and the like are also faced.
The related experts are selected and recommended by utilizing the science and technology expert base, the expert base is required to provide accurate images of the professional field, academic ecology and academic life cycle of the experts, and effective collection and dynamic updating are required to be carried out aiming at the expert information and the achievement. At present, expert information collection is mainly achieved by crawling from mass internet resources and fully mining and fusing as a technical means. For example, patent application No. 201910976349.1 discloses a method for constructing and applying user portrait facing scholars, which obtains basic information of scholars from homepages of scholars in China, and obtains research information of scholars from famous academic websites at home and abroad, so as to construct an academic resource corpus, thereby establishing precise portrait of scholars.
Although the internet technology big data resources are rich, the data usually needs extraction processing to obtain the structured expert information, and there is a high possibility that the multi-source data is inconsistent. Under the condition, a set of mechanism is needed to ensure the recall rate of the collected data, ensure the timely updating of the data, accurately judge the reliability of the data and reduce the error rate of the data updating as much as possible, but the prior art cannot realize the above.
Disclosure of Invention
In view of the above problems, the present application provides an expert information detection method, which is used to solve the technical problem that the detection of the existing expert technology cannot ensure the recall rate of the collected data, and simultaneously can ensure the timely update of the data, and can also accurately judge the reliability of the data. The specific technical scheme is as follows:
an expert information detection method, comprising the steps of:
collecting target resources and preprocessing the target resources;
fusing and aligning multi-source heterogeneous data;
and pushing the expert information change early warning.
Further, the "acquiring target resources and preprocessing the target resources" specifically includes the steps of:
and acquiring a target webpage, and performing retrieval filtering, block text information extraction and detailed field information extraction on the target webpage.
Further, the "retrieving and filtering the target homepage" specifically includes the steps of:
pre-searching by using a preset searching strategy;
performing secondary classification on each item of the pre-retrieval result through a classifier model to judge whether the item contains expert information, if so, marking the table to be 1, and if not, marking the table to be 0;
filtering out the web pages meeting any one of the following conditions through a filter: the web pages with label of 0 or the classifier label of 1 and the score of the classifier less than the preset value, or the title abstract contains preset keywords.
Further, the "extraction of block text information" specifically includes the steps of:
filtering the target homepage through a preset rule to obtain text information;
and extracting paragraph level and sentence level through two preset models.
Further, the preset model comprises: BILSTM, CRF.
Further, the "detailed field information extraction" specifically includes the steps of:
extracting different types of fields by adopting different methods;
the different methods include one or more of: regular matching mode, named entity method identification field, dictionary mode.
Further, the target homepage includes one or more of: the unit official website, encyclopedia homepage, academic homepage, and collar English homepage.
Further, the "performing fusion alignment on the multi-source heterogeneous data" specifically includes the steps of:
aligning the captured experts with the experts in the original expert database in a preset mode, and performing one or more of the following operations: and creating new field storage, de-duplication and combination.
Further, the "pushing the warning of the change of the expert information" specifically includes the following steps:
and judging whether the important fields in the expert information are changed or not, and if so, pushing the expert information change early warning to a target end for displaying.
In order to solve the technical problem, the storage device is further provided, and the specific technical scheme is as follows:
a storage device having stored therein a set of instructions for performing: any of the steps of one of the expert information detection methods mentioned above.
The invention has the beneficial effects that: an expert information detection method, comprising the steps of: collecting target resources and preprocessing the target resources; fusing and aligning multi-source heterogeneous data; and pushing the expert information change early warning. By the method, the target expert can be quickly detected, and when the important information of the target expert is changed, early warning updating can be timely carried out, so that the accuracy of data is ensured.
The above description of the present invention is only an overview of the technical solutions of the present application, and in order to make the technical solutions of the present application more clearly understood by those skilled in the art, the present invention may be further implemented according to the content described in the text and drawings of the present application, and in order to make the above objects, other objects, features, and advantages of the present application more easily understood, the following description is made in conjunction with the detailed description of the present application and the drawings.
Drawings
The drawings are only for purposes of illustrating the principles, implementations, applications, features, and effects of particular embodiments of the present application, as well as others related thereto, and are not to be construed as limiting the application.
In the drawings of the specification:
FIG. 1 is a flow chart of a method for expert information detection according to an embodiment;
FIG. 2 is a block diagram illustrating an expert information detection method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the retrieval filtering of a target homepage according to an embodiment;
FIG. 4 is a flowchart of block text information extraction according to an embodiment;
FIG. 5 is a diagram illustrating an algorithm for extracting block text information according to an embodiment;
FIG. 6 is a diagram illustrating expert information fields in accordance with an exemplary embodiment;
FIG. 7 is a dataflow graph constructed by the tenserflow platform described in the detailed description;
FIG. 8 is a diagram illustrating a change information display according to an embodiment;
fig. 9 is a block diagram of a storage device according to an embodiment.
The reference numerals referred to in the above figures are explained below:
900. a storage device.
Detailed Description
In order to explain in detail possible application scenarios, technical principles, practical embodiments, and the like of the present application, the following detailed description is given with reference to the accompanying drawings in conjunction with the listed embodiments. The embodiments described herein are merely for more clearly illustrating the technical solutions of the present application, and therefore, the embodiments are only used as examples, and the scope of the present application is not limited thereby.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or related to other embodiments specifically defined. In principle, in the present application, the technical features mentioned in the embodiments can be combined in any manner to form a corresponding implementable technical solution as long as there is no technical contradiction or conflict.
Unless defined otherwise, technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the use of relational terms herein is intended only to describe particular embodiments and is not intended to limit the present application.
In the description of the present application, the term "and/or" is a expression for describing a logical relationship between objects, meaning that three relationships may exist, for example a and/or B, meaning: there are three cases of A, B, and both A and B. In addition, the character "/" herein generally indicates that the former and latter associated objects are in a logical relationship of "or".
In this application, terms such as "first" and "second" are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
Without further limitation, in this application, the use of "including," "comprising," "having," or other similar expressions in phrases and expressions of "including," "comprising," or "having," is intended to cover a non-exclusive inclusion, and such expressions do not exclude the presence of additional elements in a process, method, or article that includes the recited elements, such that a process, method, or article that includes a list of elements may include not only those elements but also other elements not expressly listed or inherent to such process, method, or article.
As is understood in the examination of the guidelines, the terms "greater than", "less than", "more than" and the like in this application are to be understood as excluding the number; the expressions "above", "below", "within" and the like are understood to include the present numbers. In addition, in the description of the embodiments of the present application, "a plurality" means two or more (including two), and expressions related to "a plurality" similar thereto are also understood, for example, "a plurality of groups", "a plurality of times", and the like, unless specifically defined otherwise.
As mentioned in the background art, there is no mechanism in the prior art, which can ensure the recall rate of the collected data, and also can ensure the timely update of the data, and at the same time, can accurately judge the reliability of the data, and reduce the error rate of the data update as much as possible. Therefore, the core technical idea of the application lies in that accurate, comprehensive and available scientific and technical expert data are realized by researching data capture, naming disambiguation, information supplement and automatic information detection and update technologies of mass internet resources, the continuous dynamic update of the data is ensured, the latest academic activities and achievement output conditions are reflected, and changes and abnormal conditions are prompted. The specific module and architecture diagram of the whole can be as shown in fig. 2.
In this embodiment, an expert information detection method may be applied to a storage device, where the storage device includes but is not limited to: personal computers, servers, general purpose computers, special purpose computers, network devices, embedded devices, programmable devices, intelligent mobile terminals, etc. The following is a detailed description:
as shown in fig. 1, an expert information detection method includes the steps of:
step S101: collecting target resources and preprocessing the target resources. The step corresponds to automatic acquisition and extraction of mass resources in fig. 2, is an important process for constructing expert basic resources, and aims to research key technologies for capturing, extracting, cleaning, removing duplicate and the like of mass, multi-source, heterogeneous and fragmented expert scientific research behavior data, and the acquired expert scientific research behavior data is put in storage according to expert data standards. In order to realize dynamic detection and update of expert data, supplementary expert data are collected from the web every month so as to improve the accuracy of expert database data.
Collecting expert data from the web presents challenges in searching and selecting personal home pages in web open data, web page analysis and extraction of main text content, structured speech information from different content organization modes, and semantic analysis of Chinese text. Aiming at the challenges, 3 main processes of searching and filtering, extracting block text information and extracting detailed field information of the target webpage are set. The following three main processes are explained in detail:
as shown in fig. 3, "search and filter the target homepage", specifically, the method further includes the following steps:
step S301: and pre-searching by using a preset searching strategy.
Step S302: and performing secondary classification on each item of the pre-retrieval result through a classifier model to judge whether the item contains expert information, wherein if the item contains the expert information, the label cable is 1, and if the item does not contain the expert information, the label cable is 0.
Step S303: filtering out the web pages meeting any one of the following conditions through a filter: the web pages with label of 0 or the classifier label of 1 and the score of the classifier less than the preset value, or the title abstract contains preset keywords.
In the steps S301 to S303, an expert information retrieval strategy is formulated mainly according to the characteristics of multiple sources of expert resources, and the automatic identification and filtering method of the expert homepage based on the classifier model is realized by analyzing the source characteristics of the expert homepage. The method comprises the following specific steps:
the information retrieval function of a full-text search engine, such as google and Baidu, is used, different search strategies are tried according to the common conditions of expert names, and a crawler is used for crawling the search results to serve as candidate homepage pages. The specific search strategy is as follows:
a) if the name is not common, directly searching by using 'name site:. edu.cn';
b) searching by using name unit site if the name is common;
c) otherwise, the "name unit" search is still used as before.
And performing secondary classification on each item of the search result by using a classifier model to judge whether the item contains expert information, if so, marking label as 1, and otherwise, marking label as 0. The classifier designs classification characteristics respectively aiming at different parts (titles and abstracts) of the items, and selects characteristics such as the number of times of the searched names, the number of times of the searched units, the number of times of other names, the number of times of other units, the number of times of matching characteristic words, the number of times of introduction of the homepage and LinkedIn encyclopedia, and the like.
In order to find the personal information web page of the expert official website as accurately as possible, the web page meeting any one of the following conditions is discarded by filtering with a filter:
a) label is 0;
b) the label of the classifier is 1, and the score of the classifier is less than a preset value (for example: 0.6) web pages;
c) the headlines or abstract contain keywords such as "research", "report", "visit", "meeting", "invited", "awarded", "communicating", "recruiting", etc.
And acquiring the latest time in the rest homepage contents, arranging according to the time sequence, and selecting the webpage with the latest time as the webpage to be analyzed.
Referring to fig. 4, block text information extraction is performed on a web page to be parsed. It mainly comprises the following steps:
step S401: and filtering the target homepage through a preset rule to obtain text information.
Step S402: and extracting paragraph level and sentence level through two preset models.
The steps S401 to S402 mainly analyze the content and structural characteristics of the expert homepage, study the webpage blocking technology and paragraph information acquisition technology, and implement an expert homepage paragraph parsing method based on the BILSTM and CRF models. The method comprises the following specific steps:
and partitioning and filtering the input webpage by using semantic and visual features of the HTML label. The purpose of blocking is to convert the structured web page into flat text units, and the text information contained in each text unit is cohesive. When the text of the webpage text is obtained, most webpages contain non-text contents which are irrelevant to the subject information of the webpages, such as navigation bars, pictures, default information in labels of a selector and the like, and the text information can be obtained after filtering is carried out by using rules.
Paragraph level and sentence level extraction is performed on the text information content by using the BILSTM and CRF models. As the personal introduction information mainly relates to professional scholars, the webpage text has the semi-structured characteristic, and much information to be extracted is semantically related and forms text blocks close to positions. In the information extraction stage, texts in the webpage information can be divided according to the education experience, the work experience, the part-time experience, the patents and the projects according to the data in the webpage. Paragraph level text extraction can be translated into the sequence labeling problem, i.e. for a one-dimensional linear input sequence:
x=x1,x2,x3,...xi...xn
each element in the linear sequence is tagged with a certain tag in the tagged set:
y=y1,y2,y3,...yi...yn
which is essentially the problem of classifying each element in a linear sequence according to context. The method of selecting the combination of BILSTM and CRF solves the sequence labeling problem, and the schematic diagram of the algorithm is shown in FIG. 5.
The model comprises the following specific steps:
a) obtaining training annotation corpora
The text data of the webpage containing personal information of 1000 people is obtained through a search engine and stored as text storage. And simultaneously carrying out denoising treatment and data labeling, wherein 900 persons are used as training data, and 100 persons are used as test data.
The labeling result is in the form of XML-like tags, such as < sender > </sender >, and the tags are not nested in pairs and contain 15 tags (i.e., correspond to 15 attributes), as shown in FIG. 6.
b) Data preprocessing stage
And removing non-text punctuation marks (such as full-angle blank conforming to uninterrupted blank marks) in the webpage from the text, converting a plurality of line breaks into one line break, and the like for data cleaning.
And performing character-based model annotation on the text data, and using a BIO annotation set, namely B-PER and I-PER represent the first character of a person, the non-first character of the person, B-ORG and I-ORG represent the first character of an organization and the non-first character of the organization, and O represents that the character does not belong to one part of a named entity. Similarly, there are 31 corresponding tag categories.
c) Building and training models
Using the tensoflow platform to build a dataflow graph, a dataflow graph can be built with constants, variables, and operations. The data flow diagram of the model is shown in FIG. 7.
In the data flow graph, in order to increase the expressive power of the neural network, a layer of LSTM units is added.
The InputData is that one-hot characters are mapped into character embedding, and the text is processed into w representing a word vector of a time sequence and a label corresponding to a corresponding hidden state, wherein the type of the label is 31 different types of labels for data preprocessing.
And then the data is transmitted into a cell of the LSTM, the LSTMCell is equivalent to the main operation of the whole LSTM and the whole hidden layer of the LSTM, the cell is expanded according to time, and dropout is added to enable the activation value of the hidden layer to be invalid according to a certain proportion, so that overfitting is prevented.
And then data are transmitted into a Bi-LSTM cell, wherein the Bi-LSTM cell comprises a reverse LSTM cell and a forward LSTM cell, the structures of the reverse LSTM cell and the forward LSTM cell are the same, only when sequence data are input, one cell is input according to a normal sequence, the other cell is input according to a reverse sequence, tensors of the forward cell and the backward cell are connected, the length of each tensor is 2 times of the number of nodes of the hidden layer, namely 600 cells, a vector b is obtained through the Bi-LSTM, and dropout is added to enable the activation value of the hidden layer to be invalid according to a certain proportion, so that overfitting is prevented.
And finally, obtaining the probability of each label through a full connection layer, adjusting the dimension of the output tensor, and outputting a tensor with the shape of [ batch _ size, max _ seq _ len, num _ tags ].
The predicted tag tenor and the real tag sequence label are used as the input of a CRF layer, and the average value of loss values is obtained through a reduce mean by using log _ likelihood of the CRF as a loss function.
d) Test model
100 pieces of data were selected for testing using a test tool challenge.
After the block text information is extracted, extracting detailed fields, and specifically comprising the following steps: extracting different types of fields by adopting different methods; the different methods include one or more of: regular matching mode, named entity method identification field, dictionary mode.
The method mainly comprises the steps of extracting the extracted text paragraphs by utilizing a template and rules, and storing the structured expert information into a database, wherein the purpose is to extract paragraph-level texts into structured json data. Since the extraction of the detailed fields involves a plurality of fields, different methods are adopted for extracting different types of fields:
a) the regular matching mode is as follows:
age, year and month of birth, gender, telephone number, mailbox, honor, educational history, work history, start and end times of project history, patent number and publication time, etc. may be easily identified using a canonical manner.
b) Named entity method identification field:
the related organization names and name entities have the characteristics of complex components, huge quantity and inexhaustibility, and aiming at the identification of the places and the people, the MSRA corppus linguistic data is adopted to train the entity identification model
c) The project name, project type, patent name, patent type, reward name, reward type, work may be identified using a named entity method or matched by rules.
d) Entity keyword extraction can be performed in a dictionary mode and mainly comprises word segmentation and combination:
first, word segmentation processing is performed. For each n-gram (n takes 3-10) in the sentence, it is extracted if it is in the position keyword lexicon.
The extraction of the optimal keyword combination is computationally expensive, and therefore, the extraction is performed by preferentially extracting n-grams (greedy algorithm) with long lengths. The long-length keywords can clearly express the semantics of the entity, so that the long words can be preferentially selected to better express the text intention, such as seat teaching and professor teaching.
e) And (3) extracting incumbent unit information:
through observing the texts of all the web pages, the appearance position of an incumbent unit in the web page is near the context of the first appearance position of the name in the basic information module, and can be obtained in the work experience information text and through domain name comparison. The following rules are therefore established:
and combining the work experience text and the text of the possible work units, and assuming that the candidate text appears in the incumbent organization, identifying all the candidate organizations by using the organization identification model.
And filtering illegal people and institutions by using keywords such as 'scholars, periodicals and laboratories'.
Matching in the candidate text by regularization "(now | to date | now incumbent at | incumbent | -now | to date | now | work unit | -now | current) {1,2} [ \ S ] {0,5 }", that matching this pattern is incumbent.
If the existing unit is still an empty set after the steps, the domain name is used for matching the domain name matching knowledge set, the unit matched with the domain name is the existing unit, and if the matching is empty, the default is that the original unit is not changed.
f) And (3) extracting the information of the incumbent position:
similar to the strategy of the incumbent unit, the candidate text determined firstly is the context with the keywords of "incumbent", "birth year and month", "sex", etc., and the text is 30-50 words.
And performing word segmentation matching by using the job title dictionary knowledge set.
g) Academic part-time information extraction
Use of. \ n,; (ii) a A list divided into sentences containing a single relationship.
For each relationship, time information is first extracted using a rule, which is a rule, pattern _ time ═ r' (((.
And extracting keywords of positions by using a dictionary mode, and extracting academic duties from the keywords by using an optimal keyword combination mode.
h) Educational work experience information extraction
Use of. \ n,; (ii) a And into a list containing a single relational sentence.
Firstly, judging whether English is contained or not, and translating the text mixed by Chinese and English into Chinese.
Time information is extracted using temporal regularization.
An organizational entity is identified using an organizational recognition model.
The entities of the positions and the academic calendars are identified in a dictionary mode.
i) Patent information extraction
Use of. \\ n; (ii) a List divided into sentences containing a single relationship.
The regular numbers are identified using the regular expressions of the patent numbers.
Time is identified using a temporal regular expression.
Punctuation mark separation is used, a phrase character string with a longer length is obtained, and the patent name and the patent type are identified.
j) Project information extraction
Project expenses are identified using regularization.
The number of items is identified using regularization.
Identifying item names and item types using canonical and rule extraction
k) Reward information extraction
Using regularization. \ n; divided into sentences list containing single relations.
The year of the prize winning is identified using temporal regularization.
The use of the separator [ and \\ \ [ \ ] [ (): a step of cutting; .. (ii) a The "\\\' ] is separated to get phrases.
And sequencing the phrases from large to small in sequence according to the length of the phrases, setting a rule, wherein the longer phrase not containing the keyword prize is a prize topic, and the phrase containing the keyword prize is a prize name.
l) paper information extraction
An ordered list in the text is extracted through a regular expression (such as [1]. [2]. 3.. times).
And removing interference items such as patents, projects and other meaningless texts by using some simple heuristic rules. The rules used are similar to what appears to be a project for a cost, a patent for a patent number, etc.
And (4) carrying out hundred-degree academic search on each extracted thesis to obtain structured information such as titles, authors, published years and the like. The searching method comprises the steps of cutting the character strings by punctuation marks, selecting two sub strings with the longest length, and combining the two sub strings to remove hundred degree academic inquiry.
In this embodiment, in order to avoid the problem that the information of a single academic homepage is simple and the dimension is not comprehensive enough, the target homepage includes: the unit official website, encyclopedia homepage, academic homepage, collar English homepage, etc.
After the detailed field information is extracted in step S101, step S102 is executed: and performing fusion alignment on the multi-source heterogeneous data. It mainly comprises the following steps: aligning the captured experts with the experts in the original expert database in a preset mode, and performing one or more of the following operations: and creating new field storage, de-duplication and combination. The method aims to perform operations such as duplicate removal, cleaning, alignment, disambiguation and the like on the acquired heterogeneous academic resource data including basic information, thesis, patents, projects, prize winning information and the like of experts at each time and an original expert database so as to obtain dynamic and complete comprehensive objective data of the experts.
The original expert database and the captured expert update data are aligned using the following method:
a) reading a file: csv, acquiring the mapping relation expert _ mapping between id of the original database expert and the intermediate mapping value mid _ index;
b) the intermediate mapping value mid _ index of the grasping expert is obtained by the following formula:
mid _ index ═ row _ number +1000 [ ("grab batch-1" ])
c) The id of the corresponding expert can be found in the mapping relation expert _ mapping using the intermediate mapping value mid _ index of the capturing expert, thereby aligning the capturing expert with the original library expert.
d) And (3) merging and updating the information fields of the experts: the processing mode comprises the steps of creating a new field storage, de-overlapping two data sources, changing the data sources into an array storage, de-overlapping two data sources, covering original information in an expert library and the like.
Step S103: and pushing the expert information change early warning. The method specifically comprises the following steps:
and judging whether the important fields in the expert information are changed or not, and if so, pushing the expert information change early warning to a target end for displaying. The step is mainly to compare the extracted important fields such as units, positions and titles on line in time with fields in a library, and display the information change condition in a webpage for a manager to check and confirm. In order to ensure the accuracy of data change, the part adopts a character string containing and character string accurate matching method.
The specific work is that aiming at the information collected by the experts and the talents in the original database, the change information after data comparison is processed in an off-line mode by adopting an automatic monitoring method, the similarity is compared through the same field, and the field with low similarity is used as the information to be confirmed to be requested to be confirmed by the experts for confirmation.
And the similarity comparison algorithm for comparing the character strings is an editing distance. The edit distance, also called the Levenshtein distance, refers to the minimum number of edit operations required to change from one string to another. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character.
After the minimum edit distance of the two texts is calculated, if the number is smaller, the two texts are similar, the time complexity calculated by the method is O (n), and the calculation is carried out pairwise, so that the method is suitable for processing the extracted and analyzed short character strings and short texts.
According to the method, the available webpages of the experts are automatically filtered from massive Internet data, and meanwhile, the captured webpage information is disambiguated, analyzed and extracted by using various models, so that the multisource heterogeneous personal information of the experts can be accurately returned in batches and is structurally stored. The user only needs to upload the expert names and the unit data according to the appointed data format, and does not need to search and confirm in a search engine, so that the time cost of the user is saved.
In addition, the web network expert data resources are automatically acquired at regular intervals, the operations of duplicate removal, cleaning, alignment, disambiguation and the like are carried out on the heterogeneous resource data acquired each time and the original expert database to acquire expert information change data, the decision part introduces manpower to judge and mark, the change early warning and the dynamic detection updating of the expert information can be realized, and the integrity, the accuracy and the effectiveness of the expert database are improved.
In some embodiments, an expert data acquisition and updating system is built according to the expert data acquisition and detection updating steps to realize intelligent detection and dynamic updating of an expert information base and improve accuracy and effectiveness of expert information. The system comprises the following four modules:
a) data capture and analysis module
And creating a grabbing task. Uploading expert data according to a convention data format, inputting the name and unit name of an expert, and receiving a CSV file with UTF-8 encoding data file format.
And (5) asynchronous data capture and analysis. Because available web pages need to be filtered from massive internet data, the influence of network bandwidth and search engine current limiting is limited, the captured page information is disambiguated, analyzed and extracted and applied to various models, and long time is needed for data capturing and analysis, the module is designed to be an asynchronous background module, a RabbitMQ is used as a task queue, and Celery is used for asynchronous task scheduling.
And viewing the task list. According to the paging request, information of a plurality of grabbing tasks can be returned, wherein the information comprises task names, creation time, creators and completion conditions (the number of people completing grabbing and the total number of people).
And viewing the grabbing result. And returning a list of experts in the task according to the designated grabbing task and the paging information, wherein the expert information displayed in the list comprises name, homepage, mechanism, gender, mailbox, position, h index, reference number and completion state.
b) Expert information change early warning module
And reminding expert information change. On the expert list page, there is a mark before the name of the expert with change, the mark can prompt the user that the information of the expert has change, and the detail of the change of the expert information can be checked through the guiding operation on the mark.
And displaying the expert change information. The function is mainly used for early warning the change of the mailbox information, the mechanism information and the position information of the expert. The details of the change of the expert information can be obtained according to the specified paging information or the specified expert information, and the conditions of the mailbox, the mechanism and the post information before and after the change can be displayed according to the change details. As shown in fig. 8.
c) Expert information labeling module
Because web pages from network sources have no unified standard, the layout styles and content organization modes are different and cannot be exhausted, and the Chinese expression is flexible, one or more models cannot be found to correctly analyze, extract and structure all text contents, and therefore manual participation is needed to assist in confirming and marking the captured and analyzed expert information. The module provides marking functions for 10 types of information of basic information, prize winning conditions, education experiences, work experiences, academic duties, projects, patents, papers, research contents and work results of experts.
d) Expert updating data exporting module
In order to protect the safety of the expert database data, the requirement of offline updating of the expert updating data needs to be realized, so that the function of exporting the expert information updating data needs to be provided for a user. And the exported data set takes the task as a unit, returns all the expert data in the task in a file form according to the specified task information, and exports the JSON file with the UTF-8 coding file format.
Referring to fig. 9, in the present embodiment, a memory device 900 is implemented as follows:
a memory device 900 having stored therein a set of instructions for performing: any of the steps of one of the expert information detection methods mentioned above.
Finally, it should be noted that, although the above embodiments have been described in the text and drawings of the present application, the scope of the patent protection of the present application is not limited thereby. All technical solutions which are generated by replacing or modifying the equivalent structure or the equivalent flow according to the contents described in the text and the drawings of the present application, and which are directly or indirectly implemented in other related technical fields, are included in the scope of protection of the present application.

Claims (10)

1. An expert information detection method, characterized by comprising the steps of:
collecting target resources and preprocessing the target resources;
fusing and aligning multi-source heterogeneous data;
and pushing the expert information change early warning.
2. The expert information detection method according to claim 1, wherein the step of collecting and preprocessing the target resource further comprises the steps of:
and acquiring a target webpage, and performing retrieval filtering, block text information extraction and detailed field information extraction on the target webpage.
3. The expert information detection method according to claim 2, wherein the "retrieving and filtering a target homepage" further comprises the steps of:
pre-searching by using a preset searching strategy;
performing secondary classification on each item of the pre-retrieval result through a classifier model to judge whether the item contains expert information, if so, marking the table to be 1, and if not, marking the table to be 0;
filtering out the web pages meeting any one of the following conditions through a filter: the web pages or the headlines of which the label is 0 or the label of the classifier is 1 and the score of the classifier is less than the preset value contain preset keywords.
4. The expert information detection method according to claim 2, wherein the block text information extraction specifically includes the steps of:
filtering the target homepage through a preset rule to obtain text information;
and extracting paragraph level and sentence level through two preset models.
5. The expert information detection method of claim 4 wherein the predetermined model comprises: BILSTM, CRF.
6. The expert information detection method according to claim 2, wherein the detailed field information extraction specifically includes the steps of:
extracting different types of fields by adopting different methods;
the different methods include one or more of: regular matching mode, named entity method identification field, dictionary mode.
7. The expert information detection method of claim 2 wherein the target homepage includes one or more of: the unit official website, encyclopedia homepage, academic homepage, and collar English homepage.
8. The expert information detection method according to claim 1, wherein the "performing fusion alignment on multi-source heterogeneous data" specifically includes the steps of:
aligning the captured experts with the experts in the original expert database in a preset mode, and performing one or more of the following operations: and creating new field storage, de-duplication and combination.
9. The expert information detection method according to claim 8, wherein the "pushing the early warning of the change of the expert information" specifically comprises the steps of:
and judging whether the important fields in the expert information are changed or not, and if so, pushing the expert information change early warning to a target end for displaying.
10. A storage device having a set of instructions stored therein, the set of instructions being operable to perform: an expert information detection method as claimed in any one of claims 1 to 9.
CN202111546359.5A 2021-12-16 2021-12-16 Expert information detection method and storage device Pending CN114238616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111546359.5A CN114238616A (en) 2021-12-16 2021-12-16 Expert information detection method and storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111546359.5A CN114238616A (en) 2021-12-16 2021-12-16 Expert information detection method and storage device

Publications (1)

Publication Number Publication Date
CN114238616A true CN114238616A (en) 2022-03-25

Family

ID=80757361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111546359.5A Pending CN114238616A (en) 2021-12-16 2021-12-16 Expert information detection method and storage device

Country Status (1)

Country Link
CN (1) CN114238616A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117762692A (en) * 2023-12-27 2024-03-26 湖南长银五八消费金融股份有限公司 Database abnormal data processing method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933711A (en) * 2019-03-04 2019-06-25 上海会米策信息科技有限公司 Experts database system, retrieval method for pushing and computer readable storage medium
CN112768039A (en) * 2020-12-31 2021-05-07 平安国际智慧城市科技股份有限公司 Information monitoring method and device based on artificial intelligence, computer equipment and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933711A (en) * 2019-03-04 2019-06-25 上海会米策信息科技有限公司 Experts database system, retrieval method for pushing and computer readable storage medium
CN112768039A (en) * 2020-12-31 2021-05-07 平安国际智慧城市科技股份有限公司 Information monitoring method and device based on artificial intelligence, computer equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹洪飞;顾复;张今;陈芨熙;: "基于专家主页的专家信息抽取方法研究", 情报探索, no. 12, 15 December 2019 (2019-12-15), pages 1 - 9 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117762692A (en) * 2023-12-27 2024-03-26 湖南长银五八消费金融股份有限公司 Database abnormal data processing method and system

Similar Documents

Publication Publication Date Title
CN110399457B (en) Intelligent question answering method and system
CN108763333B (en) Social media-based event map construction method
CN110298033B (en) Keyword corpus labeling training extraction system
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
CN111950285B (en) Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion
CN102253930B (en) A kind of method of text translation and device
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
RU2488877C2 (en) Identification of semantic relations in indirect speech
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
US20150081277A1 (en) System and Method for Automatically Classifying Text using Discourse Analysis
Al-Zoghby et al. Arabic semantic web applications–a survey
CN110609983B (en) Structured decomposition method for policy file
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
CN106126619A (en) A kind of video retrieval method based on video content and system
CN112559684A (en) Keyword extraction and information retrieval method
CN108804592A (en) Knowledge library searching implementation method
WO2020074017A1 (en) Deep learning-based method and device for screening for keywords in medical document
CN115186050B (en) Method, system and related equipment for recommending selected questions based on natural language processing
CN113032552A (en) Text abstract-based policy key point extraction method and system
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
Malik et al. NLP techniques, tools, and algorithms for data science
CN114238616A (en) Expert information detection method and storage device
Al-Ayyoub et al. Framework for Affective News Analysis of Arabic News: 2014 Gaza Attacks Case Study.
Zheng et al. Architecture Descriptions Analysis Based on Text Mining and Crawling Technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination