CN114492402A - Scientific and technological new word recognition method and device - Google Patents

Scientific and technological new word recognition method and device Download PDF

Info

Publication number
CN114492402A
CN114492402A CN202111624012.8A CN202111624012A CN114492402A CN 114492402 A CN114492402 A CN 114492402A CN 202111624012 A CN202111624012 A CN 202111624012A CN 114492402 A CN114492402 A CN 114492402A
Authority
CN
China
Prior art keywords
unit
key
scientific
time
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111624012.8A
Other languages
Chinese (zh)
Inventor
贾永芳
刘�东
陈华雄
王健
韩霜
艾静
曹丽霄
霍瞳
刘烨
殷广丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Center For Science And Technology Evaluation
Beijing Aerospace Intelligent Technology Development Co ltd
Original Assignee
National Center For Science And Technology Evaluation
Beijing Aerospace Intelligent Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Center For Science And Technology Evaluation, Beijing Aerospace Intelligent Technology Development Co ltd filed Critical National Center For Science And Technology Evaluation
Priority to CN202111624012.8A priority Critical patent/CN114492402A/en
Publication of CN114492402A publication Critical patent/CN114492402A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a scientific and technological new word recognition method and device, which are used for comprehensively grasping scientific and technological dynamics and research directions by crawling key vocabulary information in a paper, counting the frequency change trend of key vocabularies on the basis of a time sequence, screening the key vocabularies according to a preset threshold value and acquiring the scientific and technological new words. The method and the device provided by the invention can accurately and efficiently acquire the scientific and technological new words, effectively solve the problem that the acquisition of the conventional new word library is difficult, simultaneously reduce the consumption of a large amount of manpower and material resources, reduce the acquisition period and provide a new idea for acquiring the scientific and technological new words.

Description

Scientific and technological new word recognition method and device
Technical Field
The invention relates to the technical field of big data and information, in particular to a method and a device for recognizing scientific and technological new words.
Background
Science and technology is an important force for promoting the development of modern productivity, and is a key core for rapid economic growth. At present, the development of new technologies in various fields is diversified day by day, and the discovery of new terms related to science and technology is especially important in order to take the lead of the development of science and technology and to advance the development of related research fields.
The existing new word discovery methods mainly focus on network environments or other specific fields, and the implementation methods are mainly divided into the following two types: firstly, artificially marking new words in a text, then performing model training by means of a natural language processing technology, and applying the new words to new word discovery after the model training reaches certain accuracy; and secondly, providing a word bank for finding new words, and analyzing and extracting the continuously generated hot new words by utilizing the internet big data according to the new words provided in the word bank.
However, if the accuracy of obtaining new words by the first method is to be improved, a lot of manpower is required to search for appropriate texts to label for training, the content of the labels needs to be accurate, and otherwise the training effect is affected. Moreover, if the training result is not ideal, repeated training is needed for many times, and the period is long; the second method is based on a word stock, but the new words are new emerging words, and the existing words in the word stock cannot accurately predict all types of new words which will appear in the future. Therefore, a method for identifying scientific and technological new words more conveniently and accurately is needed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for recognizing scientific and technological new words, and aims to solve the problems that in the prior art, the recognition of the scientific and technological new words is large in limitation and low in accuracy.
In order to solve the technical problem, the embodiment of the invention discloses the following technical scheme:
a scientific and technological new word recognition method comprises the following steps:
acquiring important vocabulary information of each paper in a preset paper database, wherein the important vocabulary information at least comprises important vocabularies and publication time corresponding to the important vocabularies;
dividing a preset statistical age into a plurality of time units according to a preset time interval, and determining the occurrence frequency and the growth rate of the key words in each time unit by using the publication time corresponding to each key word;
and judging whether the key words are scientific and technological new words or not according to the publication time corresponding to each key word and the occurrence frequency and the growth rate of each time unit.
Optionally, the obtaining of the important vocabulary information of each thesis in the preset thesis database includes:
acquiring key information of each paper in a preset paper database, wherein the key information at least comprises a title, a keyword, an abstract and publication time;
and extracting the key words in each piece of paper and the publication time corresponding to the key words by using the key information.
Optionally, the preset statistical age is divided into a plurality of time units according to a preset time interval, and the frequency and the growth rate of the key words appearing in each time unit are determined by using publication time corresponding to each key word, including:
dividing each year into the first half year and the second half year by taking 6 months as a preset time interval, wherein each half year is a time unit;
taking 20 time units before the time unit to which the current moment belongs as a preset statistical age;
aiming at each key word, determining the frequency of each time unit of the key word within a preset statistical age by using the publication time corresponding to the key word;
respectively calculating the growth rate of each key word in each time unit according to the following formula;
Figure BDA0003439163580000021
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
Optionally, the determining whether the key words are scientific and technological new words according to the publication time corresponding to each key word and the occurrence frequency and growth rate of each time unit includes:
judging whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not for each key word;
if so, judging whether the growth rate of the key words in the time unit is greater than a preset growth rate threshold value, and if so, determining the key words as new words of the time unit;
and screening out the vocabulary of the science and technology class from the determined new words to be used as the finally recognized science and technology new words.
Optionally, the screening out the vocabulary of the science and technology class from the determined new words as the finally recognized science and technology new words includes:
establishing a corpus sample set according to a scientific class vocabulary and a non-scientific class vocabulary which are acquired in advance;
training scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation based on a converter) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies;
and sequentially judging whether each new word belongs to the science and technology class vocabulary or not based on the BERT model, and determining the new words belonging to the science and technology class vocabulary as the science and technology new words in the corresponding time unit.
A scientific and technological new word recognition device, comprising:
the system comprises a key word information acquisition unit, a key word information acquisition unit and a processing unit, wherein the key word information acquisition unit is used for acquiring key word information of each thesis in a preset thesis database, and the key word information at least comprises key words and publication time corresponding to the key words;
the frequency and growth rate calculation unit is used for dividing a preset statistical age into a plurality of time units according to a preset time interval, and determining the frequency and growth rate of the key words appearing in each time unit by using the publication time corresponding to each key word;
and the scientific and technological new word judging unit is used for judging whether the key words are scientific and technological new words or not according to the publication time corresponding to each key word and the frequency and the growth rate of the key words appearing in each time unit.
Optionally, the important vocabulary information obtaining unit includes:
the system comprises a key information acquisition unit, a key information acquisition unit and a processing unit, wherein the key information acquisition unit is used for acquiring key information of each thesis in a preset thesis database, and the key information at least comprises a title, a keyword, an abstract and publication time;
and the information extraction unit is used for extracting the key words in each piece of paper and the publication time corresponding to the key words by using the key information.
Optionally, the frequency and growth rate calculating unit includes:
the time unit dividing unit is used for dividing each year into the first half year and the second half year at a preset time interval of 6 months, and each half year is a time unit;
a preset statistical age determining unit, configured to use 20 time units before a time unit to which the current time belongs as a preset statistical age;
the frequency calculation unit is used for determining the frequency of the key words appearing in each time unit within a preset statistical year by using the publication time corresponding to the key words for each key word;
the growth rate calculating unit is used for calculating the growth rate of each key word in each time unit according to the following formula;
Figure BDA0003439163580000031
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
Optionally, the scientific and technological new word determination unit includes:
the frequency judging unit is used for judging whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not according to the time sequence for each key word;
the growth rate judging unit is used for judging whether the growth rate of the key words in the time units is greater than a preset growth rate threshold value or not when the time units with the key words appearing frequency greater than a preset frequency threshold value exist;
the new word determining unit is used for determining the key vocabulary as the new word of the time unit when the growth rate of the key vocabulary in the time unit is greater than a preset growth rate threshold value;
and the scientific and technological new word screening unit is used for screening out scientific and technological words from the determined new words to serve as the finally recognized scientific and technological new words.
Optionally, the scientific and technological new word screening unit includes:
the corpus sample set establishing unit is used for establishing a corpus sample set according to a scientific class vocabulary and a non-scientific class vocabulary which are acquired in advance;
the sample training unit is used for training scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies;
and the scientific and technological new word determining unit is used for sequentially judging whether each new word belongs to scientific and technological vocabularies based on the BERT model and determining the new words belonging to the scientific and technological vocabularies as the scientific and technological new words in the corresponding time units.
According to the technical scheme, whether the key vocabulary is the scientific and technological new word or not is judged by acquiring the occurrence frequency and the growth rate of the key vocabulary in different time units. According to the invention, the key vocabulary information in the theoretical text is crawled, the frequency change trend of the key vocabulary is counted based on the time sequence, the key vocabulary is screened according to the preset threshold value, and the new science and technology words are obtained, so that the science and technology dynamics and the research direction are comprehensively grasped. The method and the device provided by the invention can accurately and efficiently acquire the scientific and technological new words, effectively solve the problem that the acquisition of the conventional new word library is difficult, simultaneously reduce the consumption of a large amount of manpower and material resources, reduce the acquisition period and provide a new idea for acquiring the scientific and technological new words.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a scientific and technological new word recognition method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of step S101 in fig. 1 according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of step S103 in fig. 1 according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating step S1034 in FIG. 3 according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a scientific and technological new word recognition device according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a scientific and technological new word recognition method provided by the present invention, as shown in fig. 1, the method includes the following steps:
step S101: and acquiring key word information of each paper in a preset paper database.
In the embodiment disclosed by the invention, the data sources required for finding the scientific and technological new words are mainly thesis data, and all the thesis data are from the SCI citation database in order to ensure the authority of the result.
In one embodiment of the present disclosure, as shown in fig. 2: step S101 may be completed by the following substeps:
step S1011: and acquiring the key information of each paper in a preset paper database.
In order to obtain key information in papers, data such as paper titles, keywords, abstracts, publication time and the like are collected in each paper.
Step S1012: and extracting the key words and the corresponding publishing time of the key words in each piece of paper by using the key information.
After key information of each thesis is obtained, key words in each thesis are extracted from information such as titles, keywords and abstracts of the thesis, and the key words of each thesis correspond to publication time of the thesis and serve as publication time corresponding to the key words.
Because the amount of data to be stored is large, in order to efficiently complete data acquisition and storage, in the specific embodiment disclosed by the invention, a script framework in a Python computer programming language is used, and acquisition codes are written in a customized manner by combining fields to be acquired and a data structure of a target database, so that the data acquisition and storage work is completed.
After the papers are collected and put in a warehouse, the logstash tool is used for realizing the word segmentation of the keywords of the papers, namely, words in a keyword list below each abstract of the papers are extracted. logstack is a free and open server-side data processing pipeline, supports various input options, can capture events from numerous common sources simultaneously, and can dynamically collect, convert, and transmit data.
In the embodiment disclosed by the invention, the important vocabulary information at least comprises the important vocabulary and the publication time corresponding to the important vocabulary. And screening the obtained key words to obtain new science and technology words, wherein the publication time corresponding to each key word is the publication time of the paper to which the key word belongs. For example, a certain key word is extracted from 3 papers, the publication times of the 3 papers are 2/6/2019, 4/21/2020, and 8/5/2021, and then the publication times corresponding to the key word are 2/6/2019, 4/21/2020, and 8/5/2021.
Step S102: dividing a preset statistical year into a plurality of time units according to a preset time interval, and determining the occurrence frequency and the growth rate of the key words in each time unit by using the publication time corresponding to each key word.
In the embodiment of the present disclosure, the preset statistical age is 10 years, and step S102 may be completed in the following manner.
1) Each year is divided into the first half year and the second half year with 6 months as a preset time interval, and each half year is a time unit.
2) And taking 20 time units before the time unit to which the current time belongs as a preset statistical age.
For example, if the time unit to which the current time belongs is the first half year, the next half year of the first year is started, and 20 time units are pushed forward to serve as the preset statistical year; and if the current time is the next half year, pushing forward 20 time units from the last half year of the current year to serve as a preset statistical year.
3) And determining the frequency of the key words in each time unit within the preset statistical age by using the publication time corresponding to the key words for each key word.
According to the publication time of the paper to which each key word belongs, the time information related to each key word can be obtained, and therefore the frequency of each key word appearing in each time unit can be counted. For example, according to the publication time corresponding to a certain important word, the frequency of the occurrence of the important word in the first half-year unit of 2020 is determined to be 20 times, the frequency of the occurrence of the important word in the second half-year unit of 2020 is determined to be 38 times, and the frequency of the occurrence of the important word in the first half-year unit of 2021 is determined to be 60 times.
4) After the frequency of each key word appearing in each time unit is obtained, the growth rate of each key word in each time unit is respectively calculated according to the following formula:
Figure BDA0003439163580000061
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
Step S103: and judging whether the key words are new science and technology words or not according to the publication time corresponding to each key word and the occurrence frequency and growth rate of each time unit.
In a specific embodiment of the disclosure of the present invention, the preset frequency threshold is 50, and the preset growth rate threshold is 30%. And according to the corresponding publication time sequence, comparing the frequency and the growth rate of each key word in each time unit with a preset threshold respectively, and if the frequency and the growth rate of a certain key word in a certain time unit exceed the preset threshold, judging that the key word is a new word appearing in the time unit. And then, judging whether the new word belongs to the science and technology class vocabulary, and if so, determining the new word as the science and technology new word in the time unit.
In one embodiment of the present disclosure, as shown in fig. 3, step S103 may include the following sub-steps:
step S1031: and judging whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not according to the time sequence for each key word.
For example, if the frequency of the important vocabulary i appearing in the first half of 2018 is 65, and the frequencies of all time units before the first half of 2018 are less than 50, it is determined that the frequency of the important vocabulary appearing in the time unit of the first half of 2018 is greater than the preset frequency threshold.
If the frequency of a certain key word appearing in a certain time unit is greater than the preset frequency threshold, step S1032 is executed: and judging whether the growth rate of the key words in the time unit is greater than a preset growth rate threshold value or not.
If the frequency of the key vocabulary i appearing in the last half year of 2018 exceeds a preset frequency threshold, acquiring the growth rate of the key vocabulary in the last half year of 2018, and comparing the growth rate with the preset growth rate threshold.
If the frequency of a certain key word appearing in a certain time unit is greater than the preset frequency threshold and the growth rate of the key word in the time unit is greater than the preset growth rate threshold, step S1033 is performed: the important vocabulary is determined as a new word in a time unit.
For example, if the frequency of occurrence of the important vocabulary i in the last half year of 2018 is 65 and the growth rate is 45%, and the preset frequency threshold value and the preset growth rate threshold value are exceeded, the important vocabulary is determined to be a new word in the last half year of 2018.
If the frequency of the appearance of a certain key word in a certain time unit is greater than a preset frequency threshold value, but the growth rate of the key word in the time unit is not greater than a preset growth rate threshold value, continuing to compare the frequency of the appearance of the key word in the next time unit and the growth rate with the preset threshold value.
For example, if the growth rate of the key word i in the first half of 2018 is 10% and does not exceed the preset growth rate threshold, the key word is not a new word in the first half of 2018, and whether the key word meets the condition in the second half of 2018 is continuously determined.
If the condition that the frequency and the growth rate of the important word i do not exceed the threshold value in 20 time units does not occur, the important word is not a new word.
After the new word is determined in step S1033, step S1034 is continued: and screening out the vocabulary of the science and technology class from the determined new words to be used as the finally recognized science and technology new words.
In one embodiment of the present disclosure, as shown in fig. 4, step S1034 includes the following sub-steps:
step S1341: and establishing a corpus sample set according to the scientific class vocabulary and the non-scientific class vocabulary which are acquired in advance.
After the new word is recognized, it is also necessary to determine whether the new word is a scientific vocabulary. Therefore, in the embodiment of the present disclosure, a corpus sample set including science-based vocabulary and non-science-based vocabulary is established. For example, the corpus sample set includes 2000 science-class vocabularies and 2000 non-science-class vocabularies as samples, and these vocabularies are all from the keywords of the paper.
A corpus sample set is selected from key words of the thesis and used as a training sample for judging whether the vocabulary is scientific or not, so that the quantity and the field range of the scientific and non-scientific words in the words meet the training requirements, keywords with high frequency can be selected, the quality of the training sample is guaranteed through a quantifiable method, the training efficiency is improved compared with the traditional method, and the reliability of a training result is guaranteed.
Step S1342: and training scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation based on a converter) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies.
And (3) repeatedly training the samples in the material sample set by using a BERT model until F1 (an index used for measuring model accuracy in statistics, the maximum value is 1, the minimum value is 0, and the larger the value is, the better the model training effect is) is larger than 0.9, so that the effect of efficient and automatic identification of the system is achieved.
The training process is carried out in a Python programming environment, a Fine-Tuning method is used for a training sample set, and a BERT pre-training model is used for carrying out classification tasks of whether science and technology class vocabularies exist. The training process is roughly:
1. reading sample data in the corpus set, storing the data into a list, constructing a tag list, and representing a scientific vocabulary by using 1 and a non-scientific vocabulary by using 0.
2. And randomly arranging sample data by using a NumPy library, and taking 500 pieces of data from the sample set as a verification set.
3. The construction function get _ dummy functions to convert the tag into a representation of one-hot.
4. The dataset representation was constructed using DataLoader () provided by PyTorch, and the dataset iterator was constructed using tensorrdataset ().
5. The data iterator for the validation set is constructed similarly to the data iterator for the training set.
6. Building a class for classification, adding a BERT model, adding a Dropout layer below the BERT model for preventing overfitting, and a Linear full-link layer.
7. Defining a loss function and establishing an optimizer.
8. And constructing a prediction function for calculating a prediction result and starting training.
9. And after training is finished, observing the training effect of the model by using the verification set.
Step S1343: and sequentially judging whether each new word belongs to the science and technology class vocabulary or not based on the BERT model, and determining the new words belonging to the science and technology class vocabulary as the science and technology new words in the corresponding time unit.
And classifying each new word in sequence based on the sample characteristics of the scientific vocabulary and the non-scientific vocabulary in the training result, judging whether the new word is the scientific vocabulary or not, and if so, determining the new word as the scientific new word in the corresponding time unit.
For example, referring to the list of new words stored in the database, assuming list1, the training result is executed, and the classification of all new words can be output. The operation code is as follows:
cls.eval()
tokenized_text=[tokenizer.tokenize(i)for i in list1]
input_ids=[tokenizer.convert_tokens_to_ids(i)for i in tokenized_text]
input_ids=torch.LongTensor(input_ids).cuda()
mask=torch.ones_like(input_ids).to(device)
output=cls(input_ids,attention_mask=mask)
pred=predict(output)
pred
fig. 5 is a schematic structural diagram of a scientific and technological new word recognition device, as shown in the figure, the device includes the following units:
a key vocabulary information obtaining unit 11 configured to obtain key vocabulary information of each thesis in a preset thesis database, the key vocabulary information at least including key vocabularies and publication times corresponding to the key vocabularies;
the frequency and growth rate calculation unit 12 is configured to divide a preset statistical age into a plurality of time units according to a preset time interval, and determine the frequency and growth rate of the occurrence of the key words in each time unit by using the publication time corresponding to each key word;
and the scientific and technological new word judging unit 13 is configured to judge whether the key words are scientific and technological new words according to the publishing time corresponding to each key word and the frequency and the growth rate of the key words appearing in each time unit.
In an embodiment of the present disclosure, the important vocabulary information acquiring unit 11 in the foregoing embodiment includes:
the key information acquisition unit is configured to acquire key information of each paper in a preset paper database, and the key information at least comprises a title, a keyword, an abstract and publication time;
and the information extraction unit is configured to extract the important words and the release time corresponding to the important words in each piece of paper by using the important information.
In an embodiment of the disclosure, the frequency and increase rate calculating unit 12 in the foregoing embodiment includes:
a time unit dividing unit configured to divide each year into a first half year and a second half year at a preset time interval of 6 months, each half year being a time unit;
a preset statistical age determination unit configured to take 20 time units before a time unit to which the current time belongs as a preset statistical age;
the frequency calculation unit is configured to determine the frequency of each time unit of the key words within a preset statistical year by using the publication time corresponding to the key words for each key word;
the growth rate calculating unit is configured to calculate the growth rate of each key word in each time unit according to the following formula;
Figure BDA0003439163580000101
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
In an embodiment of the disclosure, the scientific and technological new word determination unit 13 in the foregoing embodiment includes:
the frequency judging unit is configured to judge whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not according to the time sequence for each key word;
the growth rate judging unit is configured to judge whether the growth rate of the key words in the time units is greater than a preset growth rate threshold value or not when the time units with the key words appearing frequency greater than the preset frequency threshold value exist;
the new word determining unit is configured to determine the important vocabulary as the new word in the time unit when the growth rate of the important vocabulary in the time unit is larger than a preset growth rate threshold;
and the scientific and technological new word screening unit is configured to screen out vocabularies of scientific and technological classes from the determined new words to serve as the finally recognized scientific and technological new words.
In an embodiment of the disclosure, the scientific and technological new word screening unit in the foregoing embodiment includes:
the corpus sample set establishing unit is configured to establish a corpus sample set according to a scientific class vocabulary and a non-scientific class vocabulary which are acquired in advance;
the sample training unit is configured to train scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation based on a converter) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies;
and the scientific and technological new word determining unit is configured to sequentially judge whether each new word belongs to scientific and technological vocabularies based on the BERT model and determine the new words belonging to the scientific and technological vocabularies as the scientific and technological new words in the corresponding time units.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A scientific and technological new word recognition method is characterized by comprising the following steps:
acquiring important vocabulary information of each paper in a preset paper database, wherein the important vocabulary information at least comprises important vocabularies and publication time corresponding to the important vocabularies;
dividing a preset statistical age into a plurality of time units according to a preset time interval, and determining the occurrence frequency and the growth rate of the key words in each time unit by using the publication time corresponding to each key word;
and judging whether the key words are scientific and technological new words or not according to the publication time corresponding to each key word and the occurrence frequency and the growth rate of each time unit.
2. The method of claim 1, wherein the obtaining of the highlight vocabulary information of each paper in the predetermined paper database comprises:
acquiring key information of each paper in a preset paper database, wherein the key information at least comprises a title, a keyword, an abstract and publication time;
and extracting the key words in each piece of paper and the publication time corresponding to the key words by using the key information.
3. The method according to claim 1, wherein the dividing the predetermined statistical age into a plurality of time units according to a predetermined time interval, and determining the occurrence frequency and the growth rate of the important vocabularies in each time unit by using the publication time corresponding to each important vocabulary comprises:
dividing each year into the first half year and the second half year by taking 6 months as a preset time interval, wherein each half year is a time unit;
taking 20 time units before the time unit to which the current moment belongs as a preset statistical age;
aiming at each key word, determining the frequency of each time unit of the key word within a preset statistical age by using the publication time corresponding to the key word;
respectively calculating the growth rate of each key word in each time unit according to the following formula;
Figure FDA0003439163570000011
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
4. The method according to claim 1, wherein the determining whether the important vocabulary is a scientific new word according to the publication time corresponding to each important vocabulary, and the occurrence frequency and the growth rate of each time unit comprises:
judging whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not according to the time sequence for each key word;
if so, judging whether the growth rate of the key words in the time unit is greater than a preset growth rate threshold value, and if so, determining the key words as new words of the time unit;
and screening out the vocabulary of the science and technology class from the determined new words to be used as the finally recognized science and technology new words.
5. The method of claim 4, wherein the step of screening out the vocabulary of the science and technology class from the determined new words as the finally recognized science and technology new words comprises:
establishing a corpus sample set according to a scientific class vocabulary and a non-scientific class vocabulary which are acquired in advance;
training scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation based on a converter) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies;
and sequentially judging whether each new word belongs to the science and technology class vocabulary or not based on the BERT model, and determining the new words belonging to the science and technology class vocabulary as the science and technology new words in the corresponding time unit.
6. A scientific and technological new word recognition device is characterized by comprising:
the system comprises a key word information acquisition unit, a key word information acquisition unit and a processing unit, wherein the key word information acquisition unit is used for acquiring key word information of each thesis in a preset thesis database, and the key word information at least comprises key words and publication time corresponding to the key words;
the frequency and growth rate calculation unit is used for dividing a preset statistical year into a plurality of time units according to a preset time interval, and determining the occurrence frequency and growth rate of the key words in each time unit by using the publication time corresponding to each key word;
and the scientific and technological new word judging unit is used for judging whether the key words are scientific and technological new words or not according to the publication time corresponding to each key word and the frequency and the growth rate of the key words appearing in each time unit.
7. The apparatus according to claim 6, wherein the important vocabulary information obtaining unit includes:
the system comprises a key information acquisition unit, a key information acquisition unit and a processing unit, wherein the key information acquisition unit is used for acquiring key information of each thesis in a preset thesis database, and the key information at least comprises a title, a keyword, an abstract and publication time;
and the information extraction unit is used for extracting the key words in each piece of paper and the publication time corresponding to the key words by using the key information.
8. The apparatus of claim 6, wherein the frequency and growth rate calculating unit comprises:
the time unit dividing unit is used for dividing each year into the first half year and the second half year at a preset time interval of 6 months, and each half year is a time unit;
a preset statistical age determining unit, configured to use 20 time units before a time unit to which the current time belongs as a preset statistical age;
the frequency calculation unit is used for determining the frequency of the key words appearing in each time unit within a preset statistical year by using the publication time corresponding to the key words for each key word;
the growth rate calculating unit is used for calculating the growth rate of each key word in each time unit according to the following formula;
Figure FDA0003439163570000031
wherein R isimFor the growth rate of the key word i in the mth time unit, NmFor the frequency of occurrence of the important word i in the m-th time unit, Nm-1The frequency of occurrence of the important word i in the (m-1) th time unit is given.
9. The apparatus of claim 6, wherein the scientific new word determination unit comprises:
the frequency judging unit is used for judging whether a time unit with the occurrence frequency of the key words larger than a preset frequency threshold exists or not according to the time sequence for each key word;
the growth rate judging unit is used for judging whether the growth rate of the key words in the time units is greater than a preset growth rate threshold value or not when the time units with the key words appearing frequency greater than a preset frequency threshold value exist;
the new word determining unit is used for determining the important vocabulary as the new word of the time unit when the growth rate of the important vocabulary in the time unit is larger than a preset growth rate threshold value;
and the scientific and technological new word screening unit is used for screening out scientific and technological words from the determined new words to serve as the finally recognized scientific and technological new words.
10. The apparatus of claim 9, wherein the scientific new word screening unit comprises:
the corpus sample set establishing unit is used for establishing a corpus sample set according to a scientific class vocabulary and a non-scientific class vocabulary which are acquired in advance;
the sample training unit is used for training scientific and non-scientific vocabularies in the material set by using a BERT (Bidirectional Encoder representation based on a converter) model to respectively obtain sample characteristics of the scientific and non-scientific vocabularies;
and the scientific and technological new word determining unit is used for sequentially judging whether each new word belongs to scientific and technological vocabularies based on the BERT model and determining the new words belonging to the scientific and technological vocabularies as the scientific and technological new words in the corresponding time units.
CN202111624012.8A 2021-12-28 2021-12-28 Scientific and technological new word recognition method and device Pending CN114492402A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111624012.8A CN114492402A (en) 2021-12-28 2021-12-28 Scientific and technological new word recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111624012.8A CN114492402A (en) 2021-12-28 2021-12-28 Scientific and technological new word recognition method and device

Publications (1)

Publication Number Publication Date
CN114492402A true CN114492402A (en) 2022-05-13

Family

ID=81496922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111624012.8A Pending CN114492402A (en) 2021-12-28 2021-12-28 Scientific and technological new word recognition method and device

Country Status (1)

Country Link
CN (1) CN114492402A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823792A (en) * 2014-03-07 2014-05-28 网易(杭州)网络有限公司 Method and equipment for detecting hotspot events from text document
CN106484672A (en) * 2015-08-27 2017-03-08 北大方正集团有限公司 Vocabulary recognition methods and vocabulary identifying system
CN107391504A (en) * 2016-05-16 2017-11-24 华为技术有限公司 New word identification method and device
CN108021628A (en) * 2017-11-22 2018-05-11 华南理工大学 A kind of management system of science and technology theme
US20190304454A1 (en) * 2018-03-30 2019-10-03 Honda Motor Co.,Ltd. Information providing device, information providing method, and recording medium
CN111079419A (en) * 2019-11-28 2020-04-28 中国人民解放军军事科学院军事科学信息研究中心 Big data-based national defense science and technology hot word discovery method and system
CN111125315A (en) * 2019-12-25 2020-05-08 北京中技华软科技服务有限公司 Technical trend prediction method and system
CN111126865A (en) * 2019-12-27 2020-05-08 北京中技华软科技服务有限公司 Technology maturity judging method and system based on scientific and technological big data
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN112364628A (en) * 2020-11-20 2021-02-12 创优数字科技(广东)有限公司 New word recognition method and device, electronic equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823792A (en) * 2014-03-07 2014-05-28 网易(杭州)网络有限公司 Method and equipment for detecting hotspot events from text document
CN106484672A (en) * 2015-08-27 2017-03-08 北大方正集团有限公司 Vocabulary recognition methods and vocabulary identifying system
CN107391504A (en) * 2016-05-16 2017-11-24 华为技术有限公司 New word identification method and device
CN108021628A (en) * 2017-11-22 2018-05-11 华南理工大学 A kind of management system of science and technology theme
US20190304454A1 (en) * 2018-03-30 2019-10-03 Honda Motor Co.,Ltd. Information providing device, information providing method, and recording medium
CN111079419A (en) * 2019-11-28 2020-04-28 中国人民解放军军事科学院军事科学信息研究中心 Big data-based national defense science and technology hot word discovery method and system
CN111125315A (en) * 2019-12-25 2020-05-08 北京中技华软科技服务有限公司 Technical trend prediction method and system
CN111126865A (en) * 2019-12-27 2020-05-08 北京中技华软科技服务有限公司 Technology maturity judging method and system based on scientific and technological big data
CN111563143A (en) * 2020-07-20 2020-08-21 上海二三四五网络科技有限公司 Method and device for determining new words
CN111914554A (en) * 2020-08-19 2020-11-10 网易(杭州)网络有限公司 Training method of field new word recognition model, field new word recognition method and field new word recognition equipment
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN112364628A (en) * 2020-11-20 2021-02-12 创优数字科技(广东)有限公司 New word recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107766371B (en) Text information classification method and device
CN110334178B (en) Data retrieval method, device, equipment and readable storage medium
Di Cocco et al. How populist are parties? Measuring degrees of populism in party manifestos using supervised machine learning
Shen et al. A hybrid model for quality assessment of Wikipedia articles
CN107102993B (en) User appeal analysis method and device
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN114661872B (en) Beginner-oriented API self-adaptive recommendation method and system
CN110717041A (en) Case retrieval method and system
Shekhawat Sentiment classification of current public opinion on BREXIT: Naïve Bayes classifier model vs Python’s TextBlob approach
CN110659352A (en) Test question and test point identification method and system
CN111831810A (en) Intelligent question and answer method, device, equipment and storage medium
CN113946677A (en) Event identification and classification method based on bidirectional cyclic neural network and attention mechanism
CN114547373A (en) Method for intelligently identifying and searching programs based on audio
CN108345694B (en) Document retrieval method and system based on theme database
WO2023207566A1 (en) Voice room quality assessment method, apparatus, and device, medium, and product
CN110765107A (en) Question type identification method and system based on digital coding
CN116383414A (en) Intelligent file review system and method based on carbon check knowledge graph
CN116108181A (en) Client information processing method and device and electronic equipment
CN114492402A (en) Scientific and technological new word recognition method and device
CN113420153B (en) Topic making method, device and equipment based on topic library and event library
CN111858860A (en) Search information processing method and system, server, and computer readable medium
CN111274404B (en) Small sample entity multi-field classification method based on man-machine cooperation
CN110688453B (en) Scene application method, system, medium and equipment based on information classification
Mallek et al. An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents
CN110516069A (en) A kind of quotation Metadata Extraction method based on FastText-CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination