CN111985222B - Text keyword recognition method and related equipment - Google Patents

Text keyword recognition method and related equipment Download PDF

Info

Publication number
CN111985222B
CN111985222B CN202010859290.0A CN202010859290A CN111985222B CN 111985222 B CN111985222 B CN 111985222B CN 202010859290 A CN202010859290 A CN 202010859290A CN 111985222 B CN111985222 B CN 111985222B
Authority
CN
China
Prior art keywords
pollution
data
classification
adjustment data
classification adjustment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010859290.0A
Other languages
Chinese (zh)
Other versions
CN111985222A (en
Inventor
杜佳辉
周琅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010859290.0A priority Critical patent/CN111985222B/en
Publication of CN111985222A publication Critical patent/CN111985222A/en
Application granted granted Critical
Publication of CN111985222B publication Critical patent/CN111985222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides a text keyword recognition method and related equipment, wherein the method comprises the following steps: collecting original data; preliminary identification is carried out on the original data through an initial identification model, and a pollution classification preliminary result is obtained; receiving classification adjustment data, and analyzing factors influencing the preliminary pollution classification result according to the classification adjustment data; marking the classification adjustment data according to factors; extracting environmental pollution characteristics, calculating weights of the environmental pollution characteristics, and constructing a vectorization sample set according to the weights; optimizing and training the initial recognition model by using the vectorization sample set to obtain an ecological environment pollution recognition model; and carrying out pollution identification on the newly issued data by using an ecological environment pollution identification model to obtain a pollution type accurate result. Optionally, the invention also relates to a blockchain technology, and the pollution type accurate result can be uploaded to the blockchain. The invention can also be applied to intelligent environmental protection, thereby promoting the development and construction of intelligent cities.

Description

Text keyword recognition method and related equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a text keyword recognition method and related equipment.
Background
With the rapid development of big data and the Internet, internet public opinion clues related to the ecological environment protection field are gathered in a large quantity and spread rapidly, and become one of important sources for officially discovering environmental pollution clues. However, in the context of "internet+", the amount of text information increases dramatically, and the speed of propagation is fast, which undoubtedly challenges the mining of environmental pollution cues. Therefore, how to mine and correctly identify pollution types from internet data is a technical problem to be solved.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a text keyword recognition method and related apparatus that can mine and correctly recognize a pollution category from internet data.
The first aspect of the present invention provides a text keyword recognition method, which includes:
collecting original data from the Internet through a web crawler technology;
performing preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result;
receiving classification adjustment data input for the pollution classification preliminary result, and analyzing factors influencing the pollution classification preliminary result according to the classification adjustment data;
Marking the classification adjustment data according to the factors;
extracting environmental pollution characteristics from the marked classification adjustment data, calculating the weight of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weight;
performing optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model;
and carrying out pollution identification on the newly issued data on the Internet by using the trained ecological environment pollution identification model to obtain a pollution type accurate result of the newly issued data.
In one possible implementation, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
judging whether the attribute of the classification adjustment data is an event or not according to each classification adjustment data;
if the attribute of the classification adjustment data is an event, acquiring an event state of the classification adjustment data;
and if the event state indicates that the classification adjustment data is classified into non-pollution types, determining that the factor influencing the preliminary result of pollution classification is the event state.
In one possible implementation, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
acquiring the data type of the classification adjustment data for each classification adjustment data;
judging whether the data type is a preset non-pollution data type or not;
if the data type is a preset non-pollution data type, and the classification adjustment data is adjusted to be a non-pollution type, determining factors influencing the preliminary result of pollution classification as the data type.
In one possible implementation manner, the text keyword recognition method further includes:
obtaining pollutant substances of each type of pollution category from the accurate result of the pollution category;
acquiring associated ecological information of the pollutants, wherein the associated ecological information is used for representing ecological chains polluted by the pollutants;
determining a plurality of pollution categories of the pollutants according to the associated ecological information;
updating the accurate result of the pollution category according to a plurality of pollution categories of the pollutant.
In one possible implementation manner, the text keyword recognition method further includes:
determining pollution data belonging to a pollution class from the newly issued data according to the pollution type accurate result;
The pollution element of the pollution data is obtained, wherein the pollution element comprises a pollution substance, a pollution result, a pollution degree and a pollution area;
weighting the pollutant, the pollution result, the pollution degree and the pollution area to obtain a weighted score;
determining a pollution level of the pollution data according to the weighted score;
and carrying out pollution judgment on the pollution data according to the pollution level.
In one possible implementation manner, the text keyword recognition method further includes:
acquiring a pollution event corresponding to first pollution data belonging to a serious level from the pollution level;
judging whether the release user of the first pollution data is an individual user or not;
if the release user of the first pollution data is an individual user, acquiring second pollution data aiming at the pollution event, which is released by an environmental protection department;
and verifying the data reliability of the second pollution data according to the first pollution data.
In one possible implementation manner, the verifying the data reliability of the second pollution data according to the first pollution data includes:
according to the pollution elements, comparing the first pollution data with the second pollution data to obtain a difference comparison value corresponding to each pollution element;
Judging whether the difference comparison value is in a reasonable range or not;
if the difference comparison value is in a reasonable range, determining that the second pollution data is reliable; or (b)
And if the difference comparison value is not in a reasonable range, determining that the data of the second pollution data is unreliable.
A second aspect of the present invention provides a text keyword recognition apparatus including:
the acquisition module is used for acquiring original data from the Internet through a web crawler technology;
the identification module is used for carrying out preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result;
the receiving module is used for receiving the classification adjustment data input aiming at the pollution classification preliminary result;
the analysis module is used for analyzing factors influencing the preliminary result of the pollution classification according to the classification adjustment data;
the marking module is used for marking the classification adjustment data according to the factors;
the calculation construction module is used for extracting environmental pollution characteristics from the marked classification adjustment data, calculating the weight of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weight;
The training module is used for carrying out optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model;
the recognition module is further used for carrying out pollution recognition on the newly issued data on the Internet by using the trained ecological environment pollution recognition model, so as to obtain a pollution type accurate result of the newly issued data.
A third aspect of the present invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the text keyword recognition method when executing a computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text keyword recognition method.
According to the method, the initial classification, the manual intervention and the optimization model are identified through the model, so that the newly released data on the Internet can be conveniently and rapidly identified and classified according to key features or factors related to the ecological environment text data, and the efficiency and the success rate of ecological environment pollution clue mining are improved.
Drawings
Fig. 1 is a flowchart of a preferred embodiment of a text keyword recognition method of the present disclosure.
Fig. 2 is a functional block diagram of a text keyword recognition apparatus according to a preferred embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing a text keyword recognition method.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first and second in the description and claims of the present application and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
The electronic device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware of the electronic device comprises, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device and the like. The electronic device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group of multiple network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers. The user equipment includes, but is not limited to, any electronic product that can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad, a voice control device or the like, for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant PDA and the like.
Referring to fig. 1, fig. 1 is a flowchart of a text keyword recognition method according to a preferred embodiment of the present invention. The sequence of steps in the flowchart may be changed and some steps may be omitted according to different needs.
S11, acquiring original data from the Internet through a web crawler technology.
S12, carrying out preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result.
Wherein, the preliminary result of pollution classification includes: water pollution, air pollution, solid waste pollution, noise pollution and non-pollution, wherein water pollution, air pollution, solid waste pollution and noise pollution can be regarded as pollution.
S13, receiving classification adjustment data input for the pollution classification preliminary result, and analyzing factors influencing the pollution classification preliminary result according to the classification adjustment data.
The user can perform manual intervention on the pollution classification preliminary result, and correspondingly recognize unidentified or misclassified data to perform classification adjustment.
Specifically, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
Judging whether the attribute of the classification adjustment data is an event or not according to each classification adjustment data;
if the attribute of the classification adjustment data is an event, acquiring an event state of the classification adjustment data;
and if the event state indicates that the classification adjustment data is classified into non-pollution types, determining that the factor influencing the preliminary result of pollution classification is the event state.
In this alternative embodiment, some of the categorization adjustment data describe a new event, some of the categorization adjustment data describe a processed result or evaluation of the event, and some of the categorization adjustment data may not belong to an event, such as science data. The invention aims to provide a lawsuit clue in the ecological environment field for a inspector and aims at a new pollution event. If a certain event state indicates that the classification adjustment data has been processed and the classification adjustment data is adjusted to be a non-contaminating class, then it may be determined that the factor affecting the preliminary result of the contaminated classification is an event state.
Specifically, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
acquiring the data type of the classification adjustment data for each classification adjustment data;
Judging whether the data type is a preset non-pollution data type or not;
if the data type is a preset non-pollution data type, and the classification adjustment data is adjusted to be a non-pollution type, determining factors influencing the preliminary result of pollution classification as the data type.
In this alternative embodiment, the non-contaminating data types may include, but are not limited to, science popularization type, evaluation type, weather forecast type, and the like.
S14, marking the classification adjustment data according to the factors.
In particular, a marking factor, such as an event state of the classification adjustment data, may be marked on each classification adjustment data, marking a data type of the classification adjustment data. After marking the classification adjustment data, the classification adjustment data is used for optimizing training again, so that the factors can be considered in the optimizing training process, and accurate identification can be realized without manual intervention when a model which is trained later is used for identification again.
S15, extracting environmental pollution characteristics from the marked classification adjustment data, calculating weights of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weights.
Wherein some features that are representative and do not affect the classification effect can be selected. For example, common water pollution data can be generally reflected as "a certain behavior" happens in a certain place "to generate a certain result", and environmental pollution characteristics such as a polluted place, pollution behavior, pollution result and the like can be extracted.
Among them, TF-IDF (term frequency-inverse text frequency index) is a common weighting technique for information retrieval and data mining. The importance degree of the environmental pollution characteristic in the classification adjustment data can be reflected by the weight calculated by the TF-IDF algorithm.
And S16, performing optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model.
The marked classification adjustment data are used for carrying out targeted optimization on the data with errors identified by the initial identification model, so that factors are considered in the optimization process of the model, meanwhile, the vectorization sample set obtained after the weight is calculated through the TF-IDF algorithm is subjected to optimization training, so that the model can identify pollution characteristics according to the importance degree, and the accuracy of model classification can be improved.
S17, using the trained ecological environment pollution recognition model to carry out pollution recognition on the newly issued data on the Internet, and obtaining a pollution type accurate result of the newly issued data.
The trained ecological environment pollution recognition model does not need manual intervention results, and factors influencing the results are considered when pollution recognition is performed on newly issued data, so that the recognition accuracy can be improved.
Optionally, the method further comprises:
obtaining pollutant substances of each type of pollution category from the accurate result of the pollution category;
acquiring associated ecological information of the pollutants, wherein the associated ecological information is used for representing ecological chains polluted by the pollutants;
determining a plurality of pollution categories of the pollutants according to the associated ecological information;
updating the accurate result of the pollution category according to a plurality of pollution categories of the pollutant.
In this alternative embodiment, the precise results of the pollution categories output one-to-one results (i.e., a piece of newly published data matches a pollution type), and some pollutants may cause environmental pollution of multiple categories, such as: when lead is widely used as an industrial raw material in industrial production, the lead is used as a pollutant, and the associated ecological information of the lead is as follows: lead is discharged into the environment in various forms such as waste gas, waste water, waste residue, etc., and thus causes atmospheric pollution, water pollution, solid pollution, etc. Therefore, a plurality of pollution types of the pollutant can be redetermined according to the associated ecological information of the pollutant, and the accurate pollution type result is updated, so that the accurate pollution type result can realize the one-to-many (namely, one piece of newly sent data corresponds to a plurality of pollution types), thereby increasing the accuracy of the accurate pollution type result, and simultaneously providing more comprehensive data support for subsequent associated analysis.
Optionally, the method further comprises:
determining pollution data belonging to a pollution class from the newly issued data according to the pollution type accurate result;
the pollution element of the pollution data is obtained, wherein the pollution element comprises a pollution substance, a pollution result, a pollution degree and a pollution area;
weighting the pollutant, the pollution result, the pollution degree and the pollution area to obtain a weighted score;
determining a pollution level of the pollution data according to the weighted score;
and carrying out pollution judgment on the pollution data according to the pollution level.
In this alternative embodiment, the pollution is caused by a pollutant such as industrial waste water, such as the death of surrounding vegetation, a pollution level such as the concentration of a gas contained in the air, and a pollution area such as an area of 500 meters. Different pollution elements can be provided with different weighting coefficients, each pollution element is weighted according to the weighting coefficients, the weighting score is obtained, the higher the weighting score is, the more serious the pollution is, the lower the weighting score is, and the more slight the pollution is. The contamination levels can be divided into: severe, moderate, mild. The pollution level determined by the weighted score can be used for pollution judgment on pollution data, so that the severity of the pollution data can be measured more intuitively and conveniently.
Optionally, the method further comprises:
acquiring a pollution event corresponding to first pollution data belonging to a serious level from the pollution level;
judging whether the release user of the first pollution data is an individual user or not;
if the release user of the first pollution data is an individual user, acquiring second pollution data aiming at the pollution event, which is released by an environmental protection department;
and verifying the data reliability of the second pollution data according to the first pollution data.
In this alternative embodiment, both individual users and environmental authorities may issue pollution-related information on the internet, however, there may be differences in the data issued by individual users and authorities for the same pollution event. In the scheme, aiming at a pollution event with serious pollution, the second pollution data issued by the authorities can be checked through the first pollution data of individual users, so that the reliability of the data issued by the authorities is supervised.
Specifically, the verifying, according to the first pollution data, the data reliability of the second pollution data includes:
according to the pollution elements, comparing the first pollution data with the second pollution data to obtain a difference comparison value corresponding to each pollution element;
Judging whether the difference comparison value is in a reasonable range or not;
if the difference comparison value is in a reasonable range, determining that the second pollution data is reliable; or (b)
And if the difference comparison value is not in a reasonable range, determining that the data of the second pollution data is unreliable.
In this alternative embodiment, the individual user collected data and the official collected data may differ in time and place, and these collected differences may lead to differences in pollution data comparison. In the scheme, a reasonable range can be preset, if the difference comparison value is in the reasonable range, the data issued by the authorities are reliable, otherwise, if the difference comparison value is out of the reasonable range, the data issued by the authorities are unreliable, so that the accuracy of reliability judgment on the data issued by the authorities can be improved according to the size of the difference comparison value.
In the method flow described in fig. 1, the initial classification, the manual intervention and the optimization model are identified through the model, so that the new release data on the internet can be conveniently and rapidly identified and classified according to key features or factors related to the ecological environment text data, and the efficiency and the success rate of ecological environment pollution clue mining are improved.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Referring to fig. 2, fig. 2 is a functional block diagram of a text keyword recognition apparatus according to a preferred embodiment of the present invention.
In some embodiments, the text keyword recognition apparatus is operated in an electronic device. The text keyword recognition means may comprise a plurality of functional modules consisting of program code segments. Program code for each of the program segments in the text keyword recognition means may be stored in a memory and executed by at least one processor to perform some or all of the steps in the text keyword recognition method described in fig. 1.
In this embodiment, the text keyword recognition apparatus may be divided into a plurality of functional modules according to the functions performed by the text keyword recognition apparatus. The functional module may include: the system comprises an acquisition module 201, an identification module 202, a receiving module 203, an analysis module 204, a marking module 205, a calculation construction module 206 and a training module 207. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory.
The acquisition module 201 is configured to acquire raw data from the internet through web crawler technology.
The identification module 202 is configured to perform preliminary identification on the raw data through an initial identification model, so as to obtain a preliminary result of pollution classification.
Wherein, the preliminary result of pollution classification includes: water pollution, air pollution, solid waste pollution, noise pollution and non-pollution, wherein water pollution, air pollution, solid waste pollution and noise pollution can be regarded as pollution.
And the receiving module 203 is used for receiving the classification adjustment data input for the pollution classification preliminary result.
An analysis module 204 for analyzing factors affecting the preliminary results of the contaminated classification based on the classification adjustment data.
The user can perform manual intervention on the pollution classification preliminary result, and correspondingly recognize unidentified or misclassified data to perform classification adjustment.
Specifically, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
judging whether the attribute of the classification adjustment data is an event or not according to each classification adjustment data;
if the attribute of the classification adjustment data is an event, acquiring an event state of the classification adjustment data;
And if the event state indicates that the classification adjustment data is classified into non-pollution types, determining that the factor influencing the preliminary result of pollution classification is the event state.
In this alternative embodiment, some of the categorization adjustment data describe a new event, some of the categorization adjustment data describe a processed result or evaluation of the event, and some of the categorization adjustment data may not belong to an event, such as science data. The invention aims to provide a lawsuit clue in the ecological environment field for a inspector and aims at a new pollution event. If a certain event state indicates that the classification adjustment data has been processed and the classification adjustment data is adjusted to be a non-contaminating class, then it may be determined that the factor affecting the preliminary result of the contaminated classification is an event state.
Specifically, the analyzing the factors affecting the preliminary result of the pollution classification according to the classification adjustment data includes:
acquiring the data type of the classification adjustment data for each classification adjustment data;
judging whether the data type is a preset non-pollution data type or not;
if the data type is a preset non-pollution data type, and the classification adjustment data is adjusted to be a non-pollution type, determining factors influencing the preliminary result of pollution classification as the data type.
In this alternative embodiment, the non-contaminating data types may include, but are not limited to, science popularization type, evaluation type, weather forecast type, and the like.
And the marking module 205 is configured to mark the classification adjustment data according to the factors.
In particular, a marking factor, such as an event state of the classification adjustment data, may be marked on each classification adjustment data, marking a data type of the classification adjustment data. After marking the classification adjustment data, the classification adjustment data is used for optimizing training again, so that the factors can be considered in the optimizing training process, and accurate identification can be realized without manual intervention when a model which is trained later is used for identification again.
The calculation construction module 206 is configured to extract an environmental pollution feature from the labeled classification adjustment data, calculate a weight of the environmental pollution feature by using a word frequency-inverse text frequency index TF-IDF algorithm, and construct a vectorized sample set according to the weight.
Wherein some features that are representative and do not affect the classification effect can be selected. For example, common water pollution data can be generally reflected as "a certain behavior" happens in a certain place "to generate a certain result", and environmental pollution characteristics such as a polluted place, pollution behavior, pollution result and the like can be extracted.
Among them, TF-IDF (term frequency-inverse text frequency index) is a common weighting technique for information retrieval and data mining. The importance degree of the environmental pollution characteristic in the classification adjustment data can be reflected by the weight calculated by the TF-IDF algorithm.
And the training module 207 is configured to perform optimization training on the initial recognition model by using the vectorized sample set, so as to obtain a trained ecological environment pollution recognition model.
The marked classification adjustment data are used for carrying out targeted optimization on the data with errors identified by the initial identification model, so that factors are considered in the optimization process of the model, meanwhile, the vectorization sample set obtained after the weight is calculated through the TF-IDF algorithm is subjected to optimization training, so that the model can identify pollution characteristics according to the importance degree, and the accuracy of model classification can be improved.
The recognition module 202 is further configured to use the trained ecological environment pollution recognition model to perform pollution recognition on the newly issued data on the internet, so as to obtain a pollution type accurate result of the newly issued data.
The trained ecological environment pollution recognition model does not need manual intervention results, and factors influencing the results are considered when pollution recognition is performed on newly issued data, so that the recognition accuracy can be improved.
Optionally, the text keyword recognition device further includes:
the acquisition module is used for acquiring the pollutant of each type of pollution category from the accurate result of the pollution category;
the acquisition module is further used for acquiring associated ecological information of the pollutants, wherein the associated ecological information is used for representing an ecological chain polluted by the pollutants;
the determining module is used for determining a plurality of pollution categories of the pollutant according to the associated ecological information;
and the updating module is used for updating the accurate result of the pollution category according to the plurality of pollution categories of the pollutant.
In this alternative embodiment, the precise results of the pollution categories output one-to-one results (i.e., a piece of newly published data matches a pollution type), and some pollutants may cause environmental pollution of multiple categories, such as: when lead is widely used as an industrial raw material in industrial production, the lead is used as a pollutant, and the associated ecological information of the lead is as follows: lead is discharged into the environment in various forms such as waste gas, waste water, waste residue, etc., and thus causes atmospheric pollution, water pollution, solid pollution, etc. Therefore, a plurality of pollution types of the pollutant can be redetermined according to the associated ecological information of the pollutant, and the accurate pollution type result is updated, so that the accurate pollution type result can realize the one-to-many (namely, one piece of newly sent data corresponds to a plurality of pollution types), thereby increasing the accuracy of the accurate pollution type result, and simultaneously providing more comprehensive data support for subsequent associated analysis.
Optionally, the determining module is further configured to determine, from the newly issued data, pollution data belonging to a pollution class according to the accurate result of the pollution type;
the acquisition module is further used for acquiring pollution elements of the pollution data, wherein the pollution elements comprise pollution substances, pollution results, pollution degrees and pollution areas;
the text keyword recognition device further comprises:
the weighting module is used for weighting the pollutant, the pollution result, the pollution degree and the pollution area to obtain a weighted score;
the determining module is further used for determining the pollution level of the pollution data according to the weighted scores;
and the judging module is used for judging the pollution of the pollution data according to the pollution level.
In this alternative embodiment, the pollution is caused by a pollutant such as industrial waste water, such as the death of surrounding vegetation, a pollution level such as the concentration of a gas contained in the air, and a pollution area such as an area of 500 meters. Different pollution elements can be provided with different weighting coefficients, each pollution element is weighted according to the weighting coefficients, the weighting score is obtained, the higher the weighting score is, the more serious the pollution is, the lower the weighting score is, and the more slight the pollution is. The contamination levels can be divided into: severe, moderate, mild. The pollution level determined by the weighted score can be used for pollution judgment on pollution data, so that the severity of the pollution data can be measured more intuitively and conveniently.
Optionally, the acquiring module is further configured to acquire a pollution event corresponding to the first pollution data belonging to the severity level from the pollution level;
the text keyword recognition device further comprises:
the judging module is used for judging whether the release user of the first pollution data is an individual user or not;
the acquisition module is further used for acquiring second pollution data aiming at the pollution event, which is issued by the environmental protection department, if the issuing user of the first pollution data is an individual user;
and the verification module is used for verifying the data reliability of the second pollution data according to the first pollution data.
In this alternative embodiment, both individual users and environmental authorities may issue pollution-related information on the internet, however, there may be differences in the data issued by individual users and authorities for the same pollution event. In the scheme, aiming at a pollution event with serious pollution, the second pollution data issued by the authorities can be checked through the first pollution data of individual users, so that the reliability of the data issued by the authorities is supervised.
Specifically, the verifying, according to the first pollution data, the data reliability of the second pollution data includes:
According to the pollution elements, comparing the first pollution data with the second pollution data to obtain a difference comparison value corresponding to each pollution element;
judging whether the difference comparison value is in a reasonable range or not;
if the difference comparison value is in a reasonable range, determining that the second pollution data is reliable; or (b)
And if the difference comparison value is not in a reasonable range, determining that the data of the second pollution data is unreliable.
In this alternative embodiment, the individual user collected data and the official collected data may differ in time and place, and these collected differences may lead to differences in pollution data comparison. In the scheme, a reasonable range can be preset, if the difference comparison value is in the reasonable range, the data issued by the authorities are reliable, otherwise, if the difference comparison value is out of the reasonable range, the data issued by the authorities are unreliable, so that the accuracy of reliability judgment on the data issued by the authorities can be improved according to the size of the difference comparison value.
In the text keyword recognition device described in fig. 2, by means of model recognition, initial classification, manual intervention and optimization model, new release data on the internet can be conveniently and rapidly recognized and classified according to key features or factors related to ecological environment text data, and efficiency and success rate of ecological environment pollution clue mining are improved.
Fig. 3 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing a text keyword recognition method. The electronic device 3 comprises a memory 31, at least one processor 32, a computer program 33 stored in the memory 31 and executable on the at least one processor 32, and at least one communication bus 34.
It will be appreciated by those skilled in the art that the schematic diagram shown in fig. 3 is merely an example of the electronic device 3 and is not limiting of the electronic device 3, and may include more or less components than illustrated, or may combine certain components, or different components, e.g. the electronic device 3 may further include input-output devices, network access devices, etc.
The at least one processor 32 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The processor 32 may be a microprocessor or the processor 32 may be any conventional processor or the like, the processor 32 being a control center of the electronic device 3, the various interfaces and lines being used to connect the various parts of the entire electronic device 3.
The memory 31 may be used to store the computer program 33 and/or modules/units, and the processor 32 may implement various functions of the electronic device 3 by running or executing the computer program and/or modules/units stored in the memory 31 and invoking data stored in the memory 31. The memory 31 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device 3 (such as audio data) and the like. In addition, the memory 31 may include non-volatile and volatile memory, such as a hard disk, memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), at least one disk storage device, a Flash memory device, or other storage device.
In connection with fig. 1, the memory 31 in the electronic device 3 stores a plurality of instructions to implement a text keyword recognition method, the processor 32 being executable to implement:
Collecting original data from the Internet through a web crawler technology;
performing preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result;
receiving classification adjustment data input for the pollution classification preliminary result, and analyzing factors influencing the pollution classification preliminary result according to the classification adjustment data;
marking the classification adjustment data according to the factors;
extracting environmental pollution characteristics from the marked classification adjustment data, calculating the weight of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weight;
performing optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model;
and carrying out pollution identification on the newly issued data on the Internet by using the trained ecological environment pollution identification model to obtain a pollution type accurate result of the newly issued data.
Specifically, the specific implementation method of the above instructions by the processor 32 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
In the electronic device 3 depicted in fig. 3, by identifying an initial classification, manually intervening and optimizing the model, the newly issued data on the internet can be conveniently and rapidly identified and classified according to key features or factors related to the ecological environment text data, and the efficiency and the success rate of ecological environment pollution clue mining are improved.
The modules/units integrated in the electronic device 3 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), and a random access Memory (RAM, random Access Memory).
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned. The various units or means recited in the system claims may also be implemented in software or hardware.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (7)

1. The text keyword recognition method is characterized by comprising the following steps of:
collecting original data from the Internet through a web crawler technology;
performing preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result;
receiving classification adjustment data input for the pollution classification preliminary result, and analyzing factors affecting the pollution classification preliminary result according to the classification adjustment data, including: judging whether the attribute of the classification adjustment data is an event according to each classification adjustment data, if the attribute of the classification adjustment data is an event, acquiring an event state of the classification adjustment data, if the event state indicates that the classification adjustment data is classified into a non-pollution type, determining factors influencing the preliminary result of pollution classification as the event state, or acquiring the data type of the classification adjustment data according to each classification adjustment data, judging whether the data type is a preset non-pollution data type, if the data type is the preset non-pollution data type, and the classification adjustment data is adjusted into the non-pollution type, determining factors influencing the preliminary result of pollution classification as the data type;
Marking the classification adjustment data according to the factors;
marking event states of the classification adjustment data or data types of the classification adjustment data, extracting environmental pollution characteristics from the marked classification adjustment data, calculating weights of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weights;
performing optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model;
using the trained ecological environment pollution recognition model to carry out pollution recognition on the newly issued data on the Internet to obtain a pollution type accurate result of the newly issued data;
and acquiring pollutant substances of each type of pollutant class from the pollutant class accurate result, wherein each piece of newly issued data in the pollutant class accurate result corresponds to a single pollutant class, acquiring associated ecological information of the pollutant substances, wherein the associated ecological information is used for representing an ecological chain for pollution caused by the pollutant substances, determining a plurality of pollutant classes of the pollutant substances according to the associated ecological information, and updating the pollutant class accurate result according to the plurality of pollutant classes of the pollutant substances, wherein each piece of newly issued data in the updated pollutant class accurate result corresponds to the plurality of pollutant classes.
2. The text keyword recognition method of claim 1, further comprising:
determining pollution data belonging to a pollution class from the newly issued data according to the pollution type accurate result;
the pollution element of the pollution data is obtained, wherein the pollution element comprises a pollution substance, a pollution result, a pollution degree and a pollution area;
weighting the pollutant, the pollution result, the pollution degree and the pollution area to obtain a weighted score;
determining a pollution level of the pollution data according to the weighted score;
and carrying out pollution judgment on the pollution data according to the pollution level.
3. The text keyword recognition method of claim 2, wherein the text keyword recognition method further comprises:
acquiring a pollution event corresponding to first pollution data belonging to a serious level from the pollution level;
judging whether the release user of the first pollution data is an individual user or not;
if the release user of the first pollution data is an individual user, acquiring second pollution data aiming at the pollution event, which is released by an environmental protection department;
And verifying the data reliability of the second pollution data according to the first pollution data.
4. The text keyword recognition method of claim 3, wherein verifying the data reliability of the second pollution data based on the first pollution data comprises:
according to the pollution elements, comparing the first pollution data with the second pollution data to obtain a difference comparison value corresponding to each pollution element;
judging whether the difference comparison value is in a reasonable range or not;
if the difference comparison value is in a reasonable range, determining that the second pollution data is reliable; or (b)
And if the difference comparison value is not in a reasonable range, determining that the data of the second pollution data is unreliable.
5. A text keyword recognition apparatus, characterized in that the text keyword recognition apparatus comprises:
the acquisition module is used for acquiring original data from the Internet through a web crawler technology;
the identification module is used for carrying out preliminary identification on the original data through an initial identification model to obtain a pollution classification preliminary result;
the receiving module is used for receiving the classification adjustment data input aiming at the pollution classification preliminary result;
The analysis module is used for analyzing factors influencing the preliminary result of the pollution classification according to the classification adjustment data, and comprises the following steps: judging whether the attribute of the classification adjustment data is an event according to each classification adjustment data, if the attribute of the classification adjustment data is an event, acquiring an event state of the classification adjustment data, if the event state indicates that the classification adjustment data is classified into a non-pollution type, determining factors influencing the preliminary result of pollution classification as the event state, or acquiring the data type of the classification adjustment data according to each classification adjustment data, judging whether the data type is a preset non-pollution data type, if the data type is the preset non-pollution data type, and the classification adjustment data is adjusted into the non-pollution type, determining factors influencing the preliminary result of pollution classification as the data type;
the marking module is used for marking the classification adjustment data according to the factors;
the computing and constructing module is used for marking event states of the classification adjustment data or marking data types of the classification adjustment data, extracting environmental pollution characteristics from the marked classification adjustment data, computing weights of the environmental pollution characteristics by adopting a word frequency-inverse text frequency index TF-IDF algorithm, and constructing a vectorization sample set according to the weights;
The training module is used for carrying out optimization training on the initial recognition model by using the vectorization sample set to obtain a trained ecological environment pollution recognition model;
the recognition module is further used for carrying out pollution recognition on the newly issued data on the Internet by using the trained ecological environment pollution recognition model to obtain a pollution type accurate result of the newly issued data;
the acquisition module is used for acquiring the pollutant of each type of pollution category from the pollution category accurate result, wherein each piece of newly issued data in the pollution category accurate result corresponds to a single pollution category;
the acquisition module is further used for acquiring associated ecological information of the pollutants, wherein the associated ecological information is used for representing an ecological chain polluted by the pollutants;
the determining module is used for determining a plurality of pollution categories of the pollutant according to the associated ecological information;
and the updating module is used for updating the accurate pollution category result according to a plurality of pollution categories of the pollutant, wherein each piece of newly released data in the updated accurate pollution category result corresponds to the plurality of pollution categories.
6. An electronic device comprising a processor and a memory, the processor being configured to execute a computer program stored in the memory to implement the text keyword recognition method of any one of claims 1 to 4.
7. A computer readable storage medium storing at least one instruction that when executed by a processor implements the text keyword recognition method of any one of claims 1 to 4.
CN202010859290.0A 2020-08-24 2020-08-24 Text keyword recognition method and related equipment Active CN111985222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010859290.0A CN111985222B (en) 2020-08-24 2020-08-24 Text keyword recognition method and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010859290.0A CN111985222B (en) 2020-08-24 2020-08-24 Text keyword recognition method and related equipment

Publications (2)

Publication Number Publication Date
CN111985222A CN111985222A (en) 2020-11-24
CN111985222B true CN111985222B (en) 2023-07-18

Family

ID=73443216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010859290.0A Active CN111985222B (en) 2020-08-24 2020-08-24 Text keyword recognition method and related equipment

Country Status (1)

Country Link
CN (1) CN111985222B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109752502A (en) * 2019-01-29 2019-05-14 苏州科技大学 Establish method, the apparent water pollution classification method of landscape water body of the apparent water pollution classification model of landscape water body
CN110196886A (en) * 2019-04-19 2019-09-03 安徽大学 The multi-source heterogeneous big data correlating method of agricultural non-point source pollution and the big data supervising platform for using this method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification
CN109752502A (en) * 2019-01-29 2019-05-14 苏州科技大学 Establish method, the apparent water pollution classification method of landscape water body of the apparent water pollution classification model of landscape water body
CN110196886A (en) * 2019-04-19 2019-09-03 安徽大学 The multi-source heterogeneous big data correlating method of agricultural non-point source pollution and the big data supervising platform for using this method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于联合主题特征的网络新闻文本蕴含环境污染事件检测;黄宗财 等;《地球信息科学》;20191031;第21卷(第10期);第1510-1517页 *
张明霞.水污染及其控制.《无机化学》.中国矿业大学出版社,2019,第252页. *

Also Published As

Publication number Publication date
CN111985222A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN109145216B (en) Network public opinion monitoring method, device and storage medium
CN108416198B (en) Device and method for establishing human-machine recognition model and computer readable storage medium
CN109325165B (en) Network public opinion analysis method, device and storage medium
CN103793484B (en) The fraud identifying system based on machine learning in classification information website
CN106557541B (en) Apparatus and method for performing automatic data analysis, and computer program product
CN104572958B (en) A kind of sensitive information monitoring method based on event extraction
CN102446255B (en) Method and device for detecting page tamper
CN110336838B (en) Account abnormity detection method, device, terminal and storage medium
CN103500405A (en) Method and device for identifying nominal model of target terminal
CN111079029B (en) Sensitive account detection method, storage medium and computer equipment
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN110309234A (en) A kind of client of knowledge based map holds position method for early warning, device and storage medium
CN114492663A (en) Intelligent event distribution method, device, equipment and storage medium
CN106301979B (en) Method and system for detecting abnormal channel
CN109933648B (en) Real user comment distinguishing method and device
CN111985222B (en) Text keyword recognition method and related equipment
CN104036189A (en) Page distortion detecting method and black link database generating method
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
CN101268465A (en) Method for sorting a set of electronic documents
CN109491970B (en) Bad picture detection method and device for cloud storage and storage medium
CN111813964B (en) Data processing method based on ecological environment and related equipment
CN103605670A (en) Method and device for determining grabbing frequency of network resource points
CN115062200A (en) User behavior mining method and system based on artificial intelligence
CN114553468A (en) Three-level network intrusion detection method based on feature intersection and ensemble learning
CN112052245A (en) Method and device for judging attack behavior in network security training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant