CN113742478B - Directional screening device and method for massive text data - Google Patents

Directional screening device and method for massive text data Download PDF

Info

Publication number
CN113742478B
CN113742478B CN202010474192.5A CN202010474192A CN113742478B CN 113742478 B CN113742478 B CN 113742478B CN 202010474192 A CN202010474192 A CN 202010474192A CN 113742478 B CN113742478 B CN 113742478B
Authority
CN
China
Prior art keywords
text
target text
sentence pattern
suspected target
business
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010474192.5A
Other languages
Chinese (zh)
Other versions
CN113742478A (en
Inventor
万辛
戚梦苑
孙晓晨
侯炜
宁珊
沈亮
李娅强
王树鹏
田正鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Computer Network and Information Security Management Center
Original Assignee
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Computer Network and Information Security Management Center filed Critical National Computer Network and Information Security Management Center
Priority to CN202010474192.5A priority Critical patent/CN113742478B/en
Publication of CN113742478A publication Critical patent/CN113742478A/en
Application granted granted Critical
Publication of CN113742478B publication Critical patent/CN113742478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a directional screening architecture and a directional screening method for massive text data. The method comprises the following steps: 1) Obtaining suspected target text from the text to be screened by using a keyword matching method; 2) Extracting a common sentence pattern from the marked target text, and dividing the common sentence pattern into a sentence pattern related to strong business and a sentence pattern related to weak business; fuzzy sentence pattern matching is carried out on the text to be screened, if the text is matched with the business strong correlation sentence pattern, the text is judged to be a target text, otherwise, the text is suspected to be the target text; 3) Classifying each suspected target text; 4) Determining an evaluation value E1 of the text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the text according to the classification and discrimination result; determining an evaluation value E3 of the text based on an information matching result of the text and the external auxiliary corpus; then based on the evaluation values E1-E3, calculating to obtain a final score of the text and feeding the final score back to the judging layer; 5) The research layer determines whether the fed-back text is a target text.

Description

Directional screening device and method for massive text data
Technical Field
The invention relates to a directional screening device and a directional screening method for massive text data, which can provide a computing device for screening massive text and a closed-loop solution, and belongs to the field of computer science.
Background
With the development of networks and the continuous increase of the number of netizens, more and more information is generated every day. In the field of natural language processing, how to find content of interest to a user more quickly and accurately is an urgent problem to be solved in the face of such a huge amount of information space. On one hand, more text data are generated every day, and stronger computing power is required to process the text in full quantity; on the other hand, based on massive text data and screening data with more accurate user interests, the content of interest to the user is found. Meanwhile, the interest of the user changes along with time, so that the design of the closed-loop screening device and method for the massive text data is very necessary.
The text screening method has two common ideas: 1. based on rules, text satisfying a certain format or containing certain specific words is screened, and is commonly used for texts with obvious rules, such as advertising and the like. The method is simple, has high efficiency, can process massive complex natural language data, has higher requirements on text quality, and can easily cause mismatching and reduce screening accuracy if the text contains a large amount of noise and even errors. The method is limited in applicable scene and requires obvious regularity of target text. Meanwhile, the rule is set too tightly to ensure the improvement of the accuracy, but the selection is easy to be omitted, the rule is too wide to be omitted, but the accuracy is reduced, and the selection is needed. 2. Based on semantics, a text classification algorithm is used to screen out the required text. Text classification may use various text topic models, or classical classification algorithms such as SVM, bayesian classification model, decision tree, etc., or classification of text using neural networks. The method is more flexible, can handle more complex classification problems, has relatively higher accuracy, is often limited by training corpuses, generally has larger quantity of the training corpuses required by more complex algorithms, is suitable for scenes with clear and limited text types, and is difficult to deal with by a text classification algorithm based on semantics if the text in a corpus has overlarge diversity and complex type.
Disclosure of Invention
Aiming at the problems of the existing directional screening method of massive text data, the main purpose of the invention is to provide a directional screening device and method for massive text data, which are used for screening specific text data from the massive text data based on multi-channel data and fusion models, and continuously updating the models through user research and judgment, thereby improving screening accuracy. The method firstly provides a calculation device and a method based on a big data platform, wherein the device comprises a data access layer, a storage layer, a calculation layer, a model layer, a business research and judgment layer and a knowledge base layer, and provides data access capability, data storage capability, data real-time/off-line analysis capability, model training capability, business research and judgment capability, model dynamic updating capability, multichannel data screening capability and multi-evidence fusion capability. The single text screening method is insufficient for screening massive text data, the method combines two ideas based on rules and semantics, firstly designs a plurality of text preprocessing steps to process text noise and wrong text in real time at a calculation layer, then uses keyword matching and sentence pattern matching based on rules at a model layer to obtain suspected target texts, then uses a textCNN text classification model to further classify the suspected targets, and optimizes the pre-training of the textCNN network. The method combines the advantages of high efficiency of rule-based methods and high accuracy of semantic-based methods. And an automatic feedback correction method is designed, data research and judgment capability is provided at a business model layer, a new marking information correction model generated by business personnel in the use process is fully utilized, the accuracy of a text screening model is improved, and the screening model meets business requirements better.
The technical scheme of the invention is as follows:
a directional screening method for massive text data comprises the following steps:
1) Obtaining suspected target text from the text to be screened by using a keyword matching method;
2) Extracting a common sentence pattern of the target text from the marked target text, and dividing the extracted common sentence pattern into a sentence pattern which is strongly related to the service and a sentence pattern which is weakly related to the service; fuzzy sentence pattern matching is carried out on the text to be screened, if the text to be screened is matched with the business strong correlation sentence pattern, the current text to be screened is judged to be a target text, and if the text to be screened is matched with the business weak correlation sentence pattern, the current text to be screened is judged to be a suspected target text;
3) Classifying each suspected target text by using a trained text classification model TextCNN;
4) Determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text; feeding back suspected target texts with weighted scores higher than a set threshold value to a business research layer;
5) The business research judging layer determines whether the fed back suspected target text is a target text and gives corresponding labels; and then, updating keywords according to the newly marked data, adding matching sentence patterns, and training and updating a text classification model textCNN.
Further, if a common sentence pattern always contains a set keyword, the common sentence pattern is a sentence pattern related to strong business, otherwise, the common sentence pattern is a sentence pattern related to weak business.
Further, the labeling information of the target text comprises text keywords and entity names in the text.
Further, the method for updating the keywords comprises the following steps: calculating the occurrence frequency of keywords in the current keyword lexicon in the latest marked target text, discarding the corresponding keywords if the occurrence frequency is lower than a set threshold value, and adding the newly-appearing keywords in the latest marked target text into the keyword lexicon.
Further, the external auxiliary corpus contains named entity information related to the target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.
Further, the external auxiliary corpus and the text to be screened are the same type of text or different types of text.
The directional screening device for massive text data is characterized by comprising a data access layer, a storage layer, a calculation layer, a model layer, a service research and judgment layer and a knowledge base layer; the data access layer is used for accessing data; the storage layer is used for carrying out persistent storage on the accessed data; the model layer is used for carrying out text screening on massive text data by utilizing the multi-channel data screening model and the multi-evidence fusion model respectively; updating the multi-channel data screening model and the multi-evidence fusion model according to the research and judgment data and the screening configuration data stored in the knowledge base layer; the multi-channel data screening model uses a keyword matching method to acquire suspected target text from the text to be screened; extracting a common sentence pattern of the target text from the marked target text, dividing the extracted common sentence pattern into a sentence pattern which is related to the business strongly and a sentence pattern which is related to the business weakly, then carrying out fuzzy sentence pattern matching on the text to be screened, judging the current text to be screened as the target text if the text to be screened is matched with the sentence pattern which is related to the business strongly, and judging the current text to be screened as a suspected target text if the text to be screened is matched with the sentence pattern which is related to the business weakly; the computing layer is used for classifying each suspected target text by using the trained text classification model TextCNN; determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text; the business research judging layer is used for determining whether the fed back suspected target text is a target text, giving corresponding labels and issuing screening configuration to the model layer; then the model layer updates keywords according to the newly marked data, adds matching sentence patterns, and trains and updates text classification model textCNN; and the knowledge base layer is used for storing the research and judgment data and the screening configuration data of the business research and judgment layer.
Furthermore, the access layer adopts a message queue to realize the access of data.
On one hand, the method provides a computing device for screening massive text data, and provides directional screening capability for services; the second aspect provides a multi-channel data screening process for screening text data, the third aspect provides a multi-evidence fusion method for scoring the screened text, and the text screening accuracy is improved through continuous labeling automatic feedback of business personnel.
According to a first aspect of the invention, a computing device is designed for a directed screening scenario of mass data, providing a closed loop solution of data access, storage, real-time computation, offline computation, model screening, business research and judgment, and automatic model updating. The design device mainly realizes the access and persistence of data based on a message queue, provides spark streaming and a link development interface, supports real-time computing capacity, provides a spark offline computing interface, supports submitting offline computing tasks, and simultaneously provides a TensorFlow machine learning platform, and supports training and publishing of a data model; at a model layer, aiming at a specific massive text data screening scene, providing a multi-channel data screening model and a multi-evidence fusion model for text screening; at a business research and judgment layer, supporting users to research and judge the screening data and issue new screening configuration (such as keywords and the like); the knowledge base layer stores the research and judgment data, configuration data and the like of the user; the model layer reads the user research and judgment result and the configuration data stored in the knowledge base layer to update the model, so that a closed loop is formed.
In a second aspect of the present invention, a multi-channel data screening method is provided for screening text data. Firstly, establishing a keyword library according to service requirements, and screening texts containing keywords from a model layer, wherein the texts have huge numbers, noise and errors in the texts and word ambiguity of Chinese words, and the texts hit the keywords cannot be guaranteed to be target texts and need to be further judged; and in the second approach, extracting the common sentence pattern of the marked target text (the sentence pattern extraction method can be seen in reference document Li Wei. Research on automatic recognition of modern Chinese sentence pattern [ D ]. Xiamen university). The method comprises the steps that a recommended text is obtained through system recommendation, business personnel conduct research and judgment in the recommended text, the research and judgment result is positive, namely the target text, and the labeling information comprises various information including whether the text is the target text, text keywords and entity names in the text. And dividing the common sentence patterns into a sentence pattern which is strongly related to the service and a sentence pattern which is weakly related to the service, wherein whether the sentence pattern is strongly related to the service is judged by whether the drama always contains keywords, if the keywords are contained, the sentence pattern is strongly related, otherwise, the sentence pattern is weakly related. And carrying out fuzzy sentence pattern matching on the text to be screened in the model layer, if the text can be matched with the upper business strong correlation sentence pattern, directly judging the text as a target text, and if the text is matched with the upper business weak correlation sentence pattern, carrying out the next processing. According to the results of the above two approaches, the text hit on the keyword and the text weakly correlated with the business on the match cannot determine whether or not it is the target text. The text classification model is trained by using the labeling text, the text has strong extraction capability on the shallow features of the text, good effect in the short text field such as searching, dialogue field and intention classification, wide application and high speed, and the text classification model can be used for effectively classifying the text to be determined at the semantic level.
In a third aspect of the invention, a multi-evidence fusion method is provided for scoring the screened text, and the text screening accuracy is improved through continuous labeling automatic feedback of business personnel. There is mainly evidence of three aspects, (1) keywords. In a text screening task, keyword screening is one of the simplest and effective methods, and the probability that texts with more hit keywords are target texts is higher; (2) a text semantic classification model. According to the business strong correlation sentence pattern or weak correlation sentence pattern on text matching and the classification model discrimination result of textCNN, respectively setting corresponding scores, and providing text classification capability at a model layer; (3) information matching based on external auxiliary corpus. And (3) introducing external corpus auxiliary judgment, wherein the external auxiliary corpus is not necessarily the same type of text as the text to be screened, but contains information related to the target text, namely named entities such as characters, institutions and the like in the target text, and can be used for carrying out named entity identification on the auxiliary corpus at a model layer, extracting related information and calculating the number of related entities appearing in the text. Combining the evidence from the above three aspects, the text is given a final weighted score. The system recommends text with a higher score for the limited text.
In a fourth aspect of the present invention, an efficient automated feedback optimization mechanism is devised. If the service personnel use the system, new labels are given to the recommended texts. In order to enable the system to have the capability of continuously improving screening accuracy, keywords are updated by using newly marked data, matching sentence patterns are added, a text CNN text classification model is updated, and the automatic feedback mechanism can fully utilize marking information to enable the screening result of the model to be closer to business requirements. Considering the timeliness of the keywords, the system counts the occurrence times of each keyword in the screening text in the last period, and if the occurrence times are too small, the system deletes the keyword.
Compared with the prior art, the invention has the following positive effects:
(1) Aiming at directional screening of massive text data, a computing device and a closed-loop solution are provided, offline and real-time computing capacity and a multi-channel screening and fusion model are provided, and the accuracy rate of model screening can be continuously improved based on research and judgment data;
(2) A multi-channel data screening flow and method are designed, a text screening method based on rules and semantics is combined, massive text data is processed efficiently, and screening accuracy is guaranteed by using multiple screening standards;
(3) The multi-aspect evidence is used for verifying the text screening result, so that on one hand, the reliability of the screening result is enhanced, and on the other hand, convenience is provided for the judgment of service personnel;
(4) A multichannel feedback correction mechanism is designed, the timeliness of keywords is considered, and a screening model is updated in real time, so that a text screening system can correct in real time according to service requirements.
Drawings
FIG. 1 is an exemplary diagram of a directional screening apparatus for massive text data;
FIG. 2 is a flow chart of multi-channel data screening;
FIG. 3 is a flowchart of multiple evidence recommendation text scoring;
fig. 4 is a schematic diagram of a multi-channel feedback correction mechanism.
Detailed Description
In order to make the objects, technical schemes and advantages of the present invention more clear, the following describes the directional screening device and method for massive text data in detail with reference to the accompanying drawings.
Fig. 1 shows a schematic diagram of a directional screening apparatus for massive text data. As shown in the figure, the whole device is divided into 5 parts, and mainly comprises a data access subsystem, a data storage subsystem, a real-time computing subsystem, a model subsystem and a business application subsystem. The data access layer mainly provides real-time massive text data access capability, the data storage layer mainly persists data of the message queue into storage for subsequent offline calculation and analysis, the calculation layer mainly provides real-time calculation capability (spark streaming, link and the like) and offline calculation capability (spark), meanwhile provides machine learning platform (TensorFlow) support model training, the model layer mainly provides a multi-channel data screening model and multi-evidence fusion model for supporting massive text data screening, the research and judgment layer provides research and judgment capability for business personnel based on an application interface, supports manual research and judgment of screened text data, and writes research and judgment results into the knowledge base layer for storage, and the model layer dynamically updates the massive text data screening model based on the knowledge base layer research and judgment results.
Fig. 2 is a flow chart of multi-channel data screening. The multi-channel data screening process and method comprises the following steps: (1) keyword matching: the knowledge base layer establishes a keyword word library, a user configures keywords through the business research and judgment layer, the model layer traverses the word library, and the number of keywords contained in the text to be screened is calculated; (2) sentence pattern matching: extracting sentence patterns with higher occurrence frequency from the model layer according to positive text in the training corpus, dividing sentence pattern information into sentence patterns with strong correlation of service and sentence patterns with weak correlation of service, then carrying out sentence pattern matching on the text to be screened, directly judging the text to be a target text matched with the sentence patterns with strong correlation of service, and carrying out next judgment matched with the sentence patterns with weak correlation of service; (3) Word segmentation, word stopping, error correction, length cutting and other preprocessing are carried out on the training corpus at a model layer, word-to-vector algorithm is used for training word vectors, text is represented by vectors, a text classification model is trained, text hit by keywords and text matched with weak related sentence patterns of business are preprocessed and represented by vectors in the same preprocessing mode, and classification is carried out by using the trained classification model. The final result is: text matched with the business strong related sentence pattern and text which is judged to be positive by the classification model are combined.
FIG. 3 is a flow chart of multiple evidence recommendation text scoring. Multiple evidence fusion method: evidence one: keywords. Counting keywords appearing in the text at a model layer; evidence two: and (5) semantic judgment. The part comprises a business strong correlation sentence pattern or a business weak correlation sentence pattern which are matched on a model layer according to text, and a classification model discrimination result of textCNN, and gives corresponding scores; evidence three: information matching based on external auxiliary corpus (the external auxiliary corpus is a specific corpus obtained according to the requirements of a business party, and the matching of the text and the auxiliary corpus refers to the number of different entities which occur in external auxiliary expectation and are contained in the text, and the repeated occurrence of the same entity is not counted repeatedly). And carrying out named entity recognition on the external auxiliary corpus in the model layer, extracting the name of the person and the organization name, and calculating the number of entities appearing in the text. Combining the evidence of the three aspects, giving a final weighted score to the text, and preferentially recommending the text with higher score by the system.
Fig. 4 illustrates a multi-channel feedback correction mechanism. If the business personnel use the system, new labels are given to the recommended texts at the business research and judgment layer. The new labeling data is used, (1) the occurrence times of each keyword in the original keyword library in the recent positive text are counted, if the occurrence times are too small (for example, lower than 10 times), the keyword of the positive text is abandoned, and the keyword of the positive text is added into a keyword configuration table of a knowledge base layer; (2) extracting a common sentence pattern and adding the common sentence pattern into a common sentence pattern table; (3) And adding the marked positive and negative texts into a model training corpus, and retraining the textCNN classification model.
Although the specific details, algorithms for implementation, and figures of the present invention have been disclosed for illustrative purposes to aid in understanding the contents of the present invention and the implementation thereof, it will be appreciated by those skilled in the art that: various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention and the appended claims. The invention should not be limited to the preferred embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.

Claims (10)

1. A directional screening method for massive text data comprises the following steps:
1) Obtaining suspected target text from the text to be screened by using a keyword matching method;
2) Extracting a common sentence pattern of the target text from the marked target text, and dividing the extracted common sentence pattern into a sentence pattern which is strongly related to the service and a sentence pattern which is weakly related to the service; fuzzy sentence pattern matching is carried out on the text to be screened, if the text to be screened is matched with the business strong correlation sentence pattern, the current text to be screened is judged to be a target text, and if the text to be screened is matched with the business weak correlation sentence pattern, the current text to be screened is judged to be a suspected target text;
3) Classifying each suspected target text by using a trained text classification model TextCNN;
4) Determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text; feeding back suspected target texts with weighted scores higher than a set threshold value to a business research layer;
5) The business research judging layer determines whether the fed back suspected target text is a target text and gives corresponding labels; and then, updating keywords according to the newly marked data, adding matching sentence patterns, and training and updating a text classification model textCNN.
2. The method of claim 1 wherein a common sentence is a business strong-related sentence if the common sentence always contains a set keyword, and is otherwise a business weak-related sentence.
3. The method of claim 1, wherein the annotation information of the target text includes text keywords, names of entities in the text.
4. The method of claim 1, wherein the method of updating the keywords is: calculating the occurrence frequency of keywords in the current keyword lexicon in the latest marked target text, discarding the corresponding keywords if the occurrence frequency is lower than a set threshold value, and adding the newly-appearing keywords in the latest marked target text into the keyword lexicon.
5. The method of claim 1, wherein the external auxiliary corpus contains named entity information related to target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.
6. The method of claim 1 or 5, wherein the external auxiliary corpus and the text to be screened are of the same type of text or different types of text.
7. The directional screening device for massive text data is characterized by comprising a data access layer, a storage layer, a calculation layer, a model layer, a service research and judgment layer and a knowledge base layer; wherein, the liquid crystal display device comprises a liquid crystal display device,
the data access layer is used for accessing data;
the storage layer is used for carrying out persistent storage on the accessed data;
the model layer is used for carrying out text screening on massive text data by utilizing the multi-channel data screening model and the multi-evidence fusion model respectively; updating the multi-channel data screening model and the multi-evidence fusion model according to the research and judgment data and the screening configuration data stored in the knowledge base layer; the multi-channel data screening model uses a keyword matching method to acquire suspected target text from the text to be screened; extracting a common sentence pattern of the target text from the marked target text, dividing the extracted common sentence pattern into a sentence pattern which is related to the business strongly and a sentence pattern which is related to the business weakly, then carrying out fuzzy sentence pattern matching on the text to be screened, judging the current text to be screened as the target text if the text to be screened is matched with the sentence pattern which is related to the business strongly, and judging the current text to be screened as a suspected target text if the text to be screened is matched with the sentence pattern which is related to the business weakly;
the computing layer is used for classifying each suspected target text by using the trained text classification model TextCNN; determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text;
the business research judging layer is used for determining whether the fed back suspected target text is a target text, giving corresponding labels and issuing screening configuration to the model layer; then the model layer updates keywords according to the newly marked data, adds matching sentence patterns, and trains and updates text classification model textCNN;
and the knowledge base layer is used for storing the research and judgment data and the screening configuration data of the business research and judgment layer.
8. The directional screening apparatus of claim 7, wherein the access layer employs a message queue to enable access to data.
9. The directional screening apparatus of claim 7 wherein a common sentence pattern is a business strong-related sentence pattern if the common sentence pattern always contains a set keyword, and is a business weak-related sentence pattern otherwise.
10. The directional screening apparatus of claim 7, wherein the external auxiliary corpus comprises named entity information related to target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.
CN202010474192.5A 2020-05-29 2020-05-29 Directional screening device and method for massive text data Active CN113742478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010474192.5A CN113742478B (en) 2020-05-29 2020-05-29 Directional screening device and method for massive text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010474192.5A CN113742478B (en) 2020-05-29 2020-05-29 Directional screening device and method for massive text data

Publications (2)

Publication Number Publication Date
CN113742478A CN113742478A (en) 2021-12-03
CN113742478B true CN113742478B (en) 2023-09-05

Family

ID=78724574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010474192.5A Active CN113742478B (en) 2020-05-29 2020-05-29 Directional screening device and method for massive text data

Country Status (1)

Country Link
CN (1) CN113742478B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
CN107993134A (en) * 2018-01-23 2018-05-04 北京知行信科技有限公司 A kind of smart shopper exchange method and system based on user interest
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
WO2019109918A1 (en) * 2017-12-06 2019-06-13 腾讯科技(深圳)有限公司 Abstract text generation method, computer readable storage medium and computer device
CN109885688A (en) * 2019-03-05 2019-06-14 湖北亿咖通科技有限公司 File classification method, device, computer readable storage medium and electronic equipment
CN109902179A (en) * 2019-03-04 2019-06-18 上海宝尊电子商务有限公司 The method of screening electric business comment spam based on natural language processing
CN110059316A (en) * 2019-04-16 2019-07-26 广东省科技基础条件平台中心 A kind of dynamic scientific and technological resources semantic analysis based on data perception
CN110377692A (en) * 2019-06-03 2019-10-25 广东幽澜机器人科技有限公司 A kind of artificial client service method of image training robot learning by imitation and device
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622443A (en) * 2012-03-13 2012-08-01 北京邮电大学 Customized screening system and method for microblog
WO2019109918A1 (en) * 2017-12-06 2019-06-13 腾讯科技(深圳)有限公司 Abstract text generation method, computer readable storage medium and computer device
CN107993134A (en) * 2018-01-23 2018-05-04 北京知行信科技有限公司 A kind of smart shopper exchange method and system based on user interest
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN109446513A (en) * 2018-09-18 2019-03-08 中国电子科技集团公司第二十八研究所 The abstracting method of event in a kind of text based on natural language understanding
CN109902179A (en) * 2019-03-04 2019-06-18 上海宝尊电子商务有限公司 The method of screening electric business comment spam based on natural language processing
CN109885688A (en) * 2019-03-05 2019-06-14 湖北亿咖通科技有限公司 File classification method, device, computer readable storage medium and electronic equipment
CN110059316A (en) * 2019-04-16 2019-07-26 广东省科技基础条件平台中心 A kind of dynamic scientific and technological resources semantic analysis based on data perception
CN110377692A (en) * 2019-06-03 2019-10-25 广东幽澜机器人科技有限公司 A kind of artificial client service method of image training robot learning by imitation and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于分布式架构的海量文本快速相似度检测研究;晋晓琳等;中国传媒大学学报(自然科学版);第26卷(第1期);39-44 *

Also Published As

Publication number Publication date
CN113742478A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
CN110110335B (en) Named entity identification method based on stack model
CN112270379B (en) Training method of classification model, sample classification method, device and equipment
CN106919673B (en) Text mood analysis system based on deep learning
CN109408622B (en) Statement processing method, device, equipment and storage medium
CN102929861B (en) Method and system for calculating text emotion index
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN107491435A (en) Method and device based on Computer Automatic Recognition user feeling
CN112581006A (en) Public opinion engine and method for screening public opinion information and monitoring enterprise main body risk level
CN111881283A (en) Business keyword library creating method, intelligent chat guiding method and device
CN110489747A (en) A kind of image processing method, device, storage medium and electronic equipment
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN113360582A (en) Relation classification method and system based on BERT model fusion multi-element entity information
CN107632974A (en) Suitable for multi-field Chinese analysis platform
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
CN113742478B (en) Directional screening device and method for massive text data
WO2023083176A1 (en) Sample processing method and device and computer readable storage medium
CN111767404B (en) Event mining method and device
CN115827867A (en) Text type detection method and device
CN112507115B (en) Method and device for classifying emotion words in barrage text and storage medium
CN113505293B (en) Information pushing method and device, electronic equipment and storage medium
CN112926340B (en) Semantic matching model for knowledge point positioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant