CN113742478B

CN113742478B - Directional screening device and method for massive text data

Info

Publication number: CN113742478B
Application number: CN202010474192.5A
Authority: CN
Inventors: 万辛; 戚梦苑; 孙晓晨; 侯炜; 宁珊; 沈亮; 李娅强; 王树鹏; 田正鑫
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2023-09-05
Anticipated expiration: 2040-05-29
Also published as: CN113742478A

Abstract

The invention discloses a directional screening architecture and a directional screening method for massive text data. The method comprises the following steps: 1) Obtaining suspected target text from the text to be screened by using a keyword matching method; 2) Extracting a common sentence pattern from the marked target text, and dividing the common sentence pattern into a sentence pattern related to strong business and a sentence pattern related to weak business; fuzzy sentence pattern matching is carried out on the text to be screened, if the text is matched with the business strong correlation sentence pattern, the text is judged to be a target text, otherwise, the text is suspected to be the target text; 3) Classifying each suspected target text; 4) Determining an evaluation value E1 of the text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the text according to the classification and discrimination result; determining an evaluation value E3 of the text based on an information matching result of the text and the external auxiliary corpus; then based on the evaluation values E1-E3, calculating to obtain a final score of the text and feeding the final score back to the judging layer; 5) The research layer determines whether the fed-back text is a target text.

Description

Directional screening device and method for massive text data

Technical Field

The invention relates to a directional screening device and a directional screening method for massive text data, which can provide a computing device for screening massive text and a closed-loop solution, and belongs to the field of computer science.

Background

With the development of networks and the continuous increase of the number of netizens, more and more information is generated every day. In the field of natural language processing, how to find content of interest to a user more quickly and accurately is an urgent problem to be solved in the face of such a huge amount of information space. On one hand, more text data are generated every day, and stronger computing power is required to process the text in full quantity; on the other hand, based on massive text data and screening data with more accurate user interests, the content of interest to the user is found. Meanwhile, the interest of the user changes along with time, so that the design of the closed-loop screening device and method for the massive text data is very necessary.

The text screening method has two common ideas: 1. based on rules, text satisfying a certain format or containing certain specific words is screened, and is commonly used for texts with obvious rules, such as advertising and the like. The method is simple, has high efficiency, can process massive complex natural language data, has higher requirements on text quality, and can easily cause mismatching and reduce screening accuracy if the text contains a large amount of noise and even errors. The method is limited in applicable scene and requires obvious regularity of target text. Meanwhile, the rule is set too tightly to ensure the improvement of the accuracy, but the selection is easy to be omitted, the rule is too wide to be omitted, but the accuracy is reduced, and the selection is needed. 2. Based on semantics, a text classification algorithm is used to screen out the required text. Text classification may use various text topic models, or classical classification algorithms such as SVM, bayesian classification model, decision tree, etc., or classification of text using neural networks. The method is more flexible, can handle more complex classification problems, has relatively higher accuracy, is often limited by training corpuses, generally has larger quantity of the training corpuses required by more complex algorithms, is suitable for scenes with clear and limited text types, and is difficult to deal with by a text classification algorithm based on semantics if the text in a corpus has overlarge diversity and complex type.

Disclosure of Invention

Aiming at the problems of the existing directional screening method of massive text data, the main purpose of the invention is to provide a directional screening device and method for massive text data, which are used for screening specific text data from the massive text data based on multi-channel data and fusion models, and continuously updating the models through user research and judgment, thereby improving screening accuracy. The method firstly provides a calculation device and a method based on a big data platform, wherein the device comprises a data access layer, a storage layer, a calculation layer, a model layer, a business research and judgment layer and a knowledge base layer, and provides data access capability, data storage capability, data real-time/off-line analysis capability, model training capability, business research and judgment capability, model dynamic updating capability, multichannel data screening capability and multi-evidence fusion capability. The single text screening method is insufficient for screening massive text data, the method combines two ideas based on rules and semantics, firstly designs a plurality of text preprocessing steps to process text noise and wrong text in real time at a calculation layer, then uses keyword matching and sentence pattern matching based on rules at a model layer to obtain suspected target texts, then uses a textCNN text classification model to further classify the suspected targets, and optimizes the pre-training of the textCNN network. The method combines the advantages of high efficiency of rule-based methods and high accuracy of semantic-based methods. And an automatic feedback correction method is designed, data research and judgment capability is provided at a business model layer, a new marking information correction model generated by business personnel in the use process is fully utilized, the accuracy of a text screening model is improved, and the screening model meets business requirements better.

The technical scheme of the invention is as follows:

a directional screening method for massive text data comprises the following steps:

1) Obtaining suspected target text from the text to be screened by using a keyword matching method;

2) Extracting a common sentence pattern of the target text from the marked target text, and dividing the extracted common sentence pattern into a sentence pattern which is strongly related to the service and a sentence pattern which is weakly related to the service; fuzzy sentence pattern matching is carried out on the text to be screened, if the text to be screened is matched with the business strong correlation sentence pattern, the current text to be screened is judged to be a target text, and if the text to be screened is matched with the business weak correlation sentence pattern, the current text to be screened is judged to be a suspected target text;

3) Classifying each suspected target text by using a trained text classification model TextCNN;

4) Determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text; feeding back suspected target texts with weighted scores higher than a set threshold value to a business research layer;

5) The business research judging layer determines whether the fed back suspected target text is a target text and gives corresponding labels; and then, updating keywords according to the newly marked data, adding matching sentence patterns, and training and updating a text classification model textCNN.

Further, if a common sentence pattern always contains a set keyword, the common sentence pattern is a sentence pattern related to strong business, otherwise, the common sentence pattern is a sentence pattern related to weak business.

Further, the labeling information of the target text comprises text keywords and entity names in the text.

Further, the method for updating the keywords comprises the following steps: calculating the occurrence frequency of keywords in the current keyword lexicon in the latest marked target text, discarding the corresponding keywords if the occurrence frequency is lower than a set threshold value, and adding the newly-appearing keywords in the latest marked target text into the keyword lexicon.

Further, the external auxiliary corpus contains named entity information related to the target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.

Further, the external auxiliary corpus and the text to be screened are the same type of text or different types of text.

The directional screening device for massive text data is characterized by comprising a data access layer, a storage layer, a calculation layer, a model layer, a service research and judgment layer and a knowledge base layer; the data access layer is used for accessing data; the storage layer is used for carrying out persistent storage on the accessed data; the model layer is used for carrying out text screening on massive text data by utilizing the multi-channel data screening model and the multi-evidence fusion model respectively; updating the multi-channel data screening model and the multi-evidence fusion model according to the research and judgment data and the screening configuration data stored in the knowledge base layer; the multi-channel data screening model uses a keyword matching method to acquire suspected target text from the text to be screened; extracting a common sentence pattern of the target text from the marked target text, dividing the extracted common sentence pattern into a sentence pattern which is related to the business strongly and a sentence pattern which is related to the business weakly, then carrying out fuzzy sentence pattern matching on the text to be screened, judging the current text to be screened as the target text if the text to be screened is matched with the sentence pattern which is related to the business strongly, and judging the current text to be screened as a suspected target text if the text to be screened is matched with the sentence pattern which is related to the business weakly; the computing layer is used for classifying each suspected target text by using the trained text classification model TextCNN; determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text; the business research judging layer is used for determining whether the fed back suspected target text is a target text, giving corresponding labels and issuing screening configuration to the model layer; then the model layer updates keywords according to the newly marked data, adds matching sentence patterns, and trains and updates text classification model textCNN; and the knowledge base layer is used for storing the research and judgment data and the screening configuration data of the business research and judgment layer.

Furthermore, the access layer adopts a message queue to realize the access of data.

On one hand, the method provides a computing device for screening massive text data, and provides directional screening capability for services; the second aspect provides a multi-channel data screening process for screening text data, the third aspect provides a multi-evidence fusion method for scoring the screened text, and the text screening accuracy is improved through continuous labeling automatic feedback of business personnel.

According to a first aspect of the invention, a computing device is designed for a directed screening scenario of mass data, providing a closed loop solution of data access, storage, real-time computation, offline computation, model screening, business research and judgment, and automatic model updating. The design device mainly realizes the access and persistence of data based on a message queue, provides spark streaming and a link development interface, supports real-time computing capacity, provides a spark offline computing interface, supports submitting offline computing tasks, and simultaneously provides a TensorFlow machine learning platform, and supports training and publishing of a data model; at a model layer, aiming at a specific massive text data screening scene, providing a multi-channel data screening model and a multi-evidence fusion model for text screening; at a business research and judgment layer, supporting users to research and judge the screening data and issue new screening configuration (such as keywords and the like); the knowledge base layer stores the research and judgment data, configuration data and the like of the user; the model layer reads the user research and judgment result and the configuration data stored in the knowledge base layer to update the model, so that a closed loop is formed.

In a second aspect of the present invention, a multi-channel data screening method is provided for screening text data. Firstly, establishing a keyword library according to service requirements, and screening texts containing keywords from a model layer, wherein the texts have huge numbers, noise and errors in the texts and word ambiguity of Chinese words, and the texts hit the keywords cannot be guaranteed to be target texts and need to be further judged; and in the second approach, extracting the common sentence pattern of the marked target text (the sentence pattern extraction method can be seen in reference document Li Wei. Research on automatic recognition of modern Chinese sentence pattern [ D ]. Xiamen university). The method comprises the steps that a recommended text is obtained through system recommendation, business personnel conduct research and judgment in the recommended text, the research and judgment result is positive, namely the target text, and the labeling information comprises various information including whether the text is the target text, text keywords and entity names in the text. And dividing the common sentence patterns into a sentence pattern which is strongly related to the service and a sentence pattern which is weakly related to the service, wherein whether the sentence pattern is strongly related to the service is judged by whether the drama always contains keywords, if the keywords are contained, the sentence pattern is strongly related, otherwise, the sentence pattern is weakly related. And carrying out fuzzy sentence pattern matching on the text to be screened in the model layer, if the text can be matched with the upper business strong correlation sentence pattern, directly judging the text as a target text, and if the text is matched with the upper business weak correlation sentence pattern, carrying out the next processing. According to the results of the above two approaches, the text hit on the keyword and the text weakly correlated with the business on the match cannot determine whether or not it is the target text. The text classification model is trained by using the labeling text, the text has strong extraction capability on the shallow features of the text, good effect in the short text field such as searching, dialogue field and intention classification, wide application and high speed, and the text classification model can be used for effectively classifying the text to be determined at the semantic level.

In a third aspect of the invention, a multi-evidence fusion method is provided for scoring the screened text, and the text screening accuracy is improved through continuous labeling automatic feedback of business personnel. There is mainly evidence of three aspects, (1) keywords. In a text screening task, keyword screening is one of the simplest and effective methods, and the probability that texts with more hit keywords are target texts is higher; (2) a text semantic classification model. According to the business strong correlation sentence pattern or weak correlation sentence pattern on text matching and the classification model discrimination result of textCNN, respectively setting corresponding scores, and providing text classification capability at a model layer; (3) information matching based on external auxiliary corpus. And (3) introducing external corpus auxiliary judgment, wherein the external auxiliary corpus is not necessarily the same type of text as the text to be screened, but contains information related to the target text, namely named entities such as characters, institutions and the like in the target text, and can be used for carrying out named entity identification on the auxiliary corpus at a model layer, extracting related information and calculating the number of related entities appearing in the text. Combining the evidence from the above three aspects, the text is given a final weighted score. The system recommends text with a higher score for the limited text.

In a fourth aspect of the present invention, an efficient automated feedback optimization mechanism is devised. If the service personnel use the system, new labels are given to the recommended texts. In order to enable the system to have the capability of continuously improving screening accuracy, keywords are updated by using newly marked data, matching sentence patterns are added, a text CNN text classification model is updated, and the automatic feedback mechanism can fully utilize marking information to enable the screening result of the model to be closer to business requirements. Considering the timeliness of the keywords, the system counts the occurrence times of each keyword in the screening text in the last period, and if the occurrence times are too small, the system deletes the keyword.

Compared with the prior art, the invention has the following positive effects:

(1) Aiming at directional screening of massive text data, a computing device and a closed-loop solution are provided, offline and real-time computing capacity and a multi-channel screening and fusion model are provided, and the accuracy rate of model screening can be continuously improved based on research and judgment data;

(2) A multi-channel data screening flow and method are designed, a text screening method based on rules and semantics is combined, massive text data is processed efficiently, and screening accuracy is guaranteed by using multiple screening standards;

(3) The multi-aspect evidence is used for verifying the text screening result, so that on one hand, the reliability of the screening result is enhanced, and on the other hand, convenience is provided for the judgment of service personnel;

(4) A multichannel feedback correction mechanism is designed, the timeliness of keywords is considered, and a screening model is updated in real time, so that a text screening system can correct in real time according to service requirements.

Drawings

FIG. 1 is an exemplary diagram of a directional screening apparatus for massive text data;

FIG. 2 is a flow chart of multi-channel data screening;

FIG. 3 is a flowchart of multiple evidence recommendation text scoring;

fig. 4 is a schematic diagram of a multi-channel feedback correction mechanism.

Detailed Description

In order to make the objects, technical schemes and advantages of the present invention more clear, the following describes the directional screening device and method for massive text data in detail with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a directional screening apparatus for massive text data. As shown in the figure, the whole device is divided into 5 parts, and mainly comprises a data access subsystem, a data storage subsystem, a real-time computing subsystem, a model subsystem and a business application subsystem. The data access layer mainly provides real-time massive text data access capability, the data storage layer mainly persists data of the message queue into storage for subsequent offline calculation and analysis, the calculation layer mainly provides real-time calculation capability (spark streaming, link and the like) and offline calculation capability (spark), meanwhile provides machine learning platform (TensorFlow) support model training, the model layer mainly provides a multi-channel data screening model and multi-evidence fusion model for supporting massive text data screening, the research and judgment layer provides research and judgment capability for business personnel based on an application interface, supports manual research and judgment of screened text data, and writes research and judgment results into the knowledge base layer for storage, and the model layer dynamically updates the massive text data screening model based on the knowledge base layer research and judgment results.

Fig. 2 is a flow chart of multi-channel data screening. The multi-channel data screening process and method comprises the following steps: (1) keyword matching: the knowledge base layer establishes a keyword word library, a user configures keywords through the business research and judgment layer, the model layer traverses the word library, and the number of keywords contained in the text to be screened is calculated; (2) sentence pattern matching: extracting sentence patterns with higher occurrence frequency from the model layer according to positive text in the training corpus, dividing sentence pattern information into sentence patterns with strong correlation of service and sentence patterns with weak correlation of service, then carrying out sentence pattern matching on the text to be screened, directly judging the text to be a target text matched with the sentence patterns with strong correlation of service, and carrying out next judgment matched with the sentence patterns with weak correlation of service; (3) Word segmentation, word stopping, error correction, length cutting and other preprocessing are carried out on the training corpus at a model layer, word-to-vector algorithm is used for training word vectors, text is represented by vectors, a text classification model is trained, text hit by keywords and text matched with weak related sentence patterns of business are preprocessed and represented by vectors in the same preprocessing mode, and classification is carried out by using the trained classification model. The final result is: text matched with the business strong related sentence pattern and text which is judged to be positive by the classification model are combined.

FIG. 3 is a flow chart of multiple evidence recommendation text scoring. Multiple evidence fusion method: evidence one: keywords. Counting keywords appearing in the text at a model layer; evidence two: and (5) semantic judgment. The part comprises a business strong correlation sentence pattern or a business weak correlation sentence pattern which are matched on a model layer according to text, and a classification model discrimination result of textCNN, and gives corresponding scores; evidence three: information matching based on external auxiliary corpus (the external auxiliary corpus is a specific corpus obtained according to the requirements of a business party, and the matching of the text and the auxiliary corpus refers to the number of different entities which occur in external auxiliary expectation and are contained in the text, and the repeated occurrence of the same entity is not counted repeatedly). And carrying out named entity recognition on the external auxiliary corpus in the model layer, extracting the name of the person and the organization name, and calculating the number of entities appearing in the text. Combining the evidence of the three aspects, giving a final weighted score to the text, and preferentially recommending the text with higher score by the system.

Fig. 4 illustrates a multi-channel feedback correction mechanism. If the business personnel use the system, new labels are given to the recommended texts at the business research and judgment layer. The new labeling data is used, (1) the occurrence times of each keyword in the original keyword library in the recent positive text are counted, if the occurrence times are too small (for example, lower than 10 times), the keyword of the positive text is abandoned, and the keyword of the positive text is added into a keyword configuration table of a knowledge base layer; (2) extracting a common sentence pattern and adding the common sentence pattern into a common sentence pattern table; (3) And adding the marked positive and negative texts into a model training corpus, and retraining the textCNN classification model.

Although the specific details, algorithms for implementation, and figures of the present invention have been disclosed for illustrative purposes to aid in understanding the contents of the present invention and the implementation thereof, it will be appreciated by those skilled in the art that: various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention and the appended claims. The invention should not be limited to the preferred embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.

Claims

1. A directional screening method for massive text data comprises the following steps:

2. The method of claim 1 wherein a common sentence is a business strong-related sentence if the common sentence always contains a set keyword, and is otherwise a business weak-related sentence.

3. The method of claim 1, wherein the annotation information of the target text includes text keywords, names of entities in the text.

4. The method of claim 1, wherein the method of updating the keywords is: calculating the occurrence frequency of keywords in the current keyword lexicon in the latest marked target text, discarding the corresponding keywords if the occurrence frequency is lower than a set threshold value, and adding the newly-appearing keywords in the latest marked target text into the keyword lexicon.

5. The method of claim 1, wherein the external auxiliary corpus contains named entity information related to target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.

6. The method of claim 1 or 5, wherein the external auxiliary corpus and the text to be screened are of the same type of text or different types of text.

7. The directional screening device for massive text data is characterized by comprising a data access layer, a storage layer, a calculation layer, a model layer, a service research and judgment layer and a knowledge base layer; wherein, the liquid crystal display device comprises a liquid crystal display device,

the data access layer is used for accessing data;

the storage layer is used for carrying out persistent storage on the accessed data;

the model layer is used for carrying out text screening on massive text data by utilizing the multi-channel data screening model and the multi-evidence fusion model respectively; updating the multi-channel data screening model and the multi-evidence fusion model according to the research and judgment data and the screening configuration data stored in the knowledge base layer; the multi-channel data screening model uses a keyword matching method to acquire suspected target text from the text to be screened; extracting a common sentence pattern of the target text from the marked target text, dividing the extracted common sentence pattern into a sentence pattern which is related to the business strongly and a sentence pattern which is related to the business weakly, then carrying out fuzzy sentence pattern matching on the text to be screened, judging the current text to be screened as the target text if the text to be screened is matched with the sentence pattern which is related to the business strongly, and judging the current text to be screened as a suspected target text if the text to be screened is matched with the sentence pattern which is related to the business weakly;

the computing layer is used for classifying each suspected target text by using the trained text classification model TextCNN; determining an evaluation value E1 of the suspected target text according to the number of keywords matched with the suspected target text; determining an evaluation value E2 of the suspected target text according to a classification discrimination result of the text classification model textCNN on the suspected target text; determining an evaluation value E3 of the suspected target text based on the information matching result of the suspected target text and the external auxiliary corpus; then calculating to obtain a final weighted score of the suspected target text based on the evaluation value E1, the evaluation value E2 and the evaluation value E3 of the suspected target text;

the business research judging layer is used for determining whether the fed back suspected target text is a target text, giving corresponding labels and issuing screening configuration to the model layer; then the model layer updates keywords according to the newly marked data, adds matching sentence patterns, and trains and updates text classification model textCNN;

and the knowledge base layer is used for storing the research and judgment data and the screening configuration data of the business research and judgment layer.

8. The directional screening apparatus of claim 7, wherein the access layer employs a message queue to enable access to data.

9. The directional screening apparatus of claim 7 wherein a common sentence pattern is a business strong-related sentence pattern if the common sentence pattern always contains a set keyword, and is a business weak-related sentence pattern otherwise.

10. The directional screening apparatus of claim 7, wherein the external auxiliary corpus comprises named entity information related to target text; and determining an evaluation value E3 according to the number of the named entities matched in the suspected target text and the external auxiliary corpus.