CN113779250A - Standardized text data processing system - Google Patents

Standardized text data processing system Download PDF

Info

Publication number
CN113779250A
CN113779250A CN202111047940.2A CN202111047940A CN113779250A CN 113779250 A CN113779250 A CN 113779250A CN 202111047940 A CN202111047940 A CN 202111047940A CN 113779250 A CN113779250 A CN 113779250A
Authority
CN
China
Prior art keywords
module
text
quality inspection
texts
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111047940.2A
Other languages
Chinese (zh)
Inventor
彭明齐
耿峰
周振泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Songxin Intelligent Technology Co ltd
Original Assignee
Shanghai Songxin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Songxin Intelligent Technology Co ltd filed Critical Shanghai Songxin Intelligent Technology Co ltd
Priority to CN202111047940.2A priority Critical patent/CN113779250A/en
Publication of CN113779250A publication Critical patent/CN113779250A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a standardized text data processing system, which relates to the technical field of computers and comprises an acquisition module, a theme screening module, a demand screening module, a quality inspection module, an early warning module and an output module, wherein the acquisition module is used for acquiring text information issued on an information source in the Internet; the theme screening module is used for taking the text information of the related words containing the required theme as a target text; the requirement screening module is used for determining the requirement direction of each target text for the requirement theme through the emotional words in each target text; the quality inspection module is used for performing quality inspection on the target text; the early warning module is used for early warning illegal texts detected by the quality inspection module and prompting a user that risks exist in corresponding texts; and the output module is used for outputting the screened text to feed back to the user. The method and the device have the advantages that the data needing to be mined by the data processing and mining system is simplified, the consumed time is reduced, the searching efficiency of the user is improved, and the user requirements can be well met.

Description

Standardized text data processing system
Technical Field
The application relates to the technical field of computers, in particular to a standardized text data processing system.
Background
At present, data analysis refers to analyzing a large amount of collected data by using an appropriate statistical and analytical method, summarizing, understanding and digesting the data so as to maximally develop the function of the data and play the role of the data. Data analysis is the process of studying and summarizing data in detail to extract useful information and to form conclusions. The data is also referred to as observation values and is the result of experiments, measurements, observations, investigations, and the like. The data processed in the data analysis is divided into qualitative data and quantitative data. Data that fall into only one category and cannot be measured numerically is called qualitative data. The qualitative data is represented as category, but is not sequential, and is classified data, such as gender, brand, and the like; the qualitative data is represented as categories, but is sorted sequentially, and is sequencing data such as academic calendar, quality grade of goods, and the like.
In the related art, since there are often many uncertain factors in the collected text data, whether in format or content, the collected text data usually needs to be processed to a certain extent before being provided for the subsequent processes. Compared with the information transmission mode of text, the multimedia is beneficial to enabling people to acquire information more easily and quickly. For example, short seconds or tens of seconds of multimedia may allow people to understand the general condition of an object. The rough situation of the object can be the performance of a product, the content of a news or the historical interest of a place, and in the process of respectively generating multimedia for a plurality of objects by using the collected text data, the collected text data is often required to be processed to a certain extent, and then the processed text data is used for generating the multimedia for the objects.
For the related technologies, the inventor thinks that the existing data processing and mining system needs to mine data, which is numerous and complicated, takes long time, has low efficiency, and cannot well satisfy users.
Disclosure of Invention
In order to solve the problem that data processing and mining take long time, the application provides a standardized text data processing system.
The standardized text data processing system provided by the application adopts the following technical scheme:
a standardized text data processing system comprises an acquisition module, a theme screening module, a demand screening module, a quality inspection module, an early warning module and an output module; the acquisition module is used for acquiring at least one piece of text information issued on at least one information source in the Internet; the theme screening module is used for taking the text information of the related words containing the requirement theme as a target text; the requirement screening module is used for determining the requirement direction of each target text for the requirement theme through the emotional words in the target text; the quality inspection module is used for performing quality inspection on the screened target text and determining illegal words contained in the target text; the early warning module is used for early warning the illegal texts detected by the quality inspection module to prompt a user that the corresponding texts have risks; and the output module is used for outputting the screened text and feeding back the text to the user.
Optionally, a memory module is connected in the quality inspection module, and before use, a designer can input some illegal words in the memory module to serve as a quality inspection basis of the quality inspection module.
Optionally, the output module is connected with a feedback module, the feedback module is connected with the memory module, the feedback module can be recorded into the memory module according to illegal words responded by a user, and the illegal words fed back by the feedback module can also be used as the quality inspection basis of the quality inspection module.
Optionally, a selection preference module is connected to the output module, the selection preference module is further connected to the acquisition module, the selection preference module memorizes the selection of the user after the output module outputs a plurality of texts, the selection preference module sends the selection of the user to the acquisition module, and the acquisition module acquires the selection of the user in a direction.
Optionally, a segmentation module is connected between the demand screening module and the quality inspection module, the segmentation module segments the text screened by the demand screening module into a plurality of entries, and the quality inspection module performs quality inspection on the entries segmented by the segmentation module.
Optionally, a conversion module is connected between the early warning module and the output module, the conversion module converts the screened text into a standardized text, and the output module outputs the standardized text converted by the conversion module.
In summary, the present application includes at least one of the following beneficial technical effects of a standardized text data processing system:
in application, a user inputs a search keyword, the acquisition module acquires a plurality of text messages published on a plurality of information sources in the Internet, the method comprises the steps that a theme screening module takes text information containing relevant words of a requirement theme as a target text, the requirement screening module determines the requirement direction of the target text for the requirement theme through emotion words in each target text, a quality inspection module performs quality inspection on the screened target text, illegal words contained in the target text are determined, an early warning module performs early warning on the illegal texts detected by the quality inspection module, the risk of corresponding texts of a user is prompted, an output module outputs the screened texts to the user, the data needing to be mined by a data processing and mining system is facilitated to be simplified, time consumed is reduced, the efficiency of searching by the user is improved, and the user requirements can be well met.
Drawings
FIG. 1 is a flow chart of a standardized text data processing system according to the present embodiment.
Reference numerals: 1. an acquisition module; 2. a topic screening module; 21. a matrix building module; 22. a feature word acquisition module; 23. a first matching module; 24. a topic screening submodule; 3. a demand screening module; 31. a second matching module; 32. a demand screening submodule; 33. a type identification module; 4. a segmentation module; 5. a quality inspection module; 51. a memory module; 6. an early warning module; 7. a conversion module; 8. an output module; 81. a feedback module; 82. an audit module; 83. a preference module is selected.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings of the embodiments of the present application. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the application without any inventive step, are within the scope of protection of the application.
Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" or "an" and the like in the description and in the claims of the present application do not denote a limitation of quantity, but rather denote the presence of at least one.
In the description of the present specification and claims, the terms "upper", "lower", "horizontal", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing the present application and simplifying the description, but do not indicate or imply that the referred device or unit must have a specific direction, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present application.
The present application is described in further detail below with reference to fig. 1.
The embodiment of the application discloses a standardized text data processing system.
Referring to fig. 1, a standardized text data processing system includes an acquisition module 1, a topic screening module 2, a requirement screening module 3, a quality inspection module 5, an early warning module 6, and an output module 8; when the system is used, a user inputs a plurality of keywords, then the acquisition module 1 acquires a plurality of text messages issued on a plurality of information sources in the Internet, then the theme screening module 2 takes the text messages containing the relevant words of the required theme as target texts, after the target texts are screened out, the required screening module 3 determines the required direction of the target texts to the required theme through the emotional words in each target text, then the quality inspection module 5 performs quality inspection on the screened target texts to determine illegal words contained in the target texts, after the quality inspection module 5 detects the illegal words in the target texts, the early warning module 6 performs early warning on the texts containing the illegal words to prompt the user that the corresponding texts have risks, and after the user removes the texts with the risks, the output module 8 outputs the screened texts to the user to simplify the data processing and mining system, the time consumed is reduced, and the searching efficiency of the user is improved.
The topic screening module 2 comprises a matrix establishing module 21, a characteristic word obtaining module 22, a first matching module 23 and a topic screening submodule 24. When the method is used, the matrix establishing module 21 finds out a plurality of keywords in each text message and constructs a distribution matrix of the keywords of the text message, the characteristic word obtaining module 22 determines characteristic words in the keywords of each text message through chi-square test to obtain a characteristic word set of each text message, the first matching module 23 matches related words of a demand topic with the characteristic words in the characteristic word set of each text message, and the topic screening submodule 24 takes the text message of which the characteristic words are successfully matched with the related words of the demand topic as the topic screening submodule 24 of the target text.
The requirement screening module 3 comprises a second matching module 31, a requirement screening submodule 32 and a type identification module 33.
The second matching module 31 matches each requirement word in the requirement word bank with the adjective part-of-speech keyword of each target text in the distribution matrix, and the requirement screening sub-module 32 takes the requirement direction and the requirement degree corresponding to the requirement word as the requirement direction of the target text for the requirement subject after the matching is successful. The type recognition module 33 determines whether the target text has a description reflecting the type of the requirement in the corresponding keyword in the distribution matrix.
After the text is selected by the requirement screening module 3, some texts may have words violating the social value view, some teenagers may only hold curious psychology to search, when the words violating the social value view and videos are received by the teenagers, the psychology of the teenagers is easily distorted, and the words screened by the requirement screening module 3 can be well subjected to quality inspection through the quality inspection module 5, so that the mental health of the teenagers is protected.
The memory module 51 is connected in the quality inspection module 5, before use, a designer can record some illegal words and websites in the memory module 51 as the quality inspection basis of the quality inspection module 5, and when the text to be screened by the screening module 3 contains the words and websites contained in the memory module 51, the quality inspection module 5 further filters the screened text. The memory module 51 can store a large amount of illegal words and websites, and can further enhance the safety of use along with the use time and the input of designers.
In the using process, the text of the quality inspection module 5 is usually a whole text, and a large number of words and phrases are overlapped in the text, so that the time for the quality inspection module 5 to perform quality inspection is increased, and the waiting time of workers is prolonged. After the improvement of designers, the requirement screening module 3 is connected with the quality inspection module 5 through the segmentation module 4, the text screened out by the requirement screening module 3 is segmented by the segmentation module 4, the original text after segmentation is changed into a plurality of simple words, and the quality inspection time of the quality inspection module 5 is greatly shortened.
After the quality inspection module 5 detects the quality of the text containing the illegal texts, the early warning module 6 carries out early warning on the illegal texts so as to improve the risk of the text of the user and reduce the search of the user on the text. After the early warning module 6 carries out early warning on illegal texts, the output module 8 outputs the texts.
Some text styles on the internet are various, some standardized text formats are usually the same, but the contents in the texts are different, the quality inspection module 5 only screens out some illegal texts in the texts, but the formats are not unified, the output module 8 outputs the texts in various same types and different formats, although the user can also understand the texts, the viewing effect of the user is reduced, after the improvement of designers, the conversion module 7 is connected between the early warning module 6 and the output module 8, the conversion module 7 can convert the texts after the quality inspection of the quality inspection module 5 into the standardized texts, and the viewing effect of the user on the texts is greatly improved.
The output module 8 is also connected with a feedback module 81, some illegal words in the existing memory module 51 are input by designers before use, and with the increase of the use time, some words and websites in the memory module 51 cannot meet the requirement of quality inspection of the text screened by the demand screening module 3. The output module 8 outputs the text, after the user finds that some illegal words exist in the text, the user can feed back the illegal words to the memory module 51 through the feedback module 81, the memory module 51 memorizes the words, the words memorized through the feedback module 81 can also be used as the words for quality inspection of the quality inspection module 5, along with the increase of the service time, the words in the memory module 51 are more and more, and the quality inspection function of the quality inspection module 5 is stronger and stronger.
In the use process, in order to prevent a user from mistakenly taking some legal vocabularies as illegal vocabularies, the feedback module 81 and the memory module 51 are connected with the auditing module 82, the auditing module 82 audits the vocabularies fed back by the feedback module 81, only the vocabularies approved by the auditing module 82 can be recorded by the memory module 51, and the accuracy of the text quality inspection screened by the quality inspection module 5 through the requirement screening module 3 is greatly improved.
Be connected with selection preference module 83 on output module 8, quality inspection module 5 quality inspection back, output module 8 can export a plurality of standardized texts, and when the user only needs a standardized text, selection preference module 83 can note down the standardized text type of user's selection at every turn, selection preference module 83 still is connected with collection module 1, selection preference module 83 feeds back the type of user's selection at every turn to collection module 1, collection module 1 then gathers according to the vocabulary and the type that correspond, the operating pressure and the operating time of follow-up module that can greatly reduced, realize that the user can be quick the collection of realization text.
The implementation principle of the standardized text data processing system in the embodiment of the application is as follows: before use, a designer enters some illegal words in the memory module 51 as a quality inspection basis of the quality inspection module 5, when the system is used, the acquisition module 1 acquires a plurality of text messages issued on a plurality of information sources in the internet, then the topic screening module 2 takes the text message containing the relevant words of the required topic as a target text, then the required screening module 3 determines the required direction of the target text for the required topic through the emotional words in each target text, then the separating module divides the text into a plurality of words, the quality inspection module 5 performs quality inspection on the divided words, when the divided words are overlapped with the words in the memory module 51, the early warning module 6 displays that the text has risks, and after the text does not contain the illegal words, the conversion module 7 converts the text into a standardized text and finally outputs the standardized text through the output module 8, when the text output by the output module 8 is still pure in illegal words, the user feeds back to the auditing module 82 through the feedback module 81, the memory module 51 memorizes the words after the auditing module 82 audits, the preference selection module 83 records the favorite text of the user and feeds back the favorite text to the acquisition module 1, and the acquisition module 1 acquires information according to the favorite text of the user.
The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims (6)

1. A standardized text data processing system, characterized by: the system comprises an acquisition module (1), a theme screening module (2), a demand screening module (3), a quality inspection module (5), an early warning module (6) and an output module (8);
the system comprises an acquisition module (1) for acquiring at least one text message issued on at least one information source in the Internet;
the theme screening module (2) is used for taking the text information containing the related words of the requirement theme as a target text;
the demand screening module (3) is used for determining the demand direction of each target text for the demand theme through the emotional words in the target text;
the quality inspection module (5) is used for performing quality inspection on the screened target text and determining illegal words contained in the target text;
the early warning module (6) is used for early warning the illegal texts detected by the quality detection module (5) to prompt a user that the corresponding texts have risks;
and the output module (8) is used for outputting the screened texts and feeding back the texts to the user.
2. A standardized text data processing system according to claim 1, characterized in that: the quality inspection module (5) is internally connected with a memory module (51), and before use, a designer can record some illegal words in the memory module (51) as the quality inspection basis of the quality inspection module (5).
3. A standardized text data processing system according to claim 2, characterized in that: be connected with feedback module (81) on output module (8), just feedback module (81) with memory module (51) are connected, feedback module (81) can be according to the illegal vocabulary entry of user reaction memory module (51), the illegal vocabulary of feedback module (81) also can be regarded as the quality control basis of quality control module (5).
4. A standardized text data processing system according to claim 1, characterized in that: the device comprises an output module (8), a selection preference module (83) is connected to the output module (8), the selection preference module (83) is further connected with an acquisition module (1), the selection preference module (83) memorizes the selection of a user after the output module (8) outputs a plurality of texts, the selection preference module (83) sends the selection of the user to the acquisition module (1), and the acquisition module (1) acquires the selection of the user in a direction.
5. A standardized text data processing system according to claim 1, characterized in that: a segmentation module (4) is connected between the requirement screening module (3) and the quality inspection module (5), the segmentation module (4) segments the text screened by the requirement screening module (3) into a plurality of entries, and the quality inspection module (5) performs quality inspection on the entries segmented by the segmentation module (4).
6. A standardized text data processing system according to claim 1, characterized in that: a conversion module (7) is connected between the early warning module (6) and the output module (8), the conversion module (7) converts the screened texts into standardized texts, and the output module (8) outputs the standardized texts converted by the conversion module (7).
CN202111047940.2A 2021-09-08 2021-09-08 Standardized text data processing system Pending CN113779250A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111047940.2A CN113779250A (en) 2021-09-08 2021-09-08 Standardized text data processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111047940.2A CN113779250A (en) 2021-09-08 2021-09-08 Standardized text data processing system

Publications (1)

Publication Number Publication Date
CN113779250A true CN113779250A (en) 2021-12-10

Family

ID=78841828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111047940.2A Pending CN113779250A (en) 2021-09-08 2021-09-08 Standardized text data processing system

Country Status (1)

Country Link
CN (1) CN113779250A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392826A (en) * 2023-12-11 2024-01-12 吉林大学 Network information early warning method and system based on big data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102253988A (en) * 2011-06-30 2011-11-23 北京新媒传信科技有限公司 Method for filtering sensitive words in network text service
CN105740302A (en) * 2014-12-12 2016-07-06 北京海尔广科数字技术有限公司 Screening method and system for demand information
US20160299955A1 (en) * 2015-04-10 2016-10-13 Musigma Business Solutions Pvt. Ltd. Text mining system and tool
CN106372168A (en) * 2016-08-30 2017-02-01 湖北银速物联网科技有限公司 Data processing system based on internet
US20170046434A1 (en) * 2014-05-01 2017-02-16 Sha LIU Universal internet information data mining method
CN111309855A (en) * 2019-12-24 2020-06-19 中国银行股份有限公司 Text information processing method and system
CN112364216A (en) * 2020-11-23 2021-02-12 上海竞信网络科技有限公司 Edge node content auditing and filtering system and method
CN113360566A (en) * 2021-08-06 2021-09-07 成都明途科技有限公司 Information content monitoring method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN102253988A (en) * 2011-06-30 2011-11-23 北京新媒传信科技有限公司 Method for filtering sensitive words in network text service
US20170046434A1 (en) * 2014-05-01 2017-02-16 Sha LIU Universal internet information data mining method
CN105740302A (en) * 2014-12-12 2016-07-06 北京海尔广科数字技术有限公司 Screening method and system for demand information
US20160299955A1 (en) * 2015-04-10 2016-10-13 Musigma Business Solutions Pvt. Ltd. Text mining system and tool
CN106372168A (en) * 2016-08-30 2017-02-01 湖北银速物联网科技有限公司 Data processing system based on internet
CN111309855A (en) * 2019-12-24 2020-06-19 中国银行股份有限公司 Text information processing method and system
CN112364216A (en) * 2020-11-23 2021-02-12 上海竞信网络科技有限公司 Edge node content auditing and filtering system and method
CN113360566A (en) * 2021-08-06 2021-09-07 成都明途科技有限公司 Information content monitoring method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117392826A (en) * 2023-12-11 2024-01-12 吉林大学 Network information early warning method and system based on big data
CN117392826B (en) * 2023-12-11 2024-02-13 吉林大学 Network information early warning method and system based on big data

Similar Documents

Publication Publication Date Title
JP4398992B2 (en) Information search apparatus, information search method, and information search program
CN104025077B (en) The real-time natural language processing of data flow
US9235643B2 (en) Method and system for generating search results from a user-selected area
US20130018894A1 (en) System and method of sentiment data generation
US20090216524A1 (en) Method and system for estimating a sentiment for an entity
CN109446376B (en) Method and system for classifying voice through word segmentation
US20040163035A1 (en) Method for automatic and semi-automatic classification and clustering of non-deterministic texts
US9015168B2 (en) Device and method for generating opinion pairs having sentiment orientation based impact relations
KR20190076381A (en) Healthy content recommendation service system using big datas
US20130018874A1 (en) System and method of sentiment data use
US20120221324A1 (en) Document Processing Apparatus
JP2001075966A (en) Data analysis system
Murray et al. Interpretation and transformation for abstracting conversations
JP2020135891A (en) Methods, apparatus, devices and media for providing search suggestions
US9256805B2 (en) Method and system of identifying an entity from a digital image of a physical text
CN107632974B (en) Chinese analysis platform suitable for multiple fields
US20190215579A1 (en) Derivative media content systems and methods
US20240104302A1 (en) Minutes processing method and apparatus, device, and storage medium
CN113779250A (en) Standardized text data processing system
US10499121B2 (en) Derivative media content systems and methods
Mahmud et al. Comparison of machine learning algorithms for sentiment classification on fake news detection
CN111435375A (en) Threat information automatic labeling method based on FastText
CN117420998A (en) Client UI interaction component generation method, device, terminal and medium
CN111859032A (en) Method and device for detecting character-breaking sensitive words of short message and computer storage medium
CN111090977A (en) Intelligent writing system and intelligent writing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination