CN113821594A - Text processing method and device and readable storage medium - Google Patents

Text processing method and device and readable storage medium Download PDF

Info

Publication number
CN113821594A
CN113821594A CN202110796094.8A CN202110796094A CN113821594A CN 113821594 A CN113821594 A CN 113821594A CN 202110796094 A CN202110796094 A CN 202110796094A CN 113821594 A CN113821594 A CN 113821594A
Authority
CN
China
Prior art keywords
keyword
word
text
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110796094.8A
Other languages
Chinese (zh)
Inventor
刘志煌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110796094.8A priority Critical patent/CN113821594A/en
Publication of CN113821594A publication Critical patent/CN113821594A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text processing method and related equipment, and the accuracy of text distribution can be improved. The method comprises the following steps: matching each word in the target text based on a positive sample matching relation and a negative sample matching relation of the first subject, wherein the positive sample matching relation comprises a matching relation between the keywords and the support degree of the first subject, and the negative sample matching relation comprises a matching relation between the keywords and the support degree of the second subject; if the matching fails, determining mutual information of each word and the first keyword, and determining mutual information of each word and the second keyword; determining the association score of the target text according to the mutual information of each word and the first keyword, the support degree of the first keyword, the mutual information of each word and the second keyword and the support degree of the second keyword; and if the relevance score of the target text meets the text distribution condition, distributing the target text to the first main body.

Description

Text processing method and device and readable storage medium
Technical Field
The present application relates to the field of digital government affairs, and in particular, to a text processing method, device and readable storage medium.
Background
In the field of digital government affairs, automatic document distribution is a necessary way for realizing digital transformation of government affairs and online handling of civil services. The digital development of governments is accelerated by proposing new governments and new service concepts such as civil service-internet management and government-internet service-internet interconnection, wherein a large amount of government data generated by civil service and social governance, such as data of civil affair handling, official document, digital service and the like, need to be better mined and analyzed, so that the intellectualization of the government affair industry can be really realized and accelerated, and the convenience of handling the affairs by people and government workers is improved.
The method for establishing the classification and grading system of the electronic official documents at present is mainly an electronic official document classification method based on templates, and the method constructs a corresponding sensitive word bank and a matching rule aiming at official document distribution labels, learns according to input sensitive words and an imported source file, generates a source file learning module of the templates, performs sensitive word matching and rule identification on the texts according to the exported templates, and obtains official documents for classification so as to realize automatic distribution of the texts.
However, the template-based electronic official document classification method excessively depends on manually given rules and templates, and consumes large time and labor cost in constructing a sensitive word bank and matching rules, and due to the limitation of the rules and the free format of official document texts, the generalization capability of the constructed rules is often reduced after a certain time, and the universality is insufficient, so that a plurality of official documents cannot be accurately distributed.
Disclosure of Invention
The application provides a text processing method and device and a readable storage medium, and the accuracy of document text distribution is improved.
An embodiment of the present application provides a text processing method, including:
matching each word in the target text based on a positive sample matching relation and a negative sample matching relation of the first subject, wherein the positive sample matching relation comprises a matching relation between the keywords and the support degree of the first subject, and the negative sample matching relation comprises a matching relation between the keywords and the support degree of the second subject;
if the matching fails, determining mutual information of each word and a first keyword, and determining mutual information of each word and a second keyword, wherein the first keyword is a keyword with the most characters in the keywords of the first main body, and the second keyword is a keyword with the most characters in the keywords of the second main body;
determining the association score of the target text according to the mutual information of each word and the first keyword, the support degree of the first keyword, the mutual information of each word and the second keyword and the support degree of the second keyword, wherein the association score represents the association degree of the target text and the first subject;
and if the relevance score of the target text meets the text distribution condition, distributing the target text to the first main body.
A second aspect of the embodiments of the present application provides a text processing apparatus, including:
the matching unit is used for matching each word in the target text based on a positive sample matching relationship and a negative sample matching relationship of the first subject, wherein the positive sample matching relationship comprises a matching relationship between the keywords and the support degree of the first subject, and the negative sample matching relationship comprises a matching relationship between the keywords and the support degree of the second subject;
the first determining unit is used for determining mutual information of each word and a first keyword if matching fails, and determining mutual information of each word and a second keyword, wherein the first keyword is a keyword with the largest number of characters in keywords of a first main body, and the second keyword is a keyword with the largest number of characters in keywords of a second main body;
the second determining unit is used for determining the association score of the target text according to the mutual information of each word and the first keyword, the support degree of the first keyword, the mutual information of each word and the second keyword and the support degree of the second keyword, wherein the association score represents the association degree of the target text and the first main body;
and the distribution unit is used for distributing the target text to the first main body if the association score of the target text meets the text distribution condition.
In one possible design, the second determining unit is specifically configured to:
determining a first association score of each word according to the mutual information of each word and the first keyword and the support degree of the first keyword;
determining a second association score of each word according to the mutual information of each word and the second key and the support degree of the second key;
and determining the association score of the target text according to the first association score of each word and the second association score of each word.
In one possible design, the first determination unit is further configured to:
if the matching is successful, determining a first keyword set matched with each word in the positive sample matching relationship, and determining a second keyword set matched with each word in the negative sample matching relationship;
determining that each first keyword in the first keyword set hits a first target keyword with the maximum number of characters in the positive sample matching relationship, and determining that each second keyword in the second keyword set hits a second target keyword with the maximum number of characters in the negative sample matching relationship;
determining the number of first sample clauses hit by the first target keywords in the sample clause set associated with the positive sample matching relationship, and determining the number of second sample clauses hit by the second target keywords in the sample clause set associated with the negative sample matching relationship and the target number of all sample clauses in the sample clause set associated with the positive sample matching relationship;
determining the support degree weight of the target text according to the number of the first sample clauses, the number of the second sample clauses and the target number;
and distributing the target text according to the support degree weight of the target text.
In one possible design, the determining, by the first determining unit, the support degree weight of the target text according to the first number of sample clauses, the second number of sample clauses, and the target number includes:
determining the forward support degree weight of the target text according to the first sample clause quantity and the target quantity;
determining the negative support degree weight of the target text according to the second sample clause quantity and the target quantity;
and determining the support degree weight of the target text according to the positive support degree weight of the target text and the negative support degree weight of the target text.
In one possible design, the apparatus further includes:
a third determination unit for:
acquiring a training text set, wherein the training text set comprises a training text associated with a first main body and a training text associated with a second main body;
each text in the training text set is subjected to clause division to obtain a clause set corresponding to each text;
processing a clause set corresponding to each text to obtain a first character sequence corresponding to each text;
removing keywords in the first character sequence which are smaller than the support degree threshold value to obtain a second character sequence corresponding to each text;
determining keywords of the second word sequence and corresponding support degrees of the keywords;
determining keywords of a second word sequence corresponding to the first main body and support degrees corresponding to the keywords as a positive sample matching relation of the first main body;
and determining the keywords in the second word sequence corresponding to the second main body and the support degrees corresponding to the keywords as the negative sample matching relation of the first main body.
In one possible design, the processing, by the third determining unit, the set of clauses corresponding to each text to obtain the first word sequence corresponding to each text includes:
filtering stop words of a clause set corresponding to each text based on a preset stop word bank;
carrying out named entity recognition on a clause set corresponding to each text after the stop words are filtered;
and splitting a clause set corresponding to each text after the named entity is identified according to word units to obtain a first word sequence.
In one possible design, the determining, by the third determining unit, the keyword of the second word sequence includes:
determining a third key word with the number of characters i in the second character sequence and a keyword set associated with the third key word in a target character unit set, wherein the target character unit set is a character unit set corresponding to at least one of the first main body and the second main body, and the value of i is the numerical value of the number of characters in the second character sequence;
removing keywords smaller than a support degree threshold value in the keyword set;
and performing recursion by taking the number of the characters of the keywords in the second character sequence as a reference until the keyword with the maximum number of the characters in the second character sequence and the keyword set corresponding to the keyword with the maximum number of the characters in the target character unit set are determined, so as to obtain the keywords with each number of the characters in the second character sequence.
In another aspect, the present invention provides a computer device, which includes at least one connected processor, a memory and a transceiver, wherein the memory is used for storing program codes, and the processor is used for calling the program codes in the memory to execute the steps of the text processing method according to the above aspects.
Another aspect of the embodiments of the present application provides a computer storage medium including instructions that, when executed on a computer, cause the computer to perform the steps of the text processing method according to the above aspects.
In summary, it can be seen that, in the embodiment of the present application, a positive sample matching relationship and a negative sample matching relationship of a first main body are matched with each word of a text to be targeted, if matching fails, mutual information of each word in the target text and a keyword with the largest number of characters in the keywords of the first main body and mutual information of a keyword with the largest number of characters in the keywords of a second main body are calculated, and an association score of the target text is determined according to the mutual information, support degrees of the keyword with the largest number of characters in the keywords of the first main body and support degrees of the keyword with the largest number of characters in the keywords of the second main body, where the association score identifies an association degree of the target text and the first main body, and if the association score of the target text meets a text distribution condition, the target text is distributed to the first main body. Therefore, when the text is distributed, rules and templates do not need to be manually formulated, a mode of introducing a positive sample matching relation and a negative sample matching relation of a main body is adopted, the association score of the target text is calculated through mutual information and support of each word and the keyword with the largest number of characters, the target text is distributed based on the association score, and the problem that the official text cannot be accurately distributed due to the fact that the rules and the templates are manually formulated are insufficient in generalization capability can be solved.
Drawings
Fig. 1 is a network architecture diagram of a text distribution system according to an embodiment of the present application:
fig. 2 is a schematic diagram of an embodiment of a text processing method provided in an embodiment of the present application;
fig. 3 is a schematic diagram of another embodiment of a text processing method provided in an embodiment of the present application;
fig. 4 is a schematic view of a virtual structure of a text processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic hardware structure diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, the division of modules presented herein is merely a logical division that may be implemented in a practical application in a further manner, such that a plurality of modules may be combined or integrated into another system, or some feature vectors may be omitted, or not implemented, and such that couplings or direct couplings or communicative coupling between each other as shown or discussed may be through some interfaces, indirect couplings or communicative coupling between modules may be electrical or other similar, this application is not intended to be limiting. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a text distribution system in an embodiment of the present application, as shown in fig. 1, including a user 101, a client 102, a network 103, a server 104, and K databases 105, where each database in the K databases stores a positive sample matching relationship corresponding to a principal, a positive sample matching relationship of a principal is a negative sample matching relationship of other principal, and a positive sample matching relationship of a principal includes a matching relationship between a keyword of the principal and a support degree;
a user 101 inputs a target text through a client 102, the client 102 uploads the target text input by the user 101 to a server 104 through a network 103, the server 104 can process the target text after obtaining the target text to obtain each participle of the target text, matches each participle of the processed target text with a positive sample matching relationship and a negative sample matching relationship of a first main body, if the matching fails, determines an association score of the target text by calculating mutual information of each word in the target text with a keyword with the maximum number of characters in the keywords of the first main body and mutual information of the keyword with the maximum number of characters in the keywords of a second main body, and according to the two mutual information, support of the keyword with the maximum number of characters in the keywords of the first main body and support of the keyword with the maximum number of characters in the keywords of the second main body, the relevance score identifies the relevance degree of the target text and the first main body, and if the relevance score of the target text meets the text distribution condition, the target text is distributed to the database 105 corresponding to the first main body. When the matching is successful, determining a first keyword set matched with each word in the positive sample matching relationship, and determining a second keyword set matched with each word in the negative sample matching relationship; determining that each first keyword in the first keyword set hits a first target keyword with the maximum number of characters in the positive sample matching relationship, and determining that each second keyword in the second keyword set hits a second target keyword with the maximum number of characters in the negative sample matching relationship; determining the number of first sample clauses hit by a first target keyword in a sample clause set associated with a positive sample matching relationship, and determining the number of second sample clauses hit by a second target keyword in a sample clause set associated with a negative sample matching relationship and the target number of all sample clauses in the sample clause set associated with the positive sample matching relationship; determining the support degree weight of the target text according to the number of the first sample clauses, the number of the second sample clauses and the target number; and finally, distributing the target text to the database 105 of the corresponding subject according to the support degree weight of the target text.
In summary, in the present application, when a text is distributed, a positive sample matching relationship and a negative sample matching relationship of a subject are introduced without manually making rules and templates, an association score of a target text is calculated through mutual information of a keyword with the largest number of characters in the matching relationship between each word and the positive sample, mutual information of a keyword with the largest number of characters in the matching relationship between each word and the negative sample, and a support degree, and the target text is distributed based on the association score, and meanwhile, when matching of a keyword in the matching relationship between each word and the positive sample and a keyword in the matching relationship between each word and the negative sample is successful, a positive support degree weight of a keyword in the matching relationship between each word and the positive sample and a negative support degree weight of a keyword in the matching relationship between each word and the negative sample are calculated, and a support degree weight of the target text is determined according to the positive support degree weight and the negative support degree weight, and finally, distributing the target text according to the support weight, so that the problem that the official document text cannot be accurately distributed due to insufficient generalization capability of manually formulating rules and templates can be solved.
Referring to fig. 2, please refer to fig. 2, wherein fig. 2 is a schematic diagram of an embodiment of a text processing method according to an embodiment of the present application, including:
201. and constructing a document distribution database to obtain a distribution department training set.
In this embodiment, a document distribution database may be constructed, and a plurality of distribution department training sets may be obtained, where the distribution departments may be, for example, "city exclusive organizations," city education bureaus, "and" city management committee, "and the training sets are document texts in the distribution departments, and for example, the document texts in the" city exclusive organizations "may be" i report to a specific organization to reflect a first abnormal case, and hope to put a case as soon as possible. ".
202. Sequential pattern mining assigns positive and negative pattern features for departments.
In this embodiment, the positive mode features and the negative mode features of the distribution departments can be mined through the text sequence patterns, the negative mode of a certain distribution department is the positive mode of other distribution departments, and the positive sequence mode features and the negative sequence mode features of the distribution department official documents are respectively obtained through mining based on the frequent word sequence patterns according to the training positive samples and the training negative samples of the official document of each distribution department.
203. And matching the document to be predicted with the positive mode characteristic and the negative mode characteristic to obtain the support degree weight.
In this embodiment, after the positive mode features and the negative mode features of each distribution department are obtained by mining, the official document to be predicted may be matched with the positive mode features and the negative mode features of each distribution department, and if matching is successful, the support degree weight corresponding to the official document to be predicted is determined.
204. And calculating mutual information between the official document to be predicted and the positive mode characteristic and mutual information between the official document to be predicted and the negative mode characteristic.
In this embodiment, if the feature corresponding to the official document to be predicted is not matched in the positive mode feature and the negative mode feature of each distribution department, the mutual information between each word in the official document to be predicted and the positive mode feature of each distribution department is calculated, and the mutual information between each word in the official document to be predicted and the negative mode feature of each distribution department is calculated.
205. And distributing the official documents to be predicted according to the support degree and the mutual information.
In this embodiment, after obtaining mutual information of each word and the positive mode feature of each distribution department and mutual information of each word and the negative mode feature of each distribution department, the to-be-predicted official document may be distributed according to the mutual information of each word and the positive mode feature of each distribution department, the mutual information of each word and the negative mode feature of each distribution department, the support degree of the positive mode feature, and the support degree of the negative mode feature.
In summary, it can be seen that, in the embodiments provided by the present application, by constructing the positive direction pattern features and the negative direction pattern features of each distribution department, and matching with the document to be predicted, if matching is successful, then the support degree weight is obtained, and the to-be-predicted official document text is distributed according to the support degree weight, and further when the to-be-predicted official document text cannot be matched, by calculating the mutual information weight of each word in the official document to be predicted and the forward mode characteristics of each assigned department, and the mutual information weight of each word in the to-be-predicted document and the negative mode characteristics of each distribution department, and distributing the to-be-predicted official documents according to mutual information weights of each word in the to-be-predicted official document and positive mode characteristics of each distribution department and mutual information weights of each word in the to-be-predicted official document and negative mode characteristics of each distribution department. Therefore, in the embodiment provided by the application, when the text is distributed, the rule and the template do not need to be manually formulated, and further the problem that the official document cannot be accurately distributed due to insufficient generalization capability of manually formulated rules and templates is avoided.
With the above description, the text processing method in the present application will be described from the perspective of a text processing apparatus, which may be a server or a service unit in the server.
Referring to fig. 3, fig. 3 is a schematic diagram of an embodiment of a text processing method according to an embodiment of the present application, including:
301. and matching each word in the target text based on the positive sample matching relationship and the negative sample matching relationship of the first subject.
In this embodiment, the text processing device may obtain a target text, where the target text is a to-be-distributed document text, and for example, the target text may be "construction at 8 am on holidays at a certain intersection on a certain road and disturbing the citizen by noise", and then the text processing device may process the to-be-distributed document text, where the processing includes but is not limited to performing clause division, stop word filtering, and named entity recognition on the target text according to punctuation, and finally obtaining a term set corresponding to the target text (it may be understood that, in order to simplify matching, a threshold value may be set, for example, 3, terms appearing in the target text less than 3 times are removed, and may also be set according to an actual situation), and then match each term in the target text based on a positive sample matching relationship and a negative sample matching relationship of a first main body constructed in advance, the positive sample matching relationship comprises a matching relationship between the keywords of the first subject and the support degree, and the negative sample matching relationship comprises a matching relationship between the keywords of the second subject and the support degree.
The following describes the construction of the positive sample matching relationship and the negative sample matching relationship of the first subject:
step 1, obtaining a training text set, wherein the training text set comprises training samples associated with a first subject and training samples associated with a second subject.
In this step, the text processing device may obtain a training sample associated with a first subject and a training sample associated with a second subject, where the first subject may be, for example, "city exclusive agency," and the second subject is all subjects except the "city exclusive agency," such as "city education bureau," "city science and technology bureau," and "city management committee," that is, the text processing device may obtain document text data of historical examination and approval corresponding to each subject, and the document text data is used as a training text set of a positive sample matching relationship and a negative sample matching relationship of the first subject, where each subject respectively collects N relevant case texts (i.e., document texts) as training positive samples, and the subjects are labeled with category Identifiers (IDs), such as: 0. n, constructing the document text data of table 1 below:
TABLE 1
Figure RE-GDA0003282774310000091
Figure RE-GDA0003282774310000101
The following text and body are illustrated by way of example, with reference to table 2:
TABLE 2
Figure RE-GDA0003282774310000102
Figure RE-GDA0003282774310000111
After the training positive sample of the official document of each subject is determined, the official documents of other subjects are the negative samples of the subject, for example, the official document of the subject "city exclusive agency" is the positive sample of the "city exclusive agency", and the official document of the "city education bureau" is the negative sample of the subject "city exclusive agency".
And 2, carrying out clause division on each text in the training text set to obtain a clause set corresponding to each text.
In this embodiment, after obtaining the training text set, the text processing apparatus may perform clause segmentation on each text in the training text set, where the same document text is an identification object, but mining the keyword features in the document text requires segmenting the identification object into clauses, and the keywords mined by each clause form a pattern feature set of the entire document text, where the clause divides the document text by regular matching punctuation marks to obtain a clause set corresponding to each document text, for example, the document text "i reports to an exclusive agency in a certain area to reflect a first abnormal means case and hopes to put up a case as soon as possible", and after performing clause segmentation, the text processing apparatus obtains "i reports to an exclusive agency in a certain area to reflect a first abnormal means case" and "hopes to put up a case as soon as possible".
And 3, processing the clause set corresponding to each text to obtain a first character sequence corresponding to each text.
In this embodiment, after obtaining the clause set corresponding to each text, the text processing device may process the clause corresponding to each text to obtain the first word sequence element corresponding to each text. Specifically, the text processing device may perform stop word filtering on the clause set corresponding to each text based on a preset stop word bank, where the stop word filtering includes, but is not limited to, filtering useless information such as "date and time, name, mailbox, mobile phone number" and the like; carrying out named entity recognition on a clause set corresponding to each text after the stop words are filtered; for the distribution of the document text, organization name entities appearing in the document text, such as exclusive organizations and exclusive organizations, often have a great effect on automatic classification and distribution of documents, so that the organization names need to be Named Entity identified based on a Named Entity identification (NER) tool to obtain the organization name entities in the document text, and identify the same organization name Entity as a uniform code, such as an exclusive organization identity #, so as to ensure that the organization name Entity words are not split when constructing a positive sample matching relationship and a negative sample matching relationship. For the convenience of understanding, the stop word filtering and named entity recognition of the official document are described below with reference to specific examples, and tables 3 and 4 are positive examples of the official document of the subject "city exclusive agency" after the stop word filtering and named entity recognition:
TABLE 3
Figure RE-GDA0003282774310000121
TABLE 4
Figure RE-GDA0003282774310000131
After stop word filtering and named entity recognition are performed on a document text, a clause set corresponding to each text can be split according to word units to obtain a first word sequence (it can be understood that a word unit here corresponds to a word if the text is a Chinese text, and corresponds to a word if the text is an English text, and the word is not specifically limited), for example, the clause "i alarms and reflects a first abnormal means case to a certain area # and" hopes to put a case as soon as possible "in a sample 1 in table 4 is split according to word units to obtain: "i.
And 4, removing the keywords in the first character sequence, wherein the keywords are smaller than the support degree threshold value, and obtaining a second character sequence corresponding to each text.
In this embodiment, the text processing apparatus may count the number of sample clauses appearing in each sample object clause of all word sequences, and filter the keyword smaller than the support threshold, assuming that the support threshold is 1/3, that is, only 2 sample clauses appearing in the 6 sample clauses can satisfy the support threshold, otherwise, filter the keyword. Taking the training official document positive sample of the "city exclusive agency" in table 3 as an example for explanation, the keywords smaller than the support degree threshold in the first word sequence are removed to obtain the second word sequence corresponding to each text, and the result is shown in table 5, including each keyword and the number of sample clauses hit by each keyword:
TABLE 5
Words and phrases Table (A table) Stand Newspaper Check the Core Piece Police Fraud prevention #
Number of sample clauses 5 3 2 2 2 2 2 2 2
It should be noted that in the present application, word sequences are used as objects for keyword mining, keywords of each number of characters satisfying a support threshold in a document text are mined based on a Prefixspan algorithm, and the support threshold can be calculated by the following formula:
min_sup=a×n
the method comprises the steps that a, a and a support rate, wherein min _ sup is a support threshold, n is the number of documents in a certain main body, a is the minimum support rate, and the minimum support rate is adjusted according to the magnitude of a training data set.
And 5, determining the keywords of the second word sequence and the corresponding support degrees of the keywords.
In this step, the text processing apparatus may determine the keywords in the second word sequence element and the support degrees corresponding to the keywords, which is specifically described below:
step 51, determining a third keyword with a character number i in the second word sequence element and a keyword set associated with the third keyword in a target word unit set, where the target word unit set is a word unit set corresponding to the first main body or the second main body, where a value of i is a numerical value of the character number in the second word sequence, and taking table 5 as an example to illustrate, starting with a value of the character number i in the second word sequence from 1, determining the third keyword with the character number i and the keyword set associated with the third keyword in the target word unit set by adding 1 for each recursion, and taking i as 1 in table 6 to illustrate:
TABLE 6
Figure RE-GDA0003282774310000141
Figure RE-GDA0003282774310000151
Wherein, the target word unit set is shown in table 7:
TABLE 7
Sample 1 # alarm fraud case Plan table
Sample 2 Case Plan verification
Sample 3 Fraud prevention # alarm core filing case
For example, a "#" with a number of characters being 1 is taken as an example, and the "#" appears in both the sample 1 and the sample 3, and the associated keyword sets in the target word unit set are an "alarm fraud case" and an "alarm core case finding".
And step 52, removing the keywords which are smaller than the support degree threshold value in the keyword set.
In this step, the text processing apparatus may remove the keywords smaller than the support threshold in the keyword set, if the support threshold is 1/3 (that is, 1 sentence in 3 sentences must appear on the keyword), the keywords smaller than the support threshold in the keyword set may be removed, which is described by taking the third keyword as "#" as an example, the keyword set associated with the keyword "#" in the target word unit set is removed, and the keywords in the keyword set satisfy the support threshold, as shown in table 8, the "fraud", "piece", "kernel", "standing", and "checking" in table 6 do not reach the support threshold, and the keywords and the number of sample clauses shown in table 8 are removed:
TABLE 8
Keywords in a set of keywords Newspaper Police Table (A table)
Number of sample clauses 2 2 2
And step 53, recursion is carried out by taking the number of characters of the keywords in the second character sequence as a reference until the keywords with the largest number of characters in the second character sequence and the keyword set corresponding to the keywords with the largest number of characters in the target character unit set are determined, so as to obtain the keywords of the second character sequence.
In this step, the text processing apparatus performs recursion with reference to the number of characters of the keyword in the second character sequence until the keyword with the largest number of characters in the second character sequence and the keyword set corresponding to the keyword with the largest number of characters in the target character unit set are determined, and obtains the keyword of the second character sequence.
How to determine the third keyword with the number of characters 2 and the keyword set associated with the third keyword in the target word unit set is described below by taking the third keyword with the number of characters 2 obtained after the second recursion of "#" as an example, as shown in table 9 specifically:
TABLE 9
Figure RE-GDA0003282774310000161
The keywords associated with the "# newspaper" in the target word unit set are "alarm fraud case" and "alarm verification case lookup", that is, the "# newspaper" is matched with the target word unit set in table 7, so as to obtain the corresponding keyword set.
The third keyword with the number of characters of 3 obtained after three recursions of "#" is continuously taken as an example to explain how to determine the third keyword with the number of characters of 3 and the keyword set associated with the third keyword in the target word unit set, and the specific table 10 shows:
watch 10
Figure RE-GDA0003282774310000162
The first row in table 10 is described, the keywords associated with the "# alarm" in the target word unit set are "fraud case" and "core case finding", that is, the "# alarm" is matched with the target word unit set in table 7, so that the corresponding keyword set can be obtained.
The third keyword with the number of characters of 4 obtained after three recursions of "#" is continuously taken as an example to explain how to determine the third keyword with the number of characters of 4 and the keyword set associated with the third keyword in the target character unit set, and the specific table 11 shows:
TABLE 11
Figure RE-GDA0003282774310000171
The keywords associated with the "# alarm case" in the target word unit set are "piece" and "lookup", that is, the "# report" is matched with the target word unit set in table 7, so that a corresponding keyword set can be obtained, and the keyword recursion corresponding to the "#" is ended. Therefore, the third key word corresponding to each character number and the key word set associated with the third key word in the target word unit set can be obtained.
And step 54, determining the keywords of the second word sequence corresponding to the first main body and the support degrees corresponding to the keywords as the positive sample matching relation of the first main body.
In this embodiment, after obtaining the keywords of the second word sequence element corresponding to each text in the training text set, the text processing apparatus may determine the support degrees corresponding to the keywords, and determine the keywords of the second word sequence corresponding to the first main body and the support degrees corresponding to the keywords as the positive sample matching relationship of the first main body, which is described below with reference to table 12, taking the keywords as "#" as an example, where table 12 is the positive sample matching relationship of the keywords "#" corresponding to the first main body:
TABLE 12
Figure RE-GDA0003282774310000172
Figure RE-GDA0003282774310000181
Taking "#" as an example, when the number of characters is 1, the corresponding support degree is 1/3, namely "#" appears twice in 6 clauses, and so does "# newspaper". Therefore, the keywords in the second word sequence element corresponding to the first main body and the support degree corresponding to the keywords can be obtained.
And step 52, determining the keywords of each character number in the second character sequence corresponding to the second main body and the support degrees corresponding to the keywords of each character number as the negative sample matching relation of the first main body.
In this step, since the first main body and the second main body are in a positive and negative sample matching relationship with each other, that is, the positive sample matching relationship of the first main body is the negative sample matching relationship of the second main body, and the positive sample matching relationship of the second main body is the negative sample matching relationship of the first main body, after obtaining the keywords corresponding to each text in the training text set and the support degrees corresponding to the keywords, the keywords corresponding to each number of characters in the second character sequence element corresponding to the second main body and the support degrees corresponding to the keywords corresponding to each number of characters are obtained.
302. And if the matching fails, determining mutual information of each word and the first keyword, and determining mutual information of each word and the second keyword.
In this embodiment, when the text processing apparatus matches each word in the target text based on the positive sample matching relationship and the negative sample matching relationship of the first main body, if each word in the target text is not successfully matched (that is, each word in the target text is respectively matched with the keywords in the positive sample matching relationship and the negative sample matching relationship, if each word in the target text is not identical to the keywords in the positive sample matching relationship and the negative sample matching relationship, the matching is considered to be failed, if each word in the target text has the words identical to the keywords in the positive sample matching relationship and the negative sample matching relationship, the matching is considered to be successful, here, the target text is taken as "mobile phone stolen case frequently occurs recently, please refer to relevant departments to establish case verification", and the positive sample matching relationship table is taken as an example of the keywords and the support degree in 12, after matching each word in the target text with the positive sample matching relationship, if a keyword identical to two keywords of "related department" and "case" in the target text exists in the positive sample matching relationship, that is, the "# case" in table 12, it is determined that the matching is successful, and if a keyword identical to each word in the target text does not exist in the positive sample matching relationship in table 12, it is determined that the matching is failed), the text processing apparatus may calculate mutual information between each participle in the target text and a first keyword, and calculate mutual information between each word and a second keyword, where the first keyword is a keyword with the largest number of characters in the keywords of the first subject, and the second keyword is a keyword with the largest number of characters in the keywords of the second subject, for example, the first keyword is the "# alarm case" in table 12. Mutual information is explained below:
if keyword x and keyword y occur together often, then the mutual information of keyword x and keyword y is large. The formula of mutual information of the keyword x and the keyword y can be defined as:
Figure RE-GDA0003282774310000191
meanwhile, the consideration of introduced word frequency is calculated in the determination of mutual information, high-frequency words are concerned more,
Figure RE-GDA0003282774310000192
wherein, the words to be mined are x, y is high-frequency words which often appear together with x, I (x, y) is mutual information of x and y, p (x, y) is the distribution probability of x and y appearing simultaneously, p (x) is the individual appearance probability of x, p (y) is the individual appearance probability of y, a belongs to 0.5, 1. in the application, x is each word in the target text, and y is a first keyword and a second keyword, in the process of calculation through the formula, the target text is firstly divided into words to obtain each word of the target text, then each word is vectorially coded through a word2vec word vector model, the first keyword and the second keyword are vectorially coded through the word2vec word vector model, and finally each word, the first keyword and the second keyword after vectorially coded are input into the formula, mutual information between each word and the first keyword and mutual information between each word and the second keyword are calculated respectively, and it is understood that when mutual information between two word vectors is calculated, the mutual information may also be calculated in other manners, for example, by using a toolkit in a python or Matlab tool, and the like, which is not limited specifically.
It can be understood that, the word2vec word vector model is used for vectorizing and coding each word, the first keyword and the second keyword, and certainly, vectorizing may also be performed through other manners, the first keyword and the second keyword, which is not limited specifically.
303. And determining the association score of the target text according to the mutual information of each word and the first keyword, the support degree of the first keyword, the mutual information of each word and the second keyword and the support degree of the second keyword.
In this embodiment, after determining the mutual information between each word and the first keyword, the support degree of the first keyword, the mutual information between each word and the second keyword, and the support degree of the second keyword, the text processing apparatus may determine the association score of the target text according to the mutual information between each word and the first keyword, the support degree of the first keyword, the mutual information between each word and the second keyword, and the support degree of the second keyword. The method comprises the following specific steps:
the text processing device determines a first association score of each word according to mutual information of each word and the first key and the support degree of the first key word;
determining a second association score of each word according to the mutual information of each word and the second key and the support degree of the second key;
and determining the association score of the target text according to the first association score of each word and the second association score of each word.
That is, the text processing apparatus may multiply the mutual information of each word and the first keyword by the support degree of the first keyword to obtain the first relevance score of each word, it is understood that, when there are a plurality of first keywords, the support degree of each first keyword is multiplied by the mutual information of each word and the first keyword, and then the first relevance score of each word is obtained by adding, similarly, the second relevance score is calculated in the same manner, and then the first relevance score and the second relevance score of each word are obtained, since the second relevance score is the relevance score of the negative sample matching relationship of each word with respect to the first subject, it is necessary to change the second relevance score of each word to a negative number after calculating the second relevance score, and add the first relevance score and the second relevance score of each word, and obtaining the association score of the target text. It is to be understood that, for convenience of calculation, after obtaining the relevance score of the first word and the relevance score of the second word, the relevance score of the first word and the relevance score of the second word may be normalized, and the range of the relevance score of the first word and the relevance score of the second word is [ -1, 1], the first relevance score is a positive number, the second relevance score is a negative number, and the relevance score of the target text may be obtained by adding the first relevance score and the second relevance score of each word.
304. And if the relevance score of the target text meets the text distribution condition, distributing the target text to the first main body.
In this embodiment, after obtaining the relevance score of the target text, the text processing device may determine whether the relevance score of the target text satisfies a text distribution condition, and distribute the target text to the first main body when the relevance score of the target text satisfies the text distribution condition. As described in the above example, the relevance score of the target text is a probability value that the target text belongs to each subject, and the target text is assigned to the subject with the highest relevance score among the subjects, or a relevance score threshold may be set, and the target text may be distributed to the subject with the relevance score satisfying the relevance score threshold among the subjects, for example, the target text may be distributed directly to the subject "city-specific institution" with a relevance score of 0.8 with respect to the subject "city-specific institution", the target text may be distributed to the subject "city-specific institution" with a relevance score of 0.3 with respect to the subject "city science bureau", and the target text may be distributed to the subject "city-specific institution" and the subject "city education bureau" with a relevance score of 0.6.
In one embodiment, if each word in the target text is matched based on the positive sample matching relationship and the negative sample matching relationship of the first subject, and the matching is successful, the text processing apparatus further performs the following operations:
determining a first keyword set matched with each word in the positive sample matching relationship, and determining a second keyword set matched with each word in the negative sample matching relationship;
determining that each first keyword in the first keyword set hits a first target keyword with the maximum number of characters in the positive sample matching relationship, and determining that each second keyword in the second keyword set hits a second target keyword with the maximum number of characters in the negative sample matching relationship;
determining the number of first sample clauses hit by a first target keyword in a sample clause set associated with a positive sample matching relationship, and determining the number of second sample clauses hit by a second target keyword in a sample clause set associated with a negative sample matching relationship and the target number of all sample clauses in the sample clause set associated with the positive sample matching relationship;
determining the support degree weight of the target text according to the number of the first sample clauses, the number of the second sample clauses and the target number;
and distributing the target text according to the support degree weight of the target text.
In this embodiment, for the target text, each word in the target text is respectively matched with the positive sample matching relationship and the negative sample matching relationship of the first subject, and when the matching is successful, a first keyword set matched with each word in the positive sample matching relationship may be determined. Taking the positive sample matching relationship as an example for the table 12, after each word in the target text is matched with the keyword of each character length in the table 12, the first keyword set is obtained as "#", "# report", "# alarm", "# case", "# alarm", and similarly, each word in the target text can be matched with the negative sample matching relationship, so as to obtain the second keyword set;
then, a first target keyword with the largest number of characters in the first keyword set and a second target keyword with the largest number of characters in the second keyword set may be determined, as shown in table 7, where the first target keyword is a "# alarm", "# report", and "# alarm", and the number of first sample clauses hit by the first target keyword in the sample clause set associated with the positive sample matching relationship is determined, as shown in table 7, the number of second sample clauses hit by the second target keyword in the sample clause set associated with the negative sample matching relationship is determined, and the number of targets of all sample clauses in the sample clause set associated with the positive sample correlation relationship is determined, as shown in table 7, where the number of targets is 6;
and finally, determining the support weight of the target text according to the number of the first sample clauses, the number of the second sample clauses and the target number, specifically determining the positive support weight of the target text according to the number of the first sample clauses and the target number, and determining the negative support weight of the target text according to the number of the second sample clauses and the target number, wherein the positive support weight and the negative support weight of the target text can be specifically calculated through the following formulas:
Figure RE-GDA0003282774310000221
and then determining the support degree weight of the target text according to the positive support degree weight of the target text and the negative support degree weight of the target text.
It should be noted that when there are a plurality of first target keywords, there are a plurality of corresponding first sample clause numbers, and each first sample clause number may be added to the forward support weight calculated by the target number to obtain the forward support weight of the target text, and similarly, when there are a plurality of second target keywords, this is also true. After the positive support weight of the target text and the negative support weight of the target text are obtained through calculation, normalization processing may be performed on the positive support weight of the target text and the negative support weight of the target text, and the normalization processing is set in the range of [ -1, 1], where the negative support weight of the target text is set to be a negative number, so that the support weight of the target text can be obtained by subtracting the negative support weight of the target text from the positive support weight of the target text, and then the target text is distributed according to the support weight of the target text, which has been described in detail in step 304, and here, only the association score threshold needs to be set as the support weight threshold.
In summary, it can be seen that, in the embodiment provided by the application, when text distribution is performed, rules and templates do not need to be manually formulated, a mode of introducing a positive sample matching relationship and a negative sample matching relationship of a main body is adopted, the association score of a target text is calculated through mutual information and support of each word and a keyword with the largest number of characters, and the target text is distributed based on the association score, so that the problem that the official text cannot be accurately distributed due to insufficient generalization capability of manually formulating the rules and the templates can be avoided.
The embodiments of the present application are described above from the viewpoint of a processing method of a text, and are described below from the viewpoint of a text processing apparatus.
Referring to fig. 4, in an embodiment of the present application, a text processing apparatus is provided, where the text processing apparatus 400 includes:
a matching unit 401, configured to match each word in the target text based on a positive sample matching relationship and a negative sample matching relationship of the first subject, where the positive sample matching relationship includes a matching relationship between a keyword of the first subject and a support degree, and the negative sample matching relationship includes a matching relationship between a keyword of the second subject and a support degree;
a first determining unit 402, configured to determine mutual information between each word and a first keyword if matching fails, and determine mutual information between each word and a second keyword, where the first keyword is a keyword with a largest number of characters in keywords of a first main body, and the second keyword is a keyword with a largest number of characters in keywords of a second main body;
a second determining unit 403, configured to determine a relevance score of the target text according to the mutual information between each word and the first keyword, the support degree of the first keyword, the mutual information between each word and the second keyword, and the support degree of the second keyword, where the relevance score indicates a relevance degree of the target text to the first subject;
a distributing unit 404, configured to distribute the target text to the first main body if the relevance score of the target text meets a text distribution condition.
In one possible design, the second determining unit 403 is specifically configured to:
determining a first association score of each word according to the mutual information of each word and the first keyword and the support degree of the first keyword;
determining a second association score of each word according to the mutual information of each word and the second key and the support degree of the second key;
and determining the association score of the target text according to the first association score of each word and the second association score of each word.
In one possible design, the first determining unit 402 is further configured to:
if the matching is successful, determining a first keyword set matched with each word in the positive sample matching relationship, and determining a second keyword set matched with each word in the negative sample matching relationship;
determining that each first keyword in the first keyword set hits a first target keyword with the maximum number of characters in the positive sample matching relationship, and determining that each second keyword in the second keyword set hits a second target keyword with the maximum number of characters in the negative sample matching relationship;
determining the number of first sample clauses hit by the first target keywords in the sample clause set associated with the positive sample matching relationship, and determining the number of second sample clauses hit by the second target keywords in the sample clause set associated with the negative sample matching relationship and the target number of all sample clauses in the sample clause set associated with the positive sample matching relationship;
determining the support degree weight of the target text according to the number of the first sample clauses, the number of the second sample clauses and the target number;
and distributing the target text according to the support degree weight of the target text.
In one possible design, the determining, by the first determining unit 402, the support degree weight of the target text according to the first sample clause number, the second sample clause number, and the target number includes:
determining the forward support degree weight of the target text according to the first sample clause quantity and the target quantity;
determining the negative support degree weight of the target text according to the second sample clause quantity and the target quantity;
and determining the support degree weight of the target text according to the positive support degree weight of the target text and the negative support degree weight of the target text.
In one possible design, the apparatus further includes:
a third determining unit 405, the third determining unit 405 being configured to:
acquiring a training text set, wherein the training text set comprises a training text associated with a first main body and a training text associated with a second main body;
each text in the training text set is subjected to clause division to obtain a clause set corresponding to each text;
processing a clause set corresponding to each text to obtain a first character sequence corresponding to each text;
removing keywords in the first character sequence which are smaller than the support degree threshold value to obtain a second character sequence corresponding to each text;
determining keywords of the second word sequence and corresponding support degrees of the keywords;
determining keywords of a second word sequence corresponding to the first main body and support degrees corresponding to the keywords as a positive sample matching relation of the first main body;
and determining the keywords in the second word sequence corresponding to the second main body and the support degrees corresponding to the keywords as the negative sample matching relation of the first main body.
In one possible design, the processing, by the third determining unit 405, the set of clauses corresponding to each text to obtain the first word sequence corresponding to each text includes:
filtering stop words of a clause set corresponding to each text based on a preset stop word bank;
carrying out named entity recognition on a clause set corresponding to each text after the stop words are filtered;
and splitting a clause set corresponding to each text after the named entity is identified according to word units to obtain a first word sequence.
In one possible design, the determining of the keyword of the second word sequence by the third determining unit 405 includes:
determining a third key word with the number of characters i in the second character sequence and a keyword set associated with the third key word in a target character unit set, wherein the target character unit set is a sample clause set corresponding to at least one of the first main body and the second main body, and the value of i is the numerical value of the number of characters in the second character sequence;
removing keywords smaller than a support degree threshold value in the keyword set;
and performing recursion by taking the number of the characters of the keywords in the second character sequence as a reference until the keyword with the maximum number of the characters in the second character sequence and the keyword set corresponding to the keyword with the maximum number of the characters in the target character unit set are determined, so as to obtain the keywords with each number of the characters in the second character sequence.
In summary, it can be seen that, in the embodiment provided by the application, when text distribution is performed, rules and templates do not need to be manually formulated, but a mode of introducing a positive sample matching relationship and a negative sample matching relationship of a main body is adopted, the association score of a target text is calculated through mutual information and support of each word and a keyword with the largest number of characters, and the target text is distributed based on the association score, so that the problem that the official text cannot be accurately distributed due to insufficient generalization capability of manually formulating the rules and the templates can be avoided.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a server provided in an embodiment of the present invention, where the server 500 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) storing an application program 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.
The server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the text processing apparatus in the above-described embodiment may be based on the server configuration shown in fig. 5.
The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a computer, implements the method flow related to the text processing apparatus in any of the above method embodiments. Correspondingly, the computer can be the text processing device.
The present application also provides a computer program or a computer program product including a computer program, which when executed on a computer causes the computer to implement the method flows related to the text processing apparatus in any of the above method embodiments. Correspondingly, the computer may be the text processing device described above.
In the embodiments corresponding to fig. 2 or 3 described above, all or part of the embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The document processing apparatus and the server as disclosed in the present application, wherein a plurality of servers can be grouped into a blockchain, and the servers are nodes on the blockchain.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It should be understood that the Processor referred to in this Application may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It should also be understood that the number of processors in the present application may be one or more, and may be specifically adjusted according to an actual application scenario, and this is merely an example and is not limited herein. The number of the memories in the embodiment of the present application may be one or multiple, and may be specifically adjusted according to an actual application scenario, and this is merely an exemplary illustration and is not limited.
It should be further noted that, when the text processing apparatus includes a processor (or a processing unit) and a memory, the processor in this application may be integrated with the memory, or the processor and the memory may be connected through an interface, which may be specifically adjusted according to an actual application scenario, and is not limited.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or other devices) to execute all or part of the steps of the method described in the embodiment of fig. 2 or 3 of the present application.
It will be appreciated that the storage media or memories referred to in this application may comprise volatile memory or non-volatile memory, or may comprise both volatile and non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (10)

1. A method for processing text, comprising:
matching each word in the target text based on a positive sample matching relationship and a negative sample matching relationship of a first subject, wherein the positive sample matching relationship comprises a matching relationship between a keyword of the first subject and the support degree, and the negative sample matching relationship comprises a matching relationship between a keyword of a second subject and the support degree;
if the matching fails, determining mutual information of each word and a first keyword, and determining mutual information of each word and a second keyword, wherein the first keyword is the keyword with the largest number of characters in the keywords of the first main body, and the second keyword is the keyword with the largest number of characters in the keywords of the second main body;
determining an association score of the target text according to the mutual information of each word and a first keyword, the support degree of the first keyword, the mutual information of each word and a second keyword and the support degree of the second keyword, wherein the association score represents the association degree of the target text and the first subject;
and if the association score of the target text meets the text distribution condition, distributing the target text to the first main body.
2. The method of claim 1, wherein determining the relevance score of the target text according to the mutual information of each word and the first keyword, the support of the first keyword, the mutual information of each word and the second keyword, and the support of the second keyword comprises:
determining a first association score of each word according to the mutual information of each word and a first keyword and the support degree of the first keyword;
determining a second association score of each word according to the mutual information of each word and a second key and the support degree of the second key word;
and determining the association score of the target text according to the first association score of each word and the second association score of each word.
3. The method of claim 1, further comprising:
if the matching is successful, determining a first keyword set matched with each word in the positive sample matching relationship, and determining a second keyword set matched with each word in the negative sample matching relationship;
determining that each first keyword in the first keyword set hits a first target keyword with the maximum number of characters in the positive sample matching relationship, and determining that each second keyword in the second keyword set hits a second target keyword with the maximum number of characters in the negative sample matching relationship;
determining the number of first sample clauses hit by the first target keyword in a sample clause set associated with the positive sample matching relationship, and determining the number of second sample clauses hit by the second target keyword in a sample clause set associated with the negative sample matching relationship and the target number of all sample clauses in a sample clause set associated with the positive sample matching relationship;
determining the support degree weight of the target text according to the first sample clause quantity, the second sample clause quantity and the target quantity;
and distributing the target text according to the support degree weight of the target text.
4. The method of claim 3, wherein determining the support weight of the target text according to the first number of sample clauses, the second number of sample clauses and the target number comprises:
determining the forward support degree weight of the target text according to the first sample clause quantity and the target quantity;
determining the negative support degree weight of the target text according to the second sample clause quantity and the target quantity;
and determining the support degree weight of the target text according to the positive support degree weight of the target text and the negative support degree weight of the target text.
5. The method of any one of claims 1 to 4, wherein before matching each word in the target text based on the positive and negative sample matching relationships of the first subject, the method further comprises:
acquiring a training text set, wherein the training text set comprises training texts associated with the first subject and training texts associated with the second subject;
each text in the training text set is subjected to clause division to obtain a clause set corresponding to each text;
processing the clause set corresponding to each text to obtain a first word sequence corresponding to each text;
removing keywords in the first word sequence, wherein the keywords are smaller than a support degree threshold value, and obtaining a second word sequence corresponding to each text;
determining keywords of the second word sequence and corresponding support degrees of the keywords;
determining keywords of the second word sequence corresponding to the first main body and support degrees corresponding to the keywords as a positive sample matching relation of the first main body;
and determining the keywords in the second word sequence corresponding to the second main body and the support degrees corresponding to the keywords as the negative sample matching relation of the first main body.
6. The method of claim 5, wherein the processing the set of clauses corresponding to each text to obtain the first word sequence corresponding to each text comprises:
filtering stop words of a clause set corresponding to each text based on a preset stop word bank;
carrying out named entity recognition on the clause set corresponding to each text after the stop words are filtered;
and splitting the clause set corresponding to each text after the named entity is identified according to word units to obtain the first word sequence.
7. The method of claim 5, wherein determining the keyword of the second word sequence comprises:
determining a third keyword with the number of characters i in the second word sequence and a keyword set associated with the third keyword in a target word unit set, where the target word unit set is a word unit set corresponding to at least one of the first main body and the second main body, and the value of i is the numerical value of the number of characters in the second word sequence;
removing keywords in the keyword set, wherein the keywords are smaller than the support threshold;
and performing recursion by taking the number of characters of the keywords in the second character sequence as a reference until the keyword with the largest number of characters in the second character sequence and the keyword set corresponding to the keyword with the largest number of characters in the target character unit set are determined, so as to obtain the keywords with each number of characters in the second character sequence.
8. A text processing apparatus, comprising:
the matching unit is used for matching each word in the target text based on a positive sample matching relationship and a negative sample matching relationship of a first subject, wherein the positive sample matching relationship comprises a matching relationship between a keyword of the first subject and the support degree, and the negative sample matching relationship comprises a matching relationship between a keyword of a second subject and the support degree;
a first determining unit, configured to determine, if matching fails, mutual information between each word and a first keyword, and determine mutual information between each word and a second keyword, where the first keyword is a keyword with a largest number of characters in keywords of the first main body, and the second keyword is a keyword with a largest number of characters in keywords of the second main body;
a second determining unit, configured to determine a relevance score of the target text according to mutual information between each word and a first keyword, support of the first keyword, mutual information between each word and a second keyword, and support of the second keyword, where the relevance score indicates a relevance degree between the target text and the first subject;
and the distribution unit is used for distributing the target text to the first main body if the association score of the target text meets a text distribution condition.
9. A computer device, comprising:
a memory, a processor, and a bus system;
wherein the memory is used for storing programs, and the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate;
the processor is configured to execute the program in the memory, and the processor is configured to execute the text processing method according to any one of claims 1 to 7 according to instructions in the program code.
10. A computer storage medium characterized in that it comprises instructions which, when run on a computer, cause the computer to perform the method of processing text of any one of claims 1-7.
CN202110796094.8A 2021-07-14 2021-07-14 Text processing method and device and readable storage medium Pending CN113821594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110796094.8A CN113821594A (en) 2021-07-14 2021-07-14 Text processing method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110796094.8A CN113821594A (en) 2021-07-14 2021-07-14 Text processing method and device and readable storage medium

Publications (1)

Publication Number Publication Date
CN113821594A true CN113821594A (en) 2021-12-21

Family

ID=78912669

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110796094.8A Pending CN113821594A (en) 2021-07-14 2021-07-14 Text processing method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN113821594A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501867A (en) * 2023-03-29 2023-07-28 北京数美时代科技有限公司 Variant knowledge mastery detection method, system and storage medium based on mutual information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501867A (en) * 2023-03-29 2023-07-28 北京数美时代科技有限公司 Variant knowledge mastery detection method, system and storage medium based on mutual information
CN116501867B (en) * 2023-03-29 2023-09-12 北京数美时代科技有限公司 Variant knowledge mastery detection method, system and storage medium based on mutual information

Similar Documents

Publication Publication Date Title
CN109190110B (en) Named entity recognition model training method and system and electronic equipment
CN109635298B (en) Group state identification method and device, computer equipment and storage medium
CN108763952B (en) Data classification method and device and electronic equipment
WO2020057022A1 (en) Associative recommendation method and apparatus, computer device, and storage medium
US20160275148A1 (en) Database query method and device
US20130097157A1 (en) System and method for matching of database records based on similarities to search queries
CN109189888B (en) Electronic device, infringement analysis method, and storage medium
CN112215008B (en) Entity identification method, device, computer equipment and medium based on semantic understanding
AU2017299435B2 (en) Record matching system
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
WO2021196934A1 (en) Question recommendation method and apparatus based on field similarity calculation, and server
US20210089667A1 (en) System and method for implementing attribute classification for pii data
CN106815265B (en) Method and device for searching referee document
CN109977233B (en) Idiom knowledge graph construction method and device
US11914626B2 (en) Machine learning approach to cross-language translation and search
CN112241458B (en) Text knowledge structuring processing method, device, equipment and readable storage medium
CN109947903B (en) Idiom query method and device
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
TWI745777B (en) Data archiving method, device, computer device and storage medium
CN111651574A (en) Event type identification method and device, computer equipment and storage medium
CN113821594A (en) Text processing method and device and readable storage medium
US20160267586A1 (en) Methods and devices for computing optimized credit scores
CN112800179A (en) Associated database query method and device, storage medium and electronic equipment
CN110674383A (en) Public opinion query method, device and equipment
CN115905885A (en) Data identification method, device, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination