CN112149403A

CN112149403A - Method and device for determining confidential text

Info

Publication number: CN112149403A
Application number: CN202011111708.6A
Authority: CN
Inventors: 李昊达; 高欣; 刘兵; 杨雨婷; 陈旭
Original assignee: MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER
Current assignee: MILITARY SECRECY QUALIFICATION CERTIFICATION CENTER
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2020-12-29

Abstract

The disclosure relates to a method and a device for determining a confidential text. The method comprises the following steps: acquiring a subfile containing the confidential keywords and the context information thereof from the text; determining the dependency relationship among the confidential keywords according to the sub-texts; and matching the dependency relationship with the dependency relationship of the secret-related keywords in a preset secret-related information rule base containing the dependency relationship among the plurality of secret-related keywords, and if the matching is successful, determining that the text is the secret-related text. The method can determine the confidential content of the text from the context semantics of the confidential keywords, so that the method for determining the confidential text is more accurate and faster.

Description

Method and device for determining confidential text

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for determining a secret-related text.

Background

Whether a file is confidential and the security level of the file are determined by the specific confidential contents contained in the file, and the specific confidential contents in the file are called as confidential points. In the related technology, the determination of the classified text is mostly completed manually. The subjectivity of manual secret setting is strong, the secret setting standards are not uniform, and different unit secret setting standards in different fields are often different greatly, so that the secret setting result is inaccurate.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method and apparatus for determining a confidential document.

According to a first aspect of the embodiments of the present disclosure, there is provided a method for determining a confidential text, including:

acquiring a subfile containing the confidential keywords and the context information thereof from the text;

determining the dependency relationship among the confidential keywords according to the sub-texts;

and matching the dependency relationship with the dependency relationship of the secret-related keywords in a preset secret-related information rule base containing the dependency relationship among the plurality of secret-related keywords, and if the matching is successful, determining that the text is the secret-related text.

In a possible implementation manner, the secret-related keyword is set to be determined in the following manner, including:

acquiring words in the text;

matching the words with secret-related keywords in a preset secret-related keyword library;

and if the matching is successful, determining the words as the secret-related keywords.

In a possible implementation manner, after the matching the word with the secret-related keywords in the preset secret-related keyword library, the method further includes:

if the matching is unsuccessful, inputting the terms into a preset synonym conversion model, and outputting synonyms of the terms through the synonym conversion model;

and matching the synonyms with the secret-related keywords.

In one possible implementation, the synonym transformation model includes at least one of:

a sound change model, a deformation model, a wrongly written character model and a traditional character model.

In a possible implementation manner, obtaining a subfolder in which the confidential keywords and the context information thereof are located from the text includes:

determining secret-related keywords in the text;

and determining the text between the upper and lower adjacent preset cut-off symbols at the positions of the confidential keywords as the sub-text.

In a possible implementation manner, determining, according to the sub-text, a dependency relationship between the confidential keywords includes:

and inputting the subfiles into a dependency syntax analysis model, and outputting the dependency relationship between the two entity words in the subfiles through the dependency syntax analysis model.

In one possible implementation manner, the dependency relationship is matched with a dependency relationship of a secret-related keyword in a preset secret-related information rule base containing dependency relationships among a plurality of secret-related keywords, wherein a condition that matching succeeds is set to be at least one of the following manners:

the confidential keywords in the sub-text and the confidential keywords in the confidential information rule base are the same words or words with the same word meanings, the dependency relationship among the confidential keywords in the sub-text is the same as the dependency relationship among the confidential keywords in the confidential information rule base,

and the confidential keywords in the sub text belong to the category range of the confidential keywords in the confidential information rule base, and the dependency relationship among the confidential keywords in the sub text is the same as the dependency relationship among the confidential keywords in the confidential information rule base.

In a possible implementation manner, before the obtaining the subfile containing the confidential keywords and the context information thereof from the text, the method further includes:

and under the condition that the file format of the text is a non-standard format, converting the file format of the text into a preset standard format.

In one possible implementation, the dependency relationship includes at least one of:

a cardinal relationship, a core relationship, a fixed relationship, a quantitative relationship, a adverb relationship, and a co-sitory relationship.

In one possible implementation manner, the secret-related information rule base includes a plurality of secret-related information rule bases in the technical field, and before the matching of the associated dependency relationship with the associated dependency relationship of the secret-related keyword in a preset secret-related information setting rule base including a dependency relationship among a plurality of secret-related keywords, the method further includes:

and determining a secret-related information rule base matched with the technical field according to the technical field of the sub-text.

In one possible implementation, the secret-related information rule base is configured to be generated as follows:

merging data of a plurality of preset classified texts in the same technical field;

extracting secret-related keywords in the secret-related texts from the merged data;

and analyzing and storing the dependency relationship among the secret-related keywords.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for determining a confidential document, including:

the acquisition module is used for acquiring subfiles containing the confidential keywords and the context information thereof from the texts;

the first determining module is used for determining the dependency relationship among the confidential keywords according to the sub-texts;

the second determining module is used for determining the dependency relationship among the confidential keywords according to the sub-texts;

In one possible implementation, the method includes: the secret-related keywords are set to be determined in the following manner, including:

acquiring words in the text;

and matching the synonyms with the secret-related keywords.

In one possible implementation manner, the obtaining module includes:

the first determining sub-module is used for determining the confidential keywords in the text;

and the second determining sub-module is used for determining the text between the upper and lower adjacent preset cut-off symbols at the positions of the confidential keywords as the sub-text.

In one possible implementation manner, the first determining module includes:

and the third determining sub-module is used for inputting the sub-text into a dependency syntax analysis model and outputting the dependency relationship between the two entity words in the sub-text through the dependency syntax analysis model.

In one possible implementation, the apparatus further includes:

and the conversion module is used for converting the file format of the text into a preset standard format under the condition that the file format of the text is a non-standard format.

In one possible implementation, the apparatus further includes:

and the matching module is used for determining a secret-related information rule base matched with the technical field according to the technical field of the sub-text.

According to a third aspect of the present disclosure, there is provided an apparatus for determining a confidential document, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing a method according to any embodiment of the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having instructions stored thereon which, when executed by a processor, enable the processor to perform a method according to any one of the embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: according to the method, the dependency relationship among the secret-related keywords is matched with the dependency relationship among the secret-related keywords in the preset secret-related information rule base by acquiring the secret-related keywords and the context information in the text, and compared with the traditional manual secret determination or keyword secret determination, the secret-related content of the text can be determined semantically from the context of the secret-related keywords, so that the method for determining the secret-related text is more accurate and faster.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of determining classified text in accordance with an exemplary embodiment.

FIG. 2 is a block diagram illustrating an apparatus for determining classified text in accordance with one exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method of determining classified text in accordance with an exemplary embodiment.

FIG. 4 is a flow diagram illustrating a method of determining a secret-related keyword in accordance with an example embodiment.

Fig. 5 illustrates a method for constructing a secret-related information rule base according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating an apparatus for determining classified text in accordance with an example embodiment.

FIG. 7 is a block diagram illustrating an apparatus for determining classified text in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In order to facilitate those skilled in the art to understand the technical solutions provided by the embodiments of the present disclosure, a technical environment for implementing the technical solutions is described below.

The expression form of the confidential content in the confidential text has diversity, including existing in the form of individual names or concepts, such as: the model of a certain device, the function index of a certain device; also included are sentences or short text that exist in the following semantic form, such as: project background: the flight speed which can be realized by the flight control software of a certain equipment model is a km/h. In the related art, if the confidential text is determined by the confidential keywords, sentences in the following semantic forms are omitted, some confidential keywords belong to confidential ranges in some fields, and the confidential keywords may not be in other fields, so that the method for determining whether the text is the confidential text by simply depending on the confidential keywords is not reliable.

Based on practical technical needs similar to those described above, the present disclosure provides a method for determining classified text.

FIG. 3 is a flow diagram illustrating a method of determining classified text in accordance with an exemplary embodiment. Referring to fig. 3, judging the secrecy of a text to be determined, first performing text extraction 301, where the text extraction 301 includes judging whether a text file format belongs to a preset standard file format, and if not, converting the text file format into a standard file format, where the file format may include word, ppt, excel, pdf, txt, and the like. Secondly, determining the secret keywords 302 for the extracted text, including: performing word segmentation and part-of-speech tagging on the text; and filtering the words after word segmentation in a secret-related keyword library, wherein when the secret-related keyword library contains words which are completely the same as the words, or the inflexion words of the words, the texts are all determined to contain the secret-related keywords. Thirdly, representing 303 the sub-texts comprises extracting sentences where the secret-related keywords are located to determine the sub-texts, and representing the sub-texts by adopting a syntax tree representation or a bag-of-words model. And finally, determining the dependency relationship among the secret-related keywords according to the sub-text, matching the dependency relationship with the dependency relationship of the secret-related keywords in a preset secret-related information rule base containing the dependency relationship among the plurality of secret-related keywords, and if the matching is successful, determining that the text is the secret-related text.

FIG. 4 is a flow diagram illustrating a method of determining a secret-related keyword in accordance with an example embodiment. Referring to fig. 4, the text 406 in the standard file format is input into the confidential keyword filter 407 and matched with the confidential keywords stored in the confidential keyword library 401. After the secret-related keywords in the secret-related keyword library are normalized 402, a dictionary tree is used for storage 403. If the matching is not successful, the intelligent character pronunciation conversion and the word type conversion are carried out on the words which are not successfully matched through the Chinese pinyin library/word type word library 404 by entering the sound change deformation filter 409. And matching the converted words in the secret-related keyword library again, if the matching is successful, determining the words as the secret-related keywords, if the matching is failed, determining the words as the non-secret-related keywords, and outputting a hit result 410 of the secret-related keywords.

The method for determining the confidential text in the present disclosure is described in detail below with reference to fig. 1. FIG. 1 is a flow diagram illustrating a method for determining classified text in accordance with an exemplary embodiment. Although the present disclosure provides method steps as illustrated in the following examples or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the disclosed embodiments.

Specifically, an embodiment of the method for determining a confidential document provided by the present disclosure is shown in fig. 1, where the method may be applied to a terminal or a server and includes:

step S101, obtaining a sub text containing the confidential keywords and the context information thereof from the text.

In the embodiment of the disclosure, the secret-related keywords include words containing secret information, and may include nouns, verbs, adjectives, and the like. In one example, consider that a single noun or a single verb in an electronic text is not confidential, but the combination of noun and number "noun + number" or "verb + number" is often confidential content, such as: the 'flying speed of the flight control software is fast' and is not involved in secret, and the 'flying speed of the flight control software reaches 50 km/h' is the content involved in secret. Therefore, in the embodiment of the present disclosure, the secret-related keywords may include numbers.

In the embodiment of the present disclosure, the length of the sub-text is not limited, and a sentence with a preset length appearing before and a sentence with a preset length appearing after the secret-related keyword may be used as the sub-text based on the secret-related keyword. In one example, if the keyword includes a first word of a text, the keyword and a sentence with a preset length appearing later are used as the sub-text, and in another example, if the keyword includes a last word of the text, the keyword and a sentence with a preset length appearing earlier are used as the sub-text. In one example, the word between two preset punctuations can be used as the sub-text according to the marking condition of the punctuations of the text.

And S102, determining the dependency relationship among the confidential keywords according to the sub-texts.

In the embodiment of the disclosure, the dependency relationship between the secret-related keywords may be determined by a relationship extraction method. In one example, the method of relationship extraction may include: pattern matching based relationship extraction, dictionary based relationship extraction, and machine learning based relationship extraction. The relation extraction based on the pattern matching comprises the steps of constructing a plurality of pattern sets based on the part of speech or the semantics of words before executing an extraction task, matching a statement segment where the secret-related keywords are located with the patterns in the pattern sets when the relation extraction is carried out, and if the matching is successful, enabling the statement segment to have the relation attributes of the corresponding patterns; the dictionary-based relationship extraction includes: setting corresponding dictionary entries for entity words in a dictionary, wherein the words serving as the dictionary entries can comprise verbs, and if the confidential keywords in the sub-texts comprise verb relations, extracting the dependency relations among the confidential keywords; the relation extraction based on machine learning comprises the steps of constructing a classifier on the basis of a pre-labeled corpus through a specific learning algorithm, and then applying the classifier to the class judgment of the corpus relation.

In the embodiment of the present disclosure, the dependency relationship may include an instance relationship, which indicates that one thing is an instance of another thing, for example, "is one", when it is a cat; classification relations, representing the category of one thing being another thing, such as "is one", football being a ball; membership, meaning that one thing is a member of another, such as "personal and collective," and minired is a firefighter in a fire brigade; attribute relationships, indicating that one node has an attribute that another node represents, such as a monkey would climb a tree; a convergent relationship, which represents a relationship of a part and a whole, for example, an arm is a part of a body, a positional relationship, which represents an orientation relationship of an object, for example, a mouse on a table; the close relationship means that things are similar in shape, content and the like, for example, the lion and the tiger have the dominance in the forest. It should be noted that the dependency relationship is not limited to the above examples, and other modifications are possible by those skilled in the art in light of the technical spirit of the present application, and the functions and effects achieved by the dependency relationship are also encompassed in the scope of the present application as long as they are the same as or similar to the present application.

And step S103, matching the dependency relationship with the dependency relationship of the confidential keywords in a preset confidential information rule base containing the dependency relationship among the confidential keywords, and if the matching is successful, determining that the text is the confidential text.

In the embodiment of the present disclosure, the matching the dependency relationship with the dependency relationship of the secret-related keywords in the preset secret-related information rule base including the dependency relationship among the plurality of secret-related keywords includes matching the dependency relationship between the secret-related keywords in the sub-text with the dependency relationship between the secret-related keywords in the secret-related information rule base and matching the secret-related keywords in the sub-text with the secret-related keywords in the secret-related information rule base.

In the embodiment of the disclosure, the dependency relationship between the secret-related keywords in the secret-related information rule base may be constructed by a predetermined secret-related text. And storing the secret-related keywords of the secret-related texts, and storing the dependency relationships among the secret-related keywords in different texts by adopting a uniform rule. In one example, the confidential keywords and the dependencies may be stored in a form of a knowledge graph. And matching the acquired dependency relationship with a pre-stored dependency relationship, and if the matching is successful, determining the text as a secret-related text.

According to the method, the dependency relationship among the secret-related keywords is matched with the dependency relationship among the secret-related keywords in the preset secret-related information rule base by acquiring the secret-related keywords and the context information in the text, and compared with the traditional manual secret determination or keyword secret determination, the secret-related content of the text can be determined semantically from the context of the secret-related keywords, so that the method for determining the secret-related text is more accurate and faster.

step S201, obtaining words in the text;

step S202, matching the words with secret-related keywords in a preset secret-related keyword library;

step S203, if the matching is successful, determining the words as the secret-related keywords.

In the embodiment of the disclosure, the secret-related keyword library is generated by extracting secret-related keywords from determined secret-related texts. The confidential keywords in the confidential keyword library may include Chinese, English, traditional Chinese, simplified Chinese, letters, numbers, and the like, and may also include a mixed form of the above forms, such as Chinese text + letters, and the like. In one example, the confidential keywords in the confidential keyword library may be stored in a form of a dictionary tree (Trie tree), and include homophones and homographs, and the storage in the form of the dictionary tree has the advantages of reducing query time by using a common prefix of a character string, being capable of reducing comparisons of worthless character strings to the greatest extent, and being high in query efficiency.

In the embodiment of the disclosure, word segmentation processing may be performed on a text to obtain words in the text. In one example, the terms may be part-of-speech tagged to be more accurate when the terms match the classified keywords. In one example, the term is identical to the secret-related keyword, and is determined to be a successful match between the two; in another example, if the inflected word or inflected word of the word is the same as the classified keyword, the matching of the inflected word or inflected word and the classified keyword can also be determined to be successful.

In the embodiment Of the disclosure, Words in the text may be stored by using a Bag Of Words model (BOW, Bag Of Words), the BOW assumes that for a text, word order, grammar and syntax are ignored, and the text is only regarded as a word set or a combination Of Words, and occurrence Of each dense point keyword in the text is independent and does not depend on whether other dense point keywords occur. It is expressed as follows:

i.e. for document doc_iW of (2)_iIf the jth tag sequence in the thesaurus

Appears in W_iThen the vector component V at this point in the document_ijIs its word frequency

Otherwise it is 0.

An example of a bag-of-words model for short text such as a dense point is as follows:

(0，0，0，0，……，1，……，0，0，0，0)

0 in the word bag model indicates that the dense point keyword does not appear in the dense point short text in the dense point keyword library, if the dense point keyword is 1 or more than 1, the dense point keyword in the dense point keyword library appears in the dense point short text, and the number is the number of times of the dense point keyword appearing in the dense point short text.

In the embodiment of the disclosure, the times of the words appearing in the text are counted through the word bag model, and then the weight of the confidential keywords is set, so that the weight of the keywords frequently appearing in the non-confidential text is reduced, and the accuracy of the confidential keywords is improved.

In the embodiment of the disclosure, the user can add the dense point keyword library in the professional field according to the requirement of the words in the dense point keyword library, and the user only needs to define the category and the weight of the dense point keyword, so that the million-level keyword library can be added in real time in an incremental manner. The operation steps are as follows:

a. the word stock of the tool can be edited, added, deleted and the like in an off-line manner;

b. the tool supports importing a dense point keyword list specific to a large batch of user services. Each line is provided with a dense point keyword message, and the specific format is as follows: word, category, weight.

In a possible implementation manner, in the step S202, the matching of the word and the secret-related keywords in the preset secret-related keyword library further includes:

step S211, if the matching is unsuccessful, inputting the terms into a preset synonym conversion model, and outputting synonyms of the terms through the synonym conversion model;

step S212, matching the synonym with the secret-related keyword.

In the embodiment of the disclosure, if the words in the text are not identical to the confidential keywords in the confidential keyword library, the words are input into a preset synonym conversion model. In the embodiment of the present disclosure, the synonym conversion model is formed by training a plurality of synonyms marked in advance as training samples, and the unified term conversion model can output terms with the same meaning as the term meaning after terms are input.

In the disclosed embodiments, the same word senses as the terms may have the same form or a different form, including, in one example, the "query" and "find"; in one example, the different forms include different kinds of words, such as chinese "look up" and english "look for", in another example, the different forms include phonetically variant words, such as "hangtaian" synonym for "space," and "HT", in another example, the different forms also include morphable words, such as "kakan" synonym for "space," in another example, the different forms also include wrongly-written words, such as "keep-clean" synonym for "keep-still", in another example, the different forms also include network terms, such as "give force", "monkey rey", and so on.

According to the embodiment of the disclosure, the words which are not successfully matched are input into the synonym conversion model, converted into the synonyms of the words, and then matched with the confidential keywords, so that whether the words have confidentiality or not can be further found, the words which have confidentiality semantically and are not stored in the confidential keyword library are prevented from being missed, and the accuracy of the matching result is ensured.

In the embodiment of the disclosure, the sound variation words represent the same meaning as the word meaning of the words, but are in the form of pinyin expression or abbreviated letter expression of the words, and the corresponding sound variation model is formed by training the pre-marked sound variation words as training samples. The deformation words represent the same meaning as the words, but the writing form is different from the words, for example, "forest" is the deformation words of "forest", and the deformation words marked in advance are used as training samples to train to form corresponding deformation models. The wrongly written or mispronounced character model represents the same meaning as the meaning of the word, but is in the form of the mispronounced or mispronounced form of the word, such as 'resolution' and 'discrimination' used in different contexts for confusion, and forms a corresponding mispronounced or mispronounced character model by training the pre-marked mispronounced word as a training sample, and the complex character model represents the same meaning as the meaning of the word, but is in the form of complex running of the word, such as 'Chinese' represented as 'middle form'.

In a possible implementation manner, in step S101, a sub-text containing the confidential keywords and the context information thereof is obtained from the text. The method comprises the following steps:

step S501, determining the secret-related keywords in the text;

step S502, determining the text between two adjacent preset cut-off symbols above and below the position of the confidential keywords as the sub-text.

In the embodiment of the present disclosure, the predetermined cutoff symbol may include punctuation marks, such as periods, commas, semicolons, exclamation marks, ellipses, and the like. In one example, a period often indicates the end of a word and the end of a semantic term, and therefore, the text between two adjacent periods above and below the position of the classified keyword may be determined as the sub-text. In another example, the preset cutoff symbol may further include a computer character, such as a "line feed symbol" or a "carriage return symbol", and the text between two carriage return symbols adjacent to each other above and below the position of the confidential keyword may be determined as the sub-text. In another example, different endcaps may be used, such as text between the position of the confidential keyword and the last period and text between the position of the confidential keyword and the next ellipsis as the sub-text.

In a possible implementation manner, in step S102, the dependency relationship between the confidential keywords is determined according to the sub-text. The method comprises the following steps:

step S601, inputting the child text into a dependency parsing model, and outputting a dependency relationship between two entity words in the child text through the dependency parsing model.

In the embodiment of the disclosure, the dependency parsing includes automatically deducing the syntactic structure of a sentence according to a given syntactic structure, analyzing entity words contained in the sentence, and analyzing dependency relationships among the entity words. The dependency relationship between the entity words in the dependency syntax forms a dependency pair, wherein one is a core word and is also called a dominant word, and the other is called a modifier word and is also called a dependent word. The dependency relationship is represented by a directed arc, called dependency arc, the starting point of the arc is the dominant word (core word) and the end point of the arc is the dominant word. In one example, a syntax tree may be generated in a manner that combines Probabilistic Context Free Grammar (PCFG) with dependent grammar.

In the embodiment of the present disclosure, the dependency relationship may include: major-minor relationship (SBV), for example: i send her bouquet, "I" is the subject, "send" is the predicate; core relationship (HED), which refers to the core of an entire sentence; moving object relationship (VOB), for example: i send a bunch of flowers, "send" is a verb, and "flower" is a direct object; front Object (FOB), for example: what book he reads, the object "book" precedes the verb "read"; centering relationships (ATT), for example: red apples; the dependency relationship may also include adverb relationship, co-ordinate relationship, etc.

The method and the device extract the dependency relationship among the entity words in the dependency syntax model sub-text, and can eliminate ambiguity among different grammars or phrase structures.

In one possible implementation manner, in step S103, the dependency relationship is matched with a dependency relationship of a secret-related keyword in a preset secret-related information rule base containing dependency relationships among a plurality of secret-related keywords, where a condition that matching succeeds is set to be at least one of the following manners:

In the embodiment of the present disclosure, the matching the dependency relationship with the dependency relationship of the secret-related keywords in the preset secret-related information rule base including the dependency relationship among the plurality of secret-related keywords includes matching the dependency relationship between the secret-related keywords in the sub-text with the dependency relationship between the secret-related keywords in the secret-related information rule base and matching the secret-related keywords in the sub-text with the secret-related keywords in the secret-related information rule base. The words with the same meaning are explained as synonyms in the above embodiments and are not described in detail here. In the embodiment of the present disclosure, in the process of performing keyword matching, the method further includes: and the confidential keywords in the sub-text belong to the category range of the confidential keywords in the confidential information rule base. In one example, for example, the rule relationship between the secret-related keywords stored in the secret-related information rule base includes: dealer-sell-fruit, then if appearing in the sub-text: yesterday afternoon, the apple sold by the grocery store downstairs is 8 pieces of money 1 jin, and since the grocery store belongs to the category range of the dealer and the apple also belongs to the category range of the fruit, the sub-text is determined to be the confidential text.

In a possible implementation manner, in step S101, before obtaining a subfile containing the confidential keywords and the context information thereof from the text, the subfile is obtained. Further comprising:

step S801, converting the file format of the text into a preset standard format when the file format of the text is a non-standard format.

In the embodiment of the present disclosure, the text to be determined may have a plurality of file formats, for example, the file formats may include word, ppt, excel, pdf, txt, jpg, and the like, and since different file formats have different file construction specifications, extraction methods for text content may also be different, so that different file formats need to be converted into a uniform standard format, which is beneficial to eliminating ambiguity generated when matching is performed on subsequent secret-related keywords. In one example, the file format of the text may be converted into a preset standard format by the conversion method disclosed in table 1.

TABLE 1

In a possible implementation manner, the secret-related information rule base includes a plurality of secret-related information rule bases in the technical field, and before the step S103 matches the associated dependency relationship with the associated dependency relationship of the secret-related keywords in the preset secret-related information setting rule base including the dependency relationship among the plurality of secret-related keywords, the method further includes:

and step S111, determining a secret-related information rule base matched with the technical field according to the technical field of the sub-text.

In the embodiment of the present disclosure, the technical fields may include technical fields such as aviation, aerospace, ships, traffic, and communication, and since some words belong to secret-related keywords in some fields and do not belong to secret-related keywords in other fields, for example, "navigation" does not belong to secret-related keywords in the traffic field, and may become keywords in the aviation field or the ship field, different secret-related information rule bases are set in the embodiment of the present disclosure for different fields.

step S113, merging data of a plurality of preset classified texts in the same technical field;

step S114, extracting secret-related keywords in the secret-related texts from the merged data;

and step S115, analyzing and storing the dependency relationship among the secret-related keywords.

In the embodiment of the disclosure, the secret-related information rule base can be stored in the form of a data table and a knowledge graph. In the embodiment of the disclosure, the preset confidential texts include data which is predetermined as confidential texts, data merging is performed on a plurality of preset confidential texts, work such as extraction of words, labeling of the fields of the words, labeling of the weights of the words and the like can be performed from the data, and the dependency relationship between the words is analyzed and stored. In one example, the combination relationship of the words can be combined in various ways, for example, association rules can be adopted to mine association relationship between different words. And combining different classified texts into a uniform classified information rule base by using the syntactic rule according to the association relationship, determining the dependency relationship between the field theme and each classified keyword by combining the words and the fields in the classified texts, and performing standardized storage. In the embodiment of the disclosure, the secret-related information rule base may perform expansion of secret-related keywords and expansion of dependency relationships.

Fig. 5 illustrates a method for constructing a secret-related information rule base according to an exemplary embodiment. Referring to fig. 5, a plurality of determined classified texts are merged for concept discovery 501, including segmentation by field, entity extraction and discovery of new concepts, wherein the entity extraction is to extract entities based on the segmentation, extract entity words, and the discovery of new concepts includes automatically constructing new concepts, processing updating of classified keywords in time, and completing expansion of classified keywords. Secondly, extracting the secret-related keywords 502, which may be an information gain method, mining the secret-related keywords with different weights, and combining with a topic extraction model to form a secret-related keyword rule combining the keywords and the topic words. Thirdly, the establishment of the secret-related information rules can be realized by mining the association rules of the secret-related keywords through an association rule mining algorithm or an artificial neural network model, and mining similar rules and storing the similar rules by adopting a uniform semantic matching rule. And finally, constructing a knowledge graph 504, wherein the triple relation among the secret-related keywords can be constructed by utilizing the association rule of the mined secret-related keywords, and the knowledge graph in the same technical field is constructed.

FIG. 2 is a block diagram illustrating an apparatus for determining classified text in accordance with an exemplary embodiment. Referring to fig. 2, the apparatus includes an acquisition module 201, a first determination module 202, and a second determination module 203.

An obtaining module 201, configured to obtain a sub-text containing the confidential keywords and the context information thereof from the text;

a first determining module 202, configured to determine, according to the sub-text, a dependency relationship between the confidential keywords;

and the second determining module 203 is configured to match the dependency relationship with the dependency relationship of the secret-related keywords in a preset secret-related information rule base containing dependency relationships among the plurality of secret-related keywords, and if the matching is successful, determine that the text is a secret-related text.

acquiring words in the text;

and matching the synonyms with the secret-related keywords.

a sound variation model, a deformation model and a wrongly written character model.

In one possible implementation manner, the obtaining module includes:

In one possible implementation manner, the first determining module includes:

In one possible implementation, the apparatus further includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 6 is a block diagram illustrating an apparatus 600 for determining classified text in accordance with one exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, apparatus 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.

The processing component 602 generally controls overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 606 provides power to the various components of device 600. The power components 606 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 600.

The multimedia component 608 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, audio component 610 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the apparatus 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the device 600, the sensor component 614 may also detect a change in position of the device 600 or a component of the device 600, the presence or absence of user contact with the device 600, acceleration/deceleration of the orientation of the device 600, and a change in temperature of the device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the apparatus 600 and other devices in a wired or wireless manner. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 7 is a block diagram illustrating an apparatus 700 for determining confidential text, according to an example embodiment. For example, the apparatus 700 may be provided as a server. Referring to fig. 7, apparatus 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform the above-described methods.

The apparatus 700 may also include a power component 726 configured to perform power management of the apparatus 700, a wired or wireless network interface 750 configured to connect the apparatus 700 to a network, and an input output (I/O) interface 758. The apparatus 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as the memory 732 that includes instructions, which are executable by the processing component 722 of the apparatus 700 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed invention. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for determining a classified text, comprising:

2. The method according to claim 1, wherein the secret-related keyword is set to be determined as follows, including:

acquiring words in the text;

3. The method of claim 2, wherein after matching the term with a secret-related keyword in a predetermined secret-related keyword library, further comprising:

and matching the synonyms with the secret-related keywords.

4. The method of claim 3, wherein the synonym transformation model comprises at least one of:

5. The method of claim 1, wherein obtaining subfolders containing the confidential keywords and the context information thereof from the text comprises:

determining secret-related keywords in the text;

6. The method of claim 1, wherein determining the dependency relationship between the confidential keywords according to the sub-text comprises:

7. The method according to claim 1, wherein the dependency relationship is matched with the dependency relationship of the secret-related key words in a preset secret-related information rule base containing dependency relationships among a plurality of secret-related key words, wherein the condition that the matching is successful is set as at least one of the following ways:

8. The method according to claim 1, further comprising, before the obtaining the subfile containing the confidential keywords and the context information thereof from the text:

9. The method of claim 1, wherein the dependency comprises at least one of:

10. The method according to claim 1, wherein the secret-related information rule base comprises a plurality of technical-field secret-related information rule bases, and before the matching of the dependency relationship with the associated dependency relationship of the secret-related keywords in the preset secret-related information setting rule base containing the dependency relationship among the plurality of secret-related keywords, the method further comprises:

11. The method of claim 1, wherein the secret-related information rule base is configured to be generated as follows:

12. An apparatus for determining confidential text, comprising:

and the second determining module is used for matching the dependency relationship with the dependency relationship of the secret-related keywords in a preset secret-related information rule base containing the dependency relationship among the plurality of secret-related keywords, and if the matching is successful, determining the text as the secret-related text.

13. The apparatus of claim 12, comprising: the secret-related keywords are set to be determined in the following manner, including:

acquiring words in the text;

14. The apparatus of claim 13, wherein after matching the word with a secret-related keyword in a predetermined secret-related keyword library, the apparatus further comprises:

and matching the synonyms with the secret-related keywords.

15. The apparatus of claim 14, wherein the synonym transformation model comprises at least one of:

16. The apparatus of claim 12, wherein the obtaining module comprises:

17. The apparatus of claim 12, wherein the first determining module comprises:

18. The apparatus according to claim 12, wherein the dependency relationship is matched with the dependency relationship of the secret-related key words in a preset secret-related information rule base containing dependency relationships among a plurality of secret-related key words, and the condition that the matching is successful is set as at least one of the following ways:

19. The apparatus of claim 12, further comprising:

20. The apparatus of claim 12, wherein the dependency comprises at least one of:

21. The apparatus of claim 12, further comprising:

22. The apparatus of claim 12, wherein the secret-related information rule base is configured to be generated as follows:

23. An apparatus for determining confidential text, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 11.

24. A non-transitory computer readable storage medium having instructions therein which, when executed by a processor, enable the processor to perform the method of any of claims 1 to 11.