CN110414006A - Theme mask method, device, electronic equipment and the storage medium of text - Google Patents
Theme mask method, device, electronic equipment and the storage medium of text Download PDFInfo
- Publication number
- CN110414006A CN110414006A CN201910703411.XA CN201910703411A CN110414006A CN 110414006 A CN110414006 A CN 110414006A CN 201910703411 A CN201910703411 A CN 201910703411A CN 110414006 A CN110414006 A CN 110414006A
- Authority
- CN
- China
- Prior art keywords
- theme
- text
- marked
- entity word
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses theme mask method, device, electronic equipment and the storage mediums of a kind of text;The described method includes: obtaining text to be marked;The text to be marked is segmented by first participle algorithm, obtains at least one entity word;The theme of the text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word;According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme annotation results.The present invention is based on the rules of correspondence of entity word and alternative theme, will match after text word segmentation processing to be marked with the rule of correspondence, to obtain the theme of text to be marked, realize the text subject mark of efficiently and accurately.
Description
Technical field
The present invention relates to field of computer technology, particularly relate to theme mask method, device, the electronic equipment of a kind of text
And storage medium.
Background technique
In recent years, with the rapid development of Internet, information resources just exponentially increase.Internet information money abundant
Source brings great convenience to people's lives, and people, which can be convenient, is rapidly obtained various types of information resources, text
It originally is wherein important one.However in this big data era, when user faces the text of magnanimity, it is difficult to accurately and quickly
The related text of itself required theme is obtained, therefore, the theme mark that efficiently and accurately is carried out to text is those skilled in the art
Technical problem urgently to be resolved.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of theme mask method of text, device, electronic equipment and deposit
Storage media, the realization for capableing of efficiently and accurately mark the theme of text.
Based on above-mentioned purpose, the present invention provides a kind of theme mask methods of text, comprising:
Obtain text to be marked;
The text to be marked is segmented by first participle algorithm, obtains at least one entity word;
The text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word
This theme;
According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme mark knot
Fruit.
In addition, the present invention also provides a kind of theme annotation equipments of text, comprising:
Module is obtained, for obtaining text to be marked;
Word segmentation module obtains at least one reality for segmenting by first participle algorithm to the text to be marked
Pronouns, general term for nouns, numerals and measure words;
Determining module, according to the rule of correspondence of preset entity word and alternative theme, is determined for using the entity word
The theme of the text to be marked;
Labeling module carries out theme mark simultaneously to the text to be marked for the theme according to the text to be marked
Export theme annotation results.
In addition, the present invention also provides a kind of electronic equipment, including memory, processor and storage are on a memory and can
The computer program run on a processor, the processor realize side described in any one as above when executing described program
Method.
In addition, the present invention also provides a kind of non-transient computer readable storage mediums, which is characterized in that described non-transient
Computer-readable recording medium storage computer instruction, the computer instruction are as above any one for executing the computer
Method described in.
From the above it can be seen that theme mask method, device, electronic equipment and the storage of text provided by the invention
Medium, the corresponding relationship based on entity word Yu alternative theme will match after text word segmentation processing to be marked with the rule of correspondence,
To obtain the theme of text to be marked, the text subject mark of efficiently and accurately is realized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the theme mask method flow chart of the text of the embodiment of the present invention;
Fig. 2 is the step flow chart that the theme of text to be marked is determined in the embodiment of the present invention;
Fig. 3 is the step flow chart that critical entities word is determined in the embodiment of the present invention;
Fig. 4 is the selection flow chart of steps of the corresponding alternative theme of critical entities word in the embodiment of the present invention;
Fig. 5 is the update flow chart of steps in the embodiment of the present invention to entity word and the rule of correspondence of alternative theme;
Fig. 6 is external processing step flow chart in the embodiment of the present invention;
Fig. 7 is the theme annotation equipment structural schematic diagram of the text of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
Attached drawing, the present invention is described in more detail.
It should be noted that unless otherwise defined, the technical term or scientific term that the embodiment of the present invention uses should
The ordinary meaning understood by the personage in disclosure fields with general technical ability." first " used in the disclosure, " the
Two " and similar word be not offered as any sequence, quantity or importance, and be used only to distinguish different composition portions
Point.The similar word such as " comprising " or "comprising" means to occur after element or object before the word cover and appear in the word
The element that face is enumerated perhaps object and its equivalent and be not excluded for other elements or object." connection " or " connected " etc. are similar
Word be not limited to physics or mechanical connection, but may include electrical connection, either it is direct still
Indirectly."upper", "lower", "left", "right" etc. are only used for indicating relative positional relationship, when the absolute position for being described object changes
Afterwards, then the relative positional relationship may also correspondingly change.
The embodiment of the invention provides a kind of theme mask methods of text, with reference to Fig. 1, comprising the following steps:
Step 101 obtains text to be marked.
In this step, first choice obtains the text to be marked for needing to carry out theme mark.Specifically acquisition modes may include:
The text that user sends directly is received, i.e. reception user is sent by way of this paper file, receives the text that user sends
After this document, this article this document is read, and then is extracted from text file and obtains text to be marked.It is also possible to according to user
The location information of transmission goes corresponding storage location to obtain text to be marked;The location information can be locally stored address or
Network storage address, according to location information, accessible corresponding storage location reads the storing data of corresponding storage location, with
It extracts and obtains text to be marked.
Step 102 segments the text to be marked by first participle algorithm, obtains at least one entity word.
In this step, word segmentation processing is carried out to the text to be marked got, specifically, word segmentation processing includes for wait mark
The participle process of explanatory notes sheet and screening process to word segmentation result.Text to be marked for participle process, which can be divided into, to be met certainly
Several words of right semantic rules, the part of speech of these words is also different, generally will include noun, verb, adverbial word, conjunction, language
Gas word etc..In order to more accurately reflect theme belonging to the text to be marked, further to obtaining after above-mentioned participle
Several words carry out screening process, thus to obtain entity word.The entity word is to refer to accurately reflect text institute to be marked
The word of the theme of category.Specific entity word determines, pass that can be existing based on the first participle algorithm used in the present embodiment
Keyword, high frequency words determine rule.In general, when frequency of occurrence is more multiple in text to be marked for a word, it can be determined
For high frequency words.And keyword generally can be high frequency words above-mentioned, be also possible to determine in other way.Specifically really
Set pattern then, depends on which kind of specific algorithm first participle algorithm uses.First participle algorithm can be selected such as in the present embodiment
NLTK, jieba etc.;For above-mentioned segmentation methods, the specific treatment process of the principle of specific word segmentation processing is existing skill
Art is no longer described in detail in the present embodiment.
For example, in step 101, the text to be marked that gets are as follows: " it is countless that bit coin inhales powder, but the heart of Central Bank is separately
Belong to | interface news ".After being segmented by Chinese Word Automatic Segmentation to text to be marked, the word that divides are as follows: " bit coin ",
" inhale powder ", " countless ", " but ", " Central Bank ", " ", " heart ", " separately having ", " affiliated ", " interface ", " news ".Further, it carries out
Screening, gets rid of conjunction, adverbial word of irrelevant contents etc., the word segmentation result of final output is to get the entity word arrived are as follows: " bit
Coin ", " Central Bank ", " interface ", " news ".
Step 103, using the entity word, according to the rule of correspondence of preset entity word and alternative theme, determine described in
The theme of text to be marked.
In this step, the master of text to be marked is determined using a preset entity word and the rule of correspondence of alternative theme
Topic.Specifically, the rule of correspondence can be arranged by way of database.In the rule of correspondence, for entity word and alternatively
The specific corresponding relationship of theme, generally entity word and the one-to-many form of alternative theme, i.e., one alternative theme is corresponding with more
A entity word.
Wherein, the rule of correspondence be it is pre-set, including entity word and alternative theme corresponding relationship original number
According to, existing Relational database can be obtained from, be also possible to by each field professional provide data it is built-up.
In this step, using the entity word obtained in abovementioned steps, retrieval is carried out in the above-mentioned preset rule of correspondence
Match, the corresponding alternative theme of the entity word that can be matched, the alternative theme that these matchings obtain can be used as
The theme of text to be marked.
For example, the entity word obtained based on the participle above-mentioned for text to be marked: " bit coin ", " Central Bank ", " boundary
Face ", " news ".By the rule of correspondence, obtain: " bit coin " corresponding alternative theme is [economy], [science and technology];" Central Bank "
Corresponding alternative theme is [economy], [society];" interface " corresponding alternative theme is [science and technology], [culture];" news " is corresponding
Alternative theme be [culture].
Based on the corresponding alternative theme of above-mentioned each entity word, the theme of text to be marked is further determined that.Specifically, can be with
The alternative theme of the corresponding whole of each entity word is determined as to the theme of text to be marked together, both by the corresponding whole of each entity word
The union of alternative theme, the then theme of text to be marked are as follows: [economy], [science and technology], [society], [culture].
It should be noted that for alternative theme and theme described in the present embodiment, substantive content be it is identical,
It is used to indicate that field involved in content that text includes;In the present embodiment, difference nominally is to embody it
In different processing steps.Specifically, alternative theme is shown to be in the rule of correspondence, in the retrieval to entity word
Match medium non-final text marking result step;And when at least one alternative theme be selected to text to be marked into
When rower is infused, then be changed to be referred to as to be the theme, i.e., the theme of text to be marked.
Step 104, according to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports master
Inscribe annotation results.
In this step, theme mark is carried out to text to be marked using the theme of aforementioned obtained text to be marked.Specifically
Theme notation methods, can be for text to be marked add data label, by data label record aforementioned acquisition wait mark
The theme of explanatory notes sheet, and the data label and text to be marked are established into corresponding relationship.
Theme annotation results can be understood as successfully completing the text to be marked after theme mark.The theme is marked and is tied
Fruit further executes output operation.The operation of output described in the present embodiment, can be and store theme annotation results, make theme mark
Note result can be needed to identify that the processing of text subject is called by other;It is also possible in the method for currently executing the present embodiment
Equipment display component on show, allow the user to the straightforward theme for learning text to be marked;It is also possible that will
Theme annotation results are sent to other equipment by any data communication mode wirelessly or non-wirelessly, after being carried out by other equipment
Continuous processing.It in the specific implementation, can one of above above-mentioned processing or a variety of for the output of theme annotation results.
As it can be seen that the theme mask method of the text of the present embodiment, the rule of correspondence based on entity word Yu alternative theme will be to
It is matched after mark text word segmentation processing with the rule of correspondence, to obtain the theme of text to be marked, realizes efficiently and accurately
Text subject mark.Wherein, the rule of correspondence of entity word and alternative theme, is pre-established based on various ways and data, is passed through
Corresponding relationship between entity word and alternative theme realizes the determination of theme;When theme is detailed according to the size level in its field
When division, it can realize the theme mark in subdivision field, there is preferable actual application prospect.
It is described to use the entity word, root in the theme mask method of the text as an optional embodiment
According to the rule of correspondence of preset entity word and alternative theme, in the step of determining the theme of the text to be marked, further examine
Consider influence of the different semantic interpretations of entity word for entity word and the rule of correspondence of alternative theme.In natural language rule,
The same word may have different meanings, i.e., the same word may have different explanations in different fields.Such as,
It is the title of historical personage for " Cao behaviour " this word, corresponding alternative theme can be [history];" Cao behaviour " simultaneously
It is the name of song, corresponding alternative theme can be [amusement].As it can be seen that different semantic interpretations will affect a word pair
The theme answered.
With reference to Fig. 2, the present embodiment specifically includes the following steps:
Step 201, for each entity word, according to the rule of correspondence, determine that the entity word is corresponding at least
One explanation data;Wherein, each explanation data correspond at least one alternative theme.
In the present embodiment, in the rule of correspondence of the entity word and alternative theme, between entity word and alternative theme, into
One step constructs corresponding relationship by explaining data.I.e. for entity word, it can explain that data are corresponding at least one first, so
Each alternative theme of one or more explaining data and being respectively corresponding with again afterwards.To by entity word is corresponding with alternative theme
Relationship, the corresponding relationship for being extended to entity word, explaining data and alternative theme.For each entity word, first determine corresponding
At least one explain data;For each explanation data, it is also corresponding at least one alternative theme.
For example, at for text to be marked " bit coin inhale powder is countless, but the heart of Central Bank is had another belongs | interface news " participle
Reason, obtained entity word are as follows: " bit coin ", " Central Bank ", " interface ", " news ".
It is corresponding there are three data are explained for entity word " bit coin ", be respectively: ' ideal money ', ' movie name ', ' book
Name '.Wherein, explain the corresponding alternative theme of data ' ideal money ' as { [economy], [science and technology] };Explain that data ' movie name ' is right
The alternative theme answered is { [amusement], [film] };Explain the corresponding alternative theme of data ' title ' as { [culture], [painting and calligraphy] }.
For entity word " Central Bank ", an explanation data are corresponding with, i.e., ' financial institution ', corresponding alternative theme is
{ [economy], [society] }.
It is corresponding there are two data are explained for entity word " interface ", be respectively: ' data mode ', ' physical object ';Its
In, explain the corresponding alternative theme of data ' data mode ' as { [culture], [amusement], [science and technology] };Explain data ' physics pair
As ' corresponding alternative theme is { [nature] }.
It is corresponding there are two data are explained for entity word " news ", be respectively: ' style ', ' song title ';Wherein, it explains
The corresponding alternative theme of data ' style ' is { [culture] };Explain the corresponding alternative theme of data ' song title ' as [culture],
[amusement] }.
Step 202, the entity word that will be corresponding with the minimum explanation data are determined as critical entities word.
In this step, in several entity words, a critical entities word is determined, which is to be best able to accurately
Reflect the entity word of text subject to be marked.Specifically, being determined according to the corresponding number for explaining data of entity word crucial real
Pronouns, general term for nouns, numerals and measure words will be corresponding with the minimum entity word for explaining data and be determined as critical entities word.For an entity word, corresponding explanation
Data bulk is fewer, then may have unique semantic interpretation more showing the entity word.So occur when in text to be marked
When the entity word, then text to be marked record in have greatly may be the corresponding entity word unique semantic interpretation.
For example, entity word " bit coin " is corresponding there are three explanation data, entity word " Central Bank " is corresponding with an explanation data,
Entity word " interface " is corresponding there are two data are explained, entity word " news " is corresponding, and there are two explain data.Entity word " Central Bank " is right
The minimum number for the explanation data answered, it is determined that entity word " Central Bank " is critical entities word.That is, " Central Bank " has uniquely
Semantic interpretation, i.e., ' financial institution ', then text to be marked have greatly may record be directly with " Central Bank " it is related in
Hold.
Step 203, the theme that the corresponding alternative theme of the critical entities word is determined as to the text to be marked.
In this step, by the corresponding alternative theme of the critical entities word determined, it is determined as the theme of text to be marked.
For example, determine critical entities word be " Central Bank ", and the corresponding alternative theme of entity word " Central Bank " be [economy],
[society] }, then [economy], [society] are determined as to the theme of text to be marked.
As it can be seen that the method for the present embodiment, by that will explain that data are added in entity word and the rule of correspondence of alternative theme, from
And determine more to accurately reflect the critical entities word of content of text to be marked, and corresponding standby according to the critical entities word
Theme is selected, determines the theme of text to be marked, realizes accuracy more preferably theme mark effect.
As an optional embodiment, on the basis of the aforementioned embodiment including explaining data, it is understood that there may be more than one
A entity word is corresponding with the case where least explanation data, and this gives the methods of selection.With reference to Fig. 3, for
The step of determining critical entities word, specifically includes:
If step 301, the entity word for being corresponding with the minimum explanation data are more than one, determination is corresponding with respectively
The quantity of the minimum corresponding alternative theme of the entity word for explaining data;
The entity word of the minimum number of corresponding alternative theme is determined as critical entities word by step 302.
For example, two entity words are only corresponding with an explanation data.The explanation data of one of entity word are corresponding with
Two alternative themes, the explanation data of another entity word are corresponding with an alternative theme.Then above-mentioned step through this embodiment
Suddenly, determination explains that data are corresponding with the entity word of an alternative theme as critical entities word.Further, the master of text to be marked
Topic is confirmed as the corresponding one alternative theme of the critical entities word.
Wherein, the quantity of the corresponding alternative theme of entity word are as follows: the corresponding each explanation data of entity word are corresponding alternative
The sum of quantity of theme.For example, two entity words are corresponding, there are two explain data.Wherein, first solution of an entity word
Releasing data correspondence, there are two alternative themes, and second explanation data is corresponding, and there are two alternative themes, then the entity word is corresponding standby
The quantity for selecting theme is four;First explanation data of another entity word are corresponding, and there are two alternative theme, second explanation numbers
According to an alternative theme is corresponding with, then the quantity of the corresponding alternative theme of the entity word is three;Correspondingly, the quantity of alternative theme
It is confirmed as critical entities word for three entity word.
The method of determination critical entities word through this embodiment, enables to the theme of text to be marked finally determined
Small number, be conducive to improve theme mark accuracy.
As an optional embodiment, it is aforementioned include explaining the embodiment of data on the basis of, further comprise for
The selection step of the corresponding alternative theme of critical entities word, it is only one that the theme of text to be marked can be made by the step.
In some cases, it is desirable to make the number of the theme of text to be marked be one, to facilitate subsequent application.It is right with reference to Fig. 4
In the selection step of the corresponding alternative theme of critical entities word, specifically include:
Step 401, when the corresponding alternative theme of the critical entities word is multiple, obtain multiple alternative themes
History labeled data, and according to the history labeled data, determine that the alternative theme is used for the number of theme mark.
In this step, when the corresponding alternative theme of critical entities word is multiple, for multiple alternative theme, obtain respectively
Take its history labeled data.The history labeled data refers to that the corresponding alternative theme of critical entities word be used to mark text
Historical record.The similar history labeled data is daily record data, and the main body by offer theme marking Function is carrying out theme mark
It generates and stores when note.In the present embodiment, the equipment that history labeled data can come from implementing the method for the present embodiment, either
It is obtained by outside.Specifically, can be recorded in history labeled data time when the alternative theme be used to mark, mark
The data item such as object.
Then, according to the corresponding history labeled data of each alternative theme, determine that alternative theme is used for theme mark
Number.
The most alternative theme of the number for being used for theme mark is determined as the text to be marked by step 402
Theme.
In this step, the most alternative theme of the number for being used for theme mark is determined as to the theme of text to be marked.
For example, theme alternative for two: [economy], [society];Alternative theme [economy] is used for time of theme mark
Number is 10000 times, and the number that alternative theme [society] is used for theme mark is 9000 times.Then [economy] is determined as to be marked
The theme of text.
Under normal circumstances, since the data volume of history labeled data is larger, therefore the number that is used for theme mark is most
Alternative theme can uniquely determine.If the most alternative theme of the number in some cases, occurring being used for theme mark is more
In one situation, then prompt information can be issued, and then according to the feedback of user, determined one of as text to be marked
Theme.
As an optional embodiment, in the theme mask method of the text, complete to text to be marked into
It further include the update step of the rule of correspondence to entity word Yu alternative theme after row theme marks and exports theme annotation results
Suddenly.The update step of entity word and the rule of correspondence of alternative theme is specifically included with reference to Fig. 5:
Step 501 segments the text to be marked by the second segmentation methods, obtains supplement entity word.
In this step, further text to be marked is segmented by the second segmentation methods, if the result of participle obtains
Entity word is done, is referred to as several entity words in the present embodiment to supplement entity word.Wherein, the second segmentation methods can be selected from such as
The various segmentation methods enumerated in previous embodiment are only the need to ensure that the second segmentation methods from first participle algorithm using different
Segmentation methods so that supplement entity word with difference there may be by the entity word that first participle algorithm obtains.
When supplementing entity word and entity word is distinct, it can update entity word and alternative using supplement entity word
The rule of correspondence of theme.
Step 502, the theme according to the text to be marked and the supplement entity word update the entity word and alternative
The rule of correspondence of theme.
In this step, by the theme of aforementioned step text to be marked it has been determined that on this basis by being different from
The supplement entity word that second segmentation methods of first participle algorithm obtain, then can further establish the theme of text to be marked with
Supplement the corresponding relationship between entity word.That is, can also be right by the supplement entity word that the participle in text to be marked obtains
It should be to the theme for the text to be marked having determined.Based on the corresponding relationship of above-mentioned supplement entity word and theme, before being added into
The rule of correspondence of preset entity word Yu alternative theme is stated, that is, realizes the update to entity word and the rule of correspondence of alternative theme.
Method through this embodiment can marking as a result, the update entity word of reaction type and alternative master according to theme
The rule of correspondence of topic, so that the progress that entity word and the rule of correspondence of alternative theme are marked with theme, the expansion that can continue
With it is perfect, effectively improve the accuracy of the method for the present embodiment in theme mark.
As an optional embodiment, in the theme mask method of the text, in fact it could happen that can not be by described
The case where entity word and the rule of correspondence of alternative theme determine the theme of text to be marked, the present embodiment provide corresponding subsequent place
Reason step is specifically included with reference to Fig. 6:
If step 601, according to the rule of correspondence, the theme of the text to be marked not can confirm that, then by described wait mark
Infuse the preset theme marking model of text input.
In this step, since entity word does not include corresponding corresponding relationship with the rule of correspondence of theme, or due to it
He such as the problems such as data processing mistake, it is possible to the corresponding theme of entity word can not be obtained.It is then at this point, text to be marked is defeated
Enter other external existing theme marking models.Wherein, theme marking model can select it is existing for theme mark
Each model, such as LDA (Latent Dirichlet Allocation, hidden Di Li Cray distributed model).
Step 602, the output data for receiving the theme marking model, and the output data is marked as theme and is tied
Fruit.
In this step, the output data of above-mentioned theme marking model is received, i.e., above-mentioned theme marking model is for this implementation
The theme mark of text to be marked in example, and using the output data of the theme marking model as theme annotation results.
Method through this embodiment, can due to the entity word and the rule of correspondence of alternative theme integrity and
Due to data processing problem, when can not obtain theme annotation results, is supplemented by external theme marking model and assist to carry out
Theme mark, to guarantee finally can successfully export the theme to text to be marked when implementing the method for this implementation
Annotation results.
Obviously, the output data based on theme marking model, can also carry out as in previous embodiment for entity word
With the update step of the rule of correspondence of alternative theme, further to improve the rule of correspondence of entity word Yu alternative theme.
Based on the same inventive concept, with reference to Fig. 7, the embodiment of the invention also provides a kind of theme annotation equipment of text,
Include:
Module 701 is obtained, for obtaining text to be marked;
Word segmentation module 702 obtains at least one for segmenting by first participle algorithm to the text to be marked
Entity word;
Determining module 703, for using the entity word, according to the rule of correspondence of preset entity word and alternative theme,
Determine the theme of the text to be marked;
Labeling module 704 carries out theme mark to the text to be marked for the theme according to the text to be marked
And export theme annotation results.
In some alternative embodiments, the determining module 703 is specifically used for: for each entity word, according to
The rule of correspondence determines at least one corresponding explanation data of the entity word;Wherein, each explanation data are corresponding extremely
A few alternative theme;The minimum entity word for explaining data will be corresponding with and be determined as critical entities word;By the pass
The corresponding alternative theme of key entity word is determined as the theme of the text to be marked.
Further, the determining module 703 is specifically used for: if being corresponding with the minimum entity word for explaining data
It is more than one, it is determined that the quantity of the corresponding alternative theme of the entity word;Wherein, the entity word corresponding alternative theme
Quantity are as follows: corresponding each the sum of the quantity for explaining the corresponding alternative theme of data of the entity word;It will be corresponding
The entity word of the negligible amounts of alternative theme is determined as critical entities word.
Further, the determining module 703 is specifically used for: when the corresponding alternative theme of the critical entities word is multiple
When, the history labeled data of multiple alternative themes is obtained, and according to the history labeled data, determine the alternative theme
It is used for the number of theme mark;The most alternative theme of the number for being used for theme mark is determined as described to be marked
The theme of text.
In some alternative embodiments, described device further include: update module, for passing through the second segmentation methods to institute
It states text to be marked to be segmented, obtains supplement entity word;According to the theme of the text to be marked and the supplement word, update
The rule of correspondence of the entity word and alternative theme.
In some alternative embodiments, described device further include: external processing module, if for according to the corresponding rule
Then, the theme of the text to be marked is not can confirm that, then by the preset theme marking model of text input to be marked;It receives
The output data of the theme marking model, and using the output data as theme annotation results.
The device of above-described embodiment for realizing method corresponding in previous embodiment there is corresponding method to implement
The beneficial effect of example, details are not described herein.
Based on the same inventive concept, the embodiment of the invention also provides a kind of electronic equipment, including memory, processor and
The computer program that can be run on a memory and on a processor is stored, the processor is realized as above when executing described program
Method described in any one embodiment.
The electronic equipment of above-described embodiment has corresponding method for realizing method corresponding in previous embodiment
The beneficial effect of embodiment, details are not described herein.
Based on the same inventive concept, the embodiment of the invention also provides a kind of non-transient computer readable storage medium, institutes
Non-transient computer readable storage medium storage computer instruction is stated, the computer instruction is for executing the computer such as
Method described in upper any one embodiment.
The storage medium of above-described embodiment has corresponding method for realizing method corresponding in previous embodiment
The beneficial effect of embodiment, details are not described herein.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not
It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under thinking of the invention, above embodiments
Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as
Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.
Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front
It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example
Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).
The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims,
Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made
Deng should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of theme mask method of text characterized by comprising
Obtain text to be marked;
The text to be marked is segmented by first participle algorithm, obtains at least one entity word;
The text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word
Theme;
According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme annotation results.
2. the theme mask method of text according to claim 1, which is characterized in that described to use the entity word, root
According to the rule of correspondence of preset entity word and alternative theme, determines the theme of the text to be marked, specifically includes:
At least one corresponding explanation data of the entity word are determined according to the rule of correspondence for each entity word;
Wherein, each explanation data correspond at least one alternative theme;
The minimum entity word for explaining data will be corresponding with and be determined as critical entities word;
The corresponding alternative theme of the critical entities word is determined as to the theme of the text to be marked.
3. the theme mask method of text according to claim 2, which is characterized in that described to be corresponding with the minimum solution
The entity word for releasing data is determined as critical entities word, specifically includes:
If it is more than one to be corresponding with the minimum entity word for explaining data, it is determined that the corresponding alternative master of the entity word
The quantity of topic;Wherein, the quantity of the corresponding alternative theme of the entity word are as follows: the corresponding each explanation number of the entity word
According to the sum of the quantity of the corresponding alternative theme;
The entity word of the minimum number of corresponding alternative theme is determined as critical entities word.
4. the theme mask method of text according to claim 2, which is characterized in that described by the critical entities word pair
The alternative theme answered is determined as the theme of the text to be marked, specifically includes:
When the corresponding alternative theme of the critical entities word is multiple, the history for obtaining multiple alternative themes marks number
According to, and according to the history labeled data, determine that each alternative theme is used for the number of theme mark;
The most alternative theme of the number for being used for theme mark is determined as to the theme of the text to be marked.
5. the theme mask method of text according to claim 1, which is characterized in that it is described to the text to be marked into
After row theme marks and exports theme annotation results, further includes:
The text to be marked is segmented by the second segmentation methods, obtains supplement entity word;
According to the theme of the text to be marked and the supplement word, the rule of correspondence of the entity word Yu alternative theme is updated.
6. the theme mask method of text according to claim 1, which is characterized in that further include:
It is if not can confirm that the theme of the text to be marked according to the rule of correspondence, then the text input to be marked is pre-
If theme marking model;
The output data of the theme marking model is received, and using the output data as theme annotation results.
7. a kind of theme annotation equipment of text characterized by comprising
Module is obtained, for obtaining text to be marked;
Word segmentation module obtains at least one entity word for segmenting by first participle algorithm to the text to be marked;
Determining module, for using the entity word, according to the rule of correspondence of preset entity word and alternative theme, described in determination
The theme of text to be marked;
Labeling module carries out theme mark to the text to be marked and exports for the theme according to the text to be marked
Theme annotation results.
8. the theme annotation equipment of text according to claim 7, which is characterized in that the determining module is specifically used for:
At least one corresponding explanation data of the entity word are determined according to the rule of correspondence for each entity word;Its
In, each explanation data correspond at least one alternative theme;The minimum entity word for explaining data will be corresponding with
It is determined as critical entities word;The corresponding alternative theme of the critical entities word is determined as to the theme of the text to be marked.
9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor
Machine program, which is characterized in that the processor realizes the side as described in claim 1 to 6 any one when executing described program
Method.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction is for making the computer perform claim require 1 to 6 any the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910703411.XA CN110414006B (en) | 2019-07-31 | 2019-07-31 | Text theme labeling method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910703411.XA CN110414006B (en) | 2019-07-31 | 2019-07-31 | Text theme labeling method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110414006A true CN110414006A (en) | 2019-11-05 |
CN110414006B CN110414006B (en) | 2023-09-08 |
Family
ID=68364783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910703411.XA Active CN110414006B (en) | 2019-07-31 | 2019-07-31 | Text theme labeling method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110414006B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062213A (en) * | 2019-11-19 | 2020-04-24 | 竹间智能科技(上海)有限公司 | Named entity identification method, device, equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN103020295A (en) * | 2012-12-28 | 2013-04-03 | 新浪网技术(中国)有限公司 | Problem label marking method and device |
US20140379719A1 (en) * | 2013-06-24 | 2014-12-25 | Tencent Technology (Shenzhen) Company Limited | System and method for tagging and searching documents |
CN106372060A (en) * | 2016-08-31 | 2017-02-01 | 北京百度网讯科技有限公司 | Search text labeling method and device |
CN107291694A (en) * | 2017-06-27 | 2017-10-24 | 北京粉笔未来科技有限公司 | A kind of automatic method and apparatus, storage medium and terminal for reading and appraising composition |
CN107644012A (en) * | 2017-08-29 | 2018-01-30 | 平安科技(深圳)有限公司 | Electronic installation, problem identification confirmation method and computer-readable recording medium |
CN108595519A (en) * | 2018-03-26 | 2018-09-28 | 平安科技(深圳)有限公司 | Focus incident sorting technique, device and storage medium |
CN109033064A (en) * | 2018-05-31 | 2018-12-18 | 华中师范大学 | A kind of primary language composition corpus label extraction method and device based on text snippet |
-
2019
- 2019-07-31 CN CN201910703411.XA patent/CN110414006B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315624A (en) * | 2007-05-29 | 2008-12-03 | 阿里巴巴集团控股有限公司 | Text subject recommending method and device |
CN103020295A (en) * | 2012-12-28 | 2013-04-03 | 新浪网技术(中国)有限公司 | Problem label marking method and device |
US20140379719A1 (en) * | 2013-06-24 | 2014-12-25 | Tencent Technology (Shenzhen) Company Limited | System and method for tagging and searching documents |
CN106372060A (en) * | 2016-08-31 | 2017-02-01 | 北京百度网讯科技有限公司 | Search text labeling method and device |
CN107291694A (en) * | 2017-06-27 | 2017-10-24 | 北京粉笔未来科技有限公司 | A kind of automatic method and apparatus, storage medium and terminal for reading and appraising composition |
CN107644012A (en) * | 2017-08-29 | 2018-01-30 | 平安科技(深圳)有限公司 | Electronic installation, problem identification confirmation method and computer-readable recording medium |
CN108595519A (en) * | 2018-03-26 | 2018-09-28 | 平安科技(深圳)有限公司 | Focus incident sorting technique, device and storage medium |
CN109033064A (en) * | 2018-05-31 | 2018-12-18 | 华中师范大学 | A kind of primary language composition corpus label extraction method and device based on text snippet |
Non-Patent Citations (1)
Title |
---|
唐培丽等: "基于中文文本主题提取的分词方法研究", 《吉林工程技术师范学院学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062213A (en) * | 2019-11-19 | 2020-04-24 | 竹间智能科技(上海)有限公司 | Named entity identification method, device, equipment and medium |
CN111062213B (en) * | 2019-11-19 | 2024-01-12 | 竹间智能科技(上海)有限公司 | Named entity identification method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN110414006B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bouayad-Agha et al. | Natural language generation in the context of the semantic web | |
Meng et al. | Context-aware neural model for temporal information extraction | |
CN108268580A (en) | The answering method and device of knowledge based collection of illustrative plates | |
CN109376309A (en) | Document recommendation method and device based on semantic label | |
CN109582799A (en) | The determination method, apparatus and electronic equipment of knowledge sample data set | |
CN109783796A (en) | Predict that the pattern in content of text destroys | |
CN108647244A (en) | The tutorial resources integration method of mind map form, network store system | |
CN110413760A (en) | Interactive method, device, storage medium and computer program product | |
CN112925901B (en) | Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof | |
CN110083837A (en) | A kind of keyword generation method and device | |
CN108009160A (en) | Corpus translation method and device containing named entity, electronic equipment and storage medium | |
CN116244412A (en) | Multi-intention recognition method and device | |
Shen et al. | Product answer generation from heterogeneous sources: A new benchmark and best practices | |
Saranya et al. | A Machine Learning-Based Technique with IntelligentWordNet Lemmatize for Twitter Sentiment Analysis. | |
Wang et al. | Data set and evaluation of automated construction of financial knowledge graph | |
Lee | Korean syntax and semantics | |
CN110414006A (en) | Theme mask method, device, electronic equipment and the storage medium of text | |
Tüselmann et al. | Recognition-free question answering on handwritten document collections | |
CN110110083A (en) | A kind of sensibility classification method of text, device, equipment and storage medium | |
Jung | Semantic wiki-based knowledge management system by interleaving ontology mapping tool | |
Liu et al. | Semantic relata for the evaluation of distributional models in mandarin chinese | |
Juraska et al. | Characterizing variation in crowd-sourced data for training neural language generators to produce stylistically varied outputs | |
CN103902248A (en) | Intelligent WeChat bank system based on natural language automatic scheduler and intelligent scheduling method for computer system through natural language | |
CN112905835B (en) | Multi-mode music title generation method and device and storage medium | |
Spyns | Object role modelling for ontology engineering in the DOGMA framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |