CN110414006A - Theme mask method, device, electronic equipment and the storage medium of text - Google Patents

Theme mask method, device, electronic equipment and the storage medium of text Download PDF

Info

Publication number
CN110414006A
CN110414006A CN201910703411.XA CN201910703411A CN110414006A CN 110414006 A CN110414006 A CN 110414006A CN 201910703411 A CN201910703411 A CN 201910703411A CN 110414006 A CN110414006 A CN 110414006A
Authority
CN
China
Prior art keywords
theme
text
marked
entity word
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910703411.XA
Other languages
Chinese (zh)
Other versions
CN110414006B (en
Inventor
许蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN201910703411.XA priority Critical patent/CN110414006B/en
Publication of CN110414006A publication Critical patent/CN110414006A/en
Application granted granted Critical
Publication of CN110414006B publication Critical patent/CN110414006B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses theme mask method, device, electronic equipment and the storage mediums of a kind of text;The described method includes: obtaining text to be marked;The text to be marked is segmented by first participle algorithm, obtains at least one entity word;The theme of the text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word;According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme annotation results.The present invention is based on the rules of correspondence of entity word and alternative theme, will match after text word segmentation processing to be marked with the rule of correspondence, to obtain the theme of text to be marked, realize the text subject mark of efficiently and accurately.

Description

Theme mask method, device, electronic equipment and the storage medium of text
Technical field
The present invention relates to field of computer technology, particularly relate to theme mask method, device, the electronic equipment of a kind of text And storage medium.
Background technique
In recent years, with the rapid development of Internet, information resources just exponentially increase.Internet information money abundant Source brings great convenience to people's lives, and people, which can be convenient, is rapidly obtained various types of information resources, text It originally is wherein important one.However in this big data era, when user faces the text of magnanimity, it is difficult to accurately and quickly The related text of itself required theme is obtained, therefore, the theme mark that efficiently and accurately is carried out to text is those skilled in the art Technical problem urgently to be resolved.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of theme mask method of text, device, electronic equipment and deposit Storage media, the realization for capableing of efficiently and accurately mark the theme of text.
Based on above-mentioned purpose, the present invention provides a kind of theme mask methods of text, comprising:
Obtain text to be marked;
The text to be marked is segmented by first participle algorithm, obtains at least one entity word;
The text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word This theme;
According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme mark knot Fruit.
In addition, the present invention also provides a kind of theme annotation equipments of text, comprising:
Module is obtained, for obtaining text to be marked;
Word segmentation module obtains at least one reality for segmenting by first participle algorithm to the text to be marked Pronouns, general term for nouns, numerals and measure words;
Determining module, according to the rule of correspondence of preset entity word and alternative theme, is determined for using the entity word The theme of the text to be marked;
Labeling module carries out theme mark simultaneously to the text to be marked for the theme according to the text to be marked Export theme annotation results.
In addition, the present invention also provides a kind of electronic equipment, including memory, processor and storage are on a memory and can The computer program run on a processor, the processor realize side described in any one as above when executing described program Method.
In addition, the present invention also provides a kind of non-transient computer readable storage mediums, which is characterized in that described non-transient Computer-readable recording medium storage computer instruction, the computer instruction are as above any one for executing the computer Method described in.
From the above it can be seen that theme mask method, device, electronic equipment and the storage of text provided by the invention Medium, the corresponding relationship based on entity word Yu alternative theme will match after text word segmentation processing to be marked with the rule of correspondence, To obtain the theme of text to be marked, the text subject mark of efficiently and accurately is realized.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the theme mask method flow chart of the text of the embodiment of the present invention;
Fig. 2 is the step flow chart that the theme of text to be marked is determined in the embodiment of the present invention;
Fig. 3 is the step flow chart that critical entities word is determined in the embodiment of the present invention;
Fig. 4 is the selection flow chart of steps of the corresponding alternative theme of critical entities word in the embodiment of the present invention;
Fig. 5 is the update flow chart of steps in the embodiment of the present invention to entity word and the rule of correspondence of alternative theme;
Fig. 6 is external processing step flow chart in the embodiment of the present invention;
Fig. 7 is the theme annotation equipment structural schematic diagram of the text of the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.
It should be noted that unless otherwise defined, the technical term or scientific term that the embodiment of the present invention uses should The ordinary meaning understood by the personage in disclosure fields with general technical ability." first " used in the disclosure, " the Two " and similar word be not offered as any sequence, quantity or importance, and be used only to distinguish different composition portions Point.The similar word such as " comprising " or "comprising" means to occur after element or object before the word cover and appear in the word The element that face is enumerated perhaps object and its equivalent and be not excluded for other elements or object." connection " or " connected " etc. are similar Word be not limited to physics or mechanical connection, but may include electrical connection, either it is direct still Indirectly."upper", "lower", "left", "right" etc. are only used for indicating relative positional relationship, when the absolute position for being described object changes Afterwards, then the relative positional relationship may also correspondingly change.
The embodiment of the invention provides a kind of theme mask methods of text, with reference to Fig. 1, comprising the following steps:
Step 101 obtains text to be marked.
In this step, first choice obtains the text to be marked for needing to carry out theme mark.Specifically acquisition modes may include: The text that user sends directly is received, i.e. reception user is sent by way of this paper file, receives the text that user sends After this document, this article this document is read, and then is extracted from text file and obtains text to be marked.It is also possible to according to user The location information of transmission goes corresponding storage location to obtain text to be marked;The location information can be locally stored address or Network storage address, according to location information, accessible corresponding storage location reads the storing data of corresponding storage location, with It extracts and obtains text to be marked.
Step 102 segments the text to be marked by first participle algorithm, obtains at least one entity word.
In this step, word segmentation processing is carried out to the text to be marked got, specifically, word segmentation processing includes for wait mark The participle process of explanatory notes sheet and screening process to word segmentation result.Text to be marked for participle process, which can be divided into, to be met certainly Several words of right semantic rules, the part of speech of these words is also different, generally will include noun, verb, adverbial word, conjunction, language Gas word etc..In order to more accurately reflect theme belonging to the text to be marked, further to obtaining after above-mentioned participle Several words carry out screening process, thus to obtain entity word.The entity word is to refer to accurately reflect text institute to be marked The word of the theme of category.Specific entity word determines, pass that can be existing based on the first participle algorithm used in the present embodiment Keyword, high frequency words determine rule.In general, when frequency of occurrence is more multiple in text to be marked for a word, it can be determined For high frequency words.And keyword generally can be high frequency words above-mentioned, be also possible to determine in other way.Specifically really Set pattern then, depends on which kind of specific algorithm first participle algorithm uses.First participle algorithm can be selected such as in the present embodiment NLTK, jieba etc.;For above-mentioned segmentation methods, the specific treatment process of the principle of specific word segmentation processing is existing skill Art is no longer described in detail in the present embodiment.
For example, in step 101, the text to be marked that gets are as follows: " it is countless that bit coin inhales powder, but the heart of Central Bank is separately Belong to | interface news ".After being segmented by Chinese Word Automatic Segmentation to text to be marked, the word that divides are as follows: " bit coin ", " inhale powder ", " countless ", " but ", " Central Bank ", " ", " heart ", " separately having ", " affiliated ", " interface ", " news ".Further, it carries out Screening, gets rid of conjunction, adverbial word of irrelevant contents etc., the word segmentation result of final output is to get the entity word arrived are as follows: " bit Coin ", " Central Bank ", " interface ", " news ".
Step 103, using the entity word, according to the rule of correspondence of preset entity word and alternative theme, determine described in The theme of text to be marked.
In this step, the master of text to be marked is determined using a preset entity word and the rule of correspondence of alternative theme Topic.Specifically, the rule of correspondence can be arranged by way of database.In the rule of correspondence, for entity word and alternatively The specific corresponding relationship of theme, generally entity word and the one-to-many form of alternative theme, i.e., one alternative theme is corresponding with more A entity word.
Wherein, the rule of correspondence be it is pre-set, including entity word and alternative theme corresponding relationship original number According to, existing Relational database can be obtained from, be also possible to by each field professional provide data it is built-up.
In this step, using the entity word obtained in abovementioned steps, retrieval is carried out in the above-mentioned preset rule of correspondence Match, the corresponding alternative theme of the entity word that can be matched, the alternative theme that these matchings obtain can be used as The theme of text to be marked.
For example, the entity word obtained based on the participle above-mentioned for text to be marked: " bit coin ", " Central Bank ", " boundary Face ", " news ".By the rule of correspondence, obtain: " bit coin " corresponding alternative theme is [economy], [science and technology];" Central Bank " Corresponding alternative theme is [economy], [society];" interface " corresponding alternative theme is [science and technology], [culture];" news " is corresponding Alternative theme be [culture].
Based on the corresponding alternative theme of above-mentioned each entity word, the theme of text to be marked is further determined that.Specifically, can be with The alternative theme of the corresponding whole of each entity word is determined as to the theme of text to be marked together, both by the corresponding whole of each entity word The union of alternative theme, the then theme of text to be marked are as follows: [economy], [science and technology], [society], [culture].
It should be noted that for alternative theme and theme described in the present embodiment, substantive content be it is identical, It is used to indicate that field involved in content that text includes;In the present embodiment, difference nominally is to embody it In different processing steps.Specifically, alternative theme is shown to be in the rule of correspondence, in the retrieval to entity word Match medium non-final text marking result step;And when at least one alternative theme be selected to text to be marked into When rower is infused, then be changed to be referred to as to be the theme, i.e., the theme of text to be marked.
Step 104, according to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports master Inscribe annotation results.
In this step, theme mark is carried out to text to be marked using the theme of aforementioned obtained text to be marked.Specifically Theme notation methods, can be for text to be marked add data label, by data label record aforementioned acquisition wait mark The theme of explanatory notes sheet, and the data label and text to be marked are established into corresponding relationship.
Theme annotation results can be understood as successfully completing the text to be marked after theme mark.The theme is marked and is tied Fruit further executes output operation.The operation of output described in the present embodiment, can be and store theme annotation results, make theme mark Note result can be needed to identify that the processing of text subject is called by other;It is also possible in the method for currently executing the present embodiment Equipment display component on show, allow the user to the straightforward theme for learning text to be marked;It is also possible that will Theme annotation results are sent to other equipment by any data communication mode wirelessly or non-wirelessly, after being carried out by other equipment Continuous processing.It in the specific implementation, can one of above above-mentioned processing or a variety of for the output of theme annotation results.
As it can be seen that the theme mask method of the text of the present embodiment, the rule of correspondence based on entity word Yu alternative theme will be to It is matched after mark text word segmentation processing with the rule of correspondence, to obtain the theme of text to be marked, realizes efficiently and accurately Text subject mark.Wherein, the rule of correspondence of entity word and alternative theme, is pre-established based on various ways and data, is passed through Corresponding relationship between entity word and alternative theme realizes the determination of theme;When theme is detailed according to the size level in its field When division, it can realize the theme mark in subdivision field, there is preferable actual application prospect.
It is described to use the entity word, root in the theme mask method of the text as an optional embodiment According to the rule of correspondence of preset entity word and alternative theme, in the step of determining the theme of the text to be marked, further examine Consider influence of the different semantic interpretations of entity word for entity word and the rule of correspondence of alternative theme.In natural language rule, The same word may have different meanings, i.e., the same word may have different explanations in different fields.Such as, It is the title of historical personage for " Cao behaviour " this word, corresponding alternative theme can be [history];" Cao behaviour " simultaneously It is the name of song, corresponding alternative theme can be [amusement].As it can be seen that different semantic interpretations will affect a word pair The theme answered.
With reference to Fig. 2, the present embodiment specifically includes the following steps:
Step 201, for each entity word, according to the rule of correspondence, determine that the entity word is corresponding at least One explanation data;Wherein, each explanation data correspond at least one alternative theme.
In the present embodiment, in the rule of correspondence of the entity word and alternative theme, between entity word and alternative theme, into One step constructs corresponding relationship by explaining data.I.e. for entity word, it can explain that data are corresponding at least one first, so Each alternative theme of one or more explaining data and being respectively corresponding with again afterwards.To by entity word is corresponding with alternative theme Relationship, the corresponding relationship for being extended to entity word, explaining data and alternative theme.For each entity word, first determine corresponding At least one explain data;For each explanation data, it is also corresponding at least one alternative theme.
For example, at for text to be marked " bit coin inhale powder is countless, but the heart of Central Bank is had another belongs | interface news " participle Reason, obtained entity word are as follows: " bit coin ", " Central Bank ", " interface ", " news ".
It is corresponding there are three data are explained for entity word " bit coin ", be respectively: ' ideal money ', ' movie name ', ' book Name '.Wherein, explain the corresponding alternative theme of data ' ideal money ' as { [economy], [science and technology] };Explain that data ' movie name ' is right The alternative theme answered is { [amusement], [film] };Explain the corresponding alternative theme of data ' title ' as { [culture], [painting and calligraphy] }.
For entity word " Central Bank ", an explanation data are corresponding with, i.e., ' financial institution ', corresponding alternative theme is { [economy], [society] }.
It is corresponding there are two data are explained for entity word " interface ", be respectively: ' data mode ', ' physical object ';Its In, explain the corresponding alternative theme of data ' data mode ' as { [culture], [amusement], [science and technology] };Explain data ' physics pair As ' corresponding alternative theme is { [nature] }.
It is corresponding there are two data are explained for entity word " news ", be respectively: ' style ', ' song title ';Wherein, it explains The corresponding alternative theme of data ' style ' is { [culture] };Explain the corresponding alternative theme of data ' song title ' as [culture], [amusement] }.
Step 202, the entity word that will be corresponding with the minimum explanation data are determined as critical entities word.
In this step, in several entity words, a critical entities word is determined, which is to be best able to accurately Reflect the entity word of text subject to be marked.Specifically, being determined according to the corresponding number for explaining data of entity word crucial real Pronouns, general term for nouns, numerals and measure words will be corresponding with the minimum entity word for explaining data and be determined as critical entities word.For an entity word, corresponding explanation Data bulk is fewer, then may have unique semantic interpretation more showing the entity word.So occur when in text to be marked When the entity word, then text to be marked record in have greatly may be the corresponding entity word unique semantic interpretation.
For example, entity word " bit coin " is corresponding there are three explanation data, entity word " Central Bank " is corresponding with an explanation data, Entity word " interface " is corresponding there are two data are explained, entity word " news " is corresponding, and there are two explain data.Entity word " Central Bank " is right The minimum number for the explanation data answered, it is determined that entity word " Central Bank " is critical entities word.That is, " Central Bank " has uniquely Semantic interpretation, i.e., ' financial institution ', then text to be marked have greatly may record be directly with " Central Bank " it is related in Hold.
Step 203, the theme that the corresponding alternative theme of the critical entities word is determined as to the text to be marked.
In this step, by the corresponding alternative theme of the critical entities word determined, it is determined as the theme of text to be marked.
For example, determine critical entities word be " Central Bank ", and the corresponding alternative theme of entity word " Central Bank " be [economy], [society] }, then [economy], [society] are determined as to the theme of text to be marked.
As it can be seen that the method for the present embodiment, by that will explain that data are added in entity word and the rule of correspondence of alternative theme, from And determine more to accurately reflect the critical entities word of content of text to be marked, and corresponding standby according to the critical entities word Theme is selected, determines the theme of text to be marked, realizes accuracy more preferably theme mark effect.
As an optional embodiment, on the basis of the aforementioned embodiment including explaining data, it is understood that there may be more than one A entity word is corresponding with the case where least explanation data, and this gives the methods of selection.With reference to Fig. 3, for The step of determining critical entities word, specifically includes:
If step 301, the entity word for being corresponding with the minimum explanation data are more than one, determination is corresponding with respectively The quantity of the minimum corresponding alternative theme of the entity word for explaining data;
The entity word of the minimum number of corresponding alternative theme is determined as critical entities word by step 302.
For example, two entity words are only corresponding with an explanation data.The explanation data of one of entity word are corresponding with Two alternative themes, the explanation data of another entity word are corresponding with an alternative theme.Then above-mentioned step through this embodiment Suddenly, determination explains that data are corresponding with the entity word of an alternative theme as critical entities word.Further, the master of text to be marked Topic is confirmed as the corresponding one alternative theme of the critical entities word.
Wherein, the quantity of the corresponding alternative theme of entity word are as follows: the corresponding each explanation data of entity word are corresponding alternative The sum of quantity of theme.For example, two entity words are corresponding, there are two explain data.Wherein, first solution of an entity word Releasing data correspondence, there are two alternative themes, and second explanation data is corresponding, and there are two alternative themes, then the entity word is corresponding standby The quantity for selecting theme is four;First explanation data of another entity word are corresponding, and there are two alternative theme, second explanation numbers According to an alternative theme is corresponding with, then the quantity of the corresponding alternative theme of the entity word is three;Correspondingly, the quantity of alternative theme It is confirmed as critical entities word for three entity word.
The method of determination critical entities word through this embodiment, enables to the theme of text to be marked finally determined Small number, be conducive to improve theme mark accuracy.
As an optional embodiment, it is aforementioned include explaining the embodiment of data on the basis of, further comprise for The selection step of the corresponding alternative theme of critical entities word, it is only one that the theme of text to be marked can be made by the step. In some cases, it is desirable to make the number of the theme of text to be marked be one, to facilitate subsequent application.It is right with reference to Fig. 4 In the selection step of the corresponding alternative theme of critical entities word, specifically include:
Step 401, when the corresponding alternative theme of the critical entities word is multiple, obtain multiple alternative themes History labeled data, and according to the history labeled data, determine that the alternative theme is used for the number of theme mark.
In this step, when the corresponding alternative theme of critical entities word is multiple, for multiple alternative theme, obtain respectively Take its history labeled data.The history labeled data refers to that the corresponding alternative theme of critical entities word be used to mark text Historical record.The similar history labeled data is daily record data, and the main body by offer theme marking Function is carrying out theme mark It generates and stores when note.In the present embodiment, the equipment that history labeled data can come from implementing the method for the present embodiment, either It is obtained by outside.Specifically, can be recorded in history labeled data time when the alternative theme be used to mark, mark The data item such as object.
Then, according to the corresponding history labeled data of each alternative theme, determine that alternative theme is used for theme mark Number.
The most alternative theme of the number for being used for theme mark is determined as the text to be marked by step 402 Theme.
In this step, the most alternative theme of the number for being used for theme mark is determined as to the theme of text to be marked.
For example, theme alternative for two: [economy], [society];Alternative theme [economy] is used for time of theme mark Number is 10000 times, and the number that alternative theme [society] is used for theme mark is 9000 times.Then [economy] is determined as to be marked The theme of text.
Under normal circumstances, since the data volume of history labeled data is larger, therefore the number that is used for theme mark is most Alternative theme can uniquely determine.If the most alternative theme of the number in some cases, occurring being used for theme mark is more In one situation, then prompt information can be issued, and then according to the feedback of user, determined one of as text to be marked Theme.
As an optional embodiment, in the theme mask method of the text, complete to text to be marked into It further include the update step of the rule of correspondence to entity word Yu alternative theme after row theme marks and exports theme annotation results Suddenly.The update step of entity word and the rule of correspondence of alternative theme is specifically included with reference to Fig. 5:
Step 501 segments the text to be marked by the second segmentation methods, obtains supplement entity word.
In this step, further text to be marked is segmented by the second segmentation methods, if the result of participle obtains Entity word is done, is referred to as several entity words in the present embodiment to supplement entity word.Wherein, the second segmentation methods can be selected from such as The various segmentation methods enumerated in previous embodiment are only the need to ensure that the second segmentation methods from first participle algorithm using different Segmentation methods so that supplement entity word with difference there may be by the entity word that first participle algorithm obtains.
When supplementing entity word and entity word is distinct, it can update entity word and alternative using supplement entity word The rule of correspondence of theme.
Step 502, the theme according to the text to be marked and the supplement entity word update the entity word and alternative The rule of correspondence of theme.
In this step, by the theme of aforementioned step text to be marked it has been determined that on this basis by being different from The supplement entity word that second segmentation methods of first participle algorithm obtain, then can further establish the theme of text to be marked with Supplement the corresponding relationship between entity word.That is, can also be right by the supplement entity word that the participle in text to be marked obtains It should be to the theme for the text to be marked having determined.Based on the corresponding relationship of above-mentioned supplement entity word and theme, before being added into The rule of correspondence of preset entity word Yu alternative theme is stated, that is, realizes the update to entity word and the rule of correspondence of alternative theme.
Method through this embodiment can marking as a result, the update entity word of reaction type and alternative master according to theme The rule of correspondence of topic, so that the progress that entity word and the rule of correspondence of alternative theme are marked with theme, the expansion that can continue With it is perfect, effectively improve the accuracy of the method for the present embodiment in theme mark.
As an optional embodiment, in the theme mask method of the text, in fact it could happen that can not be by described The case where entity word and the rule of correspondence of alternative theme determine the theme of text to be marked, the present embodiment provide corresponding subsequent place Reason step is specifically included with reference to Fig. 6:
If step 601, according to the rule of correspondence, the theme of the text to be marked not can confirm that, then by described wait mark Infuse the preset theme marking model of text input.
In this step, since entity word does not include corresponding corresponding relationship with the rule of correspondence of theme, or due to it He such as the problems such as data processing mistake, it is possible to the corresponding theme of entity word can not be obtained.It is then at this point, text to be marked is defeated Enter other external existing theme marking models.Wherein, theme marking model can select it is existing for theme mark Each model, such as LDA (Latent Dirichlet Allocation, hidden Di Li Cray distributed model).
Step 602, the output data for receiving the theme marking model, and the output data is marked as theme and is tied Fruit.
In this step, the output data of above-mentioned theme marking model is received, i.e., above-mentioned theme marking model is for this implementation The theme mark of text to be marked in example, and using the output data of the theme marking model as theme annotation results.
Method through this embodiment, can due to the entity word and the rule of correspondence of alternative theme integrity and Due to data processing problem, when can not obtain theme annotation results, is supplemented by external theme marking model and assist to carry out Theme mark, to guarantee finally can successfully export the theme to text to be marked when implementing the method for this implementation Annotation results.
Obviously, the output data based on theme marking model, can also carry out as in previous embodiment for entity word With the update step of the rule of correspondence of alternative theme, further to improve the rule of correspondence of entity word Yu alternative theme.
Based on the same inventive concept, with reference to Fig. 7, the embodiment of the invention also provides a kind of theme annotation equipment of text, Include:
Module 701 is obtained, for obtaining text to be marked;
Word segmentation module 702 obtains at least one for segmenting by first participle algorithm to the text to be marked Entity word;
Determining module 703, for using the entity word, according to the rule of correspondence of preset entity word and alternative theme, Determine the theme of the text to be marked;
Labeling module 704 carries out theme mark to the text to be marked for the theme according to the text to be marked And export theme annotation results.
In some alternative embodiments, the determining module 703 is specifically used for: for each entity word, according to The rule of correspondence determines at least one corresponding explanation data of the entity word;Wherein, each explanation data are corresponding extremely A few alternative theme;The minimum entity word for explaining data will be corresponding with and be determined as critical entities word;By the pass The corresponding alternative theme of key entity word is determined as the theme of the text to be marked.
Further, the determining module 703 is specifically used for: if being corresponding with the minimum entity word for explaining data It is more than one, it is determined that the quantity of the corresponding alternative theme of the entity word;Wherein, the entity word corresponding alternative theme Quantity are as follows: corresponding each the sum of the quantity for explaining the corresponding alternative theme of data of the entity word;It will be corresponding The entity word of the negligible amounts of alternative theme is determined as critical entities word.
Further, the determining module 703 is specifically used for: when the corresponding alternative theme of the critical entities word is multiple When, the history labeled data of multiple alternative themes is obtained, and according to the history labeled data, determine the alternative theme It is used for the number of theme mark;The most alternative theme of the number for being used for theme mark is determined as described to be marked The theme of text.
In some alternative embodiments, described device further include: update module, for passing through the second segmentation methods to institute It states text to be marked to be segmented, obtains supplement entity word;According to the theme of the text to be marked and the supplement word, update The rule of correspondence of the entity word and alternative theme.
In some alternative embodiments, described device further include: external processing module, if for according to the corresponding rule Then, the theme of the text to be marked is not can confirm that, then by the preset theme marking model of text input to be marked;It receives The output data of the theme marking model, and using the output data as theme annotation results.
The device of above-described embodiment for realizing method corresponding in previous embodiment there is corresponding method to implement The beneficial effect of example, details are not described herein.
Based on the same inventive concept, the embodiment of the invention also provides a kind of electronic equipment, including memory, processor and The computer program that can be run on a memory and on a processor is stored, the processor is realized as above when executing described program Method described in any one embodiment.
The electronic equipment of above-described embodiment has corresponding method for realizing method corresponding in previous embodiment The beneficial effect of embodiment, details are not described herein.
Based on the same inventive concept, the embodiment of the invention also provides a kind of non-transient computer readable storage medium, institutes Non-transient computer readable storage medium storage computer instruction is stated, the computer instruction is for executing the computer such as Method described in upper any one embodiment.
The storage medium of above-described embodiment has corresponding method for realizing method corresponding in previous embodiment The beneficial effect of embodiment, details are not described herein.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under thinking of the invention, above embodiments Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.
Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).
The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims, Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made Deng should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of theme mask method of text characterized by comprising
Obtain text to be marked;
The text to be marked is segmented by first participle algorithm, obtains at least one entity word;
The text to be marked is determined according to the rule of correspondence of preset entity word and alternative theme using the entity word Theme;
According to the theme of the text to be marked, theme mark is carried out to the text to be marked and exports theme annotation results.
2. the theme mask method of text according to claim 1, which is characterized in that described to use the entity word, root According to the rule of correspondence of preset entity word and alternative theme, determines the theme of the text to be marked, specifically includes:
At least one corresponding explanation data of the entity word are determined according to the rule of correspondence for each entity word; Wherein, each explanation data correspond at least one alternative theme;
The minimum entity word for explaining data will be corresponding with and be determined as critical entities word;
The corresponding alternative theme of the critical entities word is determined as to the theme of the text to be marked.
3. the theme mask method of text according to claim 2, which is characterized in that described to be corresponding with the minimum solution The entity word for releasing data is determined as critical entities word, specifically includes:
If it is more than one to be corresponding with the minimum entity word for explaining data, it is determined that the corresponding alternative master of the entity word The quantity of topic;Wherein, the quantity of the corresponding alternative theme of the entity word are as follows: the corresponding each explanation number of the entity word According to the sum of the quantity of the corresponding alternative theme;
The entity word of the minimum number of corresponding alternative theme is determined as critical entities word.
4. the theme mask method of text according to claim 2, which is characterized in that described by the critical entities word pair The alternative theme answered is determined as the theme of the text to be marked, specifically includes:
When the corresponding alternative theme of the critical entities word is multiple, the history for obtaining multiple alternative themes marks number According to, and according to the history labeled data, determine that each alternative theme is used for the number of theme mark;
The most alternative theme of the number for being used for theme mark is determined as to the theme of the text to be marked.
5. the theme mask method of text according to claim 1, which is characterized in that it is described to the text to be marked into After row theme marks and exports theme annotation results, further includes:
The text to be marked is segmented by the second segmentation methods, obtains supplement entity word;
According to the theme of the text to be marked and the supplement word, the rule of correspondence of the entity word Yu alternative theme is updated.
6. the theme mask method of text according to claim 1, which is characterized in that further include:
It is if not can confirm that the theme of the text to be marked according to the rule of correspondence, then the text input to be marked is pre- If theme marking model;
The output data of the theme marking model is received, and using the output data as theme annotation results.
7. a kind of theme annotation equipment of text characterized by comprising
Module is obtained, for obtaining text to be marked;
Word segmentation module obtains at least one entity word for segmenting by first participle algorithm to the text to be marked;
Determining module, for using the entity word, according to the rule of correspondence of preset entity word and alternative theme, described in determination The theme of text to be marked;
Labeling module carries out theme mark to the text to be marked and exports for the theme according to the text to be marked Theme annotation results.
8. the theme annotation equipment of text according to claim 7, which is characterized in that the determining module is specifically used for: At least one corresponding explanation data of the entity word are determined according to the rule of correspondence for each entity word;Its In, each explanation data correspond at least one alternative theme;The minimum entity word for explaining data will be corresponding with It is determined as critical entities word;The corresponding alternative theme of the critical entities word is determined as to the theme of the text to be marked.
9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes the side as described in claim 1 to 6 any one when executing described program Method.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction is for making the computer perform claim require 1 to 6 any the method.
CN201910703411.XA 2019-07-31 2019-07-31 Text theme labeling method and device, electronic equipment and storage medium Active CN110414006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910703411.XA CN110414006B (en) 2019-07-31 2019-07-31 Text theme labeling method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910703411.XA CN110414006B (en) 2019-07-31 2019-07-31 Text theme labeling method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110414006A true CN110414006A (en) 2019-11-05
CN110414006B CN110414006B (en) 2023-09-08

Family

ID=68364783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910703411.XA Active CN110414006B (en) 2019-07-31 2019-07-31 Text theme labeling method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110414006B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062213A (en) * 2019-11-19 2020-04-24 竹间智能科技(上海)有限公司 Named entity identification method, device, equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103020295A (en) * 2012-12-28 2013-04-03 新浪网技术(中国)有限公司 Problem label marking method and device
US20140379719A1 (en) * 2013-06-24 2014-12-25 Tencent Technology (Shenzhen) Company Limited System and method for tagging and searching documents
CN106372060A (en) * 2016-08-31 2017-02-01 北京百度网讯科技有限公司 Search text labeling method and device
CN107291694A (en) * 2017-06-27 2017-10-24 北京粉笔未来科技有限公司 A kind of automatic method and apparatus, storage medium and terminal for reading and appraising composition
CN107644012A (en) * 2017-08-29 2018-01-30 平安科技(深圳)有限公司 Electronic installation, problem identification confirmation method and computer-readable recording medium
CN108595519A (en) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 Focus incident sorting technique, device and storage medium
CN109033064A (en) * 2018-05-31 2018-12-18 华中师范大学 A kind of primary language composition corpus label extraction method and device based on text snippet

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624A (en) * 2007-05-29 2008-12-03 阿里巴巴集团控股有限公司 Text subject recommending method and device
CN103020295A (en) * 2012-12-28 2013-04-03 新浪网技术(中国)有限公司 Problem label marking method and device
US20140379719A1 (en) * 2013-06-24 2014-12-25 Tencent Technology (Shenzhen) Company Limited System and method for tagging and searching documents
CN106372060A (en) * 2016-08-31 2017-02-01 北京百度网讯科技有限公司 Search text labeling method and device
CN107291694A (en) * 2017-06-27 2017-10-24 北京粉笔未来科技有限公司 A kind of automatic method and apparatus, storage medium and terminal for reading and appraising composition
CN107644012A (en) * 2017-08-29 2018-01-30 平安科技(深圳)有限公司 Electronic installation, problem identification confirmation method and computer-readable recording medium
CN108595519A (en) * 2018-03-26 2018-09-28 平安科技(深圳)有限公司 Focus incident sorting technique, device and storage medium
CN109033064A (en) * 2018-05-31 2018-12-18 华中师范大学 A kind of primary language composition corpus label extraction method and device based on text snippet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐培丽等: "基于中文文本主题提取的分词方法研究", 《吉林工程技术师范学院学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062213A (en) * 2019-11-19 2020-04-24 竹间智能科技(上海)有限公司 Named entity identification method, device, equipment and medium
CN111062213B (en) * 2019-11-19 2024-01-12 竹间智能科技(上海)有限公司 Named entity identification method, device, equipment and medium

Also Published As

Publication number Publication date
CN110414006B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
Bouayad-Agha et al. Natural language generation in the context of the semantic web
Meng et al. Context-aware neural model for temporal information extraction
CN108268580A (en) The answering method and device of knowledge based collection of illustrative plates
CN109376309A (en) Document recommendation method and device based on semantic label
CN109582799A (en) The determination method, apparatus and electronic equipment of knowledge sample data set
CN109783796A (en) Predict that the pattern in content of text destroys
CN108647244A (en) The tutorial resources integration method of mind map form, network store system
CN110413760A (en) Interactive method, device, storage medium and computer program product
CN112925901B (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN110083837A (en) A kind of keyword generation method and device
CN108009160A (en) Corpus translation method and device containing named entity, electronic equipment and storage medium
CN116244412A (en) Multi-intention recognition method and device
Shen et al. Product answer generation from heterogeneous sources: A new benchmark and best practices
Saranya et al. A Machine Learning-Based Technique with IntelligentWordNet Lemmatize for Twitter Sentiment Analysis.
Wang et al. Data set and evaluation of automated construction of financial knowledge graph
Lee Korean syntax and semantics
CN110414006A (en) Theme mask method, device, electronic equipment and the storage medium of text
Tüselmann et al. Recognition-free question answering on handwritten document collections
CN110110083A (en) A kind of sensibility classification method of text, device, equipment and storage medium
Jung Semantic wiki-based knowledge management system by interleaving ontology mapping tool
Liu et al. Semantic relata for the evaluation of distributional models in mandarin chinese
Juraska et al. Characterizing variation in crowd-sourced data for training neural language generators to produce stylistically varied outputs
CN103902248A (en) Intelligent WeChat bank system based on natural language automatic scheduler and intelligent scheduling method for computer system through natural language
CN112905835B (en) Multi-mode music title generation method and device and storage medium
Spyns Object role modelling for ontology engineering in the DOGMA framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant